Hbase集群挂掉的一次惊险经历
本文转载自微信公众号「Java大数据与数据仓库」,作者柯同学。转载本文请联系Java大数据与数据仓库公众号。
这是以前的一次hbase集群异常事故,由于不规范操作,集群无法启动,在腾讯云大佬的帮助下,花了一个周末才修好,真的是一次难忘的回忆。
版本信息
cdh-6.0.1 hadoop-3.0 hbase-2.0.0
问题
想在空闲时候重启一下hbase释放一下内存,顺便修改一下yarn的一些配置,结果停掉后,hbase起不来了,错误信息就是hbase:namespace表is not online,master一直初始化,具体错误信息
15:41:59.313 [ProcExecTimeout] WARN .apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node4,16020,1589648302672, table=real_time_data, region=74cac15d22e99800ad0ace14c9ed74d6 15:41:59.313 [ProcExecTimeout] WARN .apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=8e68891d5826c09974d81ad5d705c3b6 15:41:59.313 [ProcExecTimeout] WARN .apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=75c42d75e2556bf70ff527f2425e8509 15:41:59.313 [ProcExecTimeout] WARN .apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=2eee04869ac2c35984d4d22e6e9f2f31 15:42:08.264 [master/node3:16000] INFO .apache.hadoop.hbase.client.RpcRetryingCallerImpl - Call exception, tries=15, retries=15, started=128887 ms ago, cancelled=false, msg=.apache.hadoop.hbase.NotServingRegionException: hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a. is not online on node1,16020,1596957741742 at .apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273) at .apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250) at .apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414) at .apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446) at .apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998) at .apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409) at .apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131) at .apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at .apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) , details=ro 'default' on table 'hbase:namespace' at region=hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a., hostname=node1,16020,1589648239142, seqNum=55 ... ... 15:44:58.229 [qtp1792826268-435] WARN .eclipse.jetty.servlet.ServletHandler - /master-status .apache.hadoop.hbase.PleaseHoldException: Master is initializing at .apache.hadoop.hbase.master.HMaster.isInMaintenanceMode(HMaster.java:2827) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:271) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:389) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:380) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:81) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) ~[javax.servlet-api-3.1.0.jar:3.1.0] at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api-3.1.0.jar:3.1.0] at .eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:112) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:48) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .apache.hadoop.hbase.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1374) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634] at .eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) [jetty-security-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.io.AbstractConnection$ReadCallback.sueeded(AbstractConnection.java:283) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502] at .eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
常规操作
到这里,我尝试使用hbck命令查看详情并修复,发现hbase2.0.0版本hbck已经废弃了修复的命令。
----------------------------------------------------------------------- NOTE: As of HBase version 2.0, the hbck tool is significantly changed. In general, all Read-Only options are supported and can be be used safely. Most -fix/ -repair options are NOT supported. Please see usage belo for details on hich options are not supported. ----------------------------------------------------------------------- 省略若干... 省略若干... 省略若干... NOTE: Folloing options are NOT supported as of HBase version 2.0+. UNSUPPORTED Metadata Repair options: (expert features, use ith caution!) -fix Try to fix region assignments. This is for backards patiblity -fixAssignments Try to fix region assignments. Replaces the old -fix -fixMeta Try to fix meta problems. This assumes HDFS region info is good. -fixHdfsHoles Try to fix region holes in hdfs. -fixHdfsOrphans Try to fix region dirs ith no .regioninfo file in hdfs -fixTableOrphans Try to fix table dirs ith no .tableinfo file in hdfs (online mode only) -fixHdfsOverlaps Try to fix region overlaps in hdfs. -maxMerge When fixing region overlaps, allo at most regions to merge. (n=5 by default) -sidelineBigOverlaps When fixing region overlaps, allo to sideline big overlaps -maxOverlapsToSideline When fixing region overlaps, allo at most regions to sideline per group. (n=2 by default) -fixSplitParents Try to force offline split parents to be online. -removeParents Try to offline and sideline lingering parents and keep daughter regions. -fixEmptyMetaCells Try to fix hbase:meta entries not referencing any region (empty REGIONINFO_QUALIFIER ros) UNSUPPORTED Metadata Repair shortcuts -repair Shortcut for -fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps -fixReferenceFiles-fixHFileLinks -repairHoles Shortcut for -fixAssignments -fixMeta -fixHdfsHoles
然后,查阅资料看到了hbck2,官方地址https://github./apache/hbase-operator-tools/tree/master/hbase-hbck2, 这个工具,本来以为抓住了救命的稻草,结果
=================================================================== HBCK2 Overvie HBCK2 is currently a simple tool that does one thing at a time only. In hbase-2.x, the Master is the final arbiter of all state, so a general principal for most HBCK2 mands is that it asks the Master to effect all repair. This means a Master must be up before you can run HBCK2 mands. The HBCK2 implementation approach is to make use of an HbckService hosted on the Master. The Service publishes a fe methods for the HBCK2 tool to pull on. Therefore, for HBCK2 mands relying on Master's HbckService facade, first thing HBCK2 does is poke the cluster to ensure the service is available. This ill fail if the remote Server does not publish the Service or if the HbckService is lacking the requested method. For the latter case, if you can, update your cluster to obtain more fix facility. HBCK2 versions should be able to ork across multiple hbase-2 releases. It ill fail ith a plaint if it is unable to run. There is no HbckService in versions of hbase before 2.0.3 and 2.1.1. HBCK2 ill not ork against these versions. Next e look first at ho you 'find' issues in your running cluster folloed by a section on ho you 'fix' found problems. ===================================================================
tm,服了。hbase2.0.0 ~ 2.0.2以及hbase2.1.0 ~ 2.1.0是不适用的,既不能使用hbck,也不能使用hbck2,这里出现了断层。
解决办法
1. 修复master,让集群正常启动
由于目前master无法初始化,集群无法启动,因为元数据表hbase:meta信息有损坏,hbase:namespace表is not online,需要让hbase:namespace表上线,启动hbase集群再说,否则后续的修复工作都进行不了;然后修复那些表(此时内心是崩溃的,都准备重搭建集群了)。
查看hbase源码,发现hbase元数据表hbase:namespace表如果没有会重建,TableNamespaceManager.java
思路备份hbase:namespace表hdfs数据,删除hbase:namespace表,启动时让其重建,然后将备份的数据bulkload进新建的hbase:namespace表中去。
删除hbase:meta中hbase:namespace那一行数据,并且mv走hbase:namespace表对应的hdfs目录到临时目录备份,这样相当于把hbase:namespace这个表删除了。
然后,重启hbase集群,namespace表会被重建,集群终于起来了。此时,hbase:namespace这张表里面保存的namespace只有default这个默认的namespace,我们通过bulkload命令,把临时目录里面的hfile文件移到hbase:namespace这张表里面,这样就还原了命名空间表。
2. 修复hbase表
很不容易,hbase集群已经起来了,通过eb ui发现,此时里面的表都是空的,无法找到每个region对应的hdfs数据文件。
由于hbase中的hbase:meta表保存所有表的region分配等信息,现在由于集群异常停止,破坏了hbase:meta表,应该是hbase:meta表有损坏,导致hbase:namespace表无法找到对应分配的region。
思路通过.regioninfo来修复hbase:meta表,参考博客https://blog.csdn./xyzkenan/article/details/103476160
工具地址https://github./DarkPhoenixs/hbase-meta-repair
总算解决了,虚惊一场,珍惜美好的生活吧。这次异常,hbase集群无法启动,两个表现
namespace region is not online; Master is initializing;
解决思路是
删除hbase:namespace表,让hbase启动服务时自动创建,解决了hbase无法启动问题 然后,通过.regioninfo文件修复hbase:meta表。
人工智能培训
- 真正能和人交流的机器人什么时候实现
- 国产机器人成功完成首例远程冠脉介入手术
- 人工智能与第四次工业革命
- 未来30年的AI和物联网
- 新三板创新层公司东方水利新增专利授权:“一
- 发展人工智能是让人和机器更好地合作
- 新春贺喜! 经开区持续推进工业互联网平台建设
- 以工业机器人为桥 传统企业如何趟过智造这条河
- 山立滤芯SAGL-1HH SAGL-2HH
- 2015国际智能星创师大赛火热报名中!
- 未来机器人会咋看人类?递归神经网络之父-像蚂
- 成都新川人工智能创新中心二期主体结构封顶
- 斯坦德机器人完成数亿元人民币C轮融资,小米产
- 到2020年,智能手机将拥有十项AI功能,有些可能
- 寻找AI机器人的增长“跳板”:老龄化为支点的产
- 力升高科耐高温消防机器人参加某支队性能测试