title | description | services | documentationcenter | author | manager | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HBase Hbck command reporting holes in the chain of regions | Microsoft Docs |
Troubleshooting the cause of holes in the chain of regions. |
hdinsight |
nitinver |
ashitg |
hdinsight |
hdinsightactive |
na |
article |
na |
big-data |
04/11/2017 |
nitinver |
Passed in file status is for something other than a regular file
Regions are offline. Running hbck returned that there were holes in the region chain. When I looked at HBase Master UI, I saw there was a region in transition or WAL splitting for a long time. Looking at the logs I saw there was a WAL split failing all the time and the error was that
2017-05-12 11:55:01,993 ERROR [RS_LOG_REPLAY_OPS-10.0.0.14:16020-1] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
java.lang.IllegalArgumentException: passed in file status is for something other than a regular file.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:271)
at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Checking the /hbase/WALs/xxxx-splitting, I noticed there was one folder having the WAL file, the others only contained a meta file which means the WAL split had already been finished. I did "hdfs dfs -ls ", I saw the file length was 0. That was why Regionserver thought it was not a WAL file. I guessed something happened during the creation of the WAL file
The mitigation was deleting this file and restart the active HMaster so that HBase skipped this WAL file splitting and those offline regions became online again.
It is a common issue to see 'multiple regions being unassigned or holes in the chain of regions' when the HBase user runs 'hbase hbck' command.
The user would see the count of regions being un-balanced across all the region servers from HBase Master UI. After that, user can run 'hbase hbck' command and shall notice holes in the region chain.
The user should first fix the assignments, because holes may be due to those offline regions.
Follow the steps below to bring the unassigned regions back to normal state:
-
Login to HDInsight HBase cluster using SSH.
-
Run 'hbase zkcli' command to connect with zookeeper shell.
-
Run 'rmr /hbase/regions-in-transition' or 'rmr /hbase-unsecure/regions-in-transition' command.
-
Exit from 'hbase zkcli' shell by using 'exit' command.
-
Open Ambari UI and restart Active HBase Master service from Ambari.
-
Run 'hbase hbck' command again (without any further options).
Check the output of command in step 6 and ensure that all regions are being assigned.
We have seen incidents where the Region Server fails to start due to multiple Splitting WAL directories.
2019-03-06 01:51:06,045 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:51:13,129 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting 2019-03-06 01:58:04,847 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:58:10,422 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/c26a119d9014497f11f7945ee5765707 2019-03-06 01:58:04,847 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:58:06,828 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-2] handler.OpenRegionHandler: Failed open of region=SALES:MYSALES_TRANS,2016-04-16_14940401167_276107,1549148647976.3b69ed680668388f7c5723f491edc1c7., starting to roll back the global memstore size. 2019-03-06 01:58:08,672 ERROR [RS_CLOSE_META-wn0-mycluster:16020-0] regionserver.HRegion: Memstore size is 490232 2019-03-06 01:58:10,056 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-8] regionserver.HRegion: Could not initialize all stores for the region=SALES:MYSALES_TRANS,2016-07-26_16381202590_129836,1549502700390.348bd64e19335f27a9f50f0b4bba5802. 2019-03-06 01:58:10,056 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-1] regionserver.HRegion: Could not initialize all stores for the region=SALES:MYSALES_TRANS,2016-06-20_15881302257_70096,1549355619258.8eadc37b6743124b2b1c228fbc7fcd69. org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,424 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-10] zookeeper.ZooKeeperWatcher: regionserver:16020-0x16884cf76828906, quorum=zk4-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk3-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk0-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/c26a119d9014497f11f7945ee5765707 2019-03-06 01:58:10,425 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-9] zookeeper.ZooKeeperWatcher: regionserver:16020-0x16884cf76828906, quorum=zk4-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk3-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk0-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,427 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-9] coordination.ZkOpenRegionCoordination: Failed transitioning node SALES:MYSALES_TRANS,2016-03-30_14396103271_227839,1549950999690.d7107da93effb7d0476952020a45e93f. from OPENING to FAILED_OPEN org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,427 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-10] coordination.ZkOpenRegionCoordination: Failed transitioning node SALES:MYSALES_TRANS,2016-07-22_16395391578_222034,1549409261112.c26a119d9014497f11f7945ee5765707. from OPENING to FAILED_OPEN
-
Get list of current wals hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out
-
Inspect the wals.out to see if there are empty files eg: Empty files from the wals output: Line 110: -rw-rwx---+ 1 sshuser sshuser 0 2019-01-24 08:42 /hbase/WALs/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1543299776294-splitting/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1543299776294/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1543299776294.default.1548319335505 Line 490: -rw-r-----+ 1 sshuser sshuser 0 2019-03-08 04:21 /hbase/WALs/wn11-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1552018852657-splitting/wn11-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1552018852657..meta.1552018872799.meta Line 788: -rw-r-----+ 1 sshuser sshuser 0 2019-03-06 01:51 /hbase/WALs/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1548362872417-splitting/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1548362872417/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1548362872417.default.1551837050362
-
If there are too many splitting directories (starting with *-splitting) the region server is probably failing because of these directories.
- Stop Hbase from Ambari portal
- Rerun hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out to get fresh list of WALs
- Move the *-splitting directories to a temporary folder and delete the *-splitting directories.
Run ‘hbase zkcli’ command to connect with zookeeper shell. Run ‘rmr /hbase-unsecure/splitWAL’ Restart hbase service