title	description	services	documentationcenter	author	manager	ms.service	ms.custom	ms.devlang	ms.topic	ms.tgt_pltfrm	ms.workload	ms.date	ms.author
HBase Hbck command reporting holes in the chain of regions \| Microsoft Docs	Troubleshooting the cause of holes in the chain of regions.	hdinsight		nitinver	ashitg	hdinsight	hdinsightactive	na	article	na	big-data	04/11/2017	nitinver

Running 'hbase hbck' command reports multiple unassigned regions.

Error:

Passed in file status is for something other than a regular file

Detailed Description:

Regions are offline. Running hbck returned that there were holes in the region chain. When I looked at HBase Master UI, I saw there was a region in transition or WAL splitting for a long time. Looking at the logs I saw there was a WAL split failing all the time and the error was that

       2017-05-12 11:55:01,993 ERROR [RS_LOG_REPLAY_OPS-10.0.0.14:16020-1] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
       java.lang.IllegalArgumentException: passed in file status is for something other than a regular file.
       at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
       at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:271)
       at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:235)
       at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:104)
       at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
       at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
       at java.lang.Thread.run(Thread.java:745)

Probable Cause:

Checking the /hbase/WALs/xxxx-splitting, I noticed there was one folder having the WAL file, the others only contained a meta file which means the WAL split had already been finished. I did "hdfs dfs -ls ", I saw the file length was 0. That was why Regionserver thought it was not a WAL file. I guessed something happened during the creation of the WAL file

Resolution Steps:

The mitigation was deleting this file and restart the active HMaster so that HBase skipped this WAL file splitting and those offline regions became online again.

If no errors found in logs

It is a common issue to see 'multiple regions being unassigned or holes in the chain of regions' when the HBase user runs 'hbase hbck' command.

The user would see the count of regions being un-balanced across all the region servers from HBase Master UI. After that, user can run 'hbase hbck' command and shall notice holes in the region chain.

The user should first fix the assignments, because holes may be due to those offline regions.

Follow the steps below to bring the unassigned regions back to normal state:

Login to HDInsight HBase cluster using SSH.
Run 'hbase zkcli' command to connect with zookeeper shell.
Run 'rmr /hbase/regions-in-transition' or 'rmr /hbase-unsecure/regions-in-transition' command.
Exit from 'hbase zkcli' shell by using 'exit' command.
Open Ambari UI and restart Active HBase Master service from Ambari.
Run 'hbase hbck' command again (without any further options).

Check the output of command in step 6 and ensure that all regions are being assigned.

Region Servers dead due to WAL splitting

We have seen incidents where the Region Server fails to start due to multiple Splitting WAL directories.

Sample errors seen in var/log/hbase/region server log file:

2019-03-06 01:51:06,045 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:51:13,129 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting 2019-03-06 01:58:04,847 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:58:10,422 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/c26a119d9014497f11f7945ee5765707 2019-03-06 01:58:04,847 ERROR [RS_LOG_REPLAY_OPS-wn0-mycluster:16020-0] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY 2019-03-06 01:58:06,828 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-2] handler.OpenRegionHandler: Failed open of region=SALES:MYSALES_TRANS,2016-04-16_14940401167_276107,1549148647976.3b69ed680668388f7c5723f491edc1c7., starting to roll back the global memstore size. 2019-03-06 01:58:08,672 ERROR [RS_CLOSE_META-wn0-mycluster:16020-0] regionserver.HRegion: Memstore size is 490232 2019-03-06 01:58:10,056 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-8] regionserver.HRegion: Could not initialize all stores for the region=SALES:MYSALES_TRANS,2016-07-26_16381202590_129836,1549502700390.348bd64e19335f27a9f50f0b4bba5802. 2019-03-06 01:58:10,056 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-1] regionserver.HRegion: Could not initialize all stores for the region=SALES:MYSALES_TRANS,2016-06-20_15881302257_70096,1549355619258.8eadc37b6743124b2b1c228fbc7fcd69. org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,424 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-10] zookeeper.ZooKeeperWatcher: regionserver:16020-0x16884cf76828906, quorum=zk4-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk3-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk0-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/c26a119d9014497f11f7945ee5765707 2019-03-06 01:58:10,425 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-9] zookeeper.ZooKeeperWatcher: regionserver:16020-0x16884cf76828906, quorum=zk4-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk3-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181,zk0-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,427 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-9] coordination.ZkOpenRegionCoordination: Failed transitioning node SALES:MYSALES_TRANS,2016-03-30_14396103271_227839,1549950999690.d7107da93effb7d0476952020a45e93f. from OPENING to FAILED_OPEN org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/region-in-transition/d7107da93effb7d0476952020a45e93f 2019-03-06 01:58:10,427 ERROR [RS_OPEN_REGION-wn0-mycluster:16020-10] coordination.ZkOpenRegionCoordination: Failed transitioning node SALES:MYSALES_TRANS,2016-07-22_16395391578_222034,1549409261112.c26a119d9014497f11f7945ee5765707. from OPENING to FAILED_OPEN

Check the following things:

Get list of current wals hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out
Inspect the wals.out to see if there are empty files eg: Empty files from the wals output: Line 110: -rw-rwx---+ 1 sshuser sshuser 0 2019-01-24 08:42 /hbase/WALs/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1543299776294-splitting/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1543299776294/wn1-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1543299776294.default.1548319335505 Line 490: -rw-r-----+ 1 sshuser sshuser 0 2019-03-08 04:21 /hbase/WALs/wn11-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1552018852657-splitting/wn11-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1552018852657..meta.1552018872799.meta Line 788: -rw-r-----+ 1 sshuser sshuser 0 2019-03-06 01:51 /hbase/WALs/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1548362872417-splitting/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net,16020,1548362872417/wn2-mycluster.svtv1pmtfjtunof0lhxfjaxxoe.cx.internal.cloudapp.net%2C16020%2C1548362872417.default.1551837050362
If there are too many splitting directories (starting with *-splitting) the region server is probably failing because of these directories.

Mitigation:

Stop Hbase from Ambari portal
Rerun hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out to get fresh list of WALs
Move the *-splitting directories to a temporary folder and delete the *-splitting directories.

Run ‘hbase zkcli’ command to connect with zookeeper shell. Run ‘rmr /hbase-unsecure/splitWAL’ Restart hbase service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hbase-hbck-regions-not-assigned.md

hbase-hbck-regions-not-assigned.md

Running 'hbase hbck' command reports multiple unassigned regions.

Error:

Detailed Description:

Probable Cause:

Resolution Steps:

If no errors found in logs

Region Servers dead due to WAL splitting

Sample errors seen in var/log/hbase/region server log file:

Check the following things:

Mitigation:

Files

hbase-hbck-regions-not-assigned.md

Latest commit

History

hbase-hbck-regions-not-assigned.md

File metadata and controls

Running 'hbase hbck' command reports multiple unassigned regions.

Error:

Detailed Description:

Probable Cause:

Resolution Steps:

If no errors found in logs

Region Servers dead due to WAL splitting

Sample errors seen in var/log/hbase/region server log file:

Check the following things:

Mitigation: