title | description | services | documentationcenter | author | manager | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HBase Region Server fails to Restart | Microsoft Docs |
Troubleshooting the cause of failure of HBase regionserver restart operation. |
hdinsight |
nitinver |
ashitg |
hdinsight |
hdinsightactive |
na |
article |
na |
big-data |
04/11/2017 |
nitinver |
First of all, the situation like this could be prevented by following best practices. It is advisable to pause the heavy workload activity when planning to restart HBase Region Servers. If the application continues to connect with region servers when shutdown is in progress, it will slow down the region server restart operation by several minutes. Also, it is advised the users to flush all the tables by following HDInsight HBase: How to Improve HBase cluster restart time by Flushing tables as a reference.
If a user initiates the restart operation on HBase region server's from Ambari UI. He would immediately see the region servers went down, but not coming back up for too long.
Below is what happens behind the scenes:
-
Ambari agent will send a stop request to region server.
-
The Ambari agent then waits for 30 seconds for region server to shutdown gracefully.
-
If the customer's application continues to connect with region server, it will not shutdown immediately and hence 30 seconds timeout will expire sooner.
-
After expiration of 30 seconds, Ambari agent will send a force kill (kill -9) to region server. One can observe this in ambari-agent log (in /var/log/ directory of respective workernode) as below:
2017-03-21 13:22:09,171 - Execute['/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-regionserver/conf stop regionserver'] {'only_if': 'ambari-sudo.sh -H -E t est -f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` >/dev/null 2>&1', 'on_timeout': '! ( ambari-sudo.sh -H -E test - f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` >/dev/null 2>&1 ) || ambari-sudo.sh -H -E kill -9 `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid`', 'timeout': 30, 'user': 'hbase'} 2017-03-21 13:22:40,268 - Executing '! ( ambari-sudo.sh -H -E test -f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` > /dev/null 2>&1 ) || ambari-sudo.sh -H -E kill -9 `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid`'. Reason: Execution of 'ambari-sudo.sh su hbase -l -s /bin/bash -c 'export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/var/lib/ambari-agent ; /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/curre nt/hbase-regionserver/conf stop regionserver was killed due timeout after 30 seconds 2017-03-21 13:22:40,285 - File['/var/run/hbase/hbase-hbase-regionserver.pid'] {'action': ['delete']} 2017-03-21 13:22:40,285 - Deleting File['/var/run/hbase/hbase-hbase-regionserver.pid']
-
Due to this abrupt shutdown, although the region server process gets killed, the port associated with the process may not be released, which eventually leads to AddressBindException as shown in the logs below while starting region server. One can verify this in region-server.log in /var/log/hbase directory on the worker nodes where region server start fails.
2017-03-21 13:25:47,061 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2636) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2634) ... 5 more Caused by: java.net.BindException: Problem binding to /10.2.0.4:16020 : Address already in use at org.apache.hadoop.hbase.ipc.RpcServer.bind(RpcServer.java:2497) at org.apache.hadoop.hbase.ipc.RpcServer$Listener.<init>(RpcServer.java:580) at org.apache.hadoop.hbase.ipc.RpcServer.<init>(RpcServer.java:1982) at org.apache.hadoop.hbase.regionserver.RSRpcServices.<init>(RSRpcServices.java:863) at org.apache.hadoop.hbase.regionserver.HRegionServer.createRpcServices(HRegionServer.java:632) at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:532) ... 10 more Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:463) at sun.nio.ch.Net.bind(Net.java:455) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.apache.hadoop.hbase.ipc.RpcServer.bind(RpcServer.java:2495) ... 15 more
During such cases, the workaround below can be tried:
-
Try to reduce the load on the HBase region servers before initiating a restart.
-
Alternatively (if step above doesn't help), try and manually restart region servers on the worker nodes using following commands:
sudo su - hbase -c "/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh stop regionserver" sudo su - hbase -c "/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh start regionserver"