Unnecessary failovers for HDFS namenode #36

dmichal · 2019-02-27T12:12:33Z

Hello,
I've noticed that in my cluster the webhdfs output plugin from time to time performs failover between my HDFS namenodes, despite the fact that the namenodes themselves do not failover. After a bit of investigation I've found that the actual exception causing the failover in plugin is: "Failed to connect to host hd4.local:50075, Net::ReadTimeout.", where hd4 is one of my datanodes, ie. the plugin performs failover even in case of datanode connection error. It is so because the plugin just searches the error string for pattern "Failed to connect". Maybe some more specific matching should be performed, eg. searching for namenode port as well? Unnecessary failovers cause a lot of problems for me, as they sometimes result in HDFS lease problems.

Observed in logstash 6.6.0, HDFS 2.7.3; logstash and hadoop machines are running on CentOS 7.

This was referenced Feb 28, 2019

better namenode connection error detection for failover #37

Closed

better namenode connection error detection for failover #38

Closed

dmichal mentioned this issue Jul 15, 2020

Error handling and retry behaviour improved #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnecessary failovers for HDFS namenode #36

Unnecessary failovers for HDFS namenode #36

dmichal commented Feb 27, 2019 •

edited

Loading

Unnecessary failovers for HDFS namenode #36

Unnecessary failovers for HDFS namenode #36

Comments

dmichal commented Feb 27, 2019 • edited Loading

dmichal commented Feb 27, 2019 •

edited

Loading