-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301
Comments
Update: After a lot of testing I've discovered that Kibana sometimes does eventually fail over to the remaining nodes, but it can take a considerable amount of time - 1-2 minutes. Attempting to connect to Kibana during that time results in a 503 error in the browser. Netstat shows me that Kibana will make a single connection to one of the recovered nodes during this period, but will not fetch any license or auth info at that point. After a couple of minutes it will make additional connections to Elasticsearch, and license/auth starts to work correctly again. I've downgraded all the way to 7.17.4, where this behaviour doesn't occur (i.e., Kibana reconnects to a recovered node instantly if it needs to), but I don't have enough consistent results to say which 8.x version the problem actually starts occurring in. |
Sounds like it's the same issue as described here elastic/elasticsearch-js#1714 /cc @pgayvallet |
Pinging @elastic/kibana-core (Team:Core) |
@azasypkin Hi there, that issue seems similar, but I've not seen this behaviour with Elasticsearch itself, and I'm not restarting master nodes when I see this issue. FWIW I have two ES nodes set up as load balancers (ingest & remote_cluster_client roles), which a pair of Kibana nodes connect to. These are separate from the three masters in the cluster, and also separate from the various dedicated data nodes. The ES cluster is 100% available throughout this process as I had a watch + curl on _cat/nodes while testing and could see the restarted nodes leave and join the cluster. Testing back on 7.17.4, I noticed in netstat that even if Kibana hadn't yet contacted a newly-restarted LB node, when the remaining node was removed it would instantly try and connect to the restarted node, i.e., would go from 0 connections to several, and the Kibana interface would be available the entire time. From 8.x this isn't the case, and it can take a considerable amount of time before Kibana retries nodes it considers to have failed, which results in license errors in the logs, and 503 errors in the interface. Happy to test further but I'm short on time for this until next week. Cheers :) |
I didn't observe that during my initial testing, but I just tried again and I can confirm that. The issue is unrelated to having all nodes down at the same time. It's just that there is a delay before the client re-attempt to connect to a node identified as down previously. So I can confirm this is the issue tracked in elastic/elasticsearch-js#1714. I will update it though to reflect the new observations
Yea, the ES client library got a full rewrite in 8.0, especially in its transport layer. It's unsurprising that 8.0 is the version that introduced this problem. |
Sounds good. 8.x is just my hand-wavey guess after some inconclusive testing. I originally though this problem was introduced at 8.2.1, but it may have been my upgrade procedure masking that since 8.0.0 |
That's actually only a guess, I tested myself with a 8.2.2 version of the client. I will try with version 8.0.0 asap to see if the problem was already present |
I confirm I reproduced it using the |
Fixed by #134628 in |
Kibana version: 8.2.2
Elasticsearch version: 8.2.2
Server OS version: Ubuntu 18.04.6 LTS
Browser version: n/a
Browser OS version: n/a
Original install method (e.g. download page, yum, from source, etc.): apt
Describe the bug:
Since 8.2.1,Kibana instances configured with multipleelasticsearch.hosts
entries do not fail over when the active host becomes unavailable. E.g., withelasticsearch.hosts: ["es-host1:9200", "es-host2:9200"]
, stopping or restarting one of the ES hosts in theelasticsearch.hosts
array causes[ERROR][http.server.Kibana] NoLivingConnectionsError: There are no living connections
, as if Kibana is configured to only contact a single host. This does not resolve until: the failed host returns; or Kibana is restarted, allowing it to connect to the remaining hosts in theelasticsearch.hosts
array.If Kibana is restarted while es-host1 is down, it will happily connect to es-host2 and operate fine. If es-host1 returns, and es-host2 is then stopped, Kibana will not attempt to reconnect to es-host1.
This is problematic as in an HA environment, Elasticsearch can not be restarted without causing a Kibana service outage. The expected behaviour below has been the case for a long time
prior to 8.2.1, and the HA architecture of our environments depends on it working this way.This has only been happening since 8.2.1, andversions prior to this failed over to remaining Elasticsearch nodes with no issue.Steps to reproduce:
elasticsearch.hosts
IPs.elasticsearch.hosts
array, wait for it to rejoin the cluster, then restart the remaining node.Expected behavior:
Kibana should continue to operate normally after an ES host failure, as long as one of the hosts in the
elasticsearch.hosts
array is still available.Screenshots (if relevant):
n/a
Errors in browser console (if relevant):
n/a
Provide logs and/or server output (if relevant):
Any additional context:
The text was updated successfully, but these errors were encountered: