Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

Closed
ceeeekay opened this issue Jun 14, 2022 · 9 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:elasticsearch Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@ceeeekay
Copy link

ceeeekay commented Jun 14, 2022

Kibana version: 8.2.2

Elasticsearch version: 8.2.2

Server OS version: Ubuntu 18.04.6 LTS

Browser version: n/a

Browser OS version: n/a

Original install method (e.g. download page, yum, from source, etc.): apt

Describe the bug: Since 8.2.1, Kibana instances configured with multiple elasticsearch.hosts entries do not fail over when the active host becomes unavailable. E.g., with elasticsearch.hosts: ["es-host1:9200", "es-host2:9200"], stopping or restarting one of the ES hosts in the elasticsearch.hosts array causes [ERROR][http.server.Kibana] NoLivingConnectionsError: There are no living connections, as if Kibana is configured to only contact a single host. This does not resolve until: the failed host returns; or Kibana is restarted, allowing it to connect to the remaining hosts in the elasticsearch.hosts array.

If Kibana is restarted while es-host1 is down, it will happily connect to es-host2 and operate fine. If es-host1 returns, and es-host2 is then stopped, Kibana will not attempt to reconnect to es-host1.

This is problematic as in an HA environment, Elasticsearch can not be restarted without causing a Kibana service outage. The expected behaviour below has been the case for a long time prior to 8.2.1, and the HA architecture of our environments depends on it working this way.

This has only been happening since 8.2.1, and versions prior to this failed over to remaining Elasticsearch nodes with no issue.

Steps to reproduce:

  1. Configure Kibana 8.2.2 to use two elasticsearch.hosts IPs.
  2. Restart one of the Elasticsearch nodes in Kibana's elasticsearch.hosts array, wait for it to rejoin the cluster, then restart the remaining node.
  3. Observe Kibana logs, and attempt to access Kibana interface.

Expected behavior:
Kibana should continue to operate normally after an ES host failure, as long as one of the hosts in the elasticsearch.hosts array is still available.

Screenshots (if relevant):
n/a

Errors in browser console (if relevant):
n/a

Provide logs and/or server output (if relevant):

[2022-06-14T16:42:52.771+12:00][ERROR][plugins.security.session.index] Failed to retrieve session value: There are no living connections
[2022-06-14T16:42:52.772+12:00][ERROR][http.server.Kibana] NoLivingConnectionsError: There are no living connections
    at KibanaTransport.request (/usr/share/kibana/node_modules/@elastic/transport/lib/Transport.js:408:27)
    at KibanaTransport.wrappedRequest (/usr/share/kibana/node_modules/elastic-apm-node/lib/instrumentation/modules/@elastic/elasticsearch.js:117:28)
    at KibanaTransport.request (/usr/share/kibana/src/core/server/elasticsearch/client/create_transport.js:58:28)
    at ClientTraced.GetApi [as get] (/usr/share/kibana/node_modules/@elastic/elasticsearch/lib/api/api/get.js:36:33)
    at SessionIndex.get (/usr/share/kibana/x-pack/plugins/security/server/session_management/session_index.js:129:50)
    at Session.get (/usr/share/kibana/x-pack/plugins/security/server/session_management/session.js:89:63)
    at Authenticator.getSessionValue (/usr/share/kibana/x-pack/plugins/security/server/authentication/authenticator.js:461:34)
    at Authenticator.authenticate (/usr/share/kibana/x-pack/plugins/security/server/authentication/authenticator.js:259:34)
    at /usr/share/kibana/x-pack/plugins/security/server/authentication/authentication_service.js:87:36
    at Object.interceptAuth [as authenticate] (/usr/share/kibana/src/core/server/http/lifecycle/auth.js:90:22)
    at exports.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
    at module.exports.internals.Auth._authenticate (/usr/share/kibana/node_modules/@hapi/hapi/lib/auth.js:273:30)
    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)
[2022-06-14T16:42:52.783+12:00][WARN ][plugins.licensing] License information could not be obtained from Elasticsearch due to NoLivingConnectionsError: There are no living connections error
[...]
[2022-06-14T16:42:52.811+12:00][ERROR][plugins.security.authentication] License is not available, authentication is not possible.

Any additional context:

@ceeeekay ceeeekay added the bug Fixes for quality problems that affect the customer experience label Jun 14, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Jun 14, 2022
@ceeeekay ceeeekay changed the title 8.2.2: Kibana with multiple ES hosts refuses to connect to remaining ES hosts after failure of active host 8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host Jun 14, 2022
@ceeeekay
Copy link
Author

ceeeekay commented Jun 15, 2022

Update: After a lot of testing I've discovered that Kibana sometimes does eventually fail over to the remaining nodes, but it can take a considerable amount of time - 1-2 minutes. Attempting to connect to Kibana during that time results in a 503 error in the browser.

Netstat shows me that Kibana will make a single connection to one of the recovered nodes during this period, but will not fetch any license or auth info at that point. After a couple of minutes it will make additional connections to Elasticsearch, and license/auth starts to work correctly again.

I've downgraded all the way to 7.17.4, where this behaviour doesn't occur (i.e., Kibana reconnects to a recovered node instantly if it needs to), but I don't have enough consistent results to say which 8.x version the problem actually starts occurring in.

@azasypkin
Copy link
Member

Sounds like it's the same issue as described here elastic/elasticsearch-js#1714

/cc @pgayvallet

@azasypkin azasypkin added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 16, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jun 16, 2022
@ceeeekay
Copy link
Author

@azasypkin Hi there, that issue seems similar, but I've not seen this behaviour with Elasticsearch itself, and I'm not restarting master nodes when I see this issue.

FWIW I have two ES nodes set up as load balancers (ingest & remote_cluster_client roles), which a pair of Kibana nodes connect to. These are separate from the three masters in the cluster, and also separate from the various dedicated data nodes.

The ES cluster is 100% available throughout this process as I had a watch + curl on _cat/nodes while testing and could see the restarted nodes leave and join the cluster.

Testing back on 7.17.4, I noticed in netstat that even if Kibana hadn't yet contacted a newly-restarted LB node, when the remaining node was removed it would instantly try and connect to the restarted node, i.e., would go from 0 connections to several, and the Kibana interface would be available the entire time. From 8.x this isn't the case, and it can take a considerable amount of time before Kibana retries nodes it considers to have failed, which results in license errors in the logs, and 503 errors in the interface.

Happy to test further but I'm short on time for this until next week.

Cheers :)

@pgayvallet
Copy link
Contributor

If Kibana is restarted while es-host1 is down, it will happily connect to es-host2 and operate fine. If es-host1 returns, and es-host2 is then stopped, Kibana will not attempt to reconnect to es-host1.

I didn't observe that during my initial testing, but I just tried again and I can confirm that. The issue is unrelated to having all nodes down at the same time. It's just that there is a delay before the client re-attempt to connect to a node identified as down previously.

So I can confirm this is the issue tracked in elastic/elasticsearch-js#1714. I will update it though to reflect the new observations

Testing back on 7.17.4, I noticed [...]. From 8.x this isn't the case

Yea, the ES client library got a full rewrite in 8.0, especially in its transport layer. It's unsurprising that 8.0 is the version that introduced this problem.

@ceeeekay
Copy link
Author

Sounds good. 8.x is just my hand-wavey guess after some inconclusive testing. I originally though this problem was introduced at 8.2.1, but it may have been my upgrade procedure masking that since 8.0.0

@pgayvallet
Copy link
Contributor

pgayvallet commented Jun 16, 2022

That's actually only a guess, I tested myself with a 8.2.2 version of the client. I will try with version 8.0.0 asap to see if the problem was already present

@pgayvallet
Copy link
Contributor

I confirm I reproduced it using the v8.0.0 of the client and that the issue not present on 7.17.0

@pgayvallet
Copy link
Contributor

Fixed by #134628 in 8.2.4, 8.3.1 and 8.4.0+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:elasticsearch Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants