8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

ceeeekay · 2022-06-14T05:10:18Z

Kibana version: 8.2.2

Elasticsearch version: 8.2.2

Server OS version: Ubuntu 18.04.6 LTS

Browser version: n/a

Browser OS version: n/a

Original install method (e.g. download page, yum, from source, etc.): apt

Describe the bug: ~~Since 8.2.1,~~ Kibana instances configured with multiple elasticsearch.hosts entries do not fail over when the active host becomes unavailable. E.g., with elasticsearch.hosts: ["es-host1:9200", "es-host2:9200"], stopping or restarting one of the ES hosts in the elasticsearch.hosts array causes [ERROR][http.server.Kibana] NoLivingConnectionsError: There are no living connections, as if Kibana is configured to only contact a single host. This does not resolve until: the failed host returns; or Kibana is restarted, allowing it to connect to the remaining hosts in the elasticsearch.hosts array.

If Kibana is restarted while es-host1 is down, it will happily connect to es-host2 and operate fine. If es-host1 returns, and es-host2 is then stopped, Kibana will not attempt to reconnect to es-host1.

This is problematic as in an HA environment, Elasticsearch can not be restarted without causing a Kibana service outage. The expected behaviour below has been the case for a long time ~~prior to 8.2.1~~, and the HA architecture of our environments depends on it working this way.

~~This has only been happening since 8.2.1, and~~ versions prior to this failed over to remaining Elasticsearch nodes with no issue.

Steps to reproduce:

Configure Kibana 8.2.2 to use two elasticsearch.hosts IPs.
Restart one of the Elasticsearch nodes in Kibana's elasticsearch.hosts array, wait for it to rejoin the cluster, then restart the remaining node.
Observe Kibana logs, and attempt to access Kibana interface.

Expected behavior:
Kibana should continue to operate normally after an ES host failure, as long as one of the hosts in the elasticsearch.hosts array is still available.

Screenshots (if relevant):
n/a

Errors in browser console (if relevant):
n/a

Provide logs and/or server output (if relevant):

[2022-06-14T16:42:52.771+12:00][ERROR][plugins.security.session.index] Failed to retrieve session value: There are no living connections
[2022-06-14T16:42:52.772+12:00][ERROR][http.server.Kibana] NoLivingConnectionsError: There are no living connections
    at KibanaTransport.request (/usr/share/kibana/node_modules/@elastic/transport/lib/Transport.js:408:27)
    at KibanaTransport.wrappedRequest (/usr/share/kibana/node_modules/elastic-apm-node/lib/instrumentation/modules/@elastic/elasticsearch.js:117:28)
    at KibanaTransport.request (/usr/share/kibana/src/core/server/elasticsearch/client/create_transport.js:58:28)
    at ClientTraced.GetApi [as get] (/usr/share/kibana/node_modules/@elastic/elasticsearch/lib/api/api/get.js:36:33)
    at SessionIndex.get (/usr/share/kibana/x-pack/plugins/security/server/session_management/session_index.js:129:50)
    at Session.get (/usr/share/kibana/x-pack/plugins/security/server/session_management/session.js:89:63)
    at Authenticator.getSessionValue (/usr/share/kibana/x-pack/plugins/security/server/authentication/authenticator.js:461:34)
    at Authenticator.authenticate (/usr/share/kibana/x-pack/plugins/security/server/authentication/authenticator.js:259:34)
    at /usr/share/kibana/x-pack/plugins/security/server/authentication/authentication_service.js:87:36
    at Object.interceptAuth [as authenticate] (/usr/share/kibana/src/core/server/http/lifecycle/auth.js:90:22)
    at exports.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
    at module.exports.internals.Auth._authenticate (/usr/share/kibana/node_modules/@hapi/hapi/lib/auth.js:273:30)
    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)
[2022-06-14T16:42:52.783+12:00][WARN ][plugins.licensing] License information could not be obtained from Elasticsearch due to NoLivingConnectionsError: There are no living connections error
[...]
[2022-06-14T16:42:52.811+12:00][ERROR][plugins.security.authentication] License is not available, authentication is not possible.

Any additional context:

The text was updated successfully, but these errors were encountered:

ceeeekay · 2022-06-15T05:31:12Z

Update: After a lot of testing I've discovered that Kibana sometimes does eventually fail over to the remaining nodes, but it can take a considerable amount of time - 1-2 minutes. Attempting to connect to Kibana during that time results in a 503 error in the browser.

Netstat shows me that Kibana will make a single connection to one of the recovered nodes during this period, but will not fetch any license or auth info at that point. After a couple of minutes it will make additional connections to Elasticsearch, and license/auth starts to work correctly again.

I've downgraded all the way to 7.17.4, where this behaviour doesn't occur (i.e., Kibana reconnects to a recovered node instantly if it needs to), but I don't have enough consistent results to say which 8.x version the problem actually starts occurring in.

azasypkin · 2022-06-16T08:54:39Z

Sounds like it's the same issue as described here elastic/elasticsearch-js#1714

/cc @pgayvallet

elasticmachine · 2022-06-16T08:54:50Z

Pinging @elastic/kibana-core (Team:Core)

ceeeekay · 2022-06-16T09:26:03Z

@azasypkin Hi there, that issue seems similar, but I've not seen this behaviour with Elasticsearch itself, and I'm not restarting master nodes when I see this issue.

FWIW I have two ES nodes set up as load balancers (ingest & remote_cluster_client roles), which a pair of Kibana nodes connect to. These are separate from the three masters in the cluster, and also separate from the various dedicated data nodes.

The ES cluster is 100% available throughout this process as I had a watch + curl on _cat/nodes while testing and could see the restarted nodes leave and join the cluster.

Testing back on 7.17.4, I noticed in netstat that even if Kibana hadn't yet contacted a newly-restarted LB node, when the remaining node was removed it would instantly try and connect to the restarted node, i.e., would go from 0 connections to several, and the Kibana interface would be available the entire time. From 8.x this isn't the case, and it can take a considerable amount of time before Kibana retries nodes it considers to have failed, which results in license errors in the logs, and 503 errors in the interface.

Happy to test further but I'm short on time for this until next week.

Cheers :)

pgayvallet · 2022-06-16T11:49:51Z

If Kibana is restarted while es-host1 is down, it will happily connect to es-host2 and operate fine. If es-host1 returns, and es-host2 is then stopped, Kibana will not attempt to reconnect to es-host1.

I didn't observe that during my initial testing, but I just tried again and I can confirm that. The issue is unrelated to having all nodes down at the same time. It's just that there is a delay before the client re-attempt to connect to a node identified as down previously.

So I can confirm this is the issue tracked in elastic/elasticsearch-js#1714. I will update it though to reflect the new observations

Testing back on 7.17.4, I noticed [...]. From 8.x this isn't the case

Yea, the ES client library got a full rewrite in 8.0, especially in its transport layer. It's unsurprising that 8.0 is the version that introduced this problem.

ceeeekay · 2022-06-16T12:05:17Z

Sounds good. 8.x is just my hand-wavey guess after some inconclusive testing. I originally though this problem was introduced at 8.2.1, but it may have been my upgrade procedure masking that since 8.0.0

pgayvallet · 2022-06-16T12:07:53Z

That's actually only a guess, I tested myself with a 8.2.2 version of the client. I will try with version 8.0.0 asap to see if the problem was already present

pgayvallet · 2022-06-16T12:17:30Z

I confirm I reproduced it using the v8.0.0 of the client and that the issue not present on 7.17.0

pgayvallet · 2022-07-20T07:23:55Z

Fixed by #134628 in 8.2.4, 8.3.1 and 8.4.0+

ceeeekay added the bug Fixes for quality problems that affect the customer experience label Jun 14, 2022

botelastic bot added the needs-team Issues missing a team label label Jun 14, 2022

ceeeekay changed the title ~~8.2.2: Kibana with multiple ES hosts refuses to connect to remaining ES hosts after failure of active host~~ 8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host Jun 14, 2022

azasypkin added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Jun 16, 2022

botelastic bot removed the needs-team Issues missing a team label label Jun 16, 2022

TinaHeiligers added the Feature:elasticsearch label Jun 16, 2022

pgayvallet closed this as completed Jul 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

ceeeekay commented Jun 14, 2022 •

edited

Loading

ceeeekay commented Jun 15, 2022 •

edited

Loading

azasypkin commented Jun 16, 2022

elasticmachine commented Jun 16, 2022

ceeeekay commented Jun 16, 2022

pgayvallet commented Jun 16, 2022

ceeeekay commented Jun 16, 2022

pgayvallet commented Jun 16, 2022 •

edited

Loading

pgayvallet commented Jun 16, 2022

pgayvallet commented Jul 20, 2022

8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

8.2.2: Kibana with multiple ES hosts won't connect to remaining ES hosts after failure of active host #134301

Comments

ceeeekay commented Jun 14, 2022 • edited Loading

ceeeekay commented Jun 15, 2022 • edited Loading

azasypkin commented Jun 16, 2022

elasticmachine commented Jun 16, 2022

ceeeekay commented Jun 16, 2022

pgayvallet commented Jun 16, 2022

ceeeekay commented Jun 16, 2022

pgayvallet commented Jun 16, 2022 • edited Loading

pgayvallet commented Jun 16, 2022

pgayvallet commented Jul 20, 2022

ceeeekay commented Jun 14, 2022 •

edited

Loading

ceeeekay commented Jun 15, 2022 •

edited

Loading

pgayvallet commented Jun 16, 2022 •

edited

Loading