Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat: The beat/stats module will frequently log errors about missing cluster UUIDs #34217

Open
cmacknz opened this issue Jan 9, 2023 · 13 comments
Labels

Comments

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2023

The Elastic agent uses the Metricbeat beat stats module to collect metrics for the Beats it starts. Until those Beats connect to Elasticsearch the agent logs will be full of errors like the one below that aren't particularly helpful. The Beats only obtain a cluster UUID when they publish their first event, so for example if there is a log source that never updates or is slow to change this can appear in the agent logs quite frequently.

{"log.level":"error","@timestamp":"2022-12-22T14:26:36.306Z","message":"Error fetching data for metricset beat.stats: monitored beat is using Elasticsearch output but cluster UUID cannot be determined","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"ecs.version":"1.6.0","log.origin":{"file.line":256,"file.name":"module/wrapper.go"},"service.name":"metricbeat","ecs.version":"1.6.0"}

This error is coming from this code:

func (m *MetricSet) getClusterUUID() (string, error) {
state, err := beat.GetState(m.MetricSet)
if err != nil {
return "", errors.Wrap(err, "could not get state information")
}
clusterUUID := state.Monitoring.ClusterUUID
if clusterUUID != "" {
return clusterUUID, nil
}
if state.Output.Name != "elasticsearch" {
return "", nil
}
clusterUUID = state.Outputs.Elasticsearch.ClusterUUID
if clusterUUID == "" {
// Output is ES but cluster UUID could not be determined. No point sending monitoring
// data with empty cluster UUID since it will not be associated with the correct ES
// production cluster. Log error instead.
return "", beat.ErrClusterUUID
}

Why do we need an ES cluster UUID to collect beat stats? Is there a way to bypass this or suppress this warning?

@cmacknz cmacknz added Agent Team:Elastic-Agent Label for the Agent team Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring labels Jan 9, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@JAmorimNeon
Copy link

I'm facing this problem too! Elastic version 8.6.0

@herbc2
Copy link

herbc2 commented Jan 20, 2023

Same here with 8.6.0

@engarpe
Copy link

engarpe commented Jan 21, 2023

I'm facing the same issues with 8.6.0 self managed

@belimawr
Copy link
Contributor

@cmacknz I'm not quite sure, but there seems to be a related issue that leads to panic: #34384

@klacabane
Copy link
Contributor

klacabane commented Jan 25, 2023

The cluster uuid is required for Stack Monitoring application to properly tie a Beat to its Elasticsearch cluster. This is mainly driven by the business logic of SM, as without this information the application would show an incorrect state for the impacted beat processes.

Given this issue should be transient and disappear once Beats successfully connects to ES, is there a need a suppress this warning ? If the issue persists it would surface a deeper problem in the monitored Beat process, and at this point it is valuable to get that logged. Should we consider a lower logging level ? Should the beats API not return a successful response unless it is consistent with its configuration ?

@cmacknz
Copy link
Member Author

cmacknz commented Jan 25, 2023

I think the root cause here is that the Beats lazily connect to Elasticsearch when they have events to send. So Filebeat for example will not connect for the first time until there is data to send.

This can lead to valid situations where we are repeatedly seeing this log message because the file being monitored hasn't updated since the last time Filebeat was started.

@belimawr and I spoke and a better solution to this problem is likely to make an initial connection attempt as soon as the Beat is initialized so we can grab the cluster UUID and also detect if something is wrong in the output configuration much earlier.

@cmacknz
Copy link
Member Author

cmacknz commented Jan 25, 2023

Generally this log message is harmless and is just log spam, because if the Beat has tried and failed to connect to Elasticsearch there will be other more obvious errors related to that in the logs.

@yevgenytrcloudzone
Copy link

@cmacknz the importance of the message is not questioned. The problem is the flood of error severity messages in the agent log that creates way too much noise.

@klacabane
Copy link
Contributor

I'll look into reducing the logs occurrence and lowering the severity of the message considering that a failure to connect to the ES output would already be logged

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Mar 16, 2023

@cmacknz Is there some way to verify which Beat is still waiting to connect to Elasticsearch?
And is there some Beat setup in the default Agent settings that would lazily connect like this?
So that we can check that the error indeed goes away once that Beat has a reason to send its first document.

@cmacknz
Copy link
Member Author

cmacknz commented Mar 23, 2023

All the Beats lazily connect as far as I know, Metricbeat and Filebeat certainly do.

If you can modify the Beat code for this experiment, I would just add a log statement when the clusterUUIDFetchingCallback is registered and another one when it is actually executed.

func (b *Beat) clusterUUIDFetchingCallback() elasticsearch.ConnectCallback {

Without modifying the Beat, in the agent logs you'll see something like the following when a Beat does eventually connect to Elasticsearch:

{"log.level":"info","@timestamp":"2023-03-22T08:54:21.468Z","message":"Connection to backoff(elasticsearch(https://$domain.europe-west1.gcp.cloud.es.io:443)) established","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"log-default","type":"log"},"log":{"source":"log-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":147,"file.name":"pipeline/client_worker.go"},"ecs.version":"1.6.0"}

@smith smith removed the Team:Infra Monitoring UI - DEPRECATED Infrastructure Monitoring UI team - DEPRECATED - Use Team:Monitoring label Nov 9, 2023
@botelastic
Copy link

botelastic bot commented Nov 8, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants