Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define strategy for debugging problems in the event data path #41993

Open
cmacknz opened this issue Dec 11, 2024 · 5 comments
Open

Define strategy for debugging problems in the event data path #41993

cmacknz opened this issue Dec 11, 2024 · 5 comments
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 11, 2024

When root causing user problems and support cases, we need a way to tell what has happened to data on the event path. The most obvious way for us to do this is to add log lines for interesting things that happen as the event travels through the pipeline. This works for low event rates, but at high event rates is not practical.

For each event, we can generate many additional log lines. A single event could become 10+ additional log lines, making the rate of monitoring data 10x the rate of actual data collected. This both floods the logs, eliminating any guarantee we observe a rare but important log line due to rapid log rotation, and makes the cost of storing the monitoring data dominate the cost of the data collection. For example, in low volume Elastic Agent use cases monitoring data collection can dominate the ingest and storage volume.

We need to decide on a strategy for ensuring we can diagnose data pipeline problems without generating excessive amounts of observability data about ourselves. Many people use distributed tracing to solve this problem which comes paired with various built in sampling strategies. We cannot default to use of distributed tracing because we do not always have a place for the traces to go or be analyzed.

One strategy would be to build up context about events as they progress through the pipeline and only emit a single, large log line when the event is considered complete. For example see canonical log lines as a reference for this strategy. We could then rate limit or sample these, instead of needing a rate limiter for each individual log line.

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 11, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member Author

cmacknz commented Dec 11, 2024

elastic/elastic-agent-libs#256 proposes a rate limiter for individual log lines which looks nice and makes sense, but likely comes with some performance overhead we would not want to put in our critical performance path (e.g. a mutex for each log line under the hood).

@leehinman
Copy link
Contributor

There are probably good arguments against this, but for things like processor failures we add to the error.message field and send the event along. And it is certainly possible to just append to this field. Could we maybe extend that to fit more use cases?

@cmacknz
Copy link
Member Author

cmacknz commented Dec 11, 2024

Could we maybe extend that to fit more use cases?

I don't think we want to start appending arbitrary data to events in a way that we have to store it in ES and make sure the fields are always mapped. In the specific case of "there was an error processing this event but I could still ship it" like a processor failure this makes sense.

We could start storing more things as event metadata and then build a single log message out of that when the event acked/dropped/published as appropriate.

@belimawr
Copy link
Contributor

We could start storing more things as event metadata and then build a single log message out of that when the event acked/dropped/published as appropriate.

I like this idea, it seems to offer a good trade off because it won't use sampling and it's bound to specific events. If needed we could later add sampling, but I'd try to avoid it if possible.

We could also log it to a separate log file, something like an 'event errors log' so we won't face the problem of other verbose loggers overwhelming errors that don't happen often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

4 participants