-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define strategy for debugging problems in the event data path #41993
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
elastic/elastic-agent-libs#256 proposes a rate limiter for individual log lines which looks nice and makes sense, but likely comes with some performance overhead we would not want to put in our critical performance path (e.g. a mutex for each log line under the hood). |
There are probably good arguments against this, but for things like processor failures we add to the |
I don't think we want to start appending arbitrary data to events in a way that we have to store it in ES and make sure the fields are always mapped. In the specific case of "there was an error processing this event but I could still ship it" like a processor failure this makes sense. We could start storing more things as event metadata and then build a single log message out of that when the event acked/dropped/published as appropriate. |
I like this idea, it seems to offer a good trade off because it won't use sampling and it's bound to specific events. If needed we could later add sampling, but I'd try to avoid it if possible. We could also log it to a separate log file, something like an 'event errors log' so we won't face the problem of other verbose loggers overwhelming errors that don't happen often. |
When root causing user problems and support cases, we need a way to tell what has happened to data on the event path. The most obvious way for us to do this is to add log lines for interesting things that happen as the event travels through the pipeline. This works for low event rates, but at high event rates is not practical.
For each event, we can generate many additional log lines. A single event could become 10+ additional log lines, making the rate of monitoring data 10x the rate of actual data collected. This both floods the logs, eliminating any guarantee we observe a rare but important log line due to rapid log rotation, and makes the cost of storing the monitoring data dominate the cost of the data collection. For example, in low volume Elastic Agent use cases monitoring data collection can dominate the ingest and storage volume.
We need to decide on a strategy for ensuring we can diagnose data pipeline problems without generating excessive amounts of observability data about ourselves. Many people use distributed tracing to solve this problem which comes paired with various built in sampling strategies. We cannot default to use of distributed tracing because we do not always have a place for the traces to go or be analyzed.
One strategy would be to build up context about events as they progress through the pipeline and only emit a single, large log line when the event is considered complete. For example see canonical log lines as a reference for this strategy. We could then rate limit or sample these, instead of needing a rate limiter for each individual log line.
The text was updated successfully, but these errors were encountered: