Define strategy for debugging problems in the event data path #41993

cmacknz · 2024-12-11T16:17:09Z

When root causing user problems and support cases, we need a way to tell what has happened to data on the event path. The most obvious way for us to do this is to add log lines for interesting things that happen as the event travels through the pipeline. This works for low event rates, but at high event rates is not practical.

For each event, we can generate many additional log lines. A single event could become 10+ additional log lines, making the rate of monitoring data 10x the rate of actual data collected. This both floods the logs, eliminating any guarantee we observe a rare but important log line due to rapid log rotation, and makes the cost of storing the monitoring data dominate the cost of the data collection. For example, in low volume Elastic Agent use cases monitoring data collection can dominate the ingest and storage volume.

We need to decide on a strategy for ensuring we can diagnose data pipeline problems without generating excessive amounts of observability data about ourselves. Many people use distributed tracing to solve this problem which comes paired with various built in sampling strategies. We cannot default to use of distributed tracing because we do not always have a place for the traces to go or be analyzed.

One strategy would be to build up context about events as they progress through the pipeline and only emit a single, large log line when the event is considered complete. For example see canonical log lines as a reference for this strategy. We could then rate limit or sample these, instead of needing a rate limiter for each individual log line.

elasticmachine · 2024-12-11T16:17:11Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-12-11T16:18:14Z

elastic/elastic-agent-libs#256 proposes a rate limiter for individual log lines which looks nice and makes sense, but likely comes with some performance overhead we would not want to put in our critical performance path (e.g. a mutex for each log line under the hood).

leehinman · 2024-12-11T16:55:02Z

There are probably good arguments against this, but for things like processor failures we add to the error.message field and send the event along. And it is certainly possible to just append to this field. Could we maybe extend that to fit more use cases?

cmacknz · 2024-12-11T17:27:03Z

Could we maybe extend that to fit more use cases?

I don't think we want to start appending arbitrary data to events in a way that we have to store it in ES and make sure the fields are always mapped. In the specific case of "there was an error processing this event but I could still ship it" like a processor failure this makes sense.

We could start storing more things as event metadata and then build a single log message out of that when the event acked/dropped/published as appropriate.

belimawr · 2024-12-11T21:21:00Z

We could start storing more things as event metadata and then build a single log message out of that when the event acked/dropped/published as appropriate.

I like this idea, it seems to offer a good trade off because it won't use sampling and it's bound to specific events. If needed we could later add sampling, but I'd try to avoid it if possible.

We could also log it to a separate log file, something like an 'event errors log' so we won't face the problem of other verbose loggers overwhelming errors that don't happen often.

cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define strategy for debugging problems in the event data path #41993

Define strategy for debugging problems in the event data path #41993

cmacknz commented Dec 11, 2024

elasticmachine commented Dec 11, 2024

cmacknz commented Dec 11, 2024

leehinman commented Dec 11, 2024

cmacknz commented Dec 11, 2024

belimawr commented Dec 11, 2024

Define strategy for debugging problems in the event data path #41993

Define strategy for debugging problems in the event data path #41993

Comments

cmacknz commented Dec 11, 2024

elasticmachine commented Dec 11, 2024

cmacknz commented Dec 11, 2024

leehinman commented Dec 11, 2024

cmacknz commented Dec 11, 2024

belimawr commented Dec 11, 2024