Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message retry performance implications and architectural issues #613

Open
nicklas-dohrn opened this issue Sep 26, 2024 · 6 comments
Open

Comments

@nicklas-dohrn
Copy link
Contributor

This is an issue to discuss the current state of the retry logic for syslog messages,
As there are some implications, that are problematic.
just listed here shortly for an overview:

  • having a syslog drain fail with high load will drop messages for other drains.
    This will also put the cpu consumption of the syslog agent over 1 cpu, not sure why
  • the syslog-batching implementation is not able to use the retry mechanic, as there is no state about the batching being done in the retry writer.

I will add details and my testing results here later in a better formatted way.

@ctlong
Copy link
Member

ctlong commented Sep 26, 2024

I dived a little deeper into the syslog writer code recently and I think that we were incorrect in some of our previous assertions about the synchronized nature of the agent. If you check out the syslog connector, which the manager uses to create new drains, each drain is provided with an egress diode. Since writing to the diode should be non-blocking, I think that the envelope writing loop is in fact asynchronous to some degree.

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

@ctlong
Copy link
Member

ctlong commented Sep 26, 2024

High CPU usage of the agent is a known problem. Unfortunately, none of the logging and metrics agents currently have any kind of memory or CPU limitation placed upon them. They will expand as necessary to meet demand.

We took a pprof dump a while ago and saw that marshalling/unmarshalling envelopes was the primary performance issue of most of our agents. Part of what I hope to accomplish by merging every agent into the OTel collector is to reduce the number of marshal/unmarshal steps required to egress an individual envelope from a VM.

@nicklas-dohrn
Copy link
Contributor Author

I did some testing as well, and your assumption about every drain getting its own diode is also my understanding why there is some sort of concurrency happening.
Imho, this is also unwanted behaviour, as this does not allow to set the wanted max resource consumption, so the syslog-agent is able to overload other components.

@nicklas-dohrn
Copy link
Contributor Author

At least, a problematic syslog drain shouldn't directly prevent other drains from continuing to receive messages.

Yes, this is what I see with testing.
It only allows a "dos" overload, where the dropped messages on the other non malicious receiver seem to be random.
Screenshot 2024-09-26 at 08 38 16
(the image shows the inflowing data on the receiving side, should be 50log/s)

@ctlong
Copy link
Member

ctlong commented Nov 7, 2024

@nicklas-dohrn to confirm the state of this issue, the current concerns are:

  1. Syslog Agent has the potential for high CPU usage under load that has the potential to overload other components.
  2. https-batch drains in Syslog Agent has no retry logic when it fails and does not propagate any error log message.

Is that correct? If so, I'd move to ignore the first concern in this issue as I consider it to be a general known issue with CF-D components – what we really want is some BPM-specific way to indicate CPU shares.

@nicklas-dohrn
Copy link
Contributor Author

  1. Syslog Agent has the potential for high CPU usage under load that has the potential to overload other components.
    your assumption is correct, that this the expected behaviour for the loggregator agent as I see it as well.
    2)https-batch drains in Syslog Agent has no retry logic when it fails and does not propagate any error log message.
    This behaviour was observed with the normal http enpoint, I was not making use of the https_batch implementation to make sure this is indeed a problem I did not introduce.
    To me, it looked like the uptake of new messages was disrupted if the agent was overloaded, not the sending side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Merge | Prioritized
Development

No branches or pull requests

2 participants