-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in error handling in OTelManager #6927
Conversation
This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
I think we should do the same thing with status reporting. |
@swiatekm We could do that and it could result in less chance of deadlock. But we do not do that with the runtime manager for a reason, and I think the same reason applies here. The reason we do not do that is for logging and |
I'm fine merging this PR as is, but I also think we should ensure we don't create the same problem with status updates. If all status updates need to be delivered, maybe it's worth having a reasonably large buffered channel for them? I don't think we're really concerned about the Coordinator just not ever consuming the updates, more about unlucky streaks of select choices in the Coordinator loop. What I really do not want to do is leave another potential race condition in this code just because it's much less probable (right now). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes themselves look good to me. I would feel better about them if we added a simple test to verify that it's possible to, say, submit two empty config changes to the Otel manager without blocking. Especially since it's a bit undertested in general.
@swiatekm I added to the existing |
|
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef)
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) # Conflicts: # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) # Conflicts: # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef)
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) # Conflicts: # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) Co-authored-by: Blake Rouse <[email protected]>
* Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) Co-authored-by: Blake Rouse <[email protected]>
…6953) * Fix deadlock in error handling in OTelManager (#6927) * Fix race in error handling. * Add changelog. * Add testing. * Changelog bug-fix. (cherry picked from commit 9b572ef) # Conflicts: # internal/pkg/otel/manager/manager.go # internal/pkg/otel/manager/manager_test.go * Fix cherry-pick. --------- Co-authored-by: Blake Rouse <[email protected]>
What does this PR do?
Fixes an issue where a deadlock would cause the the OTelManager to get stuck.
This also seems to be the cause of the flaky test
TestOTelManager_Run
. I have ran this over 100 times and hit no more errors with this PR.Why is it important?
Deadlock causes the running internal OTel collector from getting any more updates or actions.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
None
How to test this PR locally
go test -race -v -count=100 github.com/elastic/elastic-agent/internal/pkg/otel/manager
Related issues