Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in error handling in OTelManager #6927

Merged
merged 4 commits into from
Feb 20, 2025

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Feb 19, 2025

What does this PR do?

Fixes an issue where a deadlock would cause the the OTelManager to get stuck.

This also seems to be the cause of the flaky test TestOTelManager_Run. I have ran this over 100 times and hit no more errors with this PR.

Why is it important?

Deadlock causes the running internal OTel collector from getting any more updates or actions.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

go test -race -v -count=100 github.com/elastic/elastic-agent/internal/pkg/otel/manager

Related issues

@blakerouse blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 19, 2025
@blakerouse blakerouse self-assigned this Feb 19, 2025
Copy link
Contributor

mergify bot commented Feb 19, 2025

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@blakerouse blakerouse marked this pull request as ready for review February 19, 2025 16:10
@blakerouse blakerouse requested a review from a team as a code owner February 19, 2025 16:10
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse blakerouse added backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches backport-active-8 Automated backport with mergify to all the active 8.[0-9]+ branches labels Feb 19, 2025
@swiatekm
Copy link
Contributor

I think we should do the same thing with status reporting.

@blakerouse
Copy link
Contributor Author

@swiatekm We could do that and it could result in less chance of deadlock. But we do not do that with the runtime manager for a reason, and I think the same reason applies here. The reason we do not do that is for logging and StateWatch over the control protocol. If we don't ensure that every state change is sent over the channel, then logging or watching of that state could be missed. At the moment the runtime manager doesn't do that for that exact reason.

@swiatekm
Copy link
Contributor

@swiatekm We could do that and it could result in less chance of deadlock. But we do not do that with the runtime manager for a reason, and I think the same reason applies here. The reason we do not do that is for logging and StateWatch over the control protocol. If we don't ensure that every state change is sent over the channel, then logging or watching of that state could be missed. At the moment the runtime manager doesn't do that for that exact reason.

I'm fine merging this PR as is, but I also think we should ensure we don't create the same problem with status updates. If all status updates need to be delivered, maybe it's worth having a reasonably large buffered channel for them? I don't think we're really concerned about the Coordinator just not ever consuming the updates, more about unlucky streaks of select choices in the Coordinator loop.

What I really do not want to do is leave another potential race condition in this code just because it's much less probable (right now).

Copy link
Contributor

@swiatekm swiatekm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes themselves look good to me. I would feel better about them if we added a simple test to verify that it's possible to, say, submit two empty config changes to the Otel manager without blocking. Especially since it's a bit undertested in general.

@blakerouse
Copy link
Contributor Author

@swiatekm I added to the existing TestOTelManager_ConfigError test to send an invalid configuration multiple times to ensure that a deadlock is not caused.

@blakerouse blakerouse merged commit 9b572ef into elastic:main Feb 20, 2025
14 checks passed
@blakerouse blakerouse deleted the fix-race-in-otel-manager branch February 20, 2025 18:30
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)
mergify bot pushed a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go
blakerouse added a commit that referenced this pull request Feb 20, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

Co-authored-by: Blake Rouse <[email protected]>
blakerouse added a commit that referenced this pull request Feb 21, 2025
* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

Co-authored-by: Blake Rouse <[email protected]>
blakerouse added a commit that referenced this pull request Feb 21, 2025
…6953)

* Fix deadlock in error handling in OTelManager (#6927)

* Fix race in error handling.

* Add changelog.

* Add testing.

* Changelog bug-fix.

(cherry picked from commit 9b572ef)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
#	internal/pkg/otel/manager/manager_test.go

* Fix cherry-pick.

---------

Co-authored-by: Blake Rouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-active-8 Automated backport with mergify to all the active 8.[0-9]+ branches backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test]: TestOTelManager_Run – otel collector never got healthy
3 participants