-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ProcessWatcher to better handle lots of short-lived processes #37366
base: main
Are you sure you want to change the base?
Refactor ProcessWatcher to better handle lots of short-lived processes #37366
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
💔 Tests Failed
Expand to view the summary
Build stats
Test stats 🧪
Test errors
Expand to view the tests failures
|
💔 Build Failed
Expand to view the summary
Build stats
Pipeline error
❕ Flaky test reportNo test was executed to be analysed. 🤖 GitHub commentsExpand to view the GitHub comments
To re-run your PR in the CI, just comment with:
|
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Steps errors
Expand to view the steps failures
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
This pull request is now in conflicts. Could you fix it? 🙏
|
Proposed commit message
closes(?): #37266
Addresses an issue discovered while profiling #37121; in cases where the ProcessWatcher is running on a system with short-lived processes making network connections, the processWatcher can use a considerable amount of CPU, as every failed PID lookup will refresh the internal mapping of endpoint->pid, which traverses all of
/proc/
to gather inodes for every running process.This is a fairly modest performance boost (see below pprof screenshots), with
FindProcessTuple
going from 89% of all samples in pprof, to 63% of samples.While this would be useful for #37121, as it hits
FindProcessTuple
far more often, I'm on the fence as to if we should merge this as-is, as we're redoing a lot of critical-path code for a relatively small performance change.I ran a series of performance tests with Packetbeat, running a
main
and 8.12 build while runningwhile true; do wget elastic.co/robots.txt;sleep 2; done
in a separate window.Before:

After:

As we can see, after the optimization, the biggest CPU hog becomes
parseProcNetProto
, which is responsible for parsing/proc/net/{tcp,udp}
, which is hard to avoid.If we want further improvements or additional optimization, I think our best bet is to avoid the "refresh" approach of constantly parsing
/proc/
and instead refactor the entire ProcessWatcher to use netlink's sock_diag and proc connector APIs, which should allow us to receive events on process/socket creation.Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.