Refactor ProcessWatcher to better handle lots of short-lived processes #37366

fearful-symmetry · 2023-12-08T18:34:21Z

Proposed commit message

closes(?): #37266

Addresses an issue discovered while profiling #37121; in cases where the ProcessWatcher is running on a system with short-lived processes making network connections, the processWatcher can use a considerable amount of CPU, as every failed PID lookup will refresh the internal mapping of endpoint->pid, which traverses all of /proc/ to gather inodes for every running process.

This is a fairly modest performance boost (see below pprof screenshots), with FindProcessTuple going from 89% of all samples in pprof, to 63% of samples.

While this would be useful for #37121, as it hits FindProcessTuple far more often, I'm on the fence as to if we should merge this as-is, as we're redoing a lot of critical-path code for a relatively small performance change.

I ran a series of performance tests with Packetbeat, running a main and 8.12 build while running while true; do wget elastic.co/robots.txt;sleep 2; done in a separate window.

Before:

After:

As we can see, after the optimization, the biggest CPU hog becomes parseProcNetProto, which is responsible for parsing /proc/net/{tcp,udp}, which is hard to avoid.

If we want further improvements or additional optimization, I think our best bet is to avoid the "refresh" approach of constantly parsing /proc/ and instead refactor the entire ProcessWatcher to use netlink's sock_diag and proc connector APIs, which should allow us to receive events on process/socket creation.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

mergify · 2023-12-08T18:34:57Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-12-08T19:02:22Z

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 28 min 30 sec

Pipeline error

This error is likely related to the pipeline itself. Click here
and then you will see the error (either incorrect syntax or an invalid configuration).

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-12-08T21:09:48Z

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-12-08T20:42:18.628+0000
Duration: 27 min 21 sec

Test stats 🧪

Test	Results
Failed	8
Passed	1603
Skipped	1
Total	1612