Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ProcessWatcher to better handle lots of short-lived processes #37366

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

fearful-symmetry
Copy link
Contributor

Proposed commit message

closes(?): #37266

Addresses an issue discovered while profiling #37121; in cases where the ProcessWatcher is running on a system with short-lived processes making network connections, the processWatcher can use a considerable amount of CPU, as every failed PID lookup will refresh the internal mapping of endpoint->pid, which traverses all of /proc/ to gather inodes for every running process.

This is a fairly modest performance boost (see below pprof screenshots), with FindProcessTuple going from 89% of all samples in pprof, to 63% of samples.

While this would be useful for #37121, as it hits FindProcessTuple far more often, I'm on the fence as to if we should merge this as-is, as we're redoing a lot of critical-path code for a relatively small performance change.

I ran a series of performance tests with Packetbeat, running a main and 8.12 build while running while true; do wget elastic.co/robots.txt;sleep 2; done in a separate window.

Before:
Screenshot 2023-12-08 at 10 15 43 AM

After:
Screenshot 2023-12-08 at 10 16 29 AM

As we can see, after the optimization, the biggest CPU hog becomes parseProcNetProto, which is responsible for parsing /proc/net/{tcp,udp}, which is hard to avoid.

If we want further improvements or additional optimization, I think our best bet is to avoid the "refresh" approach of constantly parsing /proc/ and instead refactor the entire ProcessWatcher to use netlink's sock_diag and proc connector APIs, which should allow us to receive events on process/socket creation.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added the Team:Elastic-Agent Label for the Agent team label Dec 8, 2023
@fearful-symmetry fearful-symmetry self-assigned this Dec 8, 2023
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner December 8, 2023 18:34
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 8, 2023
Copy link
Contributor

mergify bot commented Dec 8, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 28 min 30 sec

Pipeline error 1

This error is likely related to the pipeline itself. Click here
and then you will see the error (either incorrect syntax or an invalid configuration).

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-08T20:42:18.628+0000

  • Duration: 27 min 21 sec

Test stats 🧪

Test Results
Failed 8
Passed 1603
Skipped 1
Total 1612

Test errors 8

Expand to view the tests failures

Build&Test / packetbeat-rhel-9-rhel-9 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple/New_client
        procs_test.go:308: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37366-2-ca18e25a-ab4d-4c49-b49f-7780b50fc9a3/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:308
            	Error:      	Not equal: 
            	            	expected: "NMap"
            	            	actual  : ""
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-NMap
            	            	+
            	Test:       	TestFindProcessTuple/New_client
        procs_test.go:310: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37366-2-ca18e25a-ab4d-4c49-b49f-7780b50fc9a3/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:310
            	Error:      	Not equal: 
            	            	expected: []string{"/usr/bin/nmap", "-sT", "-P443", "10.0.0.0/8"}
            	            	actual  : []string(nil)
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,7 +1,2 @@
            	            	-([]string) (len=4) {
            	            	- (string) (len=13) "/usr/bin/nmap",
            	            	- (string) (len=3) "-sT",
            	            	- (string) (len=5) "-P443",
            	            	- (string) (len=10) "10.0.0.0/8"
            	            	-}
            	            	+([]string) <nil>
            	            	 
            	Test:       	TestFindProcessTuple/New_client
    --- FAIL: TestFindProcessTuple/New_client (0.00s)
     
    

Build&Test / packetbeat-rhel-9-rhel-9 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple
    --- FAIL: TestFindProcessTuple (0.00s)
     
    

Build&Test / packetbeat-unitTest / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple/New_client
        procs_test.go:308: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37366-2-8ab235a3-c375-45d1-9ab3-3bd69025d6a7/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:308
            	Error:      	Not equal: 
            	            	expected: "NMap"
            	            	actual  : ""
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-NMap
            	            	+
            	Test:       	TestFindProcessTuple/New_client
        procs_test.go:310: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37366-2-8ab235a3-c375-45d1-9ab3-3bd69025d6a7/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:310
            	Error:      	Not equal: 
            	            	expected: []string{"/usr/bin/nmap", "-sT", "-P443", "10.0.0.0/8"}
            	            	actual  : []string(nil)
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,7 +1,2 @@
            	            	-([]string) (len=4) {
            	            	- (string) (len=13) "/usr/bin/nmap",
            	            	- (string) (len=3) "-sT",
            	            	- (string) (len=5) "-P443",
            	            	- (string) (len=10) "10.0.0.0/8"
            	            	-}
            	            	+([]string) <nil>
            	            	 
            	Test:       	TestFindProcessTuple/New_client
    --- FAIL: TestFindProcessTuple/New_client (0.00s)
     
    

Build&Test / packetbeat-unitTest / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple
    --- FAIL: TestFindProcessTuple (0.00s)
     
    

Build&Test / packetbeat-windows-2022-windows-2022 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple/New_client
        procs_test.go:308: 
            	Error Trace:	C:/Users/jenkins/workspace/PR-37366-2-531d2560-2972-4e97-be47-03209b795452/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:308
            	Error:      	Not equal: 
            	            	expected: "NMap"
            	            	actual  : ""
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-NMap
            	            	+
            	Test:       	TestFindProcessTuple/New_client
        procs_test.go:310: 
            	Error Trace:	C:/Users/jenkins/workspace/PR-37366-2-531d2560-2972-4e97-be47-03209b795452/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:310
            	Error:      	Not equal: 
            	            	expected: []string{"/usr/bin/nmap", "-sT", "-P443", "10.0.0.0/8"}
            	            	actual  : []string(nil)
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,7 +1,2 @@
            	            	-([]string) (len=4) {
            	            	- (string) (len=13) "/usr/bin/nmap",
            	            	- (string) (len=3) "-sT",
            	            	- (string) (len=5) "-P443",
            	            	- (string) (len=10) "10.0.0.0/8"
            	            	-}
            	            	+([]string) <nil>
            	            	 
            	Test:       	TestFindProcessTuple/New_client
    --- FAIL: TestFindProcessTuple/New_client (0.00s)
     
    

Build&Test / packetbeat-windows-2022-windows-2022 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple
    --- FAIL: TestFindProcessTuple (0.00s)
     
    

Build&Test / packetbeat-windows-2016-windows-2016 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple/New_client
        procs_test.go:308: 
            	Error Trace:	C:/Users/jenkins/workspace/PR-37366-2-305185a5-1a46-4aea-8d88-9ccac6a4cc8c/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:308
            	Error:      	Not equal: 
            	            	expected: "NMap"
            	            	actual  : ""
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-NMap
            	            	+
            	Test:       	TestFindProcessTuple/New_client
        procs_test.go:310: 
            	Error Trace:	C:/Users/jenkins/workspace/PR-37366-2-305185a5-1a46-4aea-8d88-9ccac6a4cc8c/src/github.com/elastic/beats/packetbeat/procs/procs_test.go:310
            	Error:      	Not equal: 
            	            	expected: []string{"/usr/bin/nmap", "-sT", "-P443", "10.0.0.0/8"}
            	            	actual  : []string(nil)
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1,7 +1,2 @@
            	            	-([]string) (len=4) {
            	            	- (string) (len=13) "/usr/bin/nmap",
            	            	- (string) (len=3) "-sT",
            	            	- (string) (len=5) "-P443",
            	            	- (string) (len=10) "10.0.0.0/8"
            	            	-}
            	            	+([]string) <nil>
            	            	 
            	Test:       	TestFindProcessTuple/New_client
    --- FAIL: TestFindProcessTuple/New_client (0.00s)
     
    

Build&Test / packetbeat-windows-2016-windows-2016 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFindProcessTuple
    --- FAIL: TestFindProcessTuple (0.00s)
     
    

Steps errors 13

Expand to view the steps failures

Show only the first 10 steps failures

packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 2 min 18 sec . View more details here
  • Description: mage build unitTest
packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 0 min 22 sec . View more details here
  • Description: mage build unitTest
packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 0 min 22 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2022-windows-2022 - mage build unitTest
  • Took 3 min 12 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2022-windows-2022 - mage build unitTest
  • Took 1 min 31 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2022-windows-2022 - mage build unitTest
  • Took 1 min 30 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2016-windows-2016 - mage build unitTest
  • Took 3 min 4 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2016-windows-2016 - mage build unitTest
  • Took 1 min 30 sec . View more details here
  • Description: mage build unitTest
packetbeat-windows-2016-windows-2016 - mage build unitTest
  • Took 1 min 30 sec . View more details here
  • Description: mage build unitTest
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'hudson.AbortException: script returned exit code 1'

🐛 Flaky test report

❕ There are test failures but not known flaky tests.

Expand to view the summary

Genuine test errors 8

💔 There are test failures but not known flaky tests, most likely a genuine test failure.

  • Name: Build&Test / packetbeat-rhel-9-rhel-9 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-rhel-9-rhel-9 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-unitTest / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-unitTest / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-windows-2022-windows-2022 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-windows-2022-windows-2022 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-windows-2016-windows-2016 / TestFindProcessTuple/New_client – github.com/elastic/beats/v7/packetbeat/procs
  • Name: Build&Test / packetbeat-windows-2016-windows-2016 / TestFindProcessTuple – github.com/elastic/beats/v7/packetbeat/procs

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 28 min 7 sec

Pipeline error 1

This error is likely related to the pipeline itself. Click here
and then you will see the error (either incorrect syntax or an invalid configuration).

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-11T23:31:24.536+0000

  • Duration: 21 min 15 sec

Test stats 🧪

Test Results
Failed 0
Passed 814
Skipped 1
Total 815

Steps errors 1

Expand to view the steps failures

Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'org.jenkinsci.plugins.workflow.steps.FlowInterruptedException'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-11T23:45:44.611+0000

  • Duration: 50 min 16 sec

Test stats 🧪

Test Results
Failed 0
Passed 2369
Skipped 25
Total 2394

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@fearful-symmetry fearful-symmetry marked this pull request as draft December 18, 2023 15:48
Copy link
Contributor

mergify bot commented Feb 5, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 26, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 26, 2024
Copy link
Contributor

mergify bot commented Dec 26, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b ProcessWatcherPerformanceRefactor upstream/ProcessWatcherPerformanceRefactor
git merge upstream/main
git push upstream ProcessWatcherPerformanceRefactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants