Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory queue: free event data sooner after acknowledgments #38042

Closed
wants to merge 1 commit into from

Conversation

faec
Copy link
Contributor

@faec faec commented Feb 15, 2024

Proposed commit message

Refactor the memory queue's ackLoop goroutine to allow earlier freeing of event data:

  • Previously, each event batch had its own acknowledgment channel, and ackLoop listened to each of them strictly in order. Now, all event batches use a shared acknowledgment channel and ackLoop frees the data for acknowledged events immediately on receiving the signal from that batch, regardless of the order they are received.
  • Since acknowledgments no longer happen strictly in queue order, each batch now has a "done" flag owned by the ackLoop goroutine, and this is used to collect acknowledged batches when advancing the queue instead of reading the sequence of acknowledgment channels.

In benchmarks, CPU and ingestion rate were unaffected, while memory used dropped by 0-5% depending on configuration, with the biggest improvements being in the scale and throughput performance presets.

There is no visible functional difference from this change, except that event pointers are reset to null slightly sooner in the acknowledgment process.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@faec faec added enhancement Team:Elastic-Agent Label for the Agent team labels Feb 15, 2024
@faec faec self-assigned this Feb 15, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 15, 2024
Copy link
Contributor

mergify bot commented Feb 15, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 15, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

cc @faec

@faec faec marked this pull request as ready for review February 15, 2024 20:41
@faec faec requested a review from a team as a code owner February 15, 2024 20:41
@faec faec requested review from belimawr and leehinman February 15, 2024 20:41
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@leehinman
Copy link
Contributor

@faec this might be more trouble than it is worth, but what do you think about adding a benchmark test for this?

I was thinking 2 scenarios might be interesting. 1) single producer and single consumer going as fast as possible. 2) multiple producers going as fast as possible and multiple consumers, but have one of those consumers be "slow".

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-15T18:57:27.975+0000

  • Duration: 145 min 9 sec

Test stats 🧪

Test Results
Failed 0
Passed 29109
Skipped 2046
Total 31155

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// ackedChan is buffered so output workers don't block on acknowledgment
// if ackLoop is busy. (10 is probably more than we really need, but
// no harm in being safe.)
ackedChan: make(chan *batch, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker.

If the buffering is for the workers could we perhaps make this dynamic for the number of workers? something like number of workers + 1? I'm just wondering if you are on a 64 core Gravitron server and the number of workers is 64, is 10 enough?

@faec
Copy link
Contributor Author

faec commented Mar 5, 2024

This PR is on hold because #38166 gets much better improvement with a much smaller change, and might make this one completely redundant. Once that one's merged, I'll reevaluate whether this one is still worth going ahead with.

@faec
Copy link
Contributor Author

faec commented Apr 8, 2024

Closing since this is redundant with other changes that have been merged in the meantime

@faec faec closed this Apr 8, 2024
@faec faec deleted the memqueue-eventfree branch April 8, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants