Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeater should leave CG on graceful shutdown #62

Merged
merged 2 commits into from
Dec 9, 2024

Conversation

travisdowns
Copy link
Member

Listen for SIGTERM and call all Shutdown() in the verifiers.

See commits and CORE-8470 for additional context.

We listen for SIGINT and do a graceful shutdown, but we should do the
same thing on SIGTERM too, as that's what systemd will send a
service, and we run kgo as a service (and anyway it just makes sense
to do graceful shutdown on SIGTERM, that's kind of what it's there
for).

The main reason to do graceful shutdown is to commit any cached offsets
and (especially) to leave the consumer group explicitly, otherwise the
CG will stay in the Stable state until heartbeat timeout, which causes
problems for re-joining members.
The verifiers expose a Shutdown() method but we never call it.

This means that the clients are never closed, so consumers never
explicitly leave their CG, which causes problems.

Fixes CORE-8470.
@travisdowns travisdowns changed the title Td kgo leave group Repeater should leave CG on graceful shutdown Dec 6, 2024
@travisdowns
Copy link
Member Author

Example of pre-fix and post-fix restarts:

image

Note that many of the topics have 0 batches written on the left, these are hung (lines are overlapping so it's hard to see, but it's about half the topics). The first two red lines, are two restarts without the fix. Note that all topics go non-zero (since we always write the initial number of tokens), but many go back to zero (not all, it's racy so you can see some which "cross over" and keep running).

The third restart is after the fix, and all topics have load now. Interestingly, the batch rate of the highest throughput topics drops (e.g., blue series at the top) goes down once all generators are up: I attribute this to increased latency due to more load generators which directly impacts these "repeater" loads since the throughput (if not throttled) is directly related to latency as there are a fixed number of tokens in flight.

@travisdowns
Copy link
Member Author

@StephanDollberg I added you on here for a lack of a better idea, feel free to recommend other/alternate reviewers as well.

@travisdowns travisdowns merged commit bdc41ee into main Dec 9, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants