KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

frankvicky · 2025-01-17T08:05:44Z

JIRA: KAFKA-18569
Please refer to ticket for further details.

In short, now new consumer close may wait for a FindCoordinator unsent request to go out when closing the consumer, even after the commit/leaveGroup stages of close are done.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

frankvicky · 2025-01-17T08:08:42Z

Hi @lianetm @kirktrue @chia7712
Please take a look when you have a free cycle.
Many thanks 🙇🏼

kirktrue · 2025-01-17T18:52:30Z

Thanks for the PR @frankvicky!

I was curious if you had a chance to look into using the pollOnClose() approach that was suggested in the Jira. If that approach works, we wouldn't need an extra ApplicationEvent.

Thanks!

frankvicky · 2025-01-18T09:18:36Z

Hi @kirktrue
Thanks for the review.
You're right, it's easier than having a new event. 😺
Previously I thought following CommitRequestManager could have a unified close style.

kirktrue

Thanks for the refresh on the PR @frankvicky! This looks much more succinct.

I'm still unsure what the behavior is for this sequence of events:

The coordinator is marked as unknown
CoordinatorRequestManager.poll() is called and creates a new FindCoordinatorRequest
The NetworkClientDelegate sends the request to the broker
Consumer.close() is called with a timeout of 30 seconds
ConsumerNetworkThread.sendUnsentRequests() is called

In step 5, won't it continue to loop for ~30 seconds because the find request created in step 2 (and sent in step 3) is still inflight when ConsumerNetworkThread.sendUnsentRequests() is called?

do {
    networkClientDelegate.poll(timer.remainingMs(), timer.currentTimeMs());
    timer.update();
} while (timer.notExpired() && networkClientDelegate.hasAnyPendingRequests());

NetworkClientDelegate.hasAnyPendingRequests() will return true while there are any in-flight requests.

Any thoughts?

Thanks!

frankvicky · 2025-01-22T15:39:09Z

Hi @kirktrue,

Thanks for the review.
It's tricky to have NetworkClientDelegate ignore the FindCoordinatorRequest since the request is stored in NetworkClient#inFlightRequests, and we shouldn't manipulate this property outside of NetworkClient.
I considered calling completeExceptionally on the in-flight FindCoordinatorRequest when closing, but I don't see any existing logic that does this.

frankvicky · 2025-01-23T08:21:44Z

Currently,testClose will timeout at this line:

kafka/core/src/test/scala/integration/kafka/api/ConsumerBounceTest.scala

Line 251 in 3276759

checkCloseWithClusterFailure(numRecords, "group4", "group5", groupProtocol)

It seems that the behavior describe in comment are not followed:

kafka/core/src/test/scala/integration/kafka/api/ConsumerBounceTest.scala

Lines 300 to 304 in 3276759

    
             /** 
        
              * Consumer is closed while all brokers are unavailable. Cannot rebalance or commit offsets since 
        
              * there is no coordinator, but close should timeout and return. If close is invoked with a very 
        
              * large timeout, close should timeout after request timeout. 
        
              */

Updated:

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java

Line 1496 in bdc92fd

ConsumerUtils.getResult(futureToAwait, timer);

Now the testClose will stock here because of max value timeout. We should have a way to let the background thread know the consumer is now closing.

lianetm · 2025-01-23T21:50:15Z

Hey here, I don't quite get how the pollOnClose approach will solve the issue, basically because it still leaves a lot of time for unneeded FindCoord to be generated and block the network thread close, doesn't it?

The pollOnClose runs when we're closing the network thread, actually right before we block on sendUnsentRequests. This means that we may have already a FindCoord generated after the consumer close committed and left the group, correct?

To find a solution, let's look at the classic consumer first, this is my understanding:

it does attempt to FindCoord to commit on close of course, here:

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

Lines 980 to 982 in 40890fa

    
           maybeAutoCommitOffsetsSync(timer); 
        
           while (pendingAsyncCommits.get() > 0 && timer.notExpired()) { 
        
               ensureCoordinatorReady(timer);

(both maybeAutoCommitOffsetsSync and the pending async will wait until they findCoord if needed).

it does not attempt to FindCoord when attempting to maybeLeaveGroup. After the commit on close phase completes, the classic shutdowns the HB thread (so no more pro-active FindCoord), and then sends the leave req (only if there is a known coord, no-op if it's unknown)

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

Line 1170 in 40890fa

if (isDynamicMember() && !coordinatorUnknown() &&

Correct me there, but if that's the behaviour, could it be achieved in the new consumer by allowing the autoCommitOnClose run with the background ensuring a coordinator, but right after it we could signal to the CoordinatorReq manager that it's closing (same effect as the HB thread shutdown in the classic I would say), so it does not generate any more FindCoord? That's what comes to mind, but let me know your thoughts. Thanks!

frankvicky · 2025-01-24T14:24:33Z

Hi @lianetm @kirktrue,
Sorry for interrupting the topic of this patch.
I noticed a potential behavior difference between the classic consumer and the async consumer while preparing this patch.

In the classic consumer, the timeout respects request.timeout.ms.

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ClassicKafkaConsumer.java

Line 1140 in 8c0a0e0

final Timer closeTimer = createTimerForRequest(timeout);

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ClassicKafkaConsumer.java

Lines 1130 to 1134 in 8c0a0e0

    
           private Timer createTimerForRequest(final Duration timeout) { 
        
               // this.time could be null if an exception occurs in constructor prior to setting the this.time field 
        
               final Time localTime = (time == null) ? Time.SYSTEM : time; 
        
               return localTime.timer(Math.min(timeout.toMillis(), requestTimeoutMs)); 
        
           }

However, in the async consumer, this logic is either missing or only applies to individual requests.
Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently.

kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java

Lines 976 to 989 in 8c0a0e0

    
           public void close(final Timer timer) { 
        
               // we do not need to re-enable wakeups since we are closing already 
        
               client.disableWakeups(); 
        
               try { 
        
                   maybeAutoCommitOffsetsSync(timer); 
        
                   while (pendingAsyncCommits.get() > 0 && timer.notExpired()) { 
        
                       ensureCoordinatorReady(timer); 
        
                       client.poll(timer); 
        
                       invokeCompletedOffsetCommitCallbacks(); 
        
                   } 
        
               } finally { 
        
                   super.close(timer); 
        
               } 
        
           }

Should we align the behavior between async and classic consumers?

lianetm · 2025-01-24T21:19:18Z

Hey @frankvicky, good finding. Agree that the behaviour is not aligned in the close timeout handling, so in practice the classic consumer.close will never wait for more than the request timeout if there is a call to close with a larger timeout (and that's indeed missing on the async close timeout)

Actually, the behaviour is explicitly called out in one of the tests:

https://github.com/lianetm/kafka/blob/023f9c26e60c0710891abd148cce52c1dadaf7cd/core/src/test/scala/integration/kafka/api/ConsumerBounceTest.scala#L300-L305

So I do agree that we need to align this. But just for my understanding, this is something else we need here to unblock these tests (the testClose specifically I imagine?), but not enough right? I expect we still need to deal with the initial situation to avoid issuing/blocking on unneeded FindCoord requests on close after the commit/leave have completed, agree? (just to make sure I'm not missing anything).

If my understanding is right then I think we should file a separate jira for the close timeout considering the request timeout, and if you can validate locally that it's the only fix required to enable the testClose then we enable it in that other PR (leaving this PR for the unneeded FindCoord issue and the testCloseDuringRebalance), let me know what you think.

frankvicky · 2025-01-25T03:21:31Z

Hi @lianetm
Yes, you are right. The request.timeout.ms is a separate issue.
I will create a new ticket to track this timeout-handling problem.

chia7712 · 2025-01-25T03:51:39Z

so in practice the classic consumer.close will never wait for more than the request timeout if there is a call to close with a larger timeout (and that's indeed missing on the async close timeout)

I agree that we should align the behavior with how it has functioned for a long time (f72203e). Additionally, we should document this behavior for both request.timeout.ms and close method.

frankvicky · 2025-01-25T06:04:13Z

Hi everyone,
I have updated the patch and looped the test locally, testCloseDuringRebalance now continuously passes the test.

…assic consumer JIRA: KAFKA-18645 see discussion: apache#18590 (comment) In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently. We should align the close timeout-handling to enable ConsumerBounceTest#testClose

kirktrue · 2025-01-27T17:10:26Z

The old/new approach to include a specialized event makes sense. Thanks for the suggestion @lianetm!

…assic consumer JIRA: KAFKA-18645 see discussion: apache#18590 (comment) In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently. We should align the close timeout-handling to enable ConsumerBounceTest#testClose

lianetm

Thanks @frankvicky ! Just one nit left. Also pls merge trunk latest changes to get the latests test fixed and will check the build again. Thanks!

lianetm · 2025-01-27T18:58:31Z

...java/org/apache/kafka/clients/consumer/internals/events/StopFindCoordinatorOnCloseEvent.java

+ * limitations under the License.
+ */
+package org.apache.kafka.clients.consumer.internals.events;
+public class StopFindCoordinatorOnCloseEvent extends ApplicationEvent {


Should we add a java doc here? Mainly to describe that the purpose of this event is to ensure that the CoordinatorRequestManager does not generate FindCoordinator requests when the consumer is closing and has already completed the operations that require a coordinator.

Sure, I have just written some description for it. PTAL 😺

...java/org/apache/kafka/clients/consumer/internals/events/StopFindCoordinatorOnCloseEvent.java

JIRA: KAFKA-18569 Please refer to ticker for further details

Co-authored-by: Lianet Magrans <[email protected]>

…assic consumer JIRA: KAFKA-18645 see discussion: apache#18590 (comment) In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently. We should align the close timeout-handling to enable ConsumerBounceTest#testClose

frankvicky · 2025-01-29T05:58:37Z

Failed test is handled by #18735

…18590) Reviewers: Lianet Magrans <[email protected]>, Kirk True <[email protected]>, Chia-Ping Tsai <[email protected]>

lianetm · 2025-01-29T19:29:22Z

Merged to trunk and cherry-picked to 4.0

…assic consumer JIRA: KAFKA-18645 see discussion: apache#18590 (comment) In the classic consumer, the timeout respects request.timeout.ms. However, in the async consumer, this logic is either missing or only applies to individual requests. Unlike the classic consumer, where request.timeout.ms works for the entire coordinator closing behavior, the async implementation handles timeouts differently. We should align the close timeout-handling to enable ConsumerBounceTest#testClose

github-actions bot added triage PRs from the community consumer clients small Small PRs labels Jan 17, 2025

chia7712 added the ci-approved label Jan 17, 2025

frankvicky force-pushed the KAKFA-18569 branch from fae88bc to 3bb4130 Compare January 17, 2025 12:40

kirktrue added KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) labels Jan 17, 2025

frankvicky force-pushed the KAKFA-18569 branch from 3bb4130 to f6a878e Compare January 18, 2025 09:05

github-actions bot added the core Kafka Broker label Jan 18, 2025

frankvicky force-pushed the KAKFA-18569 branch 3 times, most recently from f3def68 to ba51a9d Compare January 19, 2025 23:46

kirktrue reviewed Jan 21, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Jan 22, 2025

frankvicky force-pushed the KAKFA-18569 branch 4 times, most recently from b9fa0df to 97e53cb Compare January 23, 2025 04:34

frankvicky force-pushed the KAKFA-18569 branch from 97e53cb to 4eb61e0 Compare January 25, 2025 05:59

frankvicky mentioned this pull request Jan 25, 2025

KAFKA-18645: New consumer should align close timeout handling with classic consumer #18702

Open

3 tasks

frankvicky force-pushed the KAKFA-18569 branch from 4eb61e0 to fee2041 Compare January 25, 2025 07:11

frankvicky force-pushed the KAKFA-18569 branch from fee2041 to d236715 Compare January 26, 2025 02:04

frankvicky force-pushed the KAKFA-18569 branch from d236715 to 09fd01b Compare January 28, 2025 05:22

lianetm reviewed Jan 28, 2025

View reviewed changes

frankvicky force-pushed the KAKFA-18569 branch from 09fd01b to 5fede5a Compare January 28, 2025 16:53

lianetm reviewed Jan 28, 2025

View reviewed changes

...java/org/apache/kafka/clients/consumer/internals/events/StopFindCoordinatorOnCloseEvent.java Outdated Show resolved Hide resolved

frankvicky and others added 2 commits January 29, 2025 10:55

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator

33581d2

JIRA: KAFKA-18569 Please refer to ticker for further details

Fix the typo at javadoc

9a2e706

Co-authored-by: Lianet Magrans <[email protected]>

frankvicky force-pushed the KAKFA-18569 branch from 7c92923 to 9a2e706 Compare January 29, 2025 02:55

lianetm approved these changes Jan 29, 2025

View reviewed changes

lianetm merged commit 9dd73d4 into apache:trunk Jan 29, 2025
7 of 9 checks passed

lianetm pushed a commit that referenced this pull request Jan 29, 2025

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator (#…

90573b4

…18590) Reviewers: Lianet Magrans <[email protected]>, Kirk True <[email protected]>, Chia-Ping Tsai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

frankvicky commented Jan 17, 2025 •

edited

Loading

frankvicky commented Jan 17, 2025

kirktrue commented Jan 17, 2025

frankvicky commented Jan 18, 2025

kirktrue left a comment

frankvicky commented Jan 22, 2025

frankvicky commented Jan 23, 2025 •

edited

Loading

lianetm commented Jan 23, 2025 •

edited

Loading

frankvicky commented Jan 24, 2025

lianetm commented Jan 24, 2025

frankvicky commented Jan 25, 2025

chia7712 commented Jan 25, 2025

frankvicky commented Jan 25, 2025

kirktrue commented Jan 27, 2025

lianetm left a comment

lianetm Jan 27, 2025

frankvicky Jan 28, 2025

frankvicky commented Jan 29, 2025

lianetm commented Jan 29, 2025

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

KAFKA-18569: New consumer close may wait on unneeded FindCoordinator #18590

Conversation

frankvicky commented Jan 17, 2025 • edited Loading

Committer Checklist (excluded from commit message)

frankvicky commented Jan 17, 2025

kirktrue commented Jan 17, 2025

frankvicky commented Jan 18, 2025

kirktrue left a comment

Choose a reason for hiding this comment

frankvicky commented Jan 22, 2025

frankvicky commented Jan 23, 2025 • edited Loading

lianetm commented Jan 23, 2025 • edited Loading

frankvicky commented Jan 24, 2025

lianetm commented Jan 24, 2025

frankvicky commented Jan 25, 2025

chia7712 commented Jan 25, 2025

frankvicky commented Jan 25, 2025

kirktrue commented Jan 27, 2025

lianetm left a comment

Choose a reason for hiding this comment

lianetm Jan 27, 2025

Choose a reason for hiding this comment

frankvicky Jan 28, 2025

Choose a reason for hiding this comment

frankvicky commented Jan 29, 2025

lianetm commented Jan 29, 2025

frankvicky commented Jan 17, 2025 •

edited

Loading

frankvicky commented Jan 23, 2025 •

edited

Loading

lianetm commented Jan 23, 2025 •

edited

Loading