-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.ArgumentException: Unexpected records polled potentially thrown during a rebalance #415
Comments
I have added this app to test with, steps are documented here: https://github.com/jblackburn21/akka-streams-kafka-lab/blob/main/docs/akka-streams-kafka-issue-415.md I have also pushed more detailed logging to this branch on my fork: https://github.com/jblackburn21/Akka.Streams.Kafka/tree/kafka-consumer-logging |
Thanks @jblackburn21 - we're going to look into this; we're wondering if there's a relationship with #414, which we're also looking into right now |
@jblackburn21 we've been able to reproduce the issue! we'll update this ticket as we go |
Here's what has been observed so far: The Akka.Streams.Kafka original code was ported from alpakka-kafka repository in 2019. As far as we can tell, the event handling are the same. There are significant difference between Java org.apache.kafka and .NET Confluent.Kafka implementation, .NET Confluent.Kafka is based on the native C++ librdkafka library while org.apache.kafka implements their own low level network I/O. org.apache.kafka kafka consumer client implementation is tightly coupled with the underlying driver, it implements a batching fetch on the network level for each Some speculations: I'm not sure how org.apache.kafka client handles a partition revoked/lost event from the broker, but in .NET consumer client, it seemed like the event is passed directly by the consumer actor to the partition event handlers. There is a possibility that the .NET consumer client does not validate its buffer content when a repartition happens, the buffer might retain a message with invalid partition after the partition have been revoked/lost. |
Doing research on what the client is doing, it appears that the newest "partition.assignment.strategy" default value is leveraging the new "cooperative strategy" which brings a new behavior change: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html#partition-assignment-strategy In the new default, the "partition.assignment.strategy" is set to Here is the XML-DOC from the ConsumerBuilder.SetPartitionsAssignedHandler:
|
Version Information
Version of Akka.NET? 1.5.33
Which Akka.NET Modules? Akka.Streams.Kafka
Describe the bug
We use the
KafkaConsumer.CommittablePartitionedSource
running in kubernetes with autoscaling enabled. When Pods are spun up/down, it is possible for theSubSourceLogic
to get out of sync with theTopicPartition
assignment on the consumer, which results in an ex being thrown and the consumer shutting down. ThePartitionsAssignedHandler
andPartitionsRevokedHandler
handlers are a side effect of_consumer.Consume()
, and doesn't appear handle timing properly.To Reproduce
Steps to reproduce the behavior:
NOTE: I added additional logging to
KafkaConsumerActor
in order to better capture current start during rebalancingRun kafka locally using docker
Create a topic with 10 partitions, in this case I'm using a
members
topicRun a producer so that messages are generated with a key to produce to all partitions
Start 3 instances of a consumer:
Stop Consumer 1, check logs for consumers 2 and 3 to see if they stopped. It may take multiple restarts of consumer 1 to trigger the error.
Expected behavior
The
KafkaConsumerActor
properly update the_requests
and_requestors
onPartitionsRevokedHandler
so that it isn't expected to consume messages from partitions that are not longer assigned.Actual behavior
When a rebalance happens, partitions are revoked and then new partitions are assigned.
First, the next
_consumer.Consume()
completes with zero records to process and zero assignments since the partitions have been consumed.This can be seen with these logs from Consumer 2:
[13:53:40 INF] [2cadbea5-c6a8-40cf-ad57-3a757c315b7b] Messages requested from: [akka://KafkaStream/user/members-consumer/StreamSupervisor-1/$$j#1512204890], for: members [[2]] [13:53:40 INF] [2cadbea5-c6a8-40cf-ad57-3a757c315b7b] Delayed poll when messages requested, periodic: False [13:53:40 INF] [2cadbea5-c6a8-40cf-ad57-3a757c315b7b] Poll requested, periodic: False [13:53:40 INF] [2cadbea5-c6a8-40cf-ad57-3a757c315b7b] Starting poll with rebalancing: False, 5 requests: members [[4]], members [[0]], members [[2]], members [[3]], members [[1]], 5 assignments: members [[0]], members [[1]], members [[2]], members [[3]], members [[4]] [13:53:40 INF] [2cadbea5-c6a8-40cf-ad57-3a757c315b7b] Processing 0 records, 0 assignments:
The
PartitionsRevokedHandler
is then called, where theIPartitionEventHandler
notifies theSubSourceLogic
that partitions have been revoked and_rebalanceInProgress
is set totrue
.Based on timing, a
Poll
message can be sent toKafkaConsumerActor
after the revoke and before new partitions are assigned. These logs show that when_consumer.Consume()
is called, the assignments are empty, but there are still 3 active_requests
from theSubSourceLogic
.When the
_consumer.Consume()
completes, it returns records from the newly assigned partitions.Inside
ProcessResult()
there is a check to validate that the messages that were consumed are from the partitions that were requested. However, due to timing during a rebalance, these can be out of sync and aSystem.ArgumentException
is thrown.The exception is handled with
ProcessExceptions()
and theKafkaConsumerActor
is stoppedScreenshots
N/A
Environment
Dotnet 8 running on Mac, with kafka running in docker desktop
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: