How do we perform tests on a stretched kafka cluster? What parameters would need testing? #16

rohan-anilkumar · 2025-02-17T07:02:07Z

rohan-anilkumar
Feb 17, 2025
Collaborator

Having a stretched kafka cluster in hand, how do we compare it's performance with a regular kafka cluster. Some things I can come up with are:

Replication lag
- The network latency between two brokers, in separate clusters, causes lag in replication.
Leader election time
- The time it would take a new leader to be elected when a cluster with the current leader goes down.
Failover
- How the cluster would react to failed central or non central clusters.

What else should be tested against a stretched kafka clusters and what methods/tools can I use for testing these?

Answered by aswinayyolath

Feb 18, 2025

In a stretched Kafka cluster, network issues between clusters can cause partitioning (i.e., temporary loss of communication). We need to determine how well Kafka handles ISR shrinking, leader re-elections, and producer/consumer failovers in such cases.

can we use iptables or tc (Traffic Control) to create artificial network disruptions between clusters

# Drop all traffic between Cluster A and Cluster B
iptables -A INPUT -s <ClusterB_IP> -j DROP

we could monitor Kafka logs for ISR shrinkage and leader re-election events. Use kafka-topics.sh --describe to check if partition replicas are still available. Then restore connectivity (iptables -F to remove rules) and track how long it takes for…

View full answer

aswinayyolath · 2025-02-17T08:31:38Z

aswinayyolath
Feb 17, 2025
Maintainer

I think these are great starting points for testing the performance and resilience of a stretched Kafka cluster. Each of these tests is highly relevant, and their impact is crucial to evaluate in a multi-cluster setup. Here’s why they matter and how we can test them

Replication Lag

Cross-cluster replication inherently experiences higher network latency compared to replication within a single cluster. Increased lag can affect ISR synchronization and lead to under-replicated partitions, impacting availability.

we can test it by

Monitor UnderReplicatedPartitions, MaxLag, and ReplicationLag metrics from Kafka’s built-in JMX metrics.
Use kafka-consumer-groups.sh --describe to check lag between producers and consumers.
Introduce artificial network delays using tc (Traffic Control) to simulate cross-cluster latency and observe its effect on replication time.

Failover Scenarios

A failed non-central cluster should not disrupt operations, but a failed central cluster might. Cross-cluster failover introduces network overhead and potential data inconsistencies.

Simulate node failures using kubectl delete pod.
Monitor Kafka logs and ISR shrinkage when brokers go down.
Test consumer failover behavior by checking rebalancing time for consumers switching to different partitions.

Central Cluster Failure

If the central cluster is responsible for controller quorum or metadata, its failure could cause a catastrophic outage. Ensuring metadata redundancy across clusters is crucial for recovery.

How to test

Shut down the entire central cluster and observe Kafka’s behavior in remote clusters.
Check controller failover using controller.quorum.voters and see if new controllers take over.

0 replies

aswinayyolath · 2025-02-18T08:02:38Z

aswinayyolath
Feb 18, 2025
Maintainer

In a stretched Kafka cluster, network issues between clusters can cause partitioning (i.e., temporary loss of communication). We need to determine how well Kafka handles ISR shrinking, leader re-elections, and producer/consumer failovers in such cases.

can we use iptables or tc (Traffic Control) to create artificial network disruptions between clusters

# Drop all traffic between Cluster A and Cluster B
iptables -A INPUT -s <ClusterB_IP> -j DROP

we could monitor Kafka logs for ISR shrinkage and leader re-election events. Use kafka-topics.sh --describe to check if partition replicas are still available. Then restore connectivity (iptables -F to remove rules) and track how long it takes for brokers to resynchronize. we could also check UnderReplicatedPartitions metric in JMX or Prometheus to determine how quickly lag is recovered.

Consumer Read Locality

Consumers should prefer local replicas over cross-cluster replicas whenever possible to minimize network latency. Without proper rack awareness, consumers might unnecessarily fetch data from remote brokers, increasing latency and egress costs. Set broker.rack in Kafka’s config to match the cluster’s region or availability zone. Use replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector in broker configs. Then start consumers in both Cluster A and Cluster B and use kafka-consumer-groups.sh --describe to track which brokers they are fetching from.
Monitor network usage (netstat, Prometheus metrics) to see if consumers are pulling data from local or remote brokers. Also we should temporarily remove local replicas and see if consumers switch to remote ones or introduce artificial network latency to confirm whether consumers retry fetching from local brokers.

Another important thing is cross-Cluster topic balancing. Topics in a stretched cluster should be evenly distributed across clusters, an unbalanced topic placement can cause some brokers to be overloaded while others remain idle. Use kafka-topics.sh --describe to check how partitions are spread across brokers in different clusters. example : kafka-topics.sh --describe --topic test-topic --bootstrap-server <BROKER>.

Then manually create topics with skewed partition replication (e.g., all replicas in Cluster A). Use kafka-producer-perf-test.sh to generate load and observe performance degradation.

In KRaft-based Kafka, metadata leaders must survive cross-cluster failures. Identify metadata leaders using kafka-metadata-shell.sh.
Simulate a leader failure (kubectl delete pod <controller-pod>), and track how long it takes for a new leader to be elected. Use controller.quorum.voters to confirm metadata replication across clusters.

Another important testing that I can think of is cross-cluster client failover testing. If a Kafka broker fails, clients should automatically reconnect to another available broker without manual intervention.

To test this we can follow below steps

Producer Failover

Start a producer with acks=all and send messages.
Shut down the broker handling the partition leader.
Observe if the producer automatically retries and finds a new leader.

Consumer Failover

Start consumers and track their assigned partitions.
Kill a broker, then use kafka-consumer-groups.sh --describe to see if they are reassigned quickly.

Measuring Recovery Time

Use kafka-metrics.sh to track recovery time.
compare behavior between low-latency local failovers vs. cross-cluster failovers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do we perform tests on a stretched kafka cluster? What parameters would need testing? #16

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How do we perform tests on a stretched kafka cluster? What parameters would need testing? #16

rohan-anilkumar Feb 17, 2025 Collaborator

Replies: 2 comments

aswinayyolath Feb 17, 2025 Maintainer

Replication Lag

Failover Scenarios

Central Cluster Failure

How to test

aswinayyolath Feb 18, 2025 Maintainer

Consumer Read Locality

Producer Failover

Consumer Failover

Measuring Recovery Time

rohan-anilkumar
Feb 17, 2025
Collaborator

aswinayyolath
Feb 17, 2025
Maintainer

aswinayyolath
Feb 18, 2025
Maintainer