How do we perform tests on a stretched kafka cluster? What parameters would need testing? #16
-
Having a stretched kafka cluster in hand, how do we compare it's performance with a regular kafka cluster. Some things I can come up with are:
What else should be tested against a stretched kafka clusters and what methods/tools can I use for testing these? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I think these are great starting points for testing the performance and resilience of a stretched Kafka cluster. Each of these tests is highly relevant, and their impact is crucial to evaluate in a multi-cluster setup. Here’s why they matter and how we can test them Replication LagCross-cluster replication inherently experiences higher network latency compared to replication within a single cluster. Increased lag can affect ISR synchronization and lead to under-replicated partitions, impacting availability. we can test it by
Failover ScenariosA failed non-central cluster should not disrupt operations, but a failed central cluster might. Cross-cluster failover introduces network overhead and potential data inconsistencies.
Central Cluster FailureIf the central cluster is responsible for controller quorum or metadata, its failure could cause a catastrophic outage. Ensuring metadata redundancy across clusters is crucial for recovery. How to test
|
Beta Was this translation helpful? Give feedback.
-
In a stretched Kafka cluster, network issues between clusters can cause partitioning (i.e., temporary loss of communication). We need to determine how well Kafka handles ISR shrinking, leader re-elections, and producer/consumer failovers in such cases. can we use iptables or tc (Traffic Control) to create artificial network disruptions between clusters
we could monitor Kafka logs for ISR shrinkage and leader re-election events. Use Consumer Read LocalityConsumers should prefer local replicas over cross-cluster replicas whenever possible to minimize network latency. Without proper rack awareness, consumers might unnecessarily fetch data from remote brokers, increasing latency and egress costs. Set broker.rack in Kafka’s config to match the cluster’s region or availability zone. Use Another important thing is cross-Cluster topic balancing. Topics in a stretched cluster should be evenly distributed across clusters, an unbalanced topic placement can cause some brokers to be overloaded while others remain idle. Use Then manually create topics with skewed partition replication (e.g., all replicas in Cluster A). Use In KRaft-based Kafka, metadata leaders must survive cross-cluster failures. Identify metadata leaders using Another important testing that I can think of is cross-cluster client failover testing. If a Kafka broker fails, clients should automatically reconnect to another available broker without manual intervention. To test this we can follow below steps Producer Failover
Consumer Failover
Measuring Recovery Time
|
Beta Was this translation helpful? Give feedback.
In a stretched Kafka cluster, network issues between clusters can cause partitioning (i.e., temporary loss of communication). We need to determine how well Kafka handles ISR shrinking, leader re-elections, and producer/consumer failovers in such cases.
can we use iptables or tc (Traffic Control) to create artificial network disruptions between clusters
we could monitor Kafka logs for ISR shrinkage and leader re-election events. Use
kafka-topics.sh --describe
to check if partition replicas are still available. Then restore connectivity(iptables -F to remove rules)
and track how long it takes for…