Optimizing Kafka Stretch Clusters: Tackling Leader Election Instability with Submariner, Istio, and Cilium #14

ghostroot007 · 2025-02-13T04:37:36Z

ghostroot007
Feb 13, 2025

In a Kafka stretch cluster deployed across multiple Kubernetes clusters using Submariner for cross-cluster communication, a leader election issue arises due to high network latency between clusters. The Raft-based controller quorum struggles to achieve consensus, leading to frequent leadership changes and instability.

What are the possible root causes of this instability?
How can you optimize the configuration of both Kafka and Submariner to mitigate this issue?
If using an alternative to Submariner (e.g., Istio or Cilium), how would you approach solving the issue differently?

Answered by aswinayyolath

Feb 13, 2025

Hope below stuff answers your question ..

Root Causes

Submariner relies on IPsec or WireGuard tunnels, and excessive latency between clusters can slow Raft leader election.
Kafka's controller quorum requires a majority to elect a leader, and if inter-cluster communication is slow, leader re-election can become unstable.
If controller.quorum.voters is not optimally distributed across clusters, it may lead to uneven voting power and split-brain scenarios.
If the Submariner Gateway nodes are under heavy load, packet drops or delayed responses can impact Kafka's leader election.

Optimizations for Kafka & Submariner

Reduce election.timeout.ms but keep it within reasonable limits (e.g., incr…

View full answer

aswinayyolath · 2025-02-13T05:24:59Z

aswinayyolath
Feb 13, 2025
Maintainer

Hope below stuff answers your question ..

Root Causes

Submariner relies on IPsec or WireGuard tunnels, and excessive latency between clusters can slow Raft leader election.
Kafka's controller quorum requires a majority to elect a leader, and if inter-cluster communication is slow, leader re-election can become unstable.
If controller.quorum.voters is not optimally distributed across clusters, it may lead to uneven voting power and split-brain scenarios.
If the Submariner Gateway nodes are under heavy load, packet drops or delayed responses can impact Kafka's leader election.

Optimizations for Kafka & Submariner

Reduce election.timeout.ms but keep it within reasonable limits (e.g., increase it from 2000ms to 5000ms to tolerate minor latencies).
Optimize Submariner tunnels by switching to WireGuard for lower overhead if using IPsec.
Ensure proper controller.quorum.voters placement to avoid all controllers being in a high-latency cluster. (This is really imp)
Tune network buffer sizes and TCP settings (net.ipv4.tcp_rmem, tcp_wmem, tcp_congestion_control) for more efficient data flow.
Monitor kafka.log.retention.ms and log.segment.bytes to prevent frequent log cleanups causing unnecessary I/O during elections.
Alternative Approach Using Istio or Cilium:

Istio-based Solution

Istio is still under investigation.. So I don't have an answer on it yet..

Cilium-based Solution

I think you should utilize Cilium’s ClusterMesh for more efficient inter-cluster service discovery and load balancing. Enable BPF-based policies to ensure efficient packet processing instead of relying on traditional tunneling. Monitor XDP (eXpress Data Path) stats to analyze and reduce packet processing delays.

1 reply

ghostroot007 Feb 13, 2025
Author

Sounds Good,Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Kafka Stretch Clusters: Tackling Leader Election Instability with Submariner, Istio, and Cilium #14

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Optimizing Kafka Stretch Clusters: Tackling Leader Election Instability with Submariner, Istio, and Cilium #14

ghostroot007 Feb 13, 2025

Root Causes

Optimizations for Kafka & Submariner

Replies: 1 comment · 1 reply

aswinayyolath Feb 13, 2025 Maintainer

Root Causes

Optimizations for Kafka & Submariner

Istio-based Solution

Cilium-based Solution

ghostroot007 Feb 13, 2025 Author

ghostroot007
Feb 13, 2025

Replies: 1 comment 1 reply

aswinayyolath
Feb 13, 2025
Maintainer

ghostroot007 Feb 13, 2025
Author