CP/DP split: Add leader election #3092

bjee19 · 2025-02-04T22:11:01Z

Add leader election to allow data plane pods to only connect to the lead NGF pod. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

Problem: We want the NGF control plane to fail-over to another pod when the control plane pod goes down.

Solution: Only the leader pod is marked as ready by Kubernetes, and all connections from data plane pods are connected to the leader pod.

Testing: Added unit tests.

Verified that the NGF control plane is able to be scaled and the non-leader pods are marked as Unready.
Verified that all data plane pods only connect to the leader pod.
Verified upon deletion of the leader pod, another NGF pod takes its place as leader, becomes ready, and connects to the data plane pods.
Verified that the nginx gateway service only has the ready leader NGF pod in its endpoints.

Closes #2850

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

internal/framework/events/event.go

codecov · 2025-02-04T22:14:35Z

Codecov Report

Attention: Patch coverage is 84.70588% with 13 lines in your changes missing coverage. Please review.

Project coverage is 89.96%. Comparing base (6337c97) to head (461bec3).
Report is 94 commits behind head on change/control-data-plane-split.

Files with missing lines	Patch %	Lines
internal/mode/static/manager.go	0.00%	13 Missing ⚠️

Additional details and impacted files

@@                         Coverage Diff                         @@
##           change/control-data-plane-split    #3092      +/-   ##
===================================================================
+ Coverage                            89.74%   89.96%   +0.21%     
===================================================================
  Files                                  109      116       +7     
  Lines                                11150    12478    +1328     
  Branches                                50       50              
===================================================================
+ Hits                                 10007    11226    +1219     
- Misses                                1083     1182      +99     
- Partials                                60       70      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

salonichf5 · 2025-02-05T17:22:17Z

So, the leader lease will also change when we scale down the pods? Did you check for that case?

bjee19 · 2025-02-05T17:27:58Z

err, if we scale down the leader lease doesn't change, the unready non-leader NGF pods are the ones that get terminated, probably some kubernetes magic

salonichf5 · 2025-02-05T17:29:52Z

err, if we scale down the leader lease doesn't change, the unready non-leader NGF pods are the ones that get terminated, probably some kubernetes magic

okay good to know! thank you. PR looks good to me

sjberman · 2025-02-05T17:42:52Z

I would rephrase this PR/commit message a bit. We already do support leader election for failover. These changes specifically are to only allow the data plane pods to connect to the leader. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

internal/mode/static/manager.go

sjberman · 2025-02-05T18:12:46Z

Would you mind updating the image in the design document that we fixed?

…eader

bjee19 · 2025-02-06T20:04:34Z

Here's what the logs look like when an NGF Pod starts up and acquires the leader lease:

{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"265b813dbe048e5a55b78e7d84729b571ded6294","date":"2025-02-05T19:40:39Z","dirty":"true"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Starting manager"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Nginx Gateway Fabric Pod will be marked as unready until it has the leader lease"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-02-06T19:12:41Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2025-02-06T19:12:41Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":9113","secure":false}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"attempting to acquire leader lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election..."}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"eventLoop.eventHandler","msg":"Reconfigured control plane.","batchID":2}
{"level":"info","ts":"2025-02-06T19:13:48Z","msg":"successfully acquired lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election"}
{"level":"info","ts":"2025-02-06T19:13:48Z","logger":"telemetryJob","msg":"Starting cronjob"}
{"level":"info","ts":"2025-02-06T19:13:49Z","logger":"eventLoop.eventHandler","msg":"Handling events didn't result into NGINX configuration changes","batchID":9}
{"level":"info","ts":"2025-02-06T19:14:05Z","logger":"nginxUpdater.commandService","msg":"Creating connection for nginx pod: tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"nginxUpdater.commandService","msg":"Successfully connected to nginx agent tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"nginxUpdater.commandService","msg":"Successfully configured nginx for new subscription","pod":"tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"eventHandler","msg":"NGINX configuration was successfully updated"}

And here's what a non-leader NGF Pod looks like when its been running for a while:

{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"265b813dbe048e5a55b78e7d84729b571ded6294","date":"2025-02-05T19:40:39Z","dirty":"true"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Starting manager"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Nginx Gateway Fabric Pod will be marked as unready until it has the leader lease"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":9113","secure":false}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"attempting to acquire leader lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election..."}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"eventLoop.eventHandler","msg":"Reconfigured control plane.","batchID":2}

internal/mode/static/manager.go

internal/mode/static/handler.go

bjee19 · 2025-02-07T23:11:33Z

@salonichf5 , re-requesting your review since a large chunk of restructuring was done

bjee19 requested a review from a team as a code owner February 4, 2025 22:11

github-actions bot added the enhancement New feature or request label Feb 4, 2025

bjee19 commented Feb 4, 2025

View reviewed changes

internal/framework/events/event.go Outdated Show resolved Hide resolved

bjee19 changed the title ~~Add leader election~~ CP/DP split: Add leader election Feb 4, 2025

bjee19 commented Feb 5, 2025

View reviewed changes

internal/mode/static/manager.go Outdated Show resolved Hide resolved

bjee19 requested a review from a team as a code owner February 5, 2025 19:36

Add leader election to allow data plane pods to only connect to the l…

265b813

…eader

bjee19 force-pushed the enh/leader-election-agent branch from 193d232 to 265b813 Compare February 5, 2025 19:41

Remove dependency on controller-runtime health probe and use ours

754a98b

sjberman reviewed Feb 6, 2025

View reviewed changes

internal/mode/static/manager.go Outdated Show resolved Hide resolved

bjee19 added 3 commits February 6, 2025 12:46

Refactor function placement

ed10855

Add health probe test

6f63196

Add code coverage on health probe test

a4df94c

sindhushiv reviewed Feb 7, 2025

View reviewed changes

internal/mode/static/handler.go Outdated Show resolved Hide resolved

bjee19 added 4 commits February 7, 2025 13:31

Refactor leader election to use runnable

f8ee138

Change some variable names and error messages

41cbf08

Add comments and more changes to variable names

f1c0389

Add another handler test

461bec3

bjee19 requested a review from sjberman February 7, 2025 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP/DP split: Add leader election #3092

CP/DP split: Add leader election #3092

bjee19 commented Feb 4, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading

salonichf5 commented Feb 5, 2025

bjee19 commented Feb 5, 2025

salonichf5 commented Feb 5, 2025

sjberman commented Feb 5, 2025

sjberman commented Feb 5, 2025

bjee19 commented Feb 6, 2025

bjee19 commented Feb 7, 2025

CP/DP split: Add leader election #3092

Are you sure you want to change the base?

CP/DP split: Add leader election #3092

Conversation

bjee19 commented Feb 4, 2025 • edited Loading

Checklist

Release notes

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

salonichf5 commented Feb 5, 2025

bjee19 commented Feb 5, 2025

salonichf5 commented Feb 5, 2025

sjberman commented Feb 5, 2025

sjberman commented Feb 5, 2025

bjee19 commented Feb 6, 2025

bjee19 commented Feb 7, 2025

bjee19 commented Feb 4, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading