Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP/DP split: Add leader election #3092

Open
wants to merge 9 commits into
base: change/control-data-plane-split
Choose a base branch
from

Conversation

bjee19
Copy link
Contributor

@bjee19 bjee19 commented Feb 4, 2025

Add leader election to allow data plane pods to only connect to the lead NGF pod. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

Problem: We want the NGF control plane to fail-over to another pod when the control plane pod goes down.

Solution: Only the leader pod is marked as ready by Kubernetes, and all connections from data plane pods are connected to the leader pod.

Testing: Added unit tests.

  • Verified that the NGF control plane is able to be scaled and the non-leader pods are marked as Unready.
  • Verified that all data plane pods only connect to the leader pod.
  • Verified upon deletion of the leader pod, another NGF pod takes its place as leader, becomes ready, and connects to the data plane pods.
  • Verified that the nginx gateway service only has the ready leader NGF pod in its endpoints.

Closes #2850

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.


@bjee19 bjee19 requested a review from a team as a code owner February 4, 2025 22:11
@github-actions github-actions bot added the enhancement New feature or request label Feb 4, 2025
@bjee19 bjee19 changed the title Add leader election CP/DP split: Add leader election Feb 4, 2025
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 84.70588% with 13 lines in your changes missing coverage. Please review.

Project coverage is 89.96%. Comparing base (6337c97) to head (461bec3).
Report is 94 commits behind head on change/control-data-plane-split.

Files with missing lines Patch % Lines
internal/mode/static/manager.go 0.00% 13 Missing ⚠️
Additional details and impacted files
@@                         Coverage Diff                         @@
##           change/control-data-plane-split    #3092      +/-   ##
===================================================================
+ Coverage                            89.74%   89.96%   +0.21%     
===================================================================
  Files                                  109      116       +7     
  Lines                                11150    12478    +1328     
  Branches                                50       50              
===================================================================
+ Hits                                 10007    11226    +1219     
- Misses                                1083     1182      +99     
- Partials                                60       70      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@salonichf5
Copy link
Contributor

So, the leader lease will also change when we scale down the pods? Did you check for that case?

@bjee19
Copy link
Contributor Author

bjee19 commented Feb 5, 2025

err, if we scale down the leader lease doesn't change, the unready non-leader NGF pods are the ones that get terminated, probably some kubernetes magic

@salonichf5
Copy link
Contributor

err, if we scale down the leader lease doesn't change, the unready non-leader NGF pods are the ones that get terminated, probably some kubernetes magic

okay good to know! thank you. PR looks good to me

@sjberman
Copy link
Collaborator

sjberman commented Feb 5, 2025

I would rephrase this PR/commit message a bit. We already do support leader election for failover. These changes specifically are to only allow the data plane pods to connect to the leader. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

@sjberman
Copy link
Collaborator

sjberman commented Feb 5, 2025

Would you mind updating the image in the design document that we fixed?

@bjee19 bjee19 requested a review from a team as a code owner February 5, 2025 19:36
@bjee19 bjee19 force-pushed the enh/leader-election-agent branch from 193d232 to 265b813 Compare February 5, 2025 19:41
@bjee19
Copy link
Contributor Author

bjee19 commented Feb 6, 2025

Here's what the logs look like when an NGF Pod starts up and acquires the leader lease:

{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"265b813dbe048e5a55b78e7d84729b571ded6294","date":"2025-02-05T19:40:39Z","dirty":"true"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Starting manager"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"Nginx Gateway Fabric Pod will be marked as unready until it has the leader lease"}
{"level":"info","ts":"2025-02-06T19:12:41Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-02-06T19:12:41Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2025-02-06T19:12:41Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":9113","secure":false}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"attempting to acquire leader lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election..."}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"eventLoop.eventHandler","msg":"Reconfigured control plane.","batchID":2}
{"level":"info","ts":"2025-02-06T19:13:48Z","msg":"successfully acquired lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election"}
{"level":"info","ts":"2025-02-06T19:13:48Z","logger":"telemetryJob","msg":"Starting cronjob"}
{"level":"info","ts":"2025-02-06T19:13:49Z","logger":"eventLoop.eventHandler","msg":"Handling events didn't result into NGINX configuration changes","batchID":9}
{"level":"info","ts":"2025-02-06T19:14:05Z","logger":"nginxUpdater.commandService","msg":"Creating connection for nginx pod: tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"nginxUpdater.commandService","msg":"Successfully connected to nginx agent tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"nginxUpdater.commandService","msg":"Successfully configured nginx for new subscription","pod":"tmp-nginx-deployment-98f54f6bd-wnp95"}
{"level":"info","ts":"2025-02-06T19:14:06Z","logger":"eventHandler","msg":"NGINX configuration was successfully updated"}

And here's what a non-leader NGF Pod looks like when its been running for a while:

{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Starting NGINX Gateway Fabric in static mode","version":"edge","commit":"265b813dbe048e5a55b78e7d84729b571ded6294","date":"2025-02-05T19:40:39Z","dirty":"true"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Starting manager"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"Nginx Gateway Fabric Pod will be marked as unready until it has the leader lease"}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"starting server","name":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":9113","secure":false}
{"level":"info","ts":"2025-02-06T19:12:42Z","msg":"attempting to acquire leader lease nginx-gateway/my-release-nginx-gateway-fabric-leader-election..."}
{"level":"info","ts":"2025-02-06T19:12:42Z","logger":"eventLoop.eventHandler","msg":"Reconfigured control plane.","batchID":2}

@bjee19 bjee19 requested a review from sjberman February 7, 2025 23:11
@bjee19
Copy link
Contributor Author

bjee19 commented Feb 7, 2025

@salonichf5 , re-requesting your review since a large chunk of restructuring was done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

4 participants