cni-repair controller #306

alpeb · 2023-12-05T15:06:25Z

This fixes the issue of injected pods that cannot acquire proper network config because linkerd-cni and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does.

This controller "linkerd-cni-repair-controller" watches over events on pods in the current node, which have been injected but are in a terminated state and whose linkerd-network-validator container exited with code 95, and proceeds to delete them so they can restart with a proper network config.

The controller is to be deployed as an additional container in the linkerd-cni DaemonSet (addressed in linkerd/linkerd2#11699).

This exposes two custom counter metrics: linkerd_cni_repair_controller_queue_overflow (in the spirit of the destination controller's endpoint_updates_queue_overflow) and linkerd_cni_repair_controller_deleted

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test

validator/Cargo.toml

reinitialize-pods/Cargo.toml

mateiidavid

This looks good! Exciting to finally have it committed. Two final questions:

Do we anticipate needing to emit events? I'm thinking for successful evictions.
Do we need to retry evictions? Would it ever fail due to a transient request and we wouldn't process it again?

Both of those are more follow-ups if they're required at all. I think what we have now is great 👍🏻

reinitialize-pods/src/lib.rs

reinitialize-pods/.gitignore

alpeb · 2023-12-06T14:59:15Z

Do we anticipate needing to emit events? I'm thinking for successful evictions.

I was expecting k8s itself to surface eviction events, but after checking it turns out that's not the case, so I'll be adding that.

Do we need to retry evictions? Would it ever fail due to a transient request and we wouldn't process it again?

If the eviction fails, I think the pod will continue on its backed-off crashloop anyways, so we'll get the event for the next failure.

Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test

…y on reinitialize-pods

Dockerfile-cni-plugin

reinitialize-pods/src/lib.rs

mateiidavid

This looks good to me, I still left some comments but they're nitpicky and shouldn't block. Let me know what you think :D

reinitialize-pods/src/lib.rs

mateiidavid

Great work @alpeb! This looks ready to ship to me.

Dockerfile-cni-plugin

reinitialize-pods/Cargo.toml

reinitialize-pods/src/lib.rs

cni-repair-controller/src/lib.rs

…c if Receiver is closed/dropped

…erly

alpeb · 2023-12-28T21:10:54Z

Thanks for all the feedback @olix0r . The latest commits address all your comments, plus the tasks handling we talked about offline. And I added a unit test that also checks on the lifecycle of process_events.

alpeb · 2023-12-28T21:12:58Z

Oh, also looking forward for the suggestions about the timer histogram on these tasks 😉

cni-repair-controller/src/main.rs

cni-repair-controller/src/lib.rs

olix0r · 2024-01-02T15:40:50Z

cni-repair-controller/src/lib.rs

+    let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);
+    tokio::spawn(process_events(pod_evts, tx, metrics.clone()));
+
+    let client = rt.client();
+    tokio::spawn(process_pods(client, controller_pod_name, rx, metrics))


It's probably desirable to put tracing contexts on each span for better logging/diagnostics:

Suggested change

let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);

tokio::spawn(process_events(pod_evts, tx, metrics.clone()));

let client = rt.client();

tokio::spawn(process_pods(client, controller_pod_name, rx, metrics))

let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);

tokio::spawn(

process_events(pod_evts, tx, metrics.clone())

.instrument(tracing::info_span!("watch").or_current()),

);

let client = rt.client();

tokio::spawn(

process_pods(client, controller_pod_name, rx, metrics)

.instrument(tracing::info_span!("repair").or_current()),

)

Sounds good 👍

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.

alpeb requested a review from a team as a code owner December 5, 2023 15:06

This was referenced Dec 5, 2023

Add cni-repair-controller to linkerd-cni DaemonSet linkerd/linkerd2#11699

Merged

reinitialize-pods controller #303

Closed

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from 673650b to 3ab1ec6 Compare December 5, 2023 15:30

olix0r reviewed Dec 5, 2023

View reviewed changes

validator/Cargo.toml Outdated Show resolved Hide resolved

reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved

olix0r reviewed Dec 5, 2023

View reviewed changes

reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved

mateiidavid reviewed Dec 6, 2023

View reviewed changes

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

reinitialize-pods/.gitignore Outdated Show resolved Hide resolved

Base automatically changed from alpeb/build-toolchain-updates to main December 6, 2023 17:56

alpeb added 3 commits December 6, 2023 15:25

Bump some dependencies, add admin feature, remove validator dependenc…

ce716bd

…y on reinitialize-pods

@mateiidavid's feedback

ea55c00

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from a756322 to c9e7e19 Compare December 6, 2023 20:26

Bump more deps

d0e2384

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from c9e7e19 to d0e2384 Compare December 6, 2023 20:33

mateiidavid reviewed Dec 7, 2023

View reviewed changes

Dockerfile-cni-plugin Outdated Show resolved Hide resolved

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

alpeb marked this pull request as draft December 12, 2023 19:55

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from e3388b9 to e74450f Compare December 13, 2023 10:09

Introduce channel, metrics, dep updates

c7d9a91

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from e74450f to c7d9a91 Compare December 13, 2023 11:07

alpeb marked this pull request as ready for review December 13, 2023 11:07

mateiidavid reviewed Dec 13, 2023

View reviewed changes

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

Improve tracing statements

2de904f

mateiidavid approved these changes Dec 14, 2023

View reviewed changes

olix0r reviewed Dec 19, 2023

View reviewed changes

Dockerfile-cni-plugin Outdated Show resolved Hide resolved

reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved

alpeb added 3 commits December 20, 2023 10:07

For simplicity, build reinitialize-pods alongside the cni plugin

5e11445

Rename to linkerd-cni-repair-controller

efbfb97

cargo update

fa23235

alpeb changed the title ~~reinitialize-pods controller~~ cni-repair controller Dec 20, 2023

olix0r reviewed Dec 26, 2023

View reviewed changes

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

olix0r reviewed Dec 26, 2023

View reviewed changes

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

olix0r reviewed Dec 26, 2023

View reviewed changes

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

alpeb added 4 commits December 27, 2023 16:50

Make run return a JoinHandle, use ObjectReference in channels, pani…

dde1d07

…c if Receiver is closed/dropped

Delete pods instead of evicting them

cc72fe8

Unit test

f3c5459

Replace panic with return, and make sure process_events finishes prop…

956a8c7

…erly

alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from 6344b1e to 956a8c7 Compare December 28, 2023 21:07

Enable tokio runtime metrics

d3de530

olix0r reviewed Dec 29, 2023

View reviewed changes

cni-repair-controller/src/main.rs Outdated Show resolved Hide resolved

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

olix0r reviewed Dec 29, 2023

View reviewed changes

cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved

Address feedback

17e300c

olix0r approved these changes Jan 2, 2024

View reviewed changes

Add tracing contexts

9616a4e

alpeb merged commit 67cc03d into main Jan 2, 2024
17 checks passed

alpeb deleted the alpeb/linkerd-reinitialize-pods branch January 2, 2024 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cni-repair controller #306

cni-repair controller #306

alpeb commented Dec 5, 2023 •

edited

Loading

mateiidavid left a comment

alpeb commented Dec 6, 2023

mateiidavid left a comment

mateiidavid left a comment •

edited

Loading

alpeb commented Dec 28, 2023

alpeb commented Dec 28, 2023

olix0r Jan 2, 2024

alpeb Jan 2, 2024

cni-repair controller #306

cni-repair controller #306

Conversation

alpeb commented Dec 5, 2023 • edited Loading

mateiidavid left a comment

Choose a reason for hiding this comment

alpeb commented Dec 6, 2023

mateiidavid left a comment

Choose a reason for hiding this comment

mateiidavid left a comment • edited Loading

Choose a reason for hiding this comment

alpeb commented Dec 28, 2023

alpeb commented Dec 28, 2023

olix0r Jan 2, 2024

Choose a reason for hiding this comment

alpeb Jan 2, 2024

Choose a reason for hiding this comment

alpeb commented Dec 5, 2023 •

edited

Loading

mateiidavid left a comment •

edited

Loading