Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cni-repair controller #306

Merged
merged 16 commits into from
Jan 2, 2024
Merged

cni-repair controller #306

merged 16 commits into from
Jan 2, 2024

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Dec 5, 2023

Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network config because linkerd-cni and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does.

This controller "linkerd-cni-repair-controller" watches over events on pods in the current node, which have been injected but are in a terminated state and whose linkerd-network-validator container exited with code 95, and proceeds to delete them so they can restart with a proper network config.

The controller is to be deployed as an additional container in the linkerd-cni DaemonSet (addressed in linkerd/linkerd2#11699).

This exposes two custom counter metrics: linkerd_cni_repair_controller_queue_overflow (in the spirit of the destination controller's endpoint_updates_queue_overflow) and linkerd_cni_repair_controller_deleted

@alpeb alpeb requested a review from a team as a code owner December 5, 2023 15:06
alpeb added a commit to linkerd/linkerd2 that referenced this pull request Dec 5, 2023
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.

## TO-DOs

- Integration test
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from 673650b to 3ab1ec6 Compare December 5, 2023 15:30
alpeb added a commit to linkerd/linkerd2 that referenced this pull request Dec 5, 2023
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.

## TO-DOs

- Integration test
validator/Cargo.toml Outdated Show resolved Hide resolved
reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved
reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Exciting to finally have it committed. Two final questions:

  1. Do we anticipate needing to emit events? I'm thinking for successful evictions.
  2. Do we need to retry evictions? Would it ever fail due to a transient request and we wouldn't process it again?

Both of those are more follow-ups if they're required at all. I think what we have now is great 👍🏻

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
reinitialize-pods/.gitignore Outdated Show resolved Hide resolved
@alpeb
Copy link
Member Author

alpeb commented Dec 6, 2023

  1. Do we anticipate needing to emit events? I'm thinking for successful evictions.

I was expecting k8s itself to surface eviction events, but after checking it turns out that's not the case, so I'll be adding that.

  1. Do we need to retry evictions? Would it ever fail due to a transient request and we wouldn't process it again?

If the eviction fails, I think the pod will continue on its backed-off crashloop anyways, so we'll get the event for the next failure.

Base automatically changed from alpeb/build-toolchain-updates to main December 6, 2023 17:56
alpeb added 3 commits December 6, 2023 15:25
Fixes linkerd/linkerd2#11073

This fixes the issue of injected pods that cannot acquire proper network
config because `linkerd-cni` and/or the cluster's network CNI haven't
fully started. They are left in a permanent crash loop and once CNI is
ready, they need to be restarted externally, which is what this
controller does.

This controller "`linkerd-reinitialize-pods`" watches over events on
pods in the current node, which have been injected but are in a
terminated state and whose `linkerd-network-validator` container exited
with code 95, and proceeds to evict them so they can restart with a
proper network config.

The controller is to be deployed as an additional container in the
`linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx).

## TO-DOs

- Figure why `/metrics` is returning a 404 (should show process metrics)
- Integration test
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from a756322 to c9e7e19 Compare December 6, 2023 20:26
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from c9e7e19 to d0e2384 Compare December 6, 2023 20:33
Dockerfile-cni-plugin Outdated Show resolved Hide resolved
reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
@alpeb alpeb marked this pull request as draft December 12, 2023 19:55
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from e3388b9 to e74450f Compare December 13, 2023 10:09
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from e74450f to c7d9a91 Compare December 13, 2023 11:07
@alpeb alpeb marked this pull request as ready for review December 13, 2023 11:07
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I still left some comments but they're nitpicky and shouldn't block. Let me know what you think :D

reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @alpeb! This looks ready to ship to me.

Dockerfile-cni-plugin Outdated Show resolved Hide resolved
reinitialize-pods/Cargo.toml Outdated Show resolved Hide resolved
reinitialize-pods/src/lib.rs Outdated Show resolved Hide resolved
@alpeb alpeb changed the title reinitialize-pods controller cni-repair controller Dec 20, 2023
@alpeb alpeb force-pushed the alpeb/linkerd-reinitialize-pods branch from 6344b1e to 956a8c7 Compare December 28, 2023 21:07
@alpeb
Copy link
Member Author

alpeb commented Dec 28, 2023

Thanks for all the feedback @olix0r . The latest commits address all your comments, plus the tasks handling we talked about offline. And I added a unit test that also checks on the lifecycle of process_events.

@alpeb
Copy link
Member Author

alpeb commented Dec 28, 2023

Oh, also looking forward for the suggestions about the timer histogram on these tasks 😉

cni-repair-controller/src/main.rs Outdated Show resolved Hide resolved
cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved
cni-repair-controller/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 54 to 58
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);
tokio::spawn(process_events(pod_evts, tx, metrics.clone()));

let client = rt.client();
tokio::spawn(process_pods(client, controller_pod_name, rx, metrics))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably desirable to put tracing contexts on each span for better logging/diagnostics:

Suggested change
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);
tokio::spawn(process_events(pod_evts, tx, metrics.clone()));
let client = rt.client();
tokio::spawn(process_pods(client, controller_pod_name, rx, metrics))
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY);
tokio::spawn(
process_events(pod_evts, tx, metrics.clone())
.instrument(tracing::info_span!("watch").or_current()),
);
let client = rt.client();
tokio::spawn(
process_pods(client, controller_pod_name, rx, metrics)
.instrument(tracing::info_span!("repair").or_current()),
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@alpeb alpeb merged commit 67cc03d into main Jan 2, 2024
17 checks passed
@alpeb alpeb deleted the alpeb/linkerd-reinitialize-pods branch January 2, 2024 16:25
olix0r pushed a commit to linkerd/linkerd2 that referenced this pull request Jan 5, 2024
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.
adleong pushed a commit to linkerd/linkerd2 that referenced this pull request Jan 18, 2024
Followup to linkerd/linkerd2-proxy-init#306
Fixes #11073

This adds the `reinitialize-pods` container to the `linkerd-cni`
DaemonSet, along with its config in `values.yaml`.

Also the `linkerd-cni`'s version is bumped, to contain the new binary
for this controller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When linkerd-network-validator catches missing iptables config, pod is left in a failure state
3 participants