-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cni-repair controller #306
Conversation
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test
673650b
to
3ab1ec6
Compare
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Exciting to finally have it committed. Two final questions:
- Do we anticipate needing to emit events? I'm thinking for successful evictions.
- Do we need to retry evictions? Would it ever fail due to a transient request and we wouldn't process it again?
Both of those are more follow-ups if they're required at all. I think what we have now is great 👍🏻
I was expecting k8s itself to surface eviction events, but after checking it turns out that's not the case, so I'll be adding that.
If the eviction fails, I think the pod will continue on its backed-off crashloop anyways, so we'll get the event for the next failure. |
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test
…y on reinitialize-pods
a756322
to
c9e7e19
Compare
c9e7e19
to
d0e2384
Compare
e3388b9
to
e74450f
Compare
e74450f
to
c7d9a91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I still left some comments but they're nitpicky and shouldn't block. Let me know what you think :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @alpeb! This looks ready to ship to me.
6344b1e
to
956a8c7
Compare
Thanks for all the feedback @olix0r . The latest commits address all your comments, plus the tasks handling we talked about offline. And I added a unit test that also checks on the lifecycle of |
Oh, also looking forward for the suggestions about the timer histogram on these tasks 😉 |
cni-repair-controller/src/lib.rs
Outdated
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY); | ||
tokio::spawn(process_events(pod_evts, tx, metrics.clone())); | ||
|
||
let client = rt.client(); | ||
tokio::spawn(process_pods(client, controller_pod_name, rx, metrics)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably desirable to put tracing contexts on each span for better logging/diagnostics:
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY); | |
tokio::spawn(process_events(pod_evts, tx, metrics.clone())); | |
let client = rt.client(); | |
tokio::spawn(process_pods(client, controller_pod_name, rx, metrics)) | |
let (tx, rx) = mpsc::channel(EVENT_CHANNEL_CAPACITY); | |
tokio::spawn( | |
process_events(pod_evts, tx, metrics.clone()) | |
.instrument(tracing::info_span!("watch").or_current()), | |
); | |
let client = rt.client(); | |
tokio::spawn( | |
process_pods(client, controller_pod_name, rx, metrics) | |
.instrument(tracing::info_span!("repair").or_current()), | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.
Fixes linkerd/linkerd2#11073
This fixes the issue of injected pods that cannot acquire proper network config because
linkerd-cni
and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does.This controller "
linkerd-cni-repair-controller
" watches over events on pods in the current node, which have been injected but are in a terminated state and whoselinkerd-network-validator
container exited with code 95, and proceeds to delete them so they can restart with a proper network config.The controller is to be deployed as an additional container in the
linkerd-cni
DaemonSet (addressed in linkerd/linkerd2#11699).This exposes two custom counter metrics:
linkerd_cni_repair_controller_queue_overflow
(in the spirit of the destination controller'sendpoint_updates_queue_overflow
) andlinkerd_cni_repair_controller_deleted