-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't ignore inotifywait failures #424
Conversation
## Problem When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail. ## Repro ```bash # 1. Install linkerd-cni and then the linkerd control plane in CNI mode. This is an easy way of doing that from the linkerd2 repo: bin/tests --name cni-calico-deep --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd` # 2. Delete the linkerd-cni namespace kubectl delete ns linkerd-cni # 3. Deploy an ubuntu image and install inotify-tools kubectl run -it --rm --restart Never eraseme --image ubuntu bash $ apt-get update && apt-get install -y inotify-tools # 4. Consume all the inotify instances. Stop when you start seeing errors, but leave the shell open $ mkdir /app $ while true; do inotifywait -m -e create -e modify -e delete --format '%w%f %e' /app & done # 5. Re-install linkerd-cni linkerd install-cni | kubectl apply -f - # 6. Check there's an error but the pod doesn't automatically restart $ kubectl -n linkerd-cni logs linkerd-cni-zlztm 2024-10-14 20:51:39] Created CNI config /host/etc/cni/net.d/10-calico.conflist Couldn't initialize inotify: No file descriptors available Try increasing the value of /proc/sys/fs/inotify/max_user_instances ``` # 7. Bounce the calico daemonset -> linkerd-cni will no longer detect this change, so its config gets overridden and lost kubectl -n kube-system delete po calico-node-wchf2 # 8. Inject and deploy emojivoto linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f - # 9. All emojivoto's pods fail to start, because the network-validator doesn't detect a proper network config $ k -n emojivoto logs web-8559b97f7c-ht6b8 linkerd-network-validator 2024-10-14T20:57:47.940801Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140 2024-10-14T20:57:47.940824Z DEBUG linkerd_network_validator: token="Zax4sMUFWS5khtN6WjuX6335uT46W7WFa6ghDlftvKRUnQOqLnydrK5bHjPK75W\n" 2024-10-14T20:57:47.940828Z INFO linkerd_network_validator: Connecting to 1.1.1.1:20001 2024-10-14T20:57:57.941789Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s ``` ## Resolution The problem is that the script is blocked on waiting for `sleep infinity`, and the `wait` isn't unblocked on a `monitor` failure even despite the traps in place, and even after enabling `-o pipefail` on the script. The solution consisted on removing the infinte loop at the end, and simply `wait` on the `monitor` call. ## Test ``` # In cluster from the steps above, reinstall linkerd-cni using this fix, published on my repo linkerd install-cni --cni-image ghcr.io/alpeb/cni-plugin --cni-image-version fixup | kubectl apply -f - # You should see the same error as above, but this time the pod enters into a crash loop. # Exit the ubuntu pod, which will kill all the inotify wait instances. # The linkerd-cni pod should recover by itself. # Bounce again the calico-node daemonset kubectl -n kube-system delete po calico-node-wchf2 # You should see in the linkerd-cni logs that the config change was detected. # After rolling out the emojivoto workloads they come out fine. ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i appreciate your detailed repro and explanation, @alpeb. great work!
i'd love a second approval from someone else with more pre-existing context on this problem, but this looks good to me. 🕵️♀️ ✔️
Thanks for the review @cratelyn 👍 |
64ab1a5
to
a5be3f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Problem
When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail.
Repro
Resolution
The problem is that the script is blocked on waiting for
sleep infinity
, and thewait
isn't unblocked on amonitor
failure.The solution consisted on:
set -o pipefail
so that the pipe in monitor fails if any command in the pipe failsconfig_file_count
is computed because an emptyfind
piped intogrep -v
is now getting caught as an errorTest
You should see the same error as above, but this time the pod enters into a crash loop.
Exit the ubuntu pod, which will kill all the inotifywait instances.
The linkerd-cni pod should recover by itself.
Bounce again the calico-node daemonset
You should see in the linkerd-cni logs that the config change was detected.
After rolling out the emojivoto workloads they come out fine.