Don't ignore inotifywait failures #424

alpeb · 2024-10-14T22:08:46Z

Problem

When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail.

Repro

# 1. Install linkerd-cni and then the linkerd control plane in CNI mode. This is an easy way of doing that from the linkerd2 repo:
bin/tests --name cni-calico-deep --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd`

# 2. Delete the linkerd-cni namespace
kubectl delete ns linkerd-cni

# 3. Deploy an ubuntu image and install inotify-tools
kubectl run -it --rm --restart Never eraseme --image ubuntu bash
$ apt-get update && apt-get install -y inotify-tools

# 4. Consume all the inotify instances. Stop when you start seeing errors, but leave the shell open
$ mkdir /app
$ while true; do inotifywait -m -e create -e modify -e delete --format '%w%f %e' /app & done

# 5. Re-install linkerd-cni
linkerd install-cni | kubectl apply -f -

# 6. Check there's an error but the pod doesn't automatically restart
$ kubectl -n linkerd-cni logs linkerd-cni-zlztm
2024-10-14 20:51:39] Created CNI config /host/etc/cni/net.d/10-calico.conflist
Couldn't initialize inotify: No file descriptors available
Try increasing the value of /proc/sys/fs/inotify/max_user_instances

# 7. Bounce the calico daemonset -> linkerd-cni will no longer detect this change, so its config gets overridden and lost
kubectl -n kube-system delete po calico-node-wchf2

# 8. Inject and deploy emojivoto
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

# 9. All emojivoto's pods fail to start, because the network-validator doesn't detect a proper network config
$ k -n emojivoto logs web-8559b97f7c-ht6b8 linkerd-network-validator
2024-10-14T20:57:47.940801Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140 2024-10-14T20:57:47.940824Z DEBUG linkerd_network_validator: _token="Zax4sMUFWS5khtN6WjuX6335uT46W7WFa6ghDlftvKRUnQOqLnydrK5bHjPK75W\n"
2024-10-14T20:57:47.940828Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-10-14T20:57:57.941789Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

Resolution

The problem is that the script is blocked on waiting for sleep infinity, and the wait isn't unblocked on a monitor failure.
The solution consisted on:

replacing the last loop with a wait on the monitor job
adding set -o pipefail so that the pipe in monitor fails if any command in the pipe fails
adding a trap on ERR to catch that error and perform cleanup
this also implied splitting how config_file_count is computed because an empty find piped into grep -v is now getting caught as an error

Test

In cluster from the steps above, reinstall linkerd-cni using this fix, published on my repo linkerd

linkerd install-cni --cni-image ghcr.io/alpeb/cni-plugin --cni-image-version fixup11 | kubectl apply -f -

You should see the same error as above, but this time the pod enters into a crash loop.
Exit the ubuntu pod, which will kill all the inotifywait instances.
The linkerd-cni pod should recover by itself.
Bounce again the calico-node daemonset

kubectl -n kube-system delete po calico-node-wchf2

You should see in the linkerd-cni logs that the config change was detected.
After rolling out the emojivoto workloads they come out fine.

## Problem When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail. ## Repro ```bash # 1. Install linkerd-cni and then the linkerd control plane in CNI mode. This is an easy way of doing that from the linkerd2 repo: bin/tests --name cni-calico-deep --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd` # 2. Delete the linkerd-cni namespace kubectl delete ns linkerd-cni # 3. Deploy an ubuntu image and install inotify-tools kubectl run -it --rm --restart Never eraseme --image ubuntu bash $ apt-get update && apt-get install -y inotify-tools # 4. Consume all the inotify instances. Stop when you start seeing errors, but leave the shell open $ mkdir /app $ while true; do inotifywait -m -e create -e modify -e delete --format '%w%f %e' /app & done # 5. Re-install linkerd-cni linkerd install-cni | kubectl apply -f - # 6. Check there's an error but the pod doesn't automatically restart $ kubectl -n linkerd-cni logs linkerd-cni-zlztm 2024-10-14 20:51:39] Created CNI config /host/etc/cni/net.d/10-calico.conflist Couldn't initialize inotify: No file descriptors available Try increasing the value of /proc/sys/fs/inotify/max_user_instances ``` # 7. Bounce the calico daemonset -> linkerd-cni will no longer detect this change, so its config gets overridden and lost kubectl -n kube-system delete po calico-node-wchf2 # 8. Inject and deploy emojivoto linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f - # 9. All emojivoto's pods fail to start, because the network-validator doesn't detect a proper network config $ k -n emojivoto logs web-8559b97f7c-ht6b8 linkerd-network-validator 2024-10-14T20:57:47.940801Z INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140 2024-10-14T20:57:47.940824Z DEBUG linkerd_network_validator: token="Zax4sMUFWS5khtN6WjuX6335uT46W7WFa6ghDlftvKRUnQOqLnydrK5bHjPK75W\n" 2024-10-14T20:57:47.940828Z INFO linkerd_network_validator: Connecting to 1.1.1.1:20001 2024-10-14T20:57:57.941789Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s ``` ## Resolution The problem is that the script is blocked on waiting for `sleep infinity`, and the `wait` isn't unblocked on a `monitor` failure even despite the traps in place, and even after enabling `-o pipefail` on the script. The solution consisted on removing the infinte loop at the end, and simply `wait` on the `monitor` call. ## Test ``` # In cluster from the steps above, reinstall linkerd-cni using this fix, published on my repo linkerd install-cni --cni-image ghcr.io/alpeb/cni-plugin --cni-image-version fixup | kubectl apply -f - # You should see the same error as above, but this time the pod enters into a crash loop. # Exit the ubuntu pod, which will kill all the inotify wait instances. # The linkerd-cni pod should recover by itself. # Bounce again the calico-node daemonset kubectl -n kube-system delete po calico-node-wchf2 # You should see in the linkerd-cni logs that the config change was detected. # After rolling out the emojivoto workloads they come out fine. ```

cratelyn

i appreciate your detailed repro and explanation, @alpeb. great work!

i'd love a second approval from someone else with more pre-existing context on this problem, but this looks good to me. 🕵️‍♀️ ✔️

alpeb · 2024-10-15T20:34:40Z

Thanks for the review @cratelyn 👍
After taking another look I realized that when inotifywait failed we were just terminating the script successfully, which does trigger a container restart, but it'd be more correct to exit with an error code. So I've made a few more changes and updated the Resolution section above.

olix0r

lgtm!

alpeb requested a review from a team as a code owner October 14, 2024 22:08

cratelyn approved these changes Oct 15, 2024

View reviewed changes

Properly handle ERR

c4a6af9

Split config_file_count computation

a5be3f8

alpeb force-pushed the alpeb/inotifywait-fail-on-errors branch from 64ab1a5 to a5be3f8 Compare October 15, 2024 22:54

olix0r approved these changes Oct 16, 2024

View reviewed changes

alpeb merged commit e2eeeb1 into main Oct 16, 2024
12 checks passed

alpeb deleted the alpeb/inotifywait-fail-on-errors branch October 16, 2024 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't ignore inotifywait failures #424

Don't ignore inotifywait failures #424

alpeb commented Oct 14, 2024 •

edited

Loading

cratelyn left a comment

alpeb commented Oct 15, 2024

olix0r left a comment

Don't ignore inotifywait failures #424

Don't ignore inotifywait failures #424

Conversation

alpeb commented Oct 14, 2024 • edited Loading

Problem

Repro

Resolution

Test

cratelyn left a comment

Choose a reason for hiding this comment

alpeb commented Oct 15, 2024

olix0r left a comment

Choose a reason for hiding this comment

alpeb commented Oct 14, 2024 •

edited

Loading