Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't ignore inotifywait failures #424

Merged
merged 3 commits into from
Oct 16, 2024
Merged

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented Oct 14, 2024

Problem

When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail.

Repro

# 1. Install linkerd-cni and then the linkerd control plane in CNI mode. This is an easy way of doing that from the linkerd2 repo:
bin/tests --name cni-calico-deep --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd`

# 2. Delete the linkerd-cni namespace
kubectl delete ns linkerd-cni

# 3. Deploy an ubuntu image and install inotify-tools
kubectl run -it --rm --restart Never eraseme --image ubuntu bash
$ apt-get update && apt-get install -y inotify-tools

# 4. Consume all the inotify instances. Stop when you start seeing errors, but leave the shell open
$ mkdir /app
$ while true; do inotifywait -m -e create -e modify -e delete --format '%w%f %e' /app & done

# 5. Re-install linkerd-cni
linkerd install-cni | kubectl apply -f -

# 6. Check there's an error but the pod doesn't automatically restart
$ kubectl -n linkerd-cni logs linkerd-cni-zlztm
2024-10-14 20:51:39] Created CNI config /host/etc/cni/net.d/10-calico.conflist
Couldn't initialize inotify: No file descriptors available
Try increasing the value of /proc/sys/fs/inotify/max_user_instances

# 7. Bounce the calico daemonset -> linkerd-cni will no longer detect this change, so its config gets overridden and lost
kubectl -n kube-system delete po calico-node-wchf2

# 8. Inject and deploy emojivoto
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

# 9. All emojivoto's pods fail to start, because the network-validator doesn't detect a proper network config
$ k -n emojivoto logs web-8559b97f7c-ht6b8 linkerd-network-validator
2024-10-14T20:57:47.940801Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140 2024-10-14T20:57:47.940824Z DEBUG linkerd_network_validator: _token="Zax4sMUFWS5khtN6WjuX6335uT46W7WFa6ghDlftvKRUnQOqLnydrK5bHjPK75W\n"
2024-10-14T20:57:47.940828Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-10-14T20:57:57.941789Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

Resolution

The problem is that the script is blocked on waiting for sleep infinity, and the wait isn't unblocked on a monitor failure.
The solution consisted on:

  • replacing the last loop with a wait on the monitor job
  • adding set -o pipefail so that the pipe in monitor fails if any command in the pipe fails
  • adding a trap on ERR to catch that error and perform cleanup
  • this also implied splitting how config_file_count is computed because an empty find piped into grep -v is now getting caught as an error

Test

  • In cluster from the steps above, reinstall linkerd-cni using this fix, published on my repo linkerd
linkerd install-cni --cni-image ghcr.io/alpeb/cni-plugin --cni-image-version fixup11 | kubectl apply -f -
  • You should see the same error as above, but this time the pod enters into a crash loop.

  • Exit the ubuntu pod, which will kill all the inotifywait instances.

  • The linkerd-cni pod should recover by itself.

  • Bounce again the calico-node daemonset

kubectl -n kube-system delete po calico-node-wchf2
  • You should see in the linkerd-cni logs that the config change was detected.

  • After rolling out the emojivoto workloads they come out fine.

## Problem

When the node has hit the inotify limit, deploying the linkerd-cni daemonset will silently fail.

## Repro

```bash
# 1. Install linkerd-cni and then the linkerd control plane in CNI mode. This is an easy way of doing that from the linkerd2 repo:
bin/tests --name cni-calico-deep --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd`

# 2. Delete the linkerd-cni namespace
kubectl delete ns linkerd-cni

# 3. Deploy an ubuntu image and install inotify-tools
kubectl run -it --rm --restart Never eraseme --image ubuntu bash
$ apt-get update && apt-get install -y inotify-tools

# 4. Consume all the inotify instances. Stop when you start seeing errors, but leave the shell open
$ mkdir /app
$ while true; do inotifywait -m -e create -e modify -e delete --format '%w%f %e' /app & done

# 5. Re-install linkerd-cni
linkerd install-cni | kubectl apply -f -

# 6. Check there's an error but the pod doesn't automatically restart
$ kubectl -n linkerd-cni logs linkerd-cni-zlztm
2024-10-14 20:51:39] Created CNI config /host/etc/cni/net.d/10-calico.conflist
Couldn't initialize inotify: No file descriptors available
Try increasing the value of /proc/sys/fs/inotify/max_user_instances
```

# 7. Bounce the calico daemonset -> linkerd-cni will no longer detect this change, so its config gets overridden and lost
kubectl -n kube-system delete po calico-node-wchf2

# 8. Inject and deploy emojivoto
linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -

# 9. All emojivoto's pods fail to start, because the network-validator doesn't detect a proper network config
$ k -n emojivoto logs web-8559b97f7c-ht6b8 linkerd-network-validator
2024-10-14T20:57:47.940801Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-10-14T20:57:47.940824Z DEBUG linkerd_network_validator: token="Zax4sMUFWS5khtN6WjuX6335uT46W7WFa6ghDlftvKRUnQOqLnydrK5bHjPK75W\n"
2024-10-14T20:57:47.940828Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2024-10-14T20:57:57.941789Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s
```

## Resolution

The problem is that the script is blocked on waiting for `sleep infinity`, and the `wait` isn't unblocked on a `monitor` failure even despite the traps in place, and even after enabling `-o pipefail` on the script.
The solution consisted on removing the infinte loop at the end, and simply `wait` on the `monitor` call.

## Test

```
# In cluster from the steps above, reinstall linkerd-cni using this fix, published on my repo
linkerd install-cni --cni-image ghcr.io/alpeb/cni-plugin --cni-image-version fixup | kubectl apply -f -

# You should see the same error as above, but this time the pod enters into a crash loop.

# Exit the ubuntu pod, which will kill all the inotify wait instances.

# The linkerd-cni pod should recover by itself.

# Bounce again the calico-node daemonset
kubectl -n kube-system delete po calico-node-wchf2

# You should see in the linkerd-cni logs that the config change was detected.

# After rolling out the emojivoto workloads they come out fine.
```
@alpeb alpeb requested a review from a team as a code owner October 14, 2024 22:08
Copy link

@cratelyn cratelyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i appreciate your detailed repro and explanation, @alpeb. great work!

i'd love a second approval from someone else with more pre-existing context on this problem, but this looks good to me. 🕵️‍♀️ ✔️

@alpeb
Copy link
Member Author

alpeb commented Oct 15, 2024

Thanks for the review @cratelyn 👍
After taking another look I realized that when inotifywait failed we were just terminating the script successfully, which does trigger a container restart, but it'd be more correct to exit with an error code. So I've made a few more changes and updated the Resolution section above.

@alpeb alpeb force-pushed the alpeb/inotifywait-fail-on-errors branch from 64ab1a5 to a5be3f8 Compare October 15, 2024 22:54
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@alpeb alpeb merged commit e2eeeb1 into main Oct 16, 2024
12 checks passed
@alpeb alpeb deleted the alpeb/inotifywait-fail-on-errors branch October 16, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants