Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuration of readinessProbe and livenessProbe timeouts in linkerd-proxy-injector #11453

Closed
jan-kantert opened this issue Oct 4, 2023 · 3 comments · Fixed by #12053

Comments

@jan-kantert
Copy link
Contributor

What problem are you trying to solve?

I have an issue when our Kubernetes cluster is under high CPU load. In this case kubelet will be slow to read readiness and liveness probe responses. In some cases we see kubernetes restarting linkerd-proxy pods due to failed livenessProbes. To reduce the chance of this happing we would like to increase the timeout (from the default 1s) to something like 10s or 20s which would be enough even under very high load.

How should the problem be solved?

Add some config parameter for probe timeouts in linkerd-proxy-injector. I would also set the timeout for livenessProbe a bit higher by default to follow kubernetes best practice.

Any alternatives you've considered?

This is partially caused by kubernetes/kubernetes#89898 and there already have been some improvements. More is coming. As a workaround we can reserve more CPU for kubelet but that harms resource utilization because less CPU will be available for payload on our nodes.

How would users interact with this feature?

Users can optionally set this timeout in their helm chart or in their linkerd-proxy-injector config.

Would you like to work on this feature?

yes

@kflynn
Copy link
Member

kflynn commented Oct 11, 2023

Hey @jan-kantert! This seems like a good thing and if you're willing to work on it, we'd love to support you. What would help you? 🙂

@jan-kantert
Copy link
Contributor Author

Hey @jan-kantert! This seems like a good thing and if you're willing to work on it, we'd love to support you. What would help you? 🙂

Have a look at my PR #11458. Is that how you expect the change?

Copy link

stale bot commented Jan 25, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 25, 2024
@stale stale bot closed this as completed Feb 8, 2024
mateiidavid added a commit that referenced this issue Feb 9, 2024
This release addresses some issues in the destination service that could cause
it to behave unexpectedly when processing updates.

* Fixed a race condition in the destination service that could cause panics
  under very specific conditions ([#12022]; fixes [#12010])
* Changed how updates to a `Server` selector are handled in the destination
  service. When a `Server` that marks a port as opaque no longer selects a
  resource, the resource's opaqueness will reverted to default settings
  ([#12031]; fixes [#11995])
* Introduced Helm configuration values for liveness and readiness probe
  timeouts and delays ([#11458]; fixes [#11453]) (thanks @jan-kantert!)

[#12010]: #12010
[#12022]: #12022
[#11995]: #11995
[#12031]: #12031
[#11453]: #11453
[#11458]: #11458

Signed-off-by: Matei David <[email protected]>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants