-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert when nodes fail to join cluster #3828
Comments
DebuggingKarpenterUnfortunately the Karpenter alerts which are mentioned, aren't working because the metrics which are used don't exists. https://github.com/giantswarm/adidas/issues/1325#issuecomment-2553426385 Example:
Cluster AutoscalerHad no useful metrics MetricsI was also checking if there was anything visible for secrets of bootstrap token if they are expired at this time or anything, but there was always at least one bootstrap token available and within the 15 minutes expiration time from this source of metric Additionally there's a problem with CAPI Machine Pools, because we're using them in CAPA clusters, we cannot check if there's an issue with joining because there are no useful metrics which would cover a case where nodes cannot join. For CAPI OptionsTheoretically we could use LOKI to aggregate the logs for certain API requests, this would help us to detect those things earlier, especially:
The other option would be relying on the creation of nodes via api server |
We discussed a couple of options in the standup, but nothing does really catch this issue. @giantswarm/team-phoenix would it be possible to check why the metrics in Karpenter alert rules are missing, I was not able to find any of those in |
hey @njuettner . I've checked the problem with the The problem is that the I believe the app is managed through gitops by the customer, so I've asked them to upgrade it. Once they do, hopefully we'd have metrics. If you want to play around with these metrics, you can try |
the |
Related to https://github.com/giantswarm/adidas/issues/1325#issuecomment-2554383452
Tl;dr both Karpenter and cluster-autoscaler were trying to scale up but new nodes couldn't join the cluster because of issues with the bootstrap token (link). That resulted in continuous termination and recreation of the failing nodes.
We should be able to register this situation as early as possible instead of relying of unrelated alerts to catch the symptoms.
The text was updated successfully, but these errors were encountered: