Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert when nodes fail to join cluster #3828

Open
yulianedyalkova opened this issue Jan 13, 2025 · 4 comments
Open

Alert when nodes fail to join cluster #3828

yulianedyalkova opened this issue Jan 13, 2025 · 4 comments
Assignees
Labels
team/tenet Team Tenet

Comments

@yulianedyalkova
Copy link

Related to https://github.com/giantswarm/adidas/issues/1325#issuecomment-2554383452

Tl;dr both Karpenter and cluster-autoscaler were trying to scale up but new nodes couldn't join the cluster because of issues with the bootstrap token (link). That resulted in continuous termination and recreation of the failing nodes.

We should be able to register this situation as early as possible instead of relying of unrelated alerts to catch the symptoms.

@yulianedyalkova yulianedyalkova converted this from a draft issue Jan 13, 2025
@yulianedyalkova yulianedyalkova added the team/tenet Team Tenet label Jan 13, 2025
@yulianedyalkova yulianedyalkova moved this from Inbox 📥 to Up Next ➡️ in Roadmap Jan 14, 2025
@njuettner njuettner self-assigned this Jan 22, 2025
@njuettner njuettner moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Jan 22, 2025
@njuettner
Copy link
Member

njuettner commented Jan 22, 2025

Debugging

Karpenter

Unfortunately the Karpenter alerts which are mentioned, aren't working because the metrics which are used don't exists.

https://github.com/giantswarm/adidas/issues/1325#issuecomment-2553426385

Example:

- alert: KarpenterCanNotRegisterNewNodes
  annotations:
    description: |
      Karpenter provisioner {{`{{ $labels.provisioner }}`}} on cluster {{`{{ $labels.cluster_id }}`}} launched new nodes, but some of nodes did not registered in the cluster    
   expr: sum by (provisioner, cluster_id, installation, pipeline, provider) (karpenter_machines_launched) - sum by (provisioner, cluster_id, installation, pipeline, provider)(karpenter_machines_registered) != 0
Image Image

Cluster Autoscaler

Had no useful metrics

Metrics

I was also checking if there was anything visible for secrets of bootstrap token if they are expired at this time or anything, but there was always at least one bootstrap token available and within the 15 minutes expiration time from this source of metric

Image

Additionally there's a problem with CAPI Machine Pools, because we're using them in CAPA clusters, we cannot check if there's an issue with joining because there are no useful metrics which would cover a case where nodes cannot join.

For CAPI
MachineDeployments is would be possible because there we would set the status on Machine CR's in case there's something wrong.

Options

Theoretically we could use LOKI to aggregate the logs for certain API requests, this would help us to detect those things earlier, especially:

•	/api/v1/nodes
•	/api/v1/bootstraptokens

The other option would be relying on the creation of nodes via api server

Image

@njuettner
Copy link
Member

We discussed a couple of options in the standup, but nothing does really catch this issue.

@giantswarm/team-phoenix would it be possible to check why the metrics in Karpenter alert rules are missing, I was not able to find any of those in alba? I think this would be our best option to detect this specific issue because unfortunately cluster-autoscaler doesn't have a good metric for machine pools 😞.

@njuettner njuettner moved this from In Progress ⛏️ to Blocked / Waiting ⛔️ in Roadmap Jan 23, 2025
@fiunchinho
Copy link
Member

hey @njuettner . I've checked the problem with the karpenter metrics and this is what I found out. When I added the alerts for karpenter, I also made a new release of our karpenter-app to make it deploy the ServiceMonitor by default. Without the ServiceMonitor, the metrics exported by karpenter are not scraped.
We deploy karpenter using the karpenter-bundle, so to include that change I also made a new release v1.4.0 of the bundle. I tested all these changes in our MCs, and they worked fine.

The problem is that the karpenter-bundle has not been updated in the customer's clusters, so they are using an older version which is not deploying the ServiceMonitor.

I believe the app is managed through gitops by the customer, so I've asked them to upgrade it. Once they do, hopefully we'd have metrics.

If you want to play around with these metrics, you can try grizzly. The metrics are currently being scrapped there.

@fiunchinho
Copy link
Member

fiunchinho commented Jan 27, 2025

the karpenter app has been upgraded to v1.4.0, there should be metrics available now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/tenet Team Tenet
Projects
Status: Blocked / Waiting ⛔️
Development

No branches or pull requests

3 participants