Alert when nodes fail to join cluster #3828

yulianedyalkova · 2025-01-13T15:04:10Z

Related to https://github.com/giantswarm/adidas/issues/1325#issuecomment-2554383452

Tl;dr both Karpenter and cluster-autoscaler were trying to scale up but new nodes couldn't join the cluster because of issues with the bootstrap token (link). That resulted in continuous termination and recreation of the failing nodes.

We should be able to register this situation as early as possible instead of relying of unrelated alerts to catch the symptoms.

njuettner · 2025-01-22T16:51:42Z

Debugging

Karpenter

Unfortunately the Karpenter alerts which are mentioned, aren't working because the metrics which are used don't exists.

https://github.com/giantswarm/adidas/issues/1325#issuecomment-2553426385

Example:

- alert: KarpenterCanNotRegisterNewNodes
  annotations:
    description: |
      Karpenter provisioner {{`{{ $labels.provisioner }}`}} on cluster {{`{{ $labels.cluster_id }}`}} launched new nodes, but some of nodes did not registered in the cluster    
   expr: sum by (provisioner, cluster_id, installation, pipeline, provider) (karpenter_machines_launched) - sum by (provisioner, cluster_id, installation, pipeline, provider)(karpenter_machines_registered) != 0

Cluster Autoscaler

Had no useful metrics

Metrics

I was also checking if there was anything visible for secrets of bootstrap token if they are expired at this time or anything, but there was always at least one bootstrap token available and within the 15 minutes expiration time from this source of metric

Additionally there's a problem with CAPI Machine Pools, because we're using them in CAPA clusters, we cannot check if there's an issue with joining because there are no useful metrics which would cover a case where nodes cannot join.

For CAPI
MachineDeployments is would be possible because there we would set the status on Machine CR's in case there's something wrong.

Options

Theoretically we could use LOKI to aggregate the logs for certain API requests, this would help us to detect those things earlier, especially:

•	/api/v1/nodes
•	/api/v1/bootstraptokens

The other option would be relying on the creation of nodes via api server

njuettner · 2025-01-23T11:39:25Z

We discussed a couple of options in the standup, but nothing does really catch this issue.

@giantswarm/team-phoenix would it be possible to check why the metrics in Karpenter alert rules are missing, I was not able to find any of those in alba? I think this would be our best option to detect this specific issue because unfortunately cluster-autoscaler doesn't have a good metric for machine pools 😞.

fiunchinho · 2025-01-23T15:30:43Z

hey @njuettner . I've checked the problem with the karpenter metrics and this is what I found out. When I added the alerts for karpenter, I also made a new release of our karpenter-app to make it deploy the ServiceMonitor by default. Without the ServiceMonitor, the metrics exported by karpenter are not scraped.
We deploy karpenter using the karpenter-bundle, so to include that change I also made a new release v1.4.0 of the bundle. I tested all these changes in our MCs, and they worked fine.

The problem is that the karpenter-bundle has not been updated in the customer's clusters, so they are using an older version which is not deploying the ServiceMonitor.

I believe the app is managed through gitops by the customer, so I've asked them to upgrade it. Once they do, hopefully we'd have metrics.

If you want to play around with these metrics, you can try grizzly. The metrics are currently being scrapped there.

fiunchinho · 2025-01-27T10:49:23Z

the karpenter app has been upgraded to v1.4.0, there should be metrics available now.

yulianedyalkova added this to Roadmap Jan 13, 2025

yulianedyalkova converted this from a draft issue Jan 13, 2025

yulianedyalkova added the team/tenet Team Tenet label Jan 13, 2025

yulianedyalkova moved this from Inbox 📥 to Up Next ➡️ in Roadmap Jan 14, 2025

njuettner self-assigned this Jan 22, 2025

njuettner moved this from Up Next ➡️ to In Progress ⛏️ in Roadmap Jan 22, 2025

njuettner moved this from In Progress ⛏️ to Blocked / Waiting ⛔️ in Roadmap Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert when nodes fail to join cluster #3828

Alert when nodes fail to join cluster #3828

yulianedyalkova commented Jan 13, 2025

njuettner commented Jan 22, 2025 •

edited

Loading

njuettner commented Jan 23, 2025

fiunchinho commented Jan 23, 2025

fiunchinho commented Jan 27, 2025 •

edited

Loading

Alert when nodes fail to join cluster #3828

Alert when nodes fail to join cluster #3828

Comments

yulianedyalkova commented Jan 13, 2025

njuettner commented Jan 22, 2025 • edited Loading

Debugging

Karpenter

Cluster Autoscaler

Metrics

Options

njuettner commented Jan 23, 2025

fiunchinho commented Jan 23, 2025

fiunchinho commented Jan 27, 2025 • edited Loading

njuettner commented Jan 22, 2025 •

edited

Loading

fiunchinho commented Jan 27, 2025 •

edited

Loading