Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

jorgeperezc · 2025-01-24T13:27:34Z

Description

Observed Behavior:
The releases of v1 brings a gift of a new approach regarding the node lifecycle, with graceful and forceful methods for draining and expiration.

Regarding node expiration, according to the Nodepool schema specification, the NodeTerminationGracePeriod property acts as a feature flag to enable a maximum threshold for recycling. If defined as null, the expiration controller will wait indefinitely, respecting do-not-disrupt, PDBs and so on.(Note that I do not remember reading this in the documentation).

  terminationGracePeriod	<string>
    TerminationGracePeriod is the maximum duration the controller will wait
    before forcefully deleting the pods on a node, measured from when deletion
    is first initiated.
    
    [...]

    The feature can also be used to allow maximum time limits for long-running
    jobs which can delay node termination with preStop hooks.
    If left undefined, the controller will wait indefinitely for pods to be
    drained.

Having said that, two erratic behaviors can be observed:

Forceful method enabled. The terminationGracePeriod property is defined as the maximum grace period threshold for draining the node's pods. When the expiration of NodeClaims begins (TTL specified in the expireAfter setting), they are marked with the annotation karpenter.sh/nodeclaim-termination-timestamp, indicating the maximum datetime for decommissioning, and the grace period countdown starts. The affected node workloads, regardless of PDBs and the do-not-disrupt annotation, are identified by the provisioner controller as reschedulable pods, causing the scheduler to determine whether to generate a new NodeClaim as a replacement based on the available capacity. We have use cases with extended grace periods and workloads with significant sizing but the scheduler does not consider the potential maximum grace period, provisioning replacements that might not be used until the application terminates. Additionally, there are pods nominated to be scheduled on the newly provisioned NodeClaims, blocking possible disruptions, lack of synchronization of the cluster state with the in-memory snapshot of Karpenter and extensive enqueuing of reconciliations by the provisioner, creating the perfect ingredients for a disaster. I believe it does not make sense to flip between provisioning and consolidation with resources that may not be used, leading to unnecessary costs. For example, jobs with a TTL of 5 hours that could use the entire grace period but from t0 already have an unused replacement. Aggressive consolidation budgets tend to worsen the situation, leading to more chaos.
Forceful method disabled. The terminationGracePeriod property is left undefined, which generates behavior similar to previous releases of v1 where PDBs and do-not-disrupt annotations are respected, causing the controller to wait indefinitely for the expired NodeClaims workloads to be drained. There are scenarios where this behavior is desired to minimize maximum disruption. In this case, an anomalous behavior occurs similar to the one mentioned earlier, with the difference that pods that cannot be drained are identified as reschedulable pods, leading to the provisioning of new NodeClaims that will never be used. The same flipping behavior persists along with the possibility risk of massive, uncontrolled scaling.

In addition to everything already mentioned, we must also consider the entropy generated by Kubernetes controllers: HPA scaling, new deployments, cronjobs leading to a possible reset of the consolidateAfter setting, suspending potential disruptions and the incorrect sizing of Karpenter pods. As a result of this last point, It could lead to a conflict between goroutines from different controllers, leading to excessive context switching, which degrades performance. Certain controllers may end up consuming more CPU time, resulting in greater disparity from the expected behavior. I am not sure if this is addressed in the documentation but it would be valuable to outline best practices for users who are unaware of the runtime and code behavior to avoid poor performance or discrepancies in the actions performed by controllers.

Example events observed:

Events:
  Type    Reason       Age       From       Message
  ----    ------       ----      ----       -------
  Normal   Nominated   38m     karpenter    Pod should schedule on: nodeclaim/test-5tu1t
  Normal   Nominated   27m     karpenter    Pod should schedule on: nodeclaim/test-qqkw6
  Normal   Nominated   16m     karpenter    Pod should schedule on: nodeclaim/test-nshgw
  Normal   Nominated   5m44s   karpenter    Pod should schedule on: nodeclaim/test-0tgjd

Events:
  Type    Reason           Age                  From       Message
  ----    ------           ----                 ----       -------
Normal  DisruptionBlocked  14s (x4 over 4m17s)  karpenter  Cannot disrupt NodeClaim: state node is nominated for a pending pod

Expected Behavior:

Forceful method enabled (expiration controller): The provisioner controller, particularly the scheduler, should consider the maximum time it may take to drain workloads before creating a replacement NodeClaim. It should also account for the average time required to provision new nodes. For example, if a workload consumes its 3 hours grace period (similar to nodeTerminationGracePeriod) and the average provisioning time for new nodes, the scheduler will create new NodeClaims with enough time before the forced decomission. This ensures the new replacement capacity is available while balancing both costs and reliability.
Forceful method disabled (expiration controller). The controller will respect workloads with PDBs and do-not-disrupt annotations on expired NodeClaims. The provisioner (scheduler) should not identify these pods as reschedulable, preventing the generation of new NodeClaims that will never be used, thus avoiding unnecessary costs.

I have submitted an illustrative PR demonstrating the expected behavior. It’s likely that the code’s current placement is not ideal and should be moved to the expiration or lifecycle controllers. I compiled those modifications and tested them in a development environment. They appear stable although I’m unsure if they might impact any other functionality.

Let me know if there is anything else I can do to help, as this issue is having a significant impact on costs and preventing access to features in EKS 1.31 that are unsupported by earlier v1 releases.

Reproduction Steps (Please include YAML):

Forceful method enabled:
- Generate a statefulSet with the do-not-disrupt annotation set to true and a terminationGracePeriodSeconds with a wide window. Keep in mind that, although the process may capture SIGTERM, it should continue running without terminating immediately to simulate the described behavior.
- The nodeTerminationGracePeriod property of the NodePool should be equal to or greater than the terminationGracePeriodSeconds of the StatefulSet.
- Define expireAfter in the NodePool to ensure the NodeClaims hosting the pods from your StatefulSet expire.
Forceful method disabled:
- Generate a statefulSet with the do-not-disrupt annotation set to true.
- Do not define a value for the nodeTerminationGracePeriod property of the NodePool or explicitly set it to null.
- Define expireAfter in the NodePool to ensure the NodeClaims hosting the pods of your StatefulSet expire.

Versions:

Karpenter v1.0.7
Chart Karpenter Version: 1.0.7
Chart Karpenter CRDs Version: 1.0.7
EKS v1.29

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jmdeal · 2025-02-01T01:47:45Z

I think there's some good points here, but I'd like to clarify a couple of things:

If defined as null, the expiration controller will wait indefinitely, respecting do-not-disrupt, PDBs and so on

This is not referring to the expiration controller, it is referring to the termination controller. Expiration will take place regardless of TGP or any other factor.

Forceful method disabled (expiration controller). The controller will respect workloads with PDBs and do-not-disrupt annotations on expired NodeClaims

This isn't the intended expiration behavior on v1+. The main purpose of expiration is to enforce maximum node lifetimes for compliance purposes, which isn't possible if it respects PDBs or do-not-disrupt annotations. We could consider reimplementing a "graceful" expiration method in the future, but most use cases we have heard are more appropriately addressed by drift. There's some more context in the design doc.

The change as you've suggested (IIUC) would have some unintended consequences for the remaining graceful disruption modes (drift and consolidation). At a high level, graceful disruption follows these steps:

Determine disruption candidate(s)
Provision replacement capacity
- If replacement capacity can't be created, bail
Set deletion timestamp on node and begin drain

If Karpenter does not continue to nominate nodes for pods on the disrupted nodes, those nodes may be consolidated during the drain process. At least when considering pods on nodes which were gracefully disrupted, this could end up increasing node churn.

That being said, I think there's a good argument for ignoring pods with the do-not-disrupt annotation. The intended use-case for the do-not-disrupt annotation is jobs, and ideally your TGP is large enough that it allows jobs to complete before the node is terminated. Pods being forcibly deleted should be the exception, not the norm. I don't think we want to ignore pods which are being blocked by PDBs though, since we can't predict when those PDBs will be unblocked. In a healthy cluster, PDBs should really just serve to rate limit disruption, rather than blocking indefinitely.

These are just my initial thoughts, I'd have to think through the ramifications for disruption some more, but I definitely think there's paths for improvement that we should explore.

jorgeperezc · 2025-02-04T11:09:51Z

Apologies for mixing up the concepts, you are right. I am trying to learn and find an explanation in the source code for the observed behavior.

The big problem I see is that if you do not set a TGP and you have workloads with the do-not-disrupt annotation, the termination controller will wait indefinitely. Meanwhile, if the cluster has not enough capacity, it will create another nodeClaim for the workload that cannot be drained. Therefore, we end up paying cloud providers for nothing.

https://github.com/kubernetes-sigs/karpenter/blob/v1.0.5/pkg/controllers/node/termination/terminator/terminator.go#L165

func (t *Terminator) DeleteExpiringPods(ctx context.Context, pods []*corev1.Pod, nodeGracePeriodTerminationTime *time.Time) error {
	for _, pod := range pods {
		// check if the node has an expiration time and the pod needs to be deleted
		deleteTime := t.podDeleteTimeWithGracePeriod(nodeGracePeriodTerminationTime, pod)
		if deleteTime != nil && ...
        }
...
return nil
}

* In this case, I see the deleteTime always will be nil, waiting for an eviction that will never happen.

I want to understand the possibility of not setting a TGP, as it seems to make no sense and try to align our config with the new behavior.

I am grateful for your time and help @jmdeal . What would be a use case for not setting a TGP, considering the waste of resources?

PS: Although migrating to the new version is being a bit painful, kudos to the team for the project!

Thanks.

jmdeal · 2025-02-06T19:47:12Z

The big problem I see is that if you do not set a TGP and you have workloads with the do-not-disrupt annotation, the termination controller will wait indefinitely.

One of the important bits here is that this should only occur due to a forceful disruption method: expiration, interruption, or a kubectl delete node / nodeclaim. Graceful disruption methods (drift and consolidation) are both blocked by do-not-disrupt pods when TGP isn't configured. Interruption, at least for the AWS provider, occurs when a spot interruption request is received which implies that the instance will be terminated regardless after ~2 minutes. It would get stuck if you're manually deleting nodes / nodeclaims, but given manual intervention is already taking place I don't know if this is too much of an issue. That leaves the remaining method, expiration, as the only real gotcha I think.

The use-cases for forceful expiration without configuring TGP are a bit tenuous IMO. The main use case for expiration is ensuring a maximum node lifetime, and TGP is an essential part of that process. While I don't think there's a strong use-case for configuring expiration without TGP, I'm definitely open to examples. When we were collecting use-cases before changing expiration to forceful, we found that drift was the better mechanism most of the time. Regardless, since you can configure expireAfter without TGP, we should ensure the behavior is as good as possible, even if I don't think there's a strong use case.

Minimally, I think it would be reasonable to not provision capacity for do-not-disrupt pods on nodes without TGP configured. As you've called out, since there's no upper bound on node drain time, we could be creating spare capacity which is never used. Figuring out how to extend this to nodes with TGP configured seems a little trickier. For forceful disruption, I think it makes sense to wait of the pod to be drained before creating a replacement. This will marginally increase the restart latency for the pod, but I think that's likely worth the tradeoff. It's a little trickier with drift, since we will only drift a node iff we can create replacements for all scheduled pods. If we ignore the do-not-disrupt pods, we run the risk of disrupting the nodes via drift without being able to create replacements (e.g. if the NodePools have reached their resource limits). I think this would be a great discussion for our next working group meeting, I'll plan on adding it to the doc.

I'm going to change this issue from a bug to a feature request, Karpenter is currently behaving as intended but I agree that enhancements could be made to improve the experience around forceful disruption and do-not-disrupt pods.

/kind feature
/triage accepted

jmdeal · 2025-02-06T19:47:24Z

/remove-kind bug

jorgeperezc · 2025-02-10T15:39:34Z

Great! Thank you very much. I will stay tuned for any updates

otoupin-nsesi · 2025-02-21T15:56:04Z

Reading this, it’s exactly what I reported back in November in this ticket.

In this case, an anomalous behaviour occurs similar to the one mentioned earlier, with the difference that pods that cannot be drained are identified as reschedulable pods, leading to the provisioning of new NodeClaims that will never be used.

It’s exactly the behaviour we are seeing in our tests (and reported previously). 

The same flipping behaviour persists along with the possibility risk of massive, uncontrolled scaling.

In our tests (15min node expiry to trigger the behaviour as often as possible), it’s not a risk, it’s exactly what is happening, a small test cluster, which normally run 6-8 nodes, jump to 20-30 nodes, most of them unused.  It's trying to reschedule pods but it is blocked by their PDB (in our case, a few small CNPG databases, CNPG flag the primary with a PDB)

It’s a bit better with TGP, the node count stabilizes. However, while we understand the warning about expireAfter and terminationGracePeriod, Doing so can result in partially drained nodes stuck in the cluster, driving up cluster cost and potentially requiring manual intervention to resolve, in practice with the old behaviour it was rarely a major problem, not killing an important singleton pod, or a database primary was a good tradeoff for the extra partial nodes and costs. Manual intervention or normal “daily life” restarts and deployments would slowly expire the problematic nodes.

We are revisiting the Karpenter v1.x upgrade, for the same reason stated above (the requirement to upgrade to Karpenter 1.31 very soon), so this issue is for us considered a blocker, and a bug. For now, the only really real solution is to continue w/ the upgrade but disable expiry completely as the current behaviour is not exactly confidence inspiring .

To summarize:

We confirm this issue is valid, and exist in the wild, we can reproduce it fairly easily.
Expiry w/o TGP might not be ideal, but should be considered a valid use case and a valid feature. An operator can weigh the pro & con for the use cases, and make an informed choice.
Shouldn’t the massive scaling issue be considered a bug? What we see is new NodeClaims being created over and over again. That Karpenter tries to create a new node and NodeClaim to fit the “to be evicted pods” is one thing, but it should do so once, not every 30s. With this fixed we can debate if the new forceful behaviour is good or not, but at least we won’t have clusters going crazy and creating infinite nodes.

jonathan-innis · 2025-02-24T11:12:14Z

Shouldn’t the massive scaling issue be considered a bug? What we see is new NodeClaims being created over and over again. That Karpenter tries to create a new node and NodeClaim to fit the “to be evicted pods” is one thing, but it should do so once, not every 30s.

You're right that this is definitely not expected. Can you share some logs from this occurring? What Karpenter should be doing in this case is launching nodes for the pods one time and then leaving it alone. If you are seeing something different that definitely sounds like a bug (but that's sort of a separate problem from what this issue is calling out).

jorgeperezc · 2025-02-24T11:14:04Z

As @otoupin-nsesi describes, this is also a blocker for us in upgrading to EKS 1.31 due to the considerable impact on the platform.

@jmdeal @jonathan-innis, could you reconsider classifying this as a bug and at least release a patch for cases where TGP is undefined and expireAfter is set, to prevent unnecessary node provisioning?

jorgeperezc · 2025-02-24T11:18:43Z

@jonathan-innis Don't you think provisioning unnecessary resources is a problem?

You're right that this is definitely not expected. Can you share some logs from this occurring? What Karpenter should be doing in this case is launching nodes for the pods one time and then leaving it alone.

vinanrra · 2025-02-24T12:47:42Z

Hi, we are having the same issue, it's creating a bunch of nodes, and it's increasing our cots, we had to revert to an older version, to avoid this

otoupin-nsesi · 2025-02-24T20:45:19Z

Shouldn’t the massive scaling issue be considered a bug? What we see is new NodeClaims being created over and over again. That Karpenter tries to create a new node and NodeClaim to fit the “to be evicted pods” is one thing, but it should do so once, not every 30s.

You're right that this is definitely not expected. Can you share some logs from this occurring? What Karpenter should be doing in this case is launching nodes for the pods one time and then leaving it alone. If you are seeing something different that definitely sounds like a bug (but that's sort of a separate problem from what this issue is calling out).

The following logs reflect what we are seeing. The same pods can't be evicted as they have an active PDB, a new nodeclaim is created and not used (or barely used). In our case we have budget 10% for consolidation, so Karpenter can't consolidate fast enough (however it appears those nodeclaims shouldn't be created in the first place anyway) .

Are we sure it's a different issue than reported here? The title is "massive scaling", not "a few extra nodes to fit the to be evicted pods", so I assumed it was related.

2025-02-20 14:36:35.750	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-xyz/app-xyz-cluster-pg-2, monitoring/loki-chunks-cache-0, kubecost/kubecost-cost-analyzer-9d47bb889-n7p6b, app-abc/app-abc-pg-2 and 18 other(s)
2025-02-20 14:36:35.751	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:36:35.751	
provisioner computed 4 unready node(s) will fit 21 pod(s) 
2025-02-20 14:36:35.783	
provisioner created nodeclaim default-m66wn
2025-02-20 14:36:37.481	
nodeclaim.lifecycle launched nodeclaim default-m66wn
2025-02-20 14:36:39.984	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-xyz/app-xyz-cluster-pg-2, monitoring/loki-chunks-cache-0, kubecost/kubecost-cost-analyzer-9d47bb889-n7p6b, app-abc/app-abc-pg-2 and 18 other(s)
2025-02-20 14:36:39.984	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:36:39.984	
provisioner computed 4 unready node(s) will fit 22 pod(s) 
2025-02-20 14:36:40.006	
provisioner created nodeclaim default-s6njn
2025-02-20 14:36:41.469	
nodeclaim.lifecycle initialized nodeclaim default-5fh96
2025-02-20 14:36:41.608	
nodeclaim.lifecycle launched nodeclaim default-s6njn
2025-02-20 14:36:44.844	
nodeclaim.lifecycle initialized nodeclaim default-wnmd5
2025-02-20 14:36:49.113	
nodeclaim.lifecycle initialized nodeclaim default-dxph4
2025-02-20 14:36:53.068	
nodeclaim.lifecycle registered nodeclaim default-m66wn
2025-02-20 14:36:58.115	
nodeclaim.lifecycle registered nodeclaim default-s6njn
2025-02-20 14:37:03.800	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-abc/app-abc-pg-2, monitoring/monitoring-grafana-pg-1, kube-system/metrics-server-ff56cc7c9-xlmzb, monitoring/kubernetes-event-exporter-785ccf7bb6-nrd7b and 7 other(s)
2025-02-20 14:37:03.800	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:37:03.800	
provisioner computed 2 unready node(s) will fit 10 pod(s) 
2025-02-20 14:37:03.821	
provisioner created nodeclaim default-5t2ft
2025-02-20 14:37:06.533	
nodeclaim.lifecycle launched nodeclaim default-5t2ft
2025-02-20 14:37:08.178	
nodeclaim.lifecycle initialized nodeclaim default-m66wn
2025-02-20 14:37:13.157	
nodeclaim.lifecycle initialized nodeclaim default-s6njn
2025-02-20 14:37:22.584	
nodeclaim.lifecycle registered nodeclaim default-5t2ft
2025-02-20 14:37:23.850	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-abc/app-abc-pg-2, monitoring/monitoring-grafana-pg-1, kube-system/metrics-server-ff56cc7c9-xlmzb, monitoring/kubernetes-event-exporter-785ccf7bb6-nrd7b and 7 other(s)
2025-02-20 14:37:23.850	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:37:23.850	
provisioner computed 2 unready node(s) will fit 10 pod(s) 
2025-02-20 14:37:23.870	
provisioner created nodeclaim default-qd65c
2025-02-20 14:37:25.514	
nodeclaim.lifecycle launched nodeclaim default-qd65c
2025-02-20 14:37:37.246	
nodeclaim.lifecycle initialized nodeclaim default-5t2ft
2025-02-20 14:37:42.688	
nodeclaim.lifecycle registered nodeclaim default-qd65c
2025-02-20 14:37:43.793	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-abc/app-abc-pg-2, monitoring/monitoring-grafana-pg-1, kube-system/metrics-server-ff56cc7c9-xlmzb, monitoring/kubernetes-event-exporter-785ccf7bb6-nrd7b and 7 other(s)
2025-02-20 14:37:43.793	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:37:43.793	
provisioner computed 2 unready node(s) will fit 10 pod(s) 
2025-02-20 14:37:43.809	
provisioner created nodeclaim default-spmvd
2025-02-20 14:37:46.414	
nodeclaim.lifecycle launched nodeclaim default-spmvd
2025-02-20 14:37:57.531	
nodeclaim.lifecycle initialized nodeclaim default-qd65c
2025-02-20 14:38:02.594	
nodeclaim.lifecycle registered nodeclaim default-spmvd
2025-02-20 14:38:03.791	
provisioner found provisionable pod(s) app-xyz/app-xyz-cluster-pg-1, app-abc/app-abc-pg-2, monitoring/monitoring-grafana-pg-1, kube-system/metrics-server-ff56cc7c9-xlmzb, monitoring/kubernetes-event-exporter-785ccf7bb6-nrd7b and 6 other(s)
2025-02-20 14:38:03.792	
provisioner computed new nodeclaim(s) to fit pod(s) 
2025-02-20 14:38:03.792	
provisioner computed 2 unready node(s) will fit 9 pod(s) 
2025-02-20 14:38:03.817	
provisioner created nodeclaim default-6hn8d
2025-02-20 14:38:06.486	
nodeclaim.lifecycle launched nodeclaim default-6hn8d
2025-02-20 14:38:17.645	
nodeclaim.lifecycle initialized nodeclaim default-spmvd
2025-02-20 14:38:23.007	
nodeclaim.lifecycle registered nodeclaim default-6hn8d

logs are from json log w/ this formatting/filtering {namespace="karpenter"} |= `` | json | line_format `{{default .controller .reason}} {{.message}} {{default .Pods (default .Node_name .NodeClaim_name)}}. This test was done with Karpenter v1.0.8.

I would expect that Karpenter wouldn't issue "provisioner found provisionable pod(s)" and a new nodeclaim for the same set of pods in such a short timeframe. Hopefully this set of logs is not too noisy, as it's creating both legitimate nodeclaim (nodes were expiring so there are legitimate workloads to reschedule), and superfluous capacity (apparently creating capacity for the same un-evictable pods over and over again).

saurav-agarwalla · 2025-02-26T19:05:44Z

I put up a PR to solve this for pods with karpenter.sh/do-not-disrupt=true since they can't really reschedule to another node: #2033. After this is merged, you'll not see new nodeclaims being created for these pods.

This was the same issue that was raised in aws/karpenter-provider-aws#7521.

jorgeperezc · 2025-02-27T09:30:59Z

Thanks for your help @saurav-agarwalla but that solution is not working.

And if you have a PDB, do you provision a node that we won't use? If I define TGP, do you provision another one from the very first second?

On the other hand, what will be the strategy? Update from v0.3X.X to v1.2.2?

One thing is clear: every day that passes, we're giving away money to AWS for resources we do not use. So I guess you could be in for a nice bonus this year xD

Jokes aside, let the experts speak, because this issue has been around for a while and has been reported by several people. The only action taken has been to add notes to the documentation. If the plan is to do nothing, then say it clearly so we can stop wasting time.

saurav-agarwalla · 2025-02-27T17:21:54Z

@jorgeperezc

Thanks for your help @saurav-agarwalla but that solution is not working.

When you say not working, do you mean it doesn't fully solve the problem for other pods or it will not solve the issue for karpenter.sh/do-not-disrupt=true? As far as I understand, it should solve the problem for karpenter.sh/do-not-disrupt=true pods.

I totally agree that other pods like ones with PDBs shouldn't run into the same problem as well but they are a little more nuanced. I am thinking that factoring in the TGP when nominating these pods for a new node will ensure that Karpenter doesn't spin up a new node immediately which is going to stay underutilised until the TGP expires (or the pods are terminated due to any other reason).

Not yet sure if there are any downside to this.

saurav-agarwalla · 2025-02-27T17:53:10Z

I brought this up in the community meeting today as well. In general, there's consensus that we can do better here and one of the ways is to consider the TGP when spinning up new replacement nodes. I am going to dig into this a little more to understand if there are any other gotchas to this or if we can simplify this further (e.g. one thing that was brought up was does Karpenter even need to care about the TGP or if there are other ways to handle cases where a nodeclaim is stuck in terms of spinning new capacity).

I'll assign this to myself for now. Not sure if I'll be able to get to this immediately but this is high on my priority list.

saurav-agarwalla · 2025-02-27T17:53:19Z

/assign

jorgeperezc · 2025-02-28T15:00:08Z

@jorgeperezc

Thanks for your help @saurav-agarwalla but that solution is not working.

When you say not working, do you mean it doesn't fully solve the problem for other pods or it will not solve the issue for karpenter.sh/do-not-disrupt=true? As far as I understand, it should solve the problem for karpenter.sh/do-not-disrupt=true pods.

I totally agree that other pods like ones with PDBs shouldn't run into the same problem as well but they are a little more nuanced. I am thinking that factoring in the TGP when nominating these pods for a new node will ensure that Karpenter doesn't spin up a new node immediately which is going to stay underutilised until the TGP expires (or the pods are terminated due to any other reason).

Not yet sure if there are any downside to this.

What I mean is that if you don't define TGP, we will have blocks due to both do-not-disrupt annotations and PDBs. The PR only solves half of the problem, so we will continue provisioning capacity that will not be used.

I saw you discussed this with @otoupin-nsesi in the PR 👍. And you're right that checking if it is really evictable involves logic that IMO doesn't make sense to apply in those pod utils functions. From my understanding, I think you'll have to play with provisioner, utils/pdb and utils/pod:

To recap and give you my perspective on the current situation, I see two clear problems:

TGP is undefined and the expiration time is set. I think the fix for this issue should be urgently ported to all v1 minor versions so that we can proceed with the migration path without major inconveniences, especially since we need to update to EKS 1.31 soon to avoid paying for extended support and to comply with the compatibility matrix.
TGP and expiration times are defined. I’ve watched the meeting and there are a few points I’d like to share from my perspective:
- It’s mentioned that if we don’t create a replacement node, there’s a possibility that the node marked as expired could also become underutilized. Yes and no. There are workloads that may reserve the capacity of a node and Karpenter will always try to provision the cheapest instance that has enough capacity. There will always be use cases that fall outside the norm, no matter how well you define the requirements for your nodePools. And in the case that the node marked as expired becomes underutilized, at least we’re not contributing to making the ball bigger by provisioning capacity that won’t be used.
- With TGP defined, you don’t have to worry about PDBs. If I’m not mistaken, there’s a delete in the code as big as a cathedral that bypasses everything.
- I share @GnatorX perspective, which is that we could completely ignore workloads in the termination flow for expired nodes or when TGP is greater than x during the scheduling phase and simply wait for pods to enter a pending status. A slower replacement? I don’t think it’s a problem. If your workloads are well configured with multiple replicas and distributed across different AZs. In the case of a singleton, it’s not a problem either. They are already assuming it could happen and if not, the user should use Kubernetes leases and have multiple replicas via a leader/follower pattern.

For my part, I’ll see how I can help further. In addition to these 3 attached issues, there are many more related to the same problem so the community will definitely appreciate it.

saurav-agarwalla · 2025-02-28T15:16:21Z

I'm definitely going in that direction. I dug into the code yesterday and didn't see a reason to continue provisioning capacity for pods which are blocked on a deleting node due to PDB regardless of TGP like @GnatorX suggested yesterday. I have a fix staged locally which solved this by ignoring pods stuck in eviction due to PDB. I'll clean it up and push it onto my existing PR today. With that, it looks like we will cover all cases of Karpenter provisioning unnecessary capacity when pods are stuck on a deleting node regardless of the reason.

GnatorX · 2025-02-28T19:58:55Z

I want to add a few more points that is related we likely should put some thoughts into moving forward in the future.

I think Karpenter is trying to be safe while efficient however these two concept tends to collide. Similar to the scenario mentioned here, Karpenter is trying to "safe" by pre-creating nodes for pods it is evicting however knowing when pods will deleted isn't something Karpenter knows about it just knows about a range (0 sec - TGP or infinite if TGP isn't set). This comes at the cost of efficiency where it could be creating nodes that won't be used and leading to node churn. I do wonder if Karpenter should get out of the business of trying to precreating nodes.
Karpenter should know if the nodes it created are usable/ being used regardless of if the reason is user self-inflicted or Karpenter bug. In cases like this where nodes are pre-created but left unused then terminated over and over again as a user I want to know this as a metric in addition to Karpenter likely should give up after some time and change the condition of the NodePool. Karpenter should handle when it is creating nodes that aren't used #2043

IMO Karpenter should lean more towards "efficiency" rather than "safety" since it has limited knowledge that allows it to perform safe actions vs efficiency where it knows much more

saurav-agarwalla · 2025-02-28T20:54:48Z

I've updated #2033 to account for pods with PDB holding up their eviction as well.

jonathan-innis · 2025-03-05T00:24:59Z

/retitle Ensure pods that can't drain aren't considered for provisioning rescheduling

jorgeperezc added the kind/bug label Jan 24, 2025

k8s-ci-robot added the needs-triage label Jan 24, 2025

jorgeperezc mentioned this issue Jan 24, 2025

Avoid overprovisioning capacity for workloads on expired nodes #1929

Draft

k8s-ci-robot added kind/feature triage/accepted and removed needs-triage labels Feb 6, 2025

k8s-ci-robot removed the kind/bug label Feb 6, 2025

This was referenced Feb 27, 2025

Cannot disrupt Node: state node is nominated for a pending pod aws/karpenter-provider-aws#7521

Open

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Closed

k8s-ci-robot assigned saurav-agarwalla Feb 27, 2025

saurav-agarwalla mentioned this issue Feb 27, 2025

fix: don't provision unnecessary capacity for pods which can't move to a new node #2033

Open

GnatorX mentioned this issue Feb 28, 2025

Karpenter should handle when it is creating nodes that aren't used #2043

Open

k8s-ci-robot changed the title ~~Massive scaling and anomalous behaviors with the new forceful disruption method~~ Ensure pods that can't drain aren't considered for provisioning rescheduling Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

jorgeperezc commented Jan 24, 2025 •

edited

Loading

jmdeal commented Feb 1, 2025

jorgeperezc commented Feb 4, 2025

jmdeal commented Feb 6, 2025

jmdeal commented Feb 6, 2025

jorgeperezc commented Feb 10, 2025

otoupin-nsesi commented Feb 21, 2025

jonathan-innis commented Feb 24, 2025

jorgeperezc commented Feb 24, 2025

jorgeperezc commented Feb 24, 2025 •

edited

Loading

vinanrra commented Feb 24, 2025

otoupin-nsesi commented Feb 24, 2025

saurav-agarwalla commented Feb 26, 2025

jorgeperezc commented Feb 27, 2025 •

edited

Loading

saurav-agarwalla commented Feb 27, 2025

saurav-agarwalla commented Feb 27, 2025

saurav-agarwalla commented Feb 27, 2025

jorgeperezc commented Feb 28, 2025 •

edited

Loading

saurav-agarwalla commented Feb 28, 2025

GnatorX commented Feb 28, 2025 •

edited

Loading

saurav-agarwalla commented Feb 28, 2025

jonathan-innis commented Mar 5, 2025

Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

Ensure pods that can't drain aren't considered for provisioning rescheduling #1928

Comments

jorgeperezc commented Jan 24, 2025 • edited Loading

Description

jmdeal commented Feb 1, 2025

jorgeperezc commented Feb 4, 2025

jmdeal commented Feb 6, 2025

jmdeal commented Feb 6, 2025

jorgeperezc commented Feb 10, 2025

otoupin-nsesi commented Feb 21, 2025

jonathan-innis commented Feb 24, 2025

jorgeperezc commented Feb 24, 2025

jorgeperezc commented Feb 24, 2025 • edited Loading

vinanrra commented Feb 24, 2025

otoupin-nsesi commented Feb 24, 2025

saurav-agarwalla commented Feb 26, 2025

jorgeperezc commented Feb 27, 2025 • edited Loading

saurav-agarwalla commented Feb 27, 2025

saurav-agarwalla commented Feb 27, 2025

saurav-agarwalla commented Feb 27, 2025

jorgeperezc commented Feb 28, 2025 • edited Loading

saurav-agarwalla commented Feb 28, 2025

GnatorX commented Feb 28, 2025 • edited Loading

saurav-agarwalla commented Feb 28, 2025

jonathan-innis commented Mar 5, 2025

jorgeperezc commented Jan 24, 2025 •

edited

Loading

jorgeperezc commented Feb 24, 2025 •

edited

Loading

jorgeperezc commented Feb 27, 2025 •

edited

Loading

jorgeperezc commented Feb 28, 2025 •

edited

Loading

GnatorX commented Feb 28, 2025 •

edited

Loading