[EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler #2500

olekszhel · 2024-12-15T20:35:23Z

Description:
During an EKS managed node group (MNG) upgrade, the Auto Scaling Group (ASG) max size can temporarily increase as part of the upgrade workflow. However, if the Cluster Autoscaler (CA) triggers scale-up activities during the upgrade process (especially during the scale-down phase), the ASG max size is not reverted to its original value after the upgrade completes. This results in unpredictable increases in the ASG max size, creating operational challenges and deviating from the documented behavior.

Expected Behavior:

The EKS upgrade workflow should ensure that the ASG max size is always reverted to its original value after the upgrade completes, even if the Cluster Autoscaler scales up the node group during the upgrade process.
This behavior should align with the documented logic for temporary ASG max size increases during the scale-up phase (e.g., twice the number of availability zones or based on the maxUnavailable value).

Current Behavior:

If the Cluster Autoscaler triggers scaling during the upgrade process, the ASG max size is left in an increased state (e.g., unpredictably higher than the original value).
This causes potential overprovisioning of nodes and additional costs due to an unintended increase in ASG max size.

Proposed Solution:
Introduce a mechanism in the EKS managed node group upgrade workflow to:

Track the original ASG max size before the upgrade begins.
Automatically restore the original ASG max size once the upgrade process is complete, regardless of any Cluster Autoscaler-triggered activities.

Benefit to Customers:

Ensures predictable and stable behavior of ASGs post-upgrade.
Prevents unexpected resource usage and costs due to unintended increases in ASG max size.
Improves alignment between EKS upgrade workflows and documented behavior, enhancing operational trust and ease of use.

Additional Context:
This feature is especially critical for users leveraging Cluster Autoscaler to dynamically manage node group sizes in production environments. The current behavior introduces unnecessary manual intervention to reset ASG configurations after upgrades, which this enhancement could automate and eliminate.

CaseID where I was suggested to create this feature request Case ID: 173143486100288

olekszhel added the Proposed Community submitted issue label Dec 15, 2024

mikestef9 added EKS Amazon Elastic Kubernetes Service EKS Managed Nodes EKS Managed Nodes labels Dec 16, 2024

mikestef9 changed the title ~~Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler~~ [EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler #2500

[EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler #2500

olekszhel commented Dec 15, 2024

[EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler #2500

[EKS] Ensure ASG Max Size Reverts to Original Value After EKS Upgrade Workflow with Cluster Autoscaler #2500

Comments

olekszhel commented Dec 15, 2024