Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add duplicate nodeclaim chack for Repair Controller #1916

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 40 additions & 22 deletions pkg/controllers/node/health/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,32 +77,24 @@ func (c *Controller) Reconcile(ctx context.Context, node *corev1.Node) (reconcil
ctx = log.IntoContext(ctx, log.FromContext(ctx).WithValues("Node", klog.KRef(node.Namespace, node.Name)))

// Validate that the node is owned by us
if !nodeutils.IsManaged(node, c.cloudProvider) {
return reconcile.Result{}, nil
}
nodeClaim, err := nodeutils.NodeClaimForNode(ctx, c.kubeClient, node)
if err != nil {
return reconcile.Result{}, nodeutils.IgnoreNodeClaimNotFoundError(err)
}

// If a nodeclaim does has a nodepool label, validate the nodeclaims inside the nodepool are healthy (i.e bellow the allowed threshold)
// In the case of standalone nodeclaim, validate the nodes inside the cluster are healthy before proceeding
// to repair the nodes
nodePoolName, found := nodeClaim.Labels[v1.NodePoolLabelKey]
if found {
nodePoolHealthy, err := c.isNodePoolHealthy(ctx, nodePoolName)
if err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
if !nodePoolHealthy {
return reconcile.Result{}, c.publishNodePoolHealthEvent(ctx, node, nodeClaim, nodePoolName)
}
} else {
clusterHealthy, err := c.isClusterHealthy(ctx)
if err != nil {
return reconcile.Result{}, err
}
if !clusterHealthy {
c.recorder.Publish(NodeRepairBlockedUnmanagedNodeClaim(node, nodeClaim, fmt.Sprintf("more then %s nodes are unhealthy in the cluster", allowedUnhealthyPercent.String()))...)
if nodeutils.IsDuplicateNodeClaimError(err) {
engedaam marked this conversation as resolved.
Show resolved Hide resolved
log.FromContext(ctx).Error(err, "failed to validate node health")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that we would just be logging the error and not returning the error. If you are looking to handle cases where you don't want to keep requeueing because the error isn't retryable, you can consider doing something like a TerminalError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, do you need to log an error for this kind of failure, I think this should propagate up as a registration error if this even happens at all -- I'd probably just consider ignoring this rather than logging for it

Copy link
Contributor Author

@engedaam engedaam Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use reconcile.TerminalError here (didn't know this was an option). The idea was just to fail load and not reconcile on the error. Customer should really be intervening and fixing the broken state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I don't think that you need to code this state in because we don't really even know of a true instance of this happening -- I would be for just removing the error logging that you have here and then just do the standard ignore pattern that we have with other errors that we just don't want to reconcile on -- there are potential options that we could consider with marking NodeClaims as unhealthy in our status conditions if we haven't already, but handling it here feels like the wrong place to do it -- if every controller handled it this way, the logs would start to look really noisy when you ran into this state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems reasonable to me. I can drop this check all together. Gonna open a new PR to avoid needing to alter to much in this commit

return reconcile.Result{}, nil
}
return reconcile.Result{}, nodeutils.IgnoreNodeClaimNotFoundError(fmt.Errorf("validating node health, %w", err))
}

healthyOwnerResources, err := c.vaildateNodeClaimOwnerHealth(ctx, nodeClaim, node)
if err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
if !healthyOwnerResources {
return reconcile.Result{}, nil
}

unhealthyNodeCondition, policyTerminationDuration := c.findUnhealthyConditions(node)
Expand Down Expand Up @@ -185,6 +177,32 @@ func (c *Controller) annotateTerminationGracePeriod(ctx context.Context, nodeCla
return nil
}

// If a nodeclaim does has a nodepool label, validate the nodeclaims inside the nodepool are healthy (i.e bellow the allowed threshold)
// In the case of standalone nodeclaim, validate the nodes inside the cluster are healthy before proceeding
// to repair the nodes
func (c *Controller) vaildateNodeClaimOwnerHealth(ctx context.Context, nodeClaim *v1.NodeClaim, node *corev1.Node) (bool, error) {
nodePoolName, found := nodeClaim.Labels[v1.NodePoolLabelKey]
if found {
nodePoolHealthy, err := c.isNodePoolHealthy(ctx, nodePoolName)
if err != nil {
return false, err
}
if !nodePoolHealthy {
return false, c.publishNodePoolHealthEvent(ctx, node, nodeClaim, nodePoolName)
}
} else {
clusterHealthy, err := c.isClusterHealthy(ctx)
if err != nil {
return false, err
}
if !clusterHealthy {
c.recorder.Publish(NodeRepairBlockedUnmanagedNodeClaim(node, nodeClaim, fmt.Sprintf("more then %s nodes are unhealthy in the cluster", allowedUnhealthyPercent.String()))...)
return false, nil
}
}
return true, nil
}

// isNodePoolHealthy checks if the number of unhealthy nodes managed by the given NodePool exceeds the health threshold.
// defined by the cloud provider
// Up to 20% of Nodes may be unhealthy before the NodePool becomes unhealthy (or the nearest whole number, rounding up).
Expand Down
Loading