Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Add duplicate nodeclaim chack for Repair Controller #1916

Closed

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Jan 15, 2025

Fixes #N/A

Description

  • Adding nodeclaim check for the repair controller for duplicate nodeclaims. We should expect manual intervention for node has multiple nodeclaim mapping. We should not re-recocile on a node with multiple nodeclaim mapping

How was this change tested?

  • make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 15, 2025
@engedaam engedaam changed the title chore" Add duplicate nodeclaim chack for Repair Controller chore: Add duplicate nodeclaim chack for Repair Controller Jan 15, 2025
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 15, 2025
@engedaam engedaam force-pushed the duplicate-nodeclaim-check branch 2 times, most recently from cb8f8b9 to 12964ef Compare January 15, 2025 21:25
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2025
@engedaam engedaam force-pushed the duplicate-nodeclaim-check branch 2 times, most recently from 96f3161 to e6baba8 Compare January 15, 2025 21:29
Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit

/lgtm
/hold

pkg/controllers/node/health/controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@engedaam engedaam force-pushed the duplicate-nodeclaim-check branch from e6baba8 to 6023fb7 Compare January 15, 2025 21:40
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@engedaam engedaam force-pushed the duplicate-nodeclaim-check branch 3 times, most recently from 2697970 to 164084c Compare January 15, 2025 21:51
Copy link
Member

@jmdeal jmdeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam, jmdeal
Once this PR has been reviewed and has the lgtm label, please assign bwagner5 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

if !clusterHealthy {
c.recorder.Publish(NodeRepairBlockedUnmanagedNodeClaim(node, nodeClaim, fmt.Sprintf("more then %s nodes are unhealthy in the cluster", allowedUnhealthyPercent.String()))...)
if nodeutils.IsDuplicateNodeClaimError(err) {
log.FromContext(ctx).Error(err, "failed to validate node health")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that we would just be logging the error and not returning the error. If you are looking to handle cases where you don't want to keep requeueing because the error isn't retryable, you can consider doing something like a TerminalError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, do you need to log an error for this kind of failure, I think this should propagate up as a registration error if this even happens at all -- I'd probably just consider ignoring this rather than logging for it

Copy link
Contributor Author

@engedaam engedaam Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use reconcile.TerminalError here (didn't know this was an option). The idea was just to fail load and not reconcile on the error. Customer should really be intervening and fixing the broken state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I don't think that you need to code this state in because we don't really even know of a true instance of this happening -- I would be for just removing the error logging that you have here and then just do the standard ignore pattern that we have with other errors that we just don't want to reconcile on -- there are potential options that we could consider with marking NodeClaims as unhealthy in our status conditions if we haven't already, but handling it here feels like the wrong place to do it -- if every controller handled it this way, the logs would start to look really noisy when you ran into this state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems reasonable to me. I can drop this check all together. Gonna open a new PR to avoid needing to alter to much in this commit

pkg/controllers/node/health/controller.go Show resolved Hide resolved
@engedaam engedaam force-pushed the duplicate-nodeclaim-check branch from 164084c to 149bc28 Compare January 15, 2025 23:14
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@engedaam engedaam closed this Jan 16, 2025
@engedaam engedaam deleted the duplicate-nodeclaim-check branch January 17, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants