Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assume instance exists within eventual-consistency grace period #1024

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cartermckinnon
Copy link
Contributor

@cartermckinnon cartermckinnon commented Sep 17, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

For a short period of time after an EC2 instance is launched, the ec2:DescribeInstances API may not return details for the instance. When this happens, the node lifecycle controller may delete the Node erroneously, believing the corresponding EC2 instance does not exist.

This PR adds a configurable "grace period" after Node creation, during which the EC2 instance is assumed to exist.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This is possible in our InstancesV2 implementation, because the controller will pass us the Node: https://github.com/kubernetes/cloud-provider/blob/912e64449ce4cb3645436a768d4a8d5c834652ed/controllers/nodelifecycle/node_lifecycle_controller.go#L253

⚠️ it will not be possible to cherry-pick this back to older release branches that do not implement InstancesV2.

Does this PR introduce a user-facing change?:

NodeEventualConsistencyGracePeriod is a new configuration option to account for propogation delays in EC2 APIs. InstanceExists will now return an error for newly-created nodes until the grace period has elapsed.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cartermckinnon. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 17, 2024
@cartermckinnon
Copy link
Contributor Author

/assign mmerkes

@@ -44,6 +49,11 @@ func (c *Cloud) getProviderID(ctx context.Context, node *v1.Node) (string, error
// InstanceExists returns true if the instance for the given node exists according to the cloud provider.
// Use the node.name or node.spec.providerID field to find the node in the cloud provider.
func (c *Cloud) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
if time.Since(node.CreationTimestamp.Time) < instanceExistsGracePeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check only if the node is not getting deleted by checking if the deletion timestamp exists?


v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/types"
cloudprovider "k8s.io/cloud-provider"
)

const (
instanceExistsGracePeriod = 2 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did we come with this time period?

@cartermckinnon
Copy link
Contributor Author

Closing this in favor of kubernetes/kubernetes#127424

@cartermckinnon
Copy link
Contributor Author

Rebooting this.

@cartermckinnon cartermckinnon reopened this Nov 1, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 1, 2024
@cartermckinnon
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 13, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2024
@@ -45,6 +47,12 @@ func (c *Cloud) getProviderID(ctx context.Context, node *v1.Node) (string, error
// InstanceExists returns true if the instance for the given node exists according to the cloud provider.
// Use the node.name or node.spec.providerID field to find the node in the cloud provider.
func (c *Cloud) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
if time.Since(node.CreationTimestamp.Time) < c.nodeEventualConsistencyGracePeriod {
// recently-launched EC2 instances may not appear in `ec2:DescribeInstances`
// we return an error to cause the cloud-node-lifecycle-controller to ignore this node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any downside to just forcing it to fail? Could increase the error logs. Can we just move this check down into code so that we only return an error if it's within the grace period AND it doesn't have the eventually consistent fields set? That would drastically reduce the noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep had the same thought. We have to handle InvalidInstanceId.NotFound differently in the Instances v1/v2 impls, since we only have the context to apply this grace period in v2

@k8s-ci-robot
Copy link
Contributor

@cartermckinnon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cloud-provider-aws-e2e-kubetest2-quick 2a7f048 link false /test pull-cloud-provider-aws-e2e-kubetest2-quick
pull-cloud-provider-aws-e2e-kubetest2 2a7f048 link false /test pull-cloud-provider-aws-e2e-kubetest2

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants