Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic to UCP GetAWSResourceWithPost handler #8170

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

willdavsmith
Copy link
Contributor

@willdavsmith willdavsmith commented Dec 26, 2024

Description

We've seen flaky functional test failures with AWS S3: #5963

This PR adds retries to the handler that I think is causing this 404 error.

  • Add pkg/retry directory for standard retries
  • Use pkg/retry in UCP GetAWSResourceWithPost handler

Type of change

  • This pull request fixes a bug in Radius and has an approved issue (issue link required).
  • This pull request adds or changes features of Radius and has an approved issue (issue link required).
  • This pull request is a minor refactor, code cleanup, test improvement, or other maintenance task and doesn't change the functionality of Radius (issue link optional).

Fixes: #7352

Contributor checklist

Please verify that the PR meets the following requirements, where applicable:

  • An overview of proposed schema changes is included in a linked GitHub issue.
  • A design document PR is created in the design-notes repository, if new APIs are being introduced.
  • If applicable, design document has been reviewed and approved by Radius maintainers/approvers.
  • A PR for the samples repository is created, if existing samples are affected by the changes in this PR.
  • A PR for the documentation repository is created, if the changes in this PR affect the documentation or any user facing updates are made.
  • A PR for the recipes repository is created, if existing recipes are affected by the changes in this PR.

Signed-off-by: willdavsmith <[email protected]>
Copy link

codecov bot commented Dec 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.15%. Comparing base (c6b2fce) to head (395ecce).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8170      +/-   ##
==========================================
+ Coverage   60.07%   60.15%   +0.07%     
==========================================
  Files         579      580       +1     
  Lines       38504    38551      +47     
==========================================
+ Hits        23133    23189      +56     
+ Misses      13669    13663       -6     
+ Partials     1702     1699       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@radius-functional-tests
Copy link

radius-functional-tests bot commented Dec 27, 2024

Radius functional test overview

🔍 Go to test action run

Name Value
Repository willdavsmith/radius
Commit ref 395ecce
Unique ID funcd8bba2bb19
Image tag pr-funcd8bba2bb19
Click here to see the list of tools in the current test run
  • gotestsum 1.12.0
  • KinD: v0.20.0
  • Dapr:
  • Azure KeyVault CSI driver: 1.4.2
  • Azure Workload identity webhook: 1.3.0
  • Bicep recipe location ghcr.io/radius-project/dev/test/testrecipes/test-bicep-recipes/<name>:pr-funcd8bba2bb19
  • Terraform recipe location http://tf-module-server.radius-test-tf-module-server.svc.cluster.local/<name>.zip (in cluster)
  • applications-rp test image location: ghcr.io/radius-project/dev/applications-rp:pr-funcd8bba2bb19
  • dynamic-rp test image location: ghcr.io/radius-project/dev/dynamic-rp:pr-funcd8bba2bb19
  • controller test image location: ghcr.io/radius-project/dev/controller:pr-funcd8bba2bb19
  • ucp test image location: ghcr.io/radius-project/dev/ucpd:pr-funcd8bba2bb19
  • deployment-engine test image location: ghcr.io/radius-project/deployment-engine:latest

Test Status

⌛ Building Radius and pushing container images for functional tests...
✅ Container images build succeeded
⌛ Publishing Bicep Recipes for functional tests...
✅ Recipe publishing succeeded
⌛ Starting corerp-cloud functional tests...
⌛ Starting ucp-cloud functional tests...
✅ corerp-cloud functional tests succeeded
✅ ucp-cloud functional tests succeeded

@@ -139,6 +139,7 @@ require (
github.com/sagikazarmark/locafero v0.6.0 // indirect
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect
github.com/sethvargo/go-retry v0.3.0 // indirect
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this package to simplify our retry logic across the project. Looks like it is well tested with no dependencies so I think it is a good choice. Let's discuss in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit seems to be from 6 months ago. Just wondering if that could be an issue.

}

// NewNoOpRetryer creates a new Retryer that does not retry.
func NewNoOpRetryer() *Retryer {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is useful for testing. using this retryer should be the same functionality as we have today.

@@ -125,7 +125,6 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit

if existing {
// Get resource type schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space

@@ -87,7 +87,7 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit
}

cloudControlOpts := []func(*cloudcontrol.Options){CloudControlRegionOption(region)}
cloudFormationOpts := []func(*cloudformation.Options){CloudFormationWithRegionOption(region)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this function to match the cloudcontrol version

@@ -74,7 +75,7 @@ func Test_GetAWSResourceWithPost(t *testing.T) {
CloudControl: testOptions.AWSCloudControlClient,
CloudFormation: testOptions.AWSCloudFormationClient,
}
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients)
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients, retry.NewNoOpRetryer())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should match the functionality we have today, i.e. these tests should pass with no other changes.

}

// NewGetAWSResourceWithPost creates a new GetAWSResourceWithPost controller with the given options and AWS clients.
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients) (armrpc_controller.Controller, error) {
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients, retryer *retry.Retryer) (armrpc_controller.Controller, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider adding the retryer to the other ucp awsproxy routes too, either in the future or this PR. I wanted to get some feedback first

@rynowak
Copy link
Contributor

rynowak commented Dec 28, 2024

I'm sure this is written up somewhere already, but I'd like to understand the failure pattern that we're addressing with this change.

Is it something like this? (please help me fill in the blanks)

  • A PUT operation is initiated.
  • The PUT operation succeeds asynchronously.
  • A GET operation against the same resource then fails with a 404.
  • At some point the future, the same GET operation will succeed if retried, without initiating any additional operations.

The background context is that any multi-regional control plane is eventually consistent. Azure/ARM/Bicep has a similar eventually consistent behavior underneath it (see notes above), and it's mostly hidden from users via the deployment engine.

@willdavsmith
Copy link
Contributor Author

I'm sure this is written up somewhere already, but I'd like to understand the failure pattern that we're addressing with this change.

Is it something like this? (please help me fill in the blanks)

  • A PUT operation is initiated.
  • The PUT operation succeeds asynchronously.
  • A GET operation against the same resource then fails with a 404.
  • At some point the future, the same GET operation will succeed if retried, without initiating any additional operations.

The background context is that any multi-regional control plane is eventually consistent. Azure/ARM/Bicep has a similar eventually consistent behavior underneath it (see notes above), and it's mostly hidden from users via the deployment engine.

This is exactly what I think is happening. I noticed that in the cases that I was investigating, the resource was actually created but returned a 404 on this route during deployment. My understanding of the DE is that it will perform a PUT operation, monitor the operation, and then do a GET at the end, where it calls UCP (getawsresourcewithpost handler) and returns a 404 because AWS says the resource doesn't exist yet. My hope is that adding retries here will make this situation more reliable without too much overhead. We can verify that it works if we see this issue less in the future, but until then, this is pretty much an educated guess as to what's happening and the solution.

@@ -139,6 +139,7 @@ require (
github.com/sagikazarmark/locafero v0.6.0 // indirect
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect
github.com/sethvargo/go-retry v0.3.0 // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit seems to be from 6 months ago. Just wondering if that could be an issue.

}

return &Retryer{
config: retryConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can retryConfig ever be just empty? Like config is not nil but config.BackOffStrategy is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an okay case?

Comment on lines +42 to +45
func TestNewRetryer(t *testing.T) {
config := &RetryConfig{
BackoffStrategy: goretry.NewConstant(1 * time.Second),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test this with a RetryConfig that has a nil BackOffStrategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add resiliency to the GET operation of AWS resource deployments
3 participants