-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry logic to UCP GetAWSResourceWithPost
handler
#8170
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: willdavsmith <[email protected]>
Signed-off-by: willdavsmith <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8170 +/- ##
==========================================
+ Coverage 60.07% 60.15% +0.07%
==========================================
Files 579 580 +1
Lines 38504 38551 +47
==========================================
+ Hits 23133 23189 +56
+ Misses 13669 13663 -6
+ Partials 1702 1699 -3 ☔ View full report in Codecov by Sentry. |
Radius functional test overview
Click here to see the list of tools in the current test run
Test Status⌛ Building Radius and pushing container images for functional tests... |
@@ -139,6 +139,7 @@ require ( | |||
github.com/sagikazarmark/locafero v0.6.0 // indirect | |||
github.com/sagikazarmark/slog-shim v0.1.0 // indirect | |||
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect | |||
github.com/sethvargo/go-retry v0.3.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this package to simplify our retry logic across the project. Looks like it is well tested with no dependencies so I think it is a good choice. Let's discuss in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last commit seems to be from 6 months ago. Just wondering if that could be an issue.
} | ||
|
||
// NewNoOpRetryer creates a new Retryer that does not retry. | ||
func NewNoOpRetryer() *Retryer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is useful for testing. using this retryer should be the same functionality as we have today.
@@ -125,7 +125,6 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit | |||
|
|||
if existing { | |||
// Get resource type schema | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space
@@ -87,7 +87,7 @@ func (p *CreateOrUpdateAWSResource) Run(ctx context.Context, w http.ResponseWrit | |||
} | |||
|
|||
cloudControlOpts := []func(*cloudcontrol.Options){CloudControlRegionOption(region)} | |||
cloudFormationOpts := []func(*cloudformation.Options){CloudFormationWithRegionOption(region)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this function to match the cloudcontrol
version
@@ -74,7 +75,7 @@ func Test_GetAWSResourceWithPost(t *testing.T) { | |||
CloudControl: testOptions.AWSCloudControlClient, | |||
CloudFormation: testOptions.AWSCloudFormationClient, | |||
} | |||
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients) | |||
awsController, err := NewGetAWSResourceWithPost(armrpc_controller.Options{DatabaseClient: testOptions.DatabaseClient}, awsClients, retry.NewNoOpRetryer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should match the functionality we have today, i.e. these tests should pass with no other changes.
} | ||
|
||
// NewGetAWSResourceWithPost creates a new GetAWSResourceWithPost controller with the given options and AWS clients. | ||
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients) (armrpc_controller.Controller, error) { | ||
func NewGetAWSResourceWithPost(opts armrpc_controller.Options, awsClients ucpaws.Clients, retryer *retry.Retryer) (armrpc_controller.Controller, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider adding the retryer to the other ucp awsproxy routes too, either in the future or this PR. I wanted to get some feedback first
I'm sure this is written up somewhere already, but I'd like to understand the failure pattern that we're addressing with this change. Is it something like this? (please help me fill in the blanks)
The background context is that any multi-regional control plane is eventually consistent. Azure/ARM/Bicep has a similar eventually consistent behavior underneath it (see notes above), and it's mostly hidden from users via the deployment engine. |
This is exactly what I think is happening. I noticed that in the cases that I was investigating, the resource was actually created but returned a 404 on this route during deployment. My understanding of the DE is that it will perform a PUT operation, monitor the operation, and then do a GET at the end, where it calls UCP (getawsresourcewithpost handler) and returns a 404 because AWS says the resource doesn't exist yet. My hope is that adding retries here will make this situation more reliable without too much overhead. We can verify that it works if we see this issue less in the future, but until then, this is pretty much an educated guess as to what's happening and the solution. |
@@ -139,6 +139,7 @@ require ( | |||
github.com/sagikazarmark/locafero v0.6.0 // indirect | |||
github.com/sagikazarmark/slog-shim v0.1.0 // indirect | |||
github.com/sergi/go-diff v1.3.2-0.20230802210424-5b0b94c5c0d3 // indirect | |||
github.com/sethvargo/go-retry v0.3.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last commit seems to be from 6 months ago. Just wondering if that could be an issue.
} | ||
|
||
return &Retryer{ | ||
config: retryConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can retryConfig ever be just empty? Like config is not nil but config.BackOffStrategy is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that an okay case?
func TestNewRetryer(t *testing.T) { | ||
config := &RetryConfig{ | ||
BackoffStrategy: goretry.NewConstant(1 * time.Second), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can test this with a RetryConfig that has a nil BackOffStrategy.
Description
We've seen flaky functional test failures with AWS S3: #5963
This PR adds retries to the handler that I think is causing this 404 error.
pkg/retry
directory for standard retriespkg/retry
in UCPGetAWSResourceWithPost
handlerType of change
Fixes: #7352
Contributor checklist
Please verify that the PR meets the following requirements, where applicable: