-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Adjust State loader to only retry for failed requests and not for 4xx #37981
Conversation
@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation | |||
sh := stateHits{} | |||
err = json.Unmarshal(body, &sh) | |||
if err != nil { | |||
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err) | |||
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error() | |||
return nil, LoaderError{Message: errMsg, Retry: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the retry property to true to avoid changing the behaviour. But now that we are revising this... do we want to retry if there is an error while marshaling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this shouldn't happen in practice, I dont think there is any value in doing retries for malformed data.
etc := &esTestContext{ | ||
namespace: namespace.String(), | ||
esc: esc, | ||
loader: IntegESLoader(t, fmt.Sprintf("synthetics-*-%s", namespace.String()), location), | ||
ec: ec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change made to make the HTTP API obj "fakeable" as you can see in esloader_test.go
Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services) |
expectedCalls int | ||
}{ | ||
{ | ||
"should retry 3 times when fails with retryable error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test lasts 5 seconds given that time.Sleep
is executed for every retry. Let me know if you want me to make this parameterizable or if you believe it's something negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it configurable.
e479310
to
aa29aa8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of needed changes
- Avoid retry on document not found - 404 - ES wont return status: 404, it will be on the response - [Heartbeat] Adjust State loader to only retry for failed requests and not for 4xx #37424 (comment). Please confirm if that is the case.
- Avoid retries on Context cancellations / DeadlineExceeded - Timeouts
net/http: request canceled (Client.Timeout exceeded while awaiting headers)
- Please do verify what happens for 410 - Gone, if the resource was deleted.
@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation | |||
sh := stateHits{} | |||
err = json.Unmarshal(body, &sh) | |||
if err != nil { | |||
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err) | |||
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error() | |||
return nil, LoaderError{Message: errMsg, Retry: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this shouldn't happen in practice, I dont think there is any value in doing retries for malformed data.
@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation | |||
sh := stateHits{} | |||
err = json.Unmarshal(body, &sh) | |||
if err != nil { | |||
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err) | |||
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ques: why was .Error called here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the change, fmt.Errorf was getting an error as an argument. This means that's wrapping an error.
Given that the new logic creates a new error type (LoaderError), I wanted to ensure the error message was identically when logging it.
Having said that, after thinking about this further, I will change this logic a bit:
Rather than:
type LoaderError struct {
Message string
Retry bool
}
I'll create:
type LoaderError struct {
err error
Retry bool
}
and will pass an error instance to LoaderError. This way, we move that concern to the error obj/struct, something like this:
func (e LoaderError) Error() string {
return e.err.Error()
}
which partially matches with how other places of heartbeat resolves similar scenarios:
beats/heartbeat/reason/reason.go
Line 50 in 85e4e46
func (e ValidateError) Error() string { return e.err.Error() } |
I will add the commit shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -107,3 +117,11 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation | |||
return state, nil | |||
} | |||
} | |||
|
|||
func shouldRetry(status int) bool { | |||
if status > 200 && status <= 499 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: status >= 500 - true else false
expectedCalls int | ||
}{ | ||
{ | ||
"should retry 3 times when fails with retryable error", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it configurable.
if a document is not found, the response contains 0 hits, and that is not treated as a situation that requires a retry. Hence, there is no retry: ![]() if the ES index is not found, it will return 404. There will not be a retry either. ![]()
If there is a timeout, the status code is 0. There will not be a retry either. ![]()
Forced this case with Charles Proxy, since I was unable to reproduce a "normal" 410, I was always getting the 0 hits thingy. Anyway, no retry either. ![]() Note: bear in mind that from now on, we will only retry if there is a status code >= 500 (also reproduced this with Charles, added breaking point to a particular url request and aborting the request, great tool btw) |
480a6a1
to
3d0aba5
Compare
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the wrapped error formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -111,7 +111,7 @@ func (t *Tracker) GetCurrentState(sf stdfields.StdMonitorFields, rc RetryConfig) | |||
} | |||
var loaderError LoaderError | |||
if errors.As(err, &loaderError) && !loaderError.Retry { | |||
logp.L().Warnf("could not load last externally recorded state: %w", err) | |||
logp.L().Warnf("could not load last externally recorded state: %v", loaderError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we want to use %w, we would need to unwrap errors.Unwrap
. Not super sure. But %v should called the Error()
method.
70a7d46
to
0b8a902
Compare
💚 Build Succeeded
History
cc @devcorpio |
💚 Build Succeeded
History
cc @devcorpio |
💚 Build Succeeded
History
cc @devcorpio |
💚 Build Succeeded
History
cc @devcorpio |
💚 Build Succeeded
History
cc @devcorpio |
💚 Build Succeeded
History
cc @devcorpio |
… not for 4xx (#37981) * only retry when the status is 5xx * remove test AAA comments * add changelog * correct changelog modification * fix ES query * change error handling strategy * do not retry when there is malformed data * improve retry mechanism * improve log message * improve changelog * fix log format (cherry picked from commit 27cde87)
… not for 4xx (#37981) (#38163) * only retry when the status is 5xx * remove test AAA comments * add changelog * correct changelog modification * fix ES query * change error handling strategy * do not retry when there is malformed data * improve retry mechanism * improve log message * improve changelog * fix log format (cherry picked from commit 27cde87) Co-authored-by: Alberto Delgado Roda <[email protected]>
Proposed commit message
Closes #37424
From now on, the state loader will just retry for failed requests (such as a 500 status code) and not for 4xx.
Additional context
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.