Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Adjust State loader to only retry for failed requests and not for 4xx #37981

Merged
merged 11 commits into from
Mar 4, 2024

Conversation

devcorpio
Copy link
Contributor

Proposed commit message

Closes #37424

From now on, the state loader will just retry for failed requests (such as a 500 status code) and not for 4xx.

Additional context

  • If the query is valid and returns zero results, there will not be any retry, this was working this way already
  • If the API key is invalid, the query will fail returning a 4xx error (this was working this way already), as just can see in the screenshot below:
screenshot-invalid-api-key

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@devcorpio devcorpio added bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team backport-skip Skip notification from the automated backport with mergify labels Feb 12, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 12, 2024
@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation
sh := stateHits{}
err = json.Unmarshal(body, &sh)
if err != nil {
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err)
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error()
return nil, LoaderError{Message: errMsg, Retry: true}
Copy link
Contributor Author

@devcorpio devcorpio Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the retry property to true to avoid changing the behaviour. But now that we are revising this... do we want to retry if there is an error while marshaling?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this shouldn't happen in practice, I dont think there is any value in doing retries for malformed data.

etc := &esTestContext{
namespace: namespace.String(),
esc: esc,
loader: IntegESLoader(t, fmt.Sprintf("synthetics-*-%s", namespace.String()), location),
ec: ec,
Copy link
Contributor Author

@devcorpio devcorpio Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change made to make the HTTP API obj "fakeable" as you can see in esloader_test.go

@devcorpio devcorpio marked this pull request as ready for review February 12, 2024 15:39
@devcorpio devcorpio requested a review from a team as a code owner February 12, 2024 15:39
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 12, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 50 min 58 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

expectedCalls int
}{
{
"should retry 3 times when fails with retryable error",
Copy link
Contributor Author

@devcorpio devcorpio Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test lasts 5 seconds given that time.Sleep is executed for every retry. Let me know if you want me to make this parameterizable or if you believe it's something negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it configurable.

@devcorpio devcorpio added backport-v8.12.0 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Feb 12, 2024
@devcorpio devcorpio force-pushed the state-loader-retry branch 2 times, most recently from e479310 to aa29aa8 Compare February 12, 2024 17:23
Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of needed changes

net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  • Please do verify what happens for 410 - Gone, if the resource was deleted.

@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation
sh := stateHits{}
err = json.Unmarshal(body, &sh)
if err != nil {
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err)
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error()
return nil, LoaderError{Message: errMsg, Retry: true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this shouldn't happen in practice, I dont think there is any value in doing retries for malformed data.

@@ -94,7 +103,8 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation
sh := stateHits{}
err = json.Unmarshal(body, &sh)
if err != nil {
return nil, fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err)
errMsg := fmt.Errorf("could not unmarshal state hits for %s: %w", sf.ID, err).Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ques: why was .Error called here?

Copy link
Contributor Author

@devcorpio devcorpio Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the change, fmt.Errorf was getting an error as an argument. This means that's wrapping an error.

Given that the new logic creates a new error type (LoaderError), I wanted to ensure the error message was identically when logging it.

Having said that, after thinking about this further, I will change this logic a bit:

Rather than:

type LoaderError struct {
	Message   string
	Retry bool
}

I'll create:

type LoaderError struct {
	err   error
	Retry bool
}

and will pass an error instance to LoaderError. This way, we move that concern to the error obj/struct, something like this:

func (e LoaderError) Error() string {
	return e.err.Error()
}

which partially matches with how other places of heartbeat resolves similar scenarios:

func (e ValidateError) Error() string { return e.err.Error() }

I will add the commit shortly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will check why the log appears "weird":

Screenshot 2024-02-22 at 20 38 04

@@ -107,3 +117,11 @@ func MakeESLoader(esc *eslegclient.Connection, indexPattern string, beatLocation
return state, nil
}
}

func shouldRetry(status int) bool {
if status > 200 && status <= 499 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: status >= 500 - true else false

expectedCalls int
}{
{
"should retry 3 times when fails with retryable error",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it configurable.

@devcorpio
Copy link
Contributor Author

Avoid retry on document not found - 404 - ES wont return status: 404, it will be on the response

if a document is not found, the response contains 0 hits, and that is not treated as a situation that requires a retry. Hence, there is no retry:

screenshot-document-not-found

if the ES index is not found, it will return 404. There will not be a retry either.

Screenshot-wrong-index

Avoid retries on Context cancellations / DeadlineExceeded - Timeouts

If there is a timeout, the status code is 0. There will not be a retry either.

Screenshot-timeout

Please do verify what happens for 410 - Gone, if the resource was deleted.

Forced this case with Charles Proxy, since I was unable to reproduce a "normal" 410, I was always getting the 0 hits thingy. Anyway, no retry either.

screenshot-410-gone-forced

Note: bear in mind that from now on, we will only retry if there is a status code >= 500 (also reproduced this with Charles, added breaking point to a particular url request and aborting the request, great tool btw)

@devcorpio devcorpio removed the backport-v8.12.0 Automated backport with mergify label Feb 22, 2024
Copy link
Contributor

mergify bot commented Feb 22, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @devcorpio? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the wrapped error formatting.

@devcorpio
Copy link
Contributor Author

LGTM except for the wrapped error formatting.

It turns out that if we want to log the error as a string we need to use "%v". It also seems that we had the very format issue before creating the new error as you can see here:

Screenshot before change

Replacing it with %v makes things show with the expected format:

Screenshot after change

🎉

@devcorpio devcorpio added the backport-v8.13.0 Automated backport with mergify label Feb 23, 2024
Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -111,7 +111,7 @@ func (t *Tracker) GetCurrentState(sf stdfields.StdMonitorFields, rc RetryConfig)
}
var loaderError LoaderError
if errors.As(err, &loaderError) && !loaderError.Retry {
logp.L().Warnf("could not load last externally recorded state: %w", err)
logp.L().Warnf("could not load last externally recorded state: %v", loaderError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we want to use %w, we would need to unwrap errors.Unwrap. Not super sure. But %v should called the Error() method.

@devcorpio devcorpio force-pushed the state-loader-retry branch from 70a7d46 to 0b8a902 Compare March 4, 2024 09:15
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @devcorpio

@devcorpio devcorpio merged commit 27cde87 into elastic:main Mar 4, 2024
42 of 44 checks passed
@devcorpio devcorpio deleted the state-loader-retry branch March 4, 2024 10:09
mergify bot pushed a commit that referenced this pull request Mar 4, 2024
… not for 4xx (#37981)

* only retry when the status is 5xx

* remove test AAA comments

* add changelog

* correct changelog modification

* fix ES query

* change error handling strategy

* do not retry when there is malformed data

* improve retry mechanism

* improve log message

* improve changelog

* fix log format

(cherry picked from commit 27cde87)
devcorpio added a commit that referenced this pull request Mar 4, 2024
… not for 4xx (#37981) (#38163)

* only retry when the status is 5xx

* remove test AAA comments

* add changelog

* correct changelog modification

* fix ES query

* change error handling strategy

* do not retry when there is malformed data

* improve retry mechanism

* improve log message

* improve changelog

* fix log format

(cherry picked from commit 27cde87)

Co-authored-by: Alberto Delgado Roda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.13.0 Automated backport with mergify bug Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Heartbeat] Adjust State loader to only retry for failed requests and not for 4xx
4 participants