-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set scrape timeout based on Prometheus header #60
Open
jdlafaye
wants to merge
4
commits into
micrometer-metrics:main
Choose a base branch
from
StateFarmIns:scrape-timeout
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Should not we set it to whatever value the client wants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timeout was set to the exact value of the Prometheus scrape timeout (given by the header), any one app instance that is too slow to respond (thus timing out), would also cause us to hit the Prometheus scrape timeout (since they're the same in this hypothetical). Thus, as a result of a single instance timing out, no metrics are successfully scraped from any of the other connected instances (because Prometheus hit its scrape timeout and closed the connection).
So, a small offset is needed to make the individual instance scrape timeout slightly less than the Prometheus scrape timeout, allowing time for the metrics to be scraped from all connected instances, concatenated and returned via the network before Prometheus itself times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is needed for the reason you describe, shouldn't we set the default to a small but nonzero value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. Defaulting to a zero offset maintains backwards compatibility with the old behavior but doesn't give the benefit of the offset from a timeout perspective.
On the other hand, defaulting to a nonzero offset gives the timeout benefit described above but technically is a breaking change. For example, consider a Prometheus setup with a 10s scrape timeout and assume we chose an offset of 1s. In this case, the RSocket timeout is 9s. Let's say the user running this Prometheus setup previously had RSocket scrape duration of about 9.5 seconds. Prior to this change, they'd be running their setup without issue since there's no explicit RSocket timeout and 9.5 seconds is lower than the Prometheus 10s timeout. However, with the introduction of this change, suddenly this user's scrapes are timing out on the RSocket side from the 9 second timeout and thus aren't getting the metrics they were getting previously.
I don't think this is necessarily a likely scenario and the odds of this happening decrease as the default offset also decreases, but it's definitely a consideration.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate you thinking about backward compatibility and the potential negative effects of this. It's important for us to take these things seriously and avoid as much pain for users as possible. That said, I think if we're changing this behavior in a minor release (as opposed to a patch release), we make note of it in the release notes, and it is configurable so people negatively affected by it can effectively undo the behavior change, I think it is worthwhile moving things in a better direction. That's how I feel anyway. @jonatan-ivanov What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, it makes sense, Prometheus sends the
scrape_timeout
value inX-Prometheus-Scrape-Timeout-Seconds
and I did not find a way to account for latency on the Prometheus server side. I find this unfortunate since:But if the offset would be on Prometheus side the issue is somewhat similar:
So I guess this is what we have, it makes sense to me.
I think I agree with @shakuzen about having a small but non-zero default value in a milestone release. To me the interesting part is what should that value be.
I should be small enough so the negative effect is minimal and big enough so that it can account the network delay which includes transfer cost which depends on the response size. My first guess is around 100-500ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think @jdlafaye about having a small default offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to set a small default offset as it will provide an immediate benefit in a number of setups. I'd probably lean towards 500ms as that meshes well with the default Prometheus scrape timeout of 10s and also happens to be the same as what Blackbox Exporter uses.
Let me know your thoughts and I can make that change.