-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dex: 🎩 investigate search path duration of 100ms #4464
Comments
It would also be good to get that dashboard into the deployments, I’m not sure how to do that (sent the JSON privately) |
There are manual bucket configs in the dex metrics module that I don't understand. Could those be part of the cause? |
for posterity, the buckets are defined here. i would agree that the |
Do you understand the purpose of those bucket configs? I don’t, so it could be good to document. |
The surrounding code has a comment that the purpose is to avoid using "Summaries". Why do we want that? Is there a reason to not record all the data and then have the quantiles figured out after the fact? |
i can't speak to the original rationale, but i dug up this relevant bit from the prometheus docs:
adding some additional buckets (1 second, 10 seconds) is probably the easiest fix here. a histogram is figuring out the quantiles after the fact, so it seems like the metric type we should stick with. |
This comment was marked as resolved.
This comment was marked as resolved.
fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.
fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.
fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.
Refs #4464. Adds a new hand-written dashboard to display DEX event durations. We're still massaging the bucketing logic, but getting this dashboard into version-control immediately, so we can iterate on it.
Refs #4464. Adds a new hand-written dashboard to display DEX event durations. We're still massaging the bucketing logic, but getting this dashboard into version-control immediately, so we can iterate on it.
Refs #4464. Adds a new hand-written dashboard to display DEX event durations. We're still massaging the bucketing logic, but getting this dashboard into version-control immediately, so we can iterate on it.
Once #4571 lands in main, we will have non-broken DEX buckets so we should be able to close this, defer future performance investigation, and redirect towards closing the loop on other streams like indexing, candlestick data integration/debugging. (cc @aubrika @hdevalence @cratelyn Y/N?) |
We can close this now, we already identified the cause (manual bucket configs) and fixed the DEX buckets. Further changes to the DEX bucket configs are not necessary based on the information we have now, because we know that we can't get more useful signal out of the GCP testnet deployment, as a result of its virtualized I/O, which prevents us from drawing any conclusions about I/O driven performance. #4571 makes changes based on data we know is bad, so it's not a helpful step towards understanding and improving performance. If we want to do that, we should first make sure that we have good data, collecting measurements on dedicated hardware (cf #4565) |
#4571 doesn't currently alter bucket configuration. we haven't found consensus about how to configure histograms, but i've opted to drop further changes and stop trying to fix the "ceiling" problem shown in the original image in this issue. #4524 and #4502 (superseding #4489) changed bucket configuration. #4571 introduces telemetry measuring the duration of path relaxation and makes changes to allow us to add other metric types like counters, while #4581 adds telemetry to measure scheduler latency. those are changes that build upon @erwanor's fantastic work that suggested that path search accounted for 85% of dex latency. i'm happy to close this for now as well. |
Describe the bug
@hdevalence noticed that there is a conspicuous ceiling of 100ms in the dex component's search path duration metrics, reported in Grafana.
we should investigate this further.
The text was updated successfully, but these errors were encountered: