Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dex: 🎩 investigate search path duration of 100ms #4464

Closed
cratelyn opened this issue May 24, 2024 · 10 comments
Closed

dex: 🎩 investigate search path duration of 100ms #4464

cratelyn opened this issue May 24, 2024 · 10 comments
Assignees
Labels
A-dex Area: Relates to the dex A-telemetry Area: Metrics, logging, and other observability-related features needs-refinement unclear, incomplete, or stub issue that needs work
Milestone

Comments

@cratelyn
Copy link
Contributor

Describe the bug

image

@hdevalence noticed that there is a conspicuous ceiling of 100ms in the dex component's search path duration metrics, reported in Grafana.

we should investigate this further.

@cratelyn cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels May 24, 2024
@cratelyn cratelyn self-assigned this May 24, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Penumbra May 24, 2024
@github-actions github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 24, 2024
@hdevalence
Copy link
Member

It would also be good to get that dashboard into the deployments, I’m not sure how to do that (sent the JSON privately)

@hdevalence
Copy link
Member

There are manual bucket configs in the dex metrics module that I don't understand. Could those be part of the cause?

@cratelyn
Copy link
Contributor Author

There are manual bucket configs in the dex metrics module that I don't understand. Could those be part of the cause?

https://github.com/penumbra-zone/penumbra/blob/main/crates/core/component/dex/src/component/metrics.rs#L49

for posterity, the buckets are defined here.

i would agree that the .1 does appear to be the cause, but i will confirm this hypothesis after the weekend!

@hdevalence
Copy link
Member

Do you understand the purpose of those bucket configs? I don’t, so it could be good to document.

@hdevalence
Copy link
Member

The surrounding code has a comment that the purpose is to avoid using "Summaries". Why do we want that? Is there a reason to not record all the data and then have the quantiles figured out after the fact?

@cratelyn
Copy link
Contributor Author

i can't speak to the original rationale, but i dug up this relevant bit from the prometheus docs:

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.

source

adding some additional buckets (1 second, 10 seconds) is probably the easiest fix here. a histogram is figuring out the quantiles after the fact, so it seems like the metric type we should stick with.

@cratelyn

This comment was marked as resolved.

cratelyn added a commit that referenced this issue May 28, 2024
fixes #4464.

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some
operations were taking longer than 100ms. to help obtain more accurate
performance data, we add a 1 second and 10 second bucket.
cratelyn added a commit that referenced this issue May 28, 2024
fixes #4464.

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some
operations were taking longer than 100ms. to help obtain more accurate
performance data, we add a 1 second and 10 second bucket.
cratelyn added a commit that referenced this issue May 28, 2024
fixes #4464.

this adds two larger buckets to the dex component's histograms.

when our dashboards calculate quantiles, we observed signals that some
operations were taking longer than 100ms. to help obtain more accurate
performance data, we add a 1 second and 10 second bucket.
conorsch added a commit that referenced this issue May 28, 2024
Refs #4464. Adds a new hand-written dashboard to display DEX event
durations. We're still massaging the bucketing logic, but getting this
dashboard into version-control immediately, so we can iterate on it.
conorsch added a commit that referenced this issue May 28, 2024
Refs #4464. Adds a new hand-written dashboard to display DEX event
durations. We're still massaging the bucketing logic, but getting this
dashboard into version-control immediately, so we can iterate on it.
conorsch added a commit that referenced this issue May 28, 2024
Refs #4464. Adds a new hand-written dashboard to display DEX event
durations. We're still massaging the bucketing logic, but getting this
dashboard into version-control immediately, so we can iterate on it.
@cratelyn cratelyn added this to the Sprint 7 milestone May 29, 2024
@cratelyn cratelyn moved this from Backlog to In progress in Penumbra May 29, 2024
cratelyn added a commit that referenced this issue Jun 3, 2024
`metrics` histograms use seconds as a unit of measurement, update the
bucket configuration to reflect that.

see:
* #4502
* #4464
cratelyn added a commit that referenced this issue Jun 3, 2024
`metrics` histograms use seconds as a unit of measurement, update the
bucket configuration to reflect that.

see:
* #4502
* #4464
@aubrika aubrika modified the milestones: Sprint 7, Sprint 8 Jun 3, 2024
conorsch pushed a commit that referenced this issue Jun 3, 2024
`metrics` histograms use seconds as a unit of measurement, update the
bucket configuration to reflect that.

see:
* #4502
* #4464

(cherry picked from commit f35b190)
conorsch pushed a commit that referenced this issue Jun 3, 2024
`metrics` histograms use seconds as a unit of measurement, update the
bucket configuration to reflect that.

see:
* #4502
* #4464

(cherry picked from commit f35b190)
@erwanor
Copy link
Member

erwanor commented Jun 10, 2024

Once #4571 lands in main, we will have non-broken DEX buckets so we should be able to close this, defer future performance investigation, and redirect towards closing the loop on other streams like indexing, candlestick data integration/debugging. (cc @aubrika @hdevalence @cratelyn Y/N?)

@hdevalence
Copy link
Member

We can close this now, we already identified the cause (manual bucket configs) and fixed the DEX buckets. Further changes to the DEX bucket configs are not necessary based on the information we have now, because we know that we can't get more useful signal out of the GCP testnet deployment, as a result of its virtualized I/O, which prevents us from drawing any conclusions about I/O driven performance. #4571 makes changes based on data we know is bad, so it's not a helpful step towards understanding and improving performance. If we want to do that, we should first make sure that we have good data, collecting measurements on dedicated hardware (cf #4565)

@github-project-automation github-project-automation bot moved this from In progress to Done in Penumbra Jun 10, 2024
@cratelyn
Copy link
Contributor Author

cratelyn commented Jun 10, 2024

#4571 doesn't currently alter bucket configuration. we haven't found consensus about how to configure histograms, but i've opted to drop further changes and stop trying to fix the "ceiling" problem shown in the original image in this issue.

#4524 and #4502 (superseding #4489) changed bucket configuration.

#4571 introduces telemetry measuring the duration of path relaxation and makes changes to allow us to add other metric types like counters, while #4581 adds telemetry to measure scheduler latency. those are changes that build upon @erwanor's fantastic work that suggested that path search accounted for 85% of dex latency.

i'm happy to close this for now as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dex Area: Relates to the dex A-telemetry Area: Metrics, logging, and other observability-related features needs-refinement unclear, incomplete, or stub issue that needs work
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants