dex: 🎩 investigate search path duration of 100ms #4464

cratelyn · 2024-05-24T15:40:44Z

Describe the bug

@hdevalence noticed that there is a conspicuous ceiling of 100ms in the dex component's search path duration metrics, reported in Grafana.

we should investigate this further.

hdevalence · 2024-05-24T15:44:42Z

It would also be good to get that dashboard into the deployments, I’m not sure how to do that (sent the JSON privately)

hdevalence · 2024-05-24T18:51:45Z

There are manual bucket configs in the dex metrics module that I don't understand. Could those be part of the cause?

cratelyn · 2024-05-25T13:03:24Z

There are manual bucket configs in the dex metrics module that I don't understand. Could those be part of the cause?

https://github.com/penumbra-zone/penumbra/blob/main/crates/core/component/dex/src/component/metrics.rs#L49

for posterity, the buckets are defined here.

i would agree that the .1 does appear to be the cause, but i will confirm this hypothesis after the weekend!

hdevalence · 2024-05-25T19:22:40Z

Do you understand the purpose of those bucket configs? I don’t, so it could be good to document.

hdevalence · 2024-05-25T19:29:15Z

The surrounding code has a comment that the purpose is to avoid using "Summaries". Why do we want that? Is there a reason to not record all the data and then have the quantiles figured out after the fact?

cratelyn · 2024-05-26T03:29:07Z

i can't speak to the original rationale, but i dug up this relevant bit from the prometheus docs:

The essential difference between summaries and histograms is that summaries calculate streaming φ-quantiles on the client side and expose them directly, while histograms expose bucketed observation counts and the calculation of quantiles from the buckets of a histogram happens on the server side using the histogram_quantile() function.

source

adding some additional buckets (1 second, 10 seconds) is probably the easiest fix here. a histogram is figuring out the quantiles after the fact, so it seems like the metric type we should stick with.

fixes #4464. this adds two larger buckets to the dex component's histograms. when our dashboards calculate quantiles, we observed signals that some operations were taking longer than 100ms. to help obtain more accurate performance data, we add a 1 second and 10 second bucket.

Refs #4464. Adds a new hand-written dashboard to display DEX event durations. We're still massaging the bucketing logic, but getting this dashboard into version-control immediately, so we can iterate on it.

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464 (cherry picked from commit f35b190)

erwanor · 2024-06-10T23:12:19Z

Once #4571 lands in main, we will have non-broken DEX buckets so we should be able to close this, defer future performance investigation, and redirect towards closing the loop on other streams like indexing, candlestick data integration/debugging. (cc @aubrika @hdevalence @cratelyn Y/N?)

hdevalence · 2024-06-10T23:25:40Z

We can close this now, we already identified the cause (manual bucket configs) and fixed the DEX buckets. Further changes to the DEX bucket configs are not necessary based on the information we have now, because we know that we can't get more useful signal out of the GCP testnet deployment, as a result of its virtualized I/O, which prevents us from drawing any conclusions about I/O driven performance. #4571 makes changes based on data we know is bad, so it's not a helpful step towards understanding and improving performance. If we want to do that, we should first make sure that we have good data, collecting measurements on dedicated hardware (cf #4565)

cratelyn · 2024-06-10T23:25:54Z

#4571 doesn't currently alter bucket configuration. we haven't found consensus about how to configure histograms, but i've opted to drop further changes and stop trying to fix the "ceiling" problem shown in the original image in this issue.

#4524 and #4502 (superseding #4489) changed bucket configuration.

#4571 introduces telemetry measuring the duration of path relaxation and makes changes to allow us to add other metric types like counters, while #4581 adds telemetry to measure scheduler latency. those are changes that build upon @erwanor's fantastic work that suggested that path search accounted for 85% of dex latency.

i'm happy to close this for now as well.

cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels May 24, 2024

cratelyn self-assigned this May 24, 2024

cratelyn added this to Penumbra May 24, 2024

github-project-automation bot moved this to Backlog in Penumbra May 24, 2024

github-actions bot added the needs-refinement unclear, incomplete, or stub issue that needs work label May 24, 2024

hdevalence mentioned this issue May 24, 2024

Ensure fullnodes in testnet deployment do not use debug logging #4455

Closed

This comment was marked as resolved.

Sign in to view

cratelyn mentioned this issue May 28, 2024

dex: 🪣 add two larger dex buckets #4489

Closed

1 task

conorsch mentioned this issue May 28, 2024

metrics: add dex dashboard #4491

Merged

1 task

cratelyn added this to the Sprint 7 milestone May 29, 2024

cratelyn moved this from Backlog to In progress in Penumbra May 29, 2024

hdevalence mentioned this issue May 30, 2024

dex: update bucket values #4502

Merged

1 task

cratelyn added a commit that referenced this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000

a267e84

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464

cratelyn mentioned this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000 #4524

Merged

1 task

cratelyn added a commit that referenced this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000

f35b190

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464

aubrika modified the milestones: Sprint 7, Sprint 8 Jun 3, 2024

conorsch pushed a commit that referenced this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000

0a58ee3

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464 (cherry picked from commit f35b190)

conorsch mentioned this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000 #4530

Merged

1 task

conorsch pushed a commit that referenced this issue Jun 3, 2024

dex: 🧮 shift dex buckets over by a factor of 1000

4ebd75d

`metrics` histograms use seconds as a unit of measurement, update the bucket configuration to reflect that. see: * #4502 * #4464 (cherry picked from commit f35b190)

cratelyn mentioned this issue Jun 6, 2024

dex: 🔖 instrument search path relaxation #4571

Merged

1 task

conorsch mentioned this issue Jun 10, 2024

pd: 💤 a sleep worker, tracking scheduler latency #4581

Merged

1 task

hdevalence closed this as completed Jun 10, 2024

github-project-automation bot moved this from In progress to Done in Penumbra Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dex: 🎩 investigate search path duration of 100ms #4464

dex: 🎩 investigate search path duration of 100ms #4464

cratelyn commented May 24, 2024

hdevalence commented May 24, 2024

hdevalence commented May 24, 2024

cratelyn commented May 25, 2024

hdevalence commented May 25, 2024

hdevalence commented May 25, 2024

cratelyn commented May 26, 2024

This comment was marked as resolved.

erwanor commented Jun 10, 2024

hdevalence commented Jun 10, 2024

cratelyn commented Jun 10, 2024 •

edited

Loading

dex: 🎩 investigate search path duration of 100ms #4464

dex: 🎩 investigate search path duration of 100ms #4464

Comments

cratelyn commented May 24, 2024

hdevalence commented May 24, 2024

hdevalence commented May 24, 2024

cratelyn commented May 25, 2024

hdevalence commented May 25, 2024

hdevalence commented May 25, 2024

cratelyn commented May 26, 2024

This comment was marked as resolved.

erwanor commented Jun 10, 2024

hdevalence commented Jun 10, 2024

cratelyn commented Jun 10, 2024 • edited Loading

cratelyn commented Jun 10, 2024 •

edited

Loading