dex: 🔖 instrument search path relaxation #4571

cratelyn · 2024-06-06T21:10:06Z

💭 describe your changes

this makes a few changes to the dex component's prometheus metrics.

see #4464 for previous context.

importantly, this adds a new metric that tracks the time spent processing a
distinct, individual path. this specific metric is configured to use a larger
set of buckets on the upper end, so that we can capture and observe latencies
observed running pd on a testnet while positions are being routed.

this is a sibling metric to the existing path search metric, which records the
time spent finding the final output path. as a part of this branch, we also
fix a small bug, which is that the duration is still recorded if a dead-end
is reached, and no path could be found.

finally, the configuration of the prometheus metrics builder is tweaked so that
we can add counters to this component in the future. additional counters tracking
things like: (1) paths that were ruled out for being too expensive, (2) dead-ends
reached, (3) errors encountered, etc. i would love for reviewers to offer insight
about other conditions worth counting.

🔖 issue ticket number and link

dex: 🎩 investigate search path duration of 100ms #4464

✅ checklist before requesting a review

if this code contains consensus-breaking changes, i have added the
"consensus-breaking" label. otherwise, i declare my belief that there are not
consensus-breaking changes, for the following reason:

this only affects metrics emission.

this commit moves the logic responsible for configuring dex metrics into the dex's metrics module. there are some lurking footguns here, so we can avoid easily forgotten non-local reasoning and move the regex/prefix logic next to the metrics.

we can't add counters or gauges named `penumbra_dex_` because of the use of `Matcher::Prefix`. rather, we can tweak this and provide the explicit list of metrics we want buckets for.

cratelyn · 2024-06-06T21:12:20Z

crates/core/component/dex/src/component/router/path_search.rs

        let entry = cache.lock().0.remove(&dst);
        let Some(PathEntry { path, spill, .. }) = entry else {
+            record_duration();
            return Ok((None, None));
        };


☝️ this looks like an arm we should still record the duration of. even if we didn't find a path to return, we still searched for one.

i'd like to put a counter here and in some other arms below, in a subsequent branch.

cratelyn · 2024-06-06T21:18:04Z

crates/bin/pd/src/main.rs

-                .set_buckets_for_metric(
-                    metrics_exporter_prometheus::Matcher::Prefix("penumbra_dex_".to_string()),
-                    penumbra_dex::component::metrics::DEX_BUCKETS,
-                )?
+                .set_buckets_for_dex_metrics()?


see the relevant commit for more info, but this (a) cuts down on non-local reasoning, and (b) sets the stage for us to add counters and gauges to this component.

hdevalence · 2024-06-06T23:40:48Z

I'm not sure why we want to change the bucket resolution for the metrics in question?

If they're larger than 250ms we already have all the information we need — they take too long — and the place where these metrics are used currently (a google cloud vm with virtualized disks) cannot be relied on for performance analysis because it's (1) unpredictable, as backend system changes can cause wild fluctuations in the observations and (2) not the case we care about anyways (node operators running on non-virtualized SSDs).

We don't particularly care to measure the performance of our specific cloud testnet deployment, since that's not how the software will be used in practice. What we care about is improving its performance generally, but changing the bucket sizes does not help us with this.

hdevalence

I don't think we should be tuning the bucket sizes like this for the reason I mentioned in my comment

cratelyn · 2024-06-07T16:30:31Z

⚖️ making the case

i have force pushed to drop d159da2 from this branch. i still feel very strongly that we are opting not to collect information about how our software behaves in production by refraining from adding larger buckets.

prometheus metrics are not the proper tool for measuring performance, they are for observability and alerting. if we want performance measurements, we should write benchmarks and use a library like criterion. prometheus metrics are also not for measurements of our specific testnet deployment, they are an operational tool for any deployment, regardless of that being a VM or an on-prem device.

having a bucket counting observations greater than 5 seconds allows operators to configure alerts so that they can be informed when exceptional events occurring in their system. having such a bucket also does not imply that 5 seconds is an acceptable duration for path searches, it helps differentiate significant outages from minor performance hiccups. having such a bucket also allows operators to compare long-tail latencies between a VM and an on-prem device to confirm that the latter does not experience these same periods of elevated latency. having such a bucket is how we can identify the source of elevated path search latency.

we have observed elevated durations like this in the wild, but are not able to make any confident statements about whether those searches took 251ms, or 5 seconds. a 1 second and 5 second bucket answers that question without any significant downside.

✂️ the buckets are gone now

with that case made, i've removed the buckets.

this branch contains other changes like 0108a4b that will facilitate the addition of other metrics to the dex, which will help us improve the dex component's telemetry regardless of our buckets' upper-bound.

i would appreciate if you could take another look, when able. thanks.

erwanor

This makes sense to me, thanks for the interesting write up, really liked the distinction between using prometheus for observability vs. latency measurement. I think to henry's point, 200ms on a single path search would make someone's pager go red so higher than 250ms has really diminishing returns.

PR has been revised to address comments

conorsch · 2024-06-12T15:25:21Z

Thanks, @cratelyn, both for the diff itself and for the detailed discussion you've been adding to the PRs and tickets related to this issue. Extremely high-quality work.

cratelyn added 5 commits June 6, 2024 14:53

dex: ⭕ set buckets for a specific list of metrics

0108a4b

we can't add counters or gauges named `penumbra_dex_` because of the use of `Matcher::Prefix`. rather, we can tweak this and provide the explicit list of metrics we want buckets for.

dex: 🚳 DEX_BUCKETS -> GENERIC_DEX_BUCKETS

6cc49d6

dex: 🚠 add a metric tracking path relaxing duration

2825fb4

dex: 🪦 record duration of dead-end path search

1e71cda

cratelyn temporarily deployed to smoke-test June 6, 2024 21:10 — with GitHub Actions Inactive

cratelyn added this to the Sprint 8 milestone Jun 6, 2024

cratelyn self-assigned this Jun 6, 2024

cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels Jun 6, 2024

cratelyn commented Jun 6, 2024

View reviewed changes

cratelyn requested a review from hdevalence June 6, 2024 21:18

cratelyn marked this pull request as ready for review June 6, 2024 21:18

hdevalence previously requested changes Jun 6, 2024

View reviewed changes

cratelyn force-pushed the kate/4464-dex-metrics branch from d159da2 to 1e71cda Compare June 7, 2024 15:58

cratelyn temporarily deployed to smoke-test June 7, 2024 15:59 — with GitHub Actions Inactive

cratelyn requested a review from hdevalence June 7, 2024 16:31

erwanor approved these changes Jun 10, 2024

View reviewed changes

erwanor mentioned this pull request Jun 10, 2024

dex: 🎩 investigate search path duration of 100ms #4464

Closed

conorsch merged commit fb47486 into main Jun 12, 2024
13 checks passed

conorsch deleted the kate/4464-dex-metrics branch June 12, 2024 15:25

conorsch mentioned this pull request Jun 20, 2024

Prepare v0.77.3 changeset #4648

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dex: 🔖 instrument search path relaxation #4571

dex: 🔖 instrument search path relaxation #4571

cratelyn commented Jun 6, 2024 •

edited

Loading

cratelyn Jun 6, 2024

cratelyn Jun 6, 2024

hdevalence commented Jun 6, 2024 •

edited

Loading

hdevalence left a comment

cratelyn commented Jun 7, 2024

erwanor left a comment

conorsch commented Jun 12, 2024

dex: 🔖 instrument search path relaxation #4571

dex: 🔖 instrument search path relaxation #4571

Conversation

cratelyn commented Jun 6, 2024 • edited Loading

💭 describe your changes

🔖 issue ticket number and link

✅ checklist before requesting a review

cratelyn Jun 6, 2024

Choose a reason for hiding this comment

cratelyn Jun 6, 2024

Choose a reason for hiding this comment

hdevalence commented Jun 6, 2024 • edited Loading

hdevalence left a comment

Choose a reason for hiding this comment

cratelyn commented Jun 7, 2024

⚖️ making the case

✂️ the buckets are gone now

erwanor left a comment

Choose a reason for hiding this comment

conorsch commented Jun 12, 2024

cratelyn commented Jun 6, 2024 •

edited

Loading

hdevalence commented Jun 6, 2024 •

edited

Loading