Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dex: 🔖 instrument search path relaxation #4571

Merged
merged 5 commits into from
Jun 12, 2024
Merged

Conversation

cratelyn
Copy link
Contributor

@cratelyn cratelyn commented Jun 6, 2024

💭 describe your changes

this makes a few changes to the dex component's prometheus metrics.

see #4464 for previous context.

importantly, this adds a new metric that tracks the time spent processing a
distinct, individual path. this specific metric is configured to use a larger
set of buckets on the upper end, so that we can capture and observe latencies
observed running pd on a testnet while positions are being routed.

this is a sibling metric to the existing path search metric, which records the
time spent finding the final output path. as a part of this branch, we also
fix a small bug, which is that the duration is still recorded if a dead-end
is reached, and no path could be found.

finally, the configuration of the prometheus metrics builder is tweaked so that
we can add counters to this component in the future. additional counters tracking
things like: (1) paths that were ruled out for being too expensive, (2) dead-ends
reached, (3) errors encountered, etc. i would love for reviewers to offer insight
about other conditions worth counting.

🔖 issue ticket number and link

✅ checklist before requesting a review

  • if this code contains consensus-breaking changes, i have added the
    "consensus-breaking" label. otherwise, i declare my belief that there are not
    consensus-breaking changes, for the following reason:

    this only affects metrics emission.

cratelyn added 5 commits June 6, 2024 14:53
this commit moves the logic responsible for configuring dex metrics into
the dex's metrics module.

there are some lurking footguns here, so we can avoid easily forgotten
non-local reasoning and move the regex/prefix logic next to the metrics.
we can't add counters or gauges named `penumbra_dex_` because of the use
of `Matcher::Prefix`. rather, we can tweak this and provide the explicit
list of metrics we want buckets for.
@cratelyn cratelyn added this to the Sprint 8 milestone Jun 6, 2024
@cratelyn cratelyn self-assigned this Jun 6, 2024
@cratelyn cratelyn added A-telemetry Area: Metrics, logging, and other observability-related features A-dex Area: Relates to the dex labels Jun 6, 2024
Comment on lines 55 to 59
let entry = cache.lock().0.remove(&dst);
let Some(PathEntry { path, spill, .. }) = entry else {
record_duration();
return Ok((None, None));
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ this looks like an arm we should still record the duration of. even if we didn't find a path to return, we still searched for one.

i'd like to put a counter here and in some other arms below, in a subsequent branch.

Comment on lines -173 to +174
.set_buckets_for_metric(
metrics_exporter_prometheus::Matcher::Prefix("penumbra_dex_".to_string()),
penumbra_dex::component::metrics::DEX_BUCKETS,
)?
.set_buckets_for_dex_metrics()?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the relevant commit for more info, but this (a) cuts down on non-local reasoning, and (b) sets the stage for us to add counters and gauges to this component.

@cratelyn cratelyn requested a review from hdevalence June 6, 2024 21:18
@cratelyn cratelyn marked this pull request as ready for review June 6, 2024 21:18
@hdevalence
Copy link
Member

hdevalence commented Jun 6, 2024

I'm not sure why we want to change the bucket resolution for the metrics in question?

If they're larger than 250ms we already have all the information we need — they take too long — and the place where these metrics are used currently (a google cloud vm with virtualized disks) cannot be relied on for performance analysis because it's (1) unpredictable, as backend system changes can cause wild fluctuations in the observations and (2) not the case we care about anyways (node operators running on non-virtualized SSDs).

We don't particularly care to measure the performance of our specific cloud testnet deployment, since that's not how the software will be used in practice. What we care about is improving its performance generally, but changing the bucket sizes does not help us with this.

Copy link
Member

@hdevalence hdevalence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be tuning the bucket sizes like this for the reason I mentioned in my comment

@cratelyn cratelyn force-pushed the kate/4464-dex-metrics branch from d159da2 to 1e71cda Compare June 7, 2024 15:58
@cratelyn
Copy link
Contributor Author

cratelyn commented Jun 7, 2024

⚖️ making the case

i have force pushed to drop d159da2 from this branch. i still feel very strongly that we are opting not to collect information about how our software behaves in production by refraining from adding larger buckets.

prometheus metrics are not the proper tool for measuring performance, they are for observability and alerting. if we want performance measurements, we should write benchmarks and use a library like criterion. prometheus metrics are also not for measurements of our specific testnet deployment, they are an operational tool for any deployment, regardless of that being a VM or an on-prem device.

having a bucket counting observations greater than 5 seconds allows operators to configure alerts so that they can be informed when exceptional events occurring in their system. having such a bucket also does not imply that 5 seconds is an acceptable duration for path searches, it helps differentiate significant outages from minor performance hiccups. having such a bucket also allows operators to compare long-tail latencies between a VM and an on-prem device to confirm that the latter does not experience these same periods of elevated latency. having such a bucket is how we can identify the source of elevated path search latency.

we have observed elevated durations like this in the wild, but are not able to make any confident statements about whether those searches took 251ms, or 5 seconds. a 1 second and 5 second bucket answers that question without any significant downside.


✂️ the buckets are gone now

with that case made, i've removed the buckets.

this branch contains other changes like 0108a4b that will facilitate the addition of other metrics to the dex, which will help us improve the dex component's telemetry regardless of our buckets' upper-bound.

i would appreciate if you could take another look, when able. thanks.

@cratelyn cratelyn requested a review from hdevalence June 7, 2024 16:31
Copy link
Member

@erwanor erwanor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thanks for the interesting write up, really liked the distinction between using prometheus for observability vs. latency measurement. I think to henry's point, 200ms on a single path search would make someone's pager go red so higher than 250ms has really diminishing returns.

@conorsch conorsch dismissed hdevalence’s stale review June 12, 2024 15:24

PR has been revised to address comments

@conorsch
Copy link
Contributor

Thanks, @cratelyn, both for the diff itself and for the detailed discussion you've been adding to the PRs and tickets related to this issue. Extremely high-quality work.

@conorsch conorsch merged commit fb47486 into main Jun 12, 2024
13 checks passed
@conorsch conorsch deleted the kate/4464-dex-metrics branch June 12, 2024 15:25
@conorsch conorsch mentioned this pull request Jun 20, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dex Area: Relates to the dex A-telemetry Area: Metrics, logging, and other observability-related features
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants