-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dex: 🔖 instrument search path relaxation #4571
Conversation
this commit moves the logic responsible for configuring dex metrics into the dex's metrics module. there are some lurking footguns here, so we can avoid easily forgotten non-local reasoning and move the regex/prefix logic next to the metrics.
we can't add counters or gauges named `penumbra_dex_` because of the use of `Matcher::Prefix`. rather, we can tweak this and provide the explicit list of metrics we want buckets for.
let entry = cache.lock().0.remove(&dst); | ||
let Some(PathEntry { path, spill, .. }) = entry else { | ||
record_duration(); | ||
return Ok((None, None)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ this looks like an arm we should still record the duration of. even if we didn't find a path to return, we still searched for one.
i'd like to put a counter here and in some other arms below, in a subsequent branch.
.set_buckets_for_metric( | ||
metrics_exporter_prometheus::Matcher::Prefix("penumbra_dex_".to_string()), | ||
penumbra_dex::component::metrics::DEX_BUCKETS, | ||
)? | ||
.set_buckets_for_dex_metrics()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the relevant commit for more info, but this (a) cuts down on non-local reasoning, and (b) sets the stage for us to add counters and gauges to this component.
I'm not sure why we want to change the bucket resolution for the metrics in question? If they're larger than 250ms we already have all the information we need — they take too long — and the place where these metrics are used currently (a google cloud vm with virtualized disks) cannot be relied on for performance analysis because it's (1) unpredictable, as backend system changes can cause wild fluctuations in the observations and (2) not the case we care about anyways (node operators running on non-virtualized SSDs). We don't particularly care to measure the performance of our specific cloud testnet deployment, since that's not how the software will be used in practice. What we care about is improving its performance generally, but changing the bucket sizes does not help us with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be tuning the bucket sizes like this for the reason I mentioned in my comment
d159da2
to
1e71cda
Compare
⚖️ making the casei have force pushed to drop d159da2 from this branch. i still feel very strongly that we are opting not to collect information about how our software behaves in production by refraining from adding larger buckets. prometheus metrics are not the proper tool for measuring performance, they are for observability and alerting. if we want performance measurements, we should write benchmarks and use a library like having a bucket counting observations greater than 5 seconds allows operators to configure alerts so that they can be informed when exceptional events occurring in their system. having such a bucket also does not imply that 5 seconds is an acceptable duration for path searches, it helps differentiate significant outages from minor performance hiccups. having such a bucket also allows operators to compare long-tail latencies between a VM and an on-prem device to confirm that the latter does not experience these same periods of elevated latency. having such a bucket is how we can identify the source of elevated path search latency. we have observed elevated durations like this in the wild, but are not able to make any confident statements about whether those searches took 251ms, or 5 seconds. a 1 second and 5 second bucket answers that question without any significant downside. ✂️ the buckets are gone nowwith that case made, i've removed the buckets. this branch contains other changes like 0108a4b that will facilitate the addition of other metrics to the dex, which will help us improve the dex component's telemetry regardless of our buckets' upper-bound. i would appreciate if you could take another look, when able. thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, thanks for the interesting write up, really liked the distinction between using prometheus for observability vs. latency measurement. I think to henry's point, 200ms on a single path search would make someone's pager go red so higher than 250ms has really diminishing returns.
PR has been revised to address comments
Thanks, @cratelyn, both for the diff itself and for the detailed discussion you've been adding to the PRs and tickets related to this issue. Extremely high-quality work. |
💭 describe your changes
this makes a few changes to the dex component's prometheus metrics.
see #4464 for previous context.
importantly, this adds a new metric that tracks the time spent processing a
distinct, individual path.
this specific metric is configured to use a largerset of buckets on the upper end, so that we can capture and observe latencies
observed running pd on a testnet while positions are being routed.
this is a sibling metric to the existing path search metric, which records the
time spent finding the final output path. as a part of this branch, we also
fix a small bug, which is that the duration is still recorded if a dead-end
is reached, and no path could be found.
finally, the configuration of the prometheus metrics builder is tweaked so that
we can add counters to this component in the future. additional counters tracking
things like: (1) paths that were ruled out for being too expensive, (2) dead-ends
reached, (3) errors encountered, etc. i would love for reviewers to offer insight
about other conditions worth counting.
🔖 issue ticket number and link
✅ checklist before requesting a review
if this code contains consensus-breaking changes, i have added the
"consensus-breaking" label. otherwise, i declare my belief that there are not
consensus-breaking changes, for the following reason: