Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: introduce first-class origin tags abstraction #441

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

tobz
Copy link
Member

@tobz tobz commented Jan 23, 2025

Context

Currently, we've frontloaded origin detection/enrichment in the ingestion phase, where we strive to both resolve the origin of a metric and then also resolve the tags for that origin, cache them, and return a stable reference to those tags.... all before we resolve the context.

This technically works, but has one major issue: origin tags arrive incrementally from the Core Agent, especially during startup. Taking this approach, which necessitates caching to maintain reasonable performance, leads to issues with missing tags and effectively breaks origin enrichment.

Solution

We've taken the tack of breaking apart the resolving of an origin -- taking the raw identifiers and computing the unique origin identifier -- and the resolving of an origins tags.

Our new approach maintains much of the same machinery -- turning a raw OriginInfo<'a> into an OriginKey that is later used to look up the origin tags -- but we've added the aspect of deferring the actual resolution of the origin tags until they're queried.

To support this, we've created a new abstraction for origin tags called OriginTags. This type is the facade to visiting the tags attached to a given origin. It is powered by OriginKey, and a new trait, OriginTagsResolver, that replaces the old OriginEnricher. This allows us to resolve a stable origin identifier (OriginKey), bundle that identifier (and the resolver used to get it) into OriginTags itself, and then at the point where the origin tags are querier, the resolver is used again to look up the tags associated with the OriginKey.

Instead of merging and allocating storage for these tags, we've also added a new abstraction -- OriginTagsVisitor -- that is used to visit the tags attached to an origin. This lets us maintain a single copy of the tags in TagStore and simply iterate over them on-demand when requested.

A new type, OriginTagsQuerier, manages the link between the tag store, External Data resolver, and the logic to determine how to "resolve" an origin to its constituent parts -- pod, container, etc -- and sits within saluki-env. This type is wired into the ADP-specific workload provider, and bridged back to saluki-context by implementing the new OriginTagsResolver, such that configuring the DogStatsD source to use it is as simple as referencing the existing workload provider used in ADP.

Additionally, the Tagged implementation for Context now transparently visits the origin tags as well, which means both instrumented and origin tags are transparently chained together when visiting the tags for a context.

@github-actions github-actions bot added area/core Core functionality, event model, etc. area/config Configuration. area/components Sources, transforms, and destinations. source/dogstatsd DogStatsD source. labels Jan 23, 2025
@pr-commenter
Copy link

pr-commenter bot commented Jan 23, 2025

Regression Detector (DogStatsD)

Regression Detector Results

Run ID: b95e7750-c31f-4dbc-81d2-1ba814485db5

Baseline: 7.62.0-rc.8
Comparison: 7.62.0-rc.8

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_100mb_3k_contexts_distributions_only memory utilization +1.56 [+1.36, +1.76] 1
dsd_uds_100mb_3k_contexts ingress throughput +0.00 [-0.05, +0.05] 1
dsd_uds_40mb_12k_contexts_40_senders ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_500mb_3k_contexts ingress throughput +0.00 [-0.01, +0.01] 1
dsd_uds_1mb_3k_contexts_dualship ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_1mb_50k_contexts_memlimit ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_100mb_250k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_1mb_50k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_512kb_3k_contexts ingress throughput -0.00 [-0.01, +0.01] 1
dsd_uds_1mb_3k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
quality_gates_idle_rss memory utilization -0.00 [-0.08, +0.08] 1
dsd_uds_10mb_3k_contexts ingress throughput -0.01 [-0.02, +0.00] 1

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed links
quality_gates_idle_rss memory_usage 0/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@tobz tobz marked this pull request as ready for review January 23, 2025 20:22
@tobz tobz requested review from a team as code owners January 23, 2025 20:22
@pr-commenter
Copy link

pr-commenter bot commented Jan 23, 2025

Regression Detector Links

Experiment Result Links

experiment link(s)
dsd_uds_100mb_250k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts_distributions_only [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_10mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts_dualship [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts_memlimit [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_40mb_12k_contexts_40_senders [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_500mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_512kb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
quality_gates_idle_rss [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining (ADP only) [Profiling (ADP)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs (ADP only) [Profiling (ADP)] [SMP Dashboard]

@tobz tobz force-pushed the tobz/external-data-expose-pod-container-entity-ids branch from d9aba07 to 80ef087 Compare January 27, 2025 14:51
@tobz tobz force-pushed the tobz/origin-enrichment-origin-tags branch from ec65bf8 to 30c0540 Compare January 27, 2025 14:51
@tobz tobz force-pushed the tobz/external-data-expose-pod-container-entity-ids branch from 80ef087 to b21a8c5 Compare January 27, 2025 14:59
@tobz tobz force-pushed the tobz/origin-enrichment-origin-tags branch from 30c0540 to 71ee9c4 Compare January 27, 2025 14:59
Base automatically changed from tobz/external-data-expose-pod-container-entity-ids to main January 28, 2025 15:53
@tobz tobz requested a review from a team as a code owner January 28, 2025 15:53
@github-actions github-actions bot added area/io General I/O and networking. destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. area/observability Internal observability of ADP and Saluki. destination/datadog Common Datadog destination code. labels Jan 28, 2025
@tobz tobz force-pushed the tobz/origin-enrichment-origin-tags branch from fe4c64e to 71ee9c4 Compare January 28, 2025 16:01
@github-actions github-actions bot removed area/io General I/O and networking. area/observability Internal observability of ADP and Saluki. labels Jan 28, 2025
@tobz tobz force-pushed the tobz/origin-enrichment-origin-tags branch from 71ee9c4 to 623ad73 Compare January 28, 2025 16:01
@github-actions github-actions bot removed destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. destination/datadog Common Datadog destination code. labels Jan 28, 2025
@tobz tobz removed request for a team January 28, 2025 16:02
Copy link
Contributor

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note on something it'd be nice to simplify and/or document a bit more, which can certainly be a followup.

/// See [`OriginEnrichmentConfiguration`] for more details.
#[serde(default)]
origin_enrichment: OriginEnrichmentConfiguration,
pub struct DogStatsDConfiguration<R = ()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice not to add this generic, since I don't think it's really serving any meaningful purpose. You can't build the source with the default unit type, and there's really only one implementation of the trait.

More generally, abstracting with the trait leads to a bigger stack of nouns to hold in your head when reviewing this stuff (i.e. WorkloadProvider -> OriginTagsResolver -> OriginTagsQuerier -> TagStoreQuerier + DogStatsDOriginTagResolver), which is especially confusing because at least TagStoreQuerier is used at multiple levels. That's not to say the design is wrong, it's just complex. If there's a workable way to simplify it, great, and if not then it might be useful have some kind of ascii diagram in a comment that illustrates how all the pieces fit together 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do an immediate follow-up PR to clean up all of this and remove the generic because I don't disagree that it's a bit intertwined.

@tobz tobz merged commit 06ccd7f into main Jan 28, 2025
17 checks passed
@tobz tobz deleted the tobz/origin-enrichment-origin-tags branch January 28, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/components Sources, transforms, and destinations. area/config Configuration. area/core Core functionality, event model, etc. source/dogstatsd DogStatsD source.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants