Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: merge gauges properly and avoid dropping existing points #444

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

tobz
Copy link
Member

@tobz tobz commented Jan 24, 2025

Context

When metrics are merged in the aggregate transform, we call MetricValues::merge, which is meant to uniformly handle the merge of different points for all metric types. This method handles both the merging of points based on their timestamp, as well as the specific merge behavior for the point values themselves.

An issue was pointed out with how ADP was handling gauges, and after some debugging, it became apparent that when merging gauges, we were actually ignoring the timestamp but, which meant that if you had an existing point for a gauge, say with timestamp X, and you merged in another point with a timestamp Y, the first point would be entirely dropped.

This manifested itself as a fun bug where, due to the bucket width and flush interval defaults, a gauge being reported every 10 seconds was only ever emitted every 30 seconds.

Solution

While gauges nominally have "last write wins" behavior, that logic only applies for a given timestamp. This PR fixes the logic of merging gauges such that we don't drop the previous points, and only apply the LWW behavior when merging two points with an identical timestamp.

Further, we've added a number of unit tests to ensure the merging behavior is correct for each metric type. We've additionally added a number of From and Display impls on the related types... some of which we only needed for initial debugging (but might be valuable in the future) and some of which are used by the unit tests.

@tobz tobz requested review from a team as code owners January 24, 2025 18:39
@tobz tobz added the type/bug Bug fixes. label Jan 24, 2025
@github-actions github-actions bot added area/core Core functionality, event model, etc. area/config Configuration. area/components Sources, transforms, and destinations. area/ci CI/CD, automated testing, etc. source/dogstatsd DogStatsD source. destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. destination/datadog Common Datadog destination code. labels Jan 24, 2025
@tobz tobz changed the base branch from main to tobz/context-resolver-origin-uniqueness-test January 24, 2025 18:40
}
}
(Self::Gauge(a), Self::Gauge(b)) => *a = b,
(Self::Gauge(a), Self::Gauge(b)) => a.merge(b, merge_scalar_latest),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix, basically. Right here. This one lil guy.

Copy link
Member Author

tobz commented Jan 24, 2025

@pr-commenter
Copy link

pr-commenter bot commented Jan 24, 2025

Regression Detector (DogStatsD)

Regression Detector Results

Run ID: a37173bd-7754-4503-9ef7-af66b2e5a1c7

Baseline: 7.62.0-rc.8
Comparison: 7.62.0-rc.8

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_100mb_3k_contexts_distributions_only memory utilization +1.48 [+1.28, +1.69] 1
quality_gates_idle_rss memory utilization +0.42 [+0.31, +0.52] 1
dsd_uds_10mb_3k_contexts ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_1mb_3k_contexts ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_512kb_3k_contexts ingress throughput +0.00 [-0.01, +0.01] 1
dsd_uds_1mb_50k_contexts ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_100mb_250k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_1mb_3k_contexts_dualship ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_1mb_50k_contexts_memlimit ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_100mb_3k_contexts ingress throughput -0.00 [-0.05, +0.04] 1
dsd_uds_40mb_12k_contexts_40_senders ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_500mb_3k_contexts ingress throughput -0.27 [-0.30, -0.24] 1

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed links
quality_gates_idle_rss memory_usage 0/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@pr-commenter
Copy link

pr-commenter bot commented Jan 24, 2025

Regression Detector (Saluki)

Regression Detector Results

Run ID: bb0a943d-79e1-47f1-9d95-58230bc95bbd

Baseline: 3576732
Comparison: 0d1856f
Diff

❌ Experiments with missing or malformed data

This is a critical error. No usable optimization goal data was produced by the listed experiments. This may be a result of misconfiguration. Ping #single-machine-performance and we can help out.

  • dsd_uds_100mb_250k_contexts

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_500mb_3k_contexts ingress throughput +3.26 [+3.14, +3.39] 1
dsd_uds_100mb_3k_contexts_distributions_only memory utilization +0.86 [+0.73, +0.99] 1
dsd_uds_50mb_10k_contexts_no_inlining ingress throughput +0.02 [-0.05, +0.09] 1
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs ingress throughput +0.01 [-0.05, +0.07] 1
dsd_uds_1mb_3k_contexts_dualship ingress throughput +0.00 [-0.00, +0.01] 1
dsd_uds_40mb_12k_contexts_40_senders ingress throughput +0.00 [-0.02, +0.03] 1
dsd_uds_100mb_3k_contexts ingress throughput +0.00 [-0.05, +0.05] 1
dsd_uds_512kb_3k_contexts ingress throughput +0.00 [-0.01, +0.01] 1
dsd_uds_1mb_50k_contexts ingress throughput -0.01 [-0.02, +0.01] 1
dsd_uds_1mb_3k_contexts ingress throughput -0.01 [-0.03, +0.00] 1
dsd_uds_10mb_3k_contexts ingress throughput -0.02 [-0.04, +0.01] 1
quality_gates_idle_rss memory utilization -0.12 [-0.15, -0.09] 1
dsd_uds_1mb_50k_contexts_memlimit ingress throughput -3.50 [-4.00, -3.01] 1

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed links
quality_gates_idle_rss memory_usage 10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@pr-commenter
Copy link

pr-commenter bot commented Jan 24, 2025

Regression Detector Links

Experiment Result Links

experiment link(s)
dsd_uds_100mb_250k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts_distributions_only [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_10mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts_dualship [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts_memlimit [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_40mb_12k_contexts_40_senders [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_500mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_512kb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
quality_gates_idle_rss [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining (ADP only) [Profiling (ADP)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs (ADP only) [Profiling (ADP)] [SMP Dashboard]

@tobz tobz force-pushed the tobz/context-resolver-origin-uniqueness-test branch from b1cda7c to d59b379 Compare January 27, 2025 14:51
@tobz tobz force-pushed the tobz/fix-broken-gauge-merge-behavior branch from c695664 to 0a73476 Compare January 27, 2025 14:52
@github-actions github-actions bot removed area/config Configuration. area/components Sources, transforms, and destinations. area/ci CI/CD, automated testing, etc. labels Jan 27, 2025
@github-actions github-actions bot removed source/dogstatsd DogStatsD source. destination/datadog-metrics Datadog Metrics destination. destination/prometheus Prometheus Scrape destination. destination/datadog Common Datadog destination code. labels Jan 27, 2025
@tobz tobz force-pushed the tobz/context-resolver-origin-uniqueness-test branch from d59b379 to 36f5486 Compare January 27, 2025 14:59
@tobz tobz force-pushed the tobz/fix-broken-gauge-merge-behavior branch 2 times, most recently from 5287ff7 to 0d1856f Compare January 27, 2025 16:52
@tobz tobz changed the base branch from tobz/context-resolver-origin-uniqueness-test to main January 27, 2025 16:53
@tobz tobz changed the base branch from main to tobz/context-resolver-origin-uniqueness-test January 27, 2025 16:55
lukesteensen
lukesteensen previously approved these changes Jan 28, 2025
@tobz tobz changed the base branch from tobz/context-resolver-origin-uniqueness-test to main January 28, 2025 16:39
@tobz tobz dismissed lukesteensen’s stale review January 28, 2025 16:39

The base branch was changed.

@tobz tobz requested a review from a team as a code owner January 28, 2025 16:39
@tobz tobz changed the base branch from main to tobz/context-resolver-origin-uniqueness-test January 28, 2025 16:39
lukesteensen
lukesteensen previously approved these changes Jan 28, 2025
@tobz tobz force-pushed the tobz/fix-broken-gauge-merge-behavior branch from 0d1856f to 0549c44 Compare January 28, 2025 16:40
@tobz tobz changed the base branch from tobz/context-resolver-origin-uniqueness-test to main January 28, 2025 16:40
@tobz tobz dismissed lukesteensen’s stale review January 28, 2025 16:40

The base branch was changed.

@tobz tobz merged commit e278a97 into main Jan 28, 2025
10 of 14 checks passed
@tobz tobz deleted the tobz/fix-broken-gauge-merge-behavior branch January 28, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Core functionality, event model, etc. type/bug Bug fixes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants