ddtrace/tracer: add integration tag to spans_started/finished #3023

hannahkm · 2024-12-10T16:06:19Z

What does this PR do?

Add a integration tag to the existing datadog.tracer.spans_started and datadog.tracer.spans_finished metrics. The value of the tag will be the name of the component from which the span was started. For example, for a contrib, it will be the name of the contrib package (chi, net/http, etc). For spans that were created manually, the tag will say manual.

For the purpose of adding tags, we move the logic of counting finished spans to span.finish() rather than trace.finishChunk(). Since we must read the integration data from each individual span, we would rather increment our counter each time one span finishes. Counting in finishChunk() as we did previously would require a for loop, which might impact efficiency.

Motivation

We want to know, in addition to when a span is started, where the span originated from. This could be a contrib or a manual implementation.

Reviewer's Checklist

Changed code has unit tests for its functionality at or near 100% coverage.
System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
There is a benchmark for any new code, or changes to existing code.
If this interacts with the agent in a new way, a system test has been added.
Add an appropriate team label so this PR gets put in the right place for the release notes.
Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.
For internal contributors, a matching PR should be created to the v2-dev branch and reviewed by @DataDog/apm-go.

Unsure? Have a question? Request a review!

datadog-datadog-prod-us1 · 2024-12-10T16:13:11Z

Datadog Report

Branch report: apm-rd/span-source-health-metric
Commit report: 2756596
Test service: dd-trace-go

✅ 0 Failed, 5211 Passed, 72 Skipped, 2m 46.57s Total Time

pr-commenter · 2024-12-10T17:48:35Z

Benchmarks

Benchmark execution time: 2025-01-21 22:23:34

Comparing candidate commit 3e8b1df in PR branch apm-rd/span-source-health-metric with baseline commit e394045 in branch main.

Found 0 performance improvements and 5 performance regressions! Performance is the same for 53 metrics, 1 unstable metrics.

scenario:BenchmarkInjectW3C-24

🟥 execution_time [+171.144ns; +209.856ns] or [+4.241%; +5.200%]

scenario:BenchmarkSingleSpanRetention/no-rules-24

🟥 execution_time [+9.279µs; +10.036µs] or [+3.961%; +4.284%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24

🟥 execution_time [+9.726µs; +10.871µs] or [+4.126%; +4.611%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24

🟥 execution_time [+9.122µs; +10.036µs] or [+3.857%; +4.244%]

scenario:BenchmarkTracerAddSpans-24

🟥 execution_time [+162.275ns; +249.325ns] or [+4.212%; +6.472%]

mtoffl01

Ok, so you're reporting spansStarted/spansFinished on span.Start/span.Finished if the integration is not empty, and leaving the chunk reporting to any spans that are manual... I understand why you did this but not totally sure about the approach.

span.Start and span.Stop are typically called quite frequently, so if a majority of the spans are from automatic integrations, this will be very noisy (and defeats the purpose of reporting the metrics at a specified interval, to reduce noise)

One alternative idea:
Change the way we track spansStarted and spansFinished to be some kind of counter map that includes the integration name, e.g. map[string]uint32 where the key is the integration name and the value is the count of spans started/finished that integration name. Then, in this goroutine, we'll have to iterate over the map and report the spans started/finished per integration
(or some other idea I haven't thought of?)

ddtrace/mocktracer/mockspan.go

ddtrace/mocktracer/mockspan_test.go

hannahkm · 2024-12-17T20:01:43Z

@mtoffl01 Good points! A map would probably work better; I was hesitant at first since I didn't want to change too much of what already exists, but knowing that these metrics are pretty old... I'm more down to change it up now.

…metric

darccio · 2024-12-20T13:12:34Z

@hannahkm I'm approving this but we should investigate why the benchmarks report the increased allocations.

mtoffl01

Overall, I definitely have some concerns 🤔 Maybe you can write some additional tests to provide peace of mind....

Tests designed to try and make the system fail -- what happens when you have multiple goroutines access a start span / finish span method, can we prove that we've protected against a race condition?
Maybe you want to write dedicated benchmarks to show how much performance is impacted

ddtrace/tracer/tracer.go

…metric

ddtrace/tracer/metrics.go

ddtrace/tracer/tracer.go

ddtrace/tracer/spancontext.go

ddtrace/tracer/metrics.go

mtoffl01

It lgtm, BUT I would recommend we have more complex tests for SpansStarted and SpansFinished; like, generating multiple spans of different integrations and checking the metrics reported (rather than just 1 span).

ddtrace/mocktracer/mockspan.go

ddtrace/tracer/metrics.go

felixge

Overall LGTM, but I think there is one small bug, see below. I might have some more feedback on the tests and benchmarks, but not enough time for it right now. But I think these comments should already be useful 🙇 .

felixge · 2025-01-21T15:26:37Z

ddtrace/tracer/spancontext.go

@@ -523,7 +520,6 @@ func (t *trace) finishedOne(s *span) {
 }

 func (t *trace) finishChunk(tr *tracer, ch *chunk) {
-	atomic.AddUint32(&tr.spansFinished, uint32(len(ch.spans)))


NIT: Why did we relocate the trigger point for the span start/finish tracking from this location to the new locations? Would be great to capture the answer to this question in the PR description.

Good point! I'll add that. tl;dr is finishChunk reports a slice of spans at a time, which would require looping over this slice to get each span's integration value and report that as a tag. Rather than introduce this for loop, we move the logic somewhere else.

felixge · 2025-01-21T15:29:18Z

ddtrace/tracer/tracer.go

-	spansStarted, spansFinished, tracesDropped uint32
+	// These maps count the spans started and finished from
+	// each component, including contribs and "manual" spans.
+	spansStarted, spansFinished *xsync.MapOf[string, *atomic.Int64]


NIT: It would be nice to abstract the concept of a counting map into a dedicated type that lives in an internal package. We have a use for this in profiling as well.

However, I don't think this needs to be done in this PR. I can do it as a follow-up change for profiling.

ddtrace/tracer/metrics.go

felixge · 2025-01-21T15:38:24Z

ddtrace/tracer/metrics.go

+				value.Swap(0)
+				if err != nil {
+					log.Debug("Error while reporting spans started from integration %s: %s", key, err.Error())
+				}


Why are we logging the error here? I mean, in general it's a good idea, but it wasn't being done in the old code.

What kind of errors could happen here? Could we end up flooding the debug log? Are we concerned about locking the xmaps for too long while doing the logging?

Generally, the idea was so that we can track down a reason for potentially missing span metrics. Though looking at the code again, it doesn't seem that t.statsd.Count() actually ever returns an error (see the function it calls, addMetric() here). So the easy answer is that perhaps we are not worried at all about flooding the log or locking for too long.

It seems that previously existing calls to this function did not check for err. For the time being, I will also remove it to match these instances.

felixge · 2025-01-21T15:42:43Z

ddtrace/tracer/metrics_test.go

@@ -48,7 +52,7 @@ func TestReportHealthMetricsAtInterval(t *testing.T) {
 	var tg statsdtest.TestStatsdClient

 	defer func(old time.Duration) { statsInterval = old }(statsInterval)
-	statsInterval = time.Nanosecond
+	statsInterval = time.Millisecond


753232f mentions this was done to fix flakiness. Was this flakiness pre-existing, or introduced by this PR? How did you verify that this fixes the problem?

[New commit is at bb7be55. Rebase accidentally moved some things around].

The flake was introduced with this PR. It was caused by issues in tg.Wait(), which waits for n (in this case 4) reported metrics. But since we are newly reporting the number of spans started/finished even if this value is 0, we are reporting at least two metrics for every statsInterval, which is why I decided to increase it here.

I verified it by simply running it 1000+ times locally. It doesn't fail at all, which made me believe that it was no longer flaking.

…metric

ddtrace/tracer: use ext.Component to report source of new spans

1aa93b8

hannahkm added 2 commits December 10, 2024 11:14

ddtrace/tracer: apply source to finished spans health metric

0a51b8a

ddtrace/tracer: check for nil span before checking source

7601d35

hannahkm added 9 commits December 10, 2024 14:25

ddtrace/mocktracer: update mockspan to also hold source

13ecd2c

contrib: check for correct source on mockspans in tests

31aa679

contrib: remove incorrect checks for source

819e312

ddtrace/tracer: check for appropriate tag in spans_started metric

24ad484

ddtrace/tracer: test for different values of source

a8f097d

contrib,ddtrace/tracer: rename source to integration

feb73d7

ddtrace/mocktracer: replace missed source with integration

7fdb0c8

ddtrace/tracer: fix false positives in test

4b609a2

ddtrace/tracer: create test for spans_finished integration tag

f061f22

hannahkm changed the title ~~ddtrace/tracer: add source tag to spans_started health metric~~ ddtrace/tracer: add integration tag to spans_started/finished Dec 12, 2024

github-actions bot added the apm:ecosystem contrib/* related feature requests or bugs label Dec 12, 2024

ddtrace/tracer: fix failing smoke tests

236cb25

mtoffl01 requested changes Dec 17, 2024

View reviewed changes

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved

ddtrace/mocktracer/mockspan_test.go Outdated Show resolved Hide resolved

hannahkm added 4 commits December 17, 2024 15:28

ddtrace/tracer: use map to keep track of spans started and finished

090c79c

ddtrace/tracer: fix races when accessing maps

c7db44d

ddtrace/tracer: replace sprintf usage with concat

df8f03e

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

5cf2e2e

…metric

hannahkm marked this pull request as ready for review December 19, 2024 21:27

hannahkm requested review from a team as code owners December 19, 2024 21:27

darccio approved these changes Dec 20, 2024

View reviewed changes

mtoffl01 reviewed Dec 20, 2024

View reviewed changes

ddtrace/tracer/tracer.go Outdated Show resolved Hide resolved

ddtrace/tracer/tracer.go Outdated Show resolved Hide resolved

hannahkm added 6 commits January 9, 2025 13:28

use waitgroups to control goroutines

aca72be

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

2e98ead

…metric

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

fc70dca

…metric

fix flaky attempt: add sleep time

889f6f6

Merge branch 'main' into apm-rd/span-source-health-metric

7ea2d4c

properly check for integration tags on metric spans

4c93020

mtoffl01 reviewed Jan 10, 2025

View reviewed changes

hannahkm added 4 commits January 10, 2025 16:32

nit: comment and doc clarifications

2ea545c

don't return early from reporting span metrics

8983bbc

remove unnecessary for loop for counting span metrics

ce1b7eb

use atomic.Swap() to reset values to 0

31cb904

mtoffl01 approved these changes Jan 14, 2025

View reviewed changes

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved

ddtrace/tracer/metrics.go Outdated Show resolved Hide resolved

felixge reviewed Jan 21, 2025

View reviewed changes

hannahkm requested review from a team as code owners January 21, 2025 19:21

hannahkm requested a review from liashenko January 21, 2025 19:21

hannahkm force-pushed the apm-rd/span-source-health-metric branch from bc2c799 to 9425178 Compare January 21, 2025 19:41

hannahkm added 3 commits January 21, 2025 14:57

review fixes

ba987c1

test metrics with multiple different integrations

73e78ee

fix test and hopefully fix a flake?

4cc70c1

hannahkm force-pushed the apm-rd/span-source-health-metric branch from 9425178 to 4cc70c1 Compare January 21, 2025 19:58

hannahkm added 2 commits January 21, 2025 15:37

increase statsInterval to resolve flaking

bb7be55

fix: readd accidentally removed commits

5d69d48

hannahkm removed request for a team and liashenko January 21, 2025 20:53

hannahkm added 4 commits January 21, 2025 15:57

prevent potential race by swapping before counting

4eacf19

remove unnecessary always nil error checks

6ff2670

Merge remote-tracking branch 'origin' into apm-rd/span-source-health-…

9e6a2a1

…metric

go mod tidy

3e8b1df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddtrace/tracer: add integration tag to spans_started/finished #3023

ddtrace/tracer: add integration tag to spans_started/finished #3023

hannahkm commented Dec 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 10, 2024 •

edited

Loading

pr-commenter bot commented Dec 10, 2024 •

edited

Loading

mtoffl01 left a comment •

edited

Loading

hannahkm commented Dec 17, 2024

darccio commented Dec 20, 2024

mtoffl01 left a comment •

edited

Loading

mtoffl01 left a comment

felixge left a comment •

edited

Loading

felixge Jan 21, 2025

hannahkm Jan 21, 2025

felixge Jan 21, 2025

felixge Jan 21, 2025

hannahkm Jan 21, 2025 •

edited

Loading

felixge Jan 21, 2025

hannahkm Jan 21, 2025

ddtrace/tracer: add integration tag to spans_started/finished #3023

Are you sure you want to change the base?

ddtrace/tracer: add integration tag to spans_started/finished #3023

Conversation

hannahkm commented Dec 10, 2024 • edited Loading

What does this PR do?

Motivation

Reviewer's Checklist

datadog-datadog-prod-us1 bot commented Dec 10, 2024 • edited Loading

Datadog Report

pr-commenter bot commented Dec 10, 2024 • edited Loading

Benchmarks

scenario:BenchmarkInjectW3C-24

scenario:BenchmarkSingleSpanRetention/no-rules-24

scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24

scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24

scenario:BenchmarkTracerAddSpans-24

mtoffl01 left a comment • edited Loading

Choose a reason for hiding this comment

hannahkm commented Dec 17, 2024

darccio commented Dec 20, 2024

mtoffl01 left a comment • edited Loading

Choose a reason for hiding this comment

mtoffl01 left a comment

Choose a reason for hiding this comment

felixge left a comment • edited Loading

Choose a reason for hiding this comment

felixge Jan 21, 2025

Choose a reason for hiding this comment

hannahkm Jan 21, 2025

Choose a reason for hiding this comment

felixge Jan 21, 2025

Choose a reason for hiding this comment

felixge Jan 21, 2025

Choose a reason for hiding this comment

hannahkm Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

felixge Jan 21, 2025

Choose a reason for hiding this comment

hannahkm Jan 21, 2025

Choose a reason for hiding this comment

hannahkm commented Dec 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Dec 10, 2024 •

edited

Loading

pr-commenter bot commented Dec 10, 2024 •

edited

Loading

mtoffl01 left a comment •

edited

Loading

mtoffl01 left a comment •

edited

Loading

felixge left a comment •

edited

Loading

hannahkm Jan 21, 2025 •

edited

Loading