Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel Exporter panics after a few minutes, complaining about invalid metrics #9336

Closed
slonka opened this issue Feb 21, 2024 · 11 comments
Closed
Labels
area/observability area/policies kind/bug A bug triage/rotten closed due to lack of information for too long, rejected feature...

Comments

@slonka
Copy link
Contributor

slonka commented Feb 21, 2024

What happened?

Not sure if we are faulty or if it's the Datadog exporter/mapping for otel.

panic: runtime error: index out of range [0] with length 0

goroutine 450 [running]:
github.com/DataDog/opentelemetry-mapping-go/pkg/quantile.(*Agent).InsertInterpolate(0xc001deaf58, 0x414b774000000000, 0x3fe0000000000000, 0x0)
	github.com/DataDog/opentelemetry-mapping-go/pkg/[email protected]/agent.go:94 +0x4b4
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).getSketchBuckets(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0x7dc81df15470, 0xc001d2e540}, 0xc0020af5c0, {0xc003420c60?, 0xc00206a240?}, {0x0, 0x0, ...}, ...)
	github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/[email protected]/metrics_translator.go:351 +0xaf5
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapHistogramMetrics(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0x90fc310, 0xc001d2e540}, 0x5b3a2273746e696f?, {0xc002149580?, 0xc00206a240?}, 0x0)
	github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/[email protected]/metrics_translator.go:515 +0x7c7
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapToDDFormat(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0xc0024b2640?, 0xc00206a240?}, {0x90fc310?, 0xc001d2e540?}, {0xc001bc6580, 0x1, 0x4}, ...)
	github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/[email protected]/metrics_translator.go:847 +0xabe
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).MapMetrics(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0xc0031ae000?, 0xc00206a240?}, {0x90fc310?, 0xc001d2e540?})
	github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/[email protected]/metrics_translator.go:797 +0xd27
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsData(0xc002afea20, {0x911ee78, 0xc002e9d7a0}, {0xc0031ae000?, 0xc00206a240?})
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/metrics_exporter.go:212 +0x21d
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsDataScrubbed(0xc002afea20, {0x911ee78?, 0xc002e9d7a0?}, {0xc0031ae000?, 0xc00206a240?})
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/metrics_exporter.go:185 +0x2c
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export(0x0?, {0x911ee78?, 0xc002e9d7a0?})
	go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:59 +0x31
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send(0xc001bdd980?, {0x911ee78?, 0xc002e9d7a0?}, {0x90d5d50?, 0xc0034429f0?})
	go.opentelemetry.io/collector/[email protected]/exporterhelper/timeout_sender.go:43 +0x48
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send(0xc00280e8c0?, {0x911ee78?, 0xc002e9d7a0?}, {0x90d5d50?, 0xc0034429f0?})
	go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:35 +0x30
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send(0xc002d8c690, {0x911f350?, 0xc002879af0?}, {0x90d5d50?, 0xc0034429f0?})
	go.opentelemetry.io/collector/[email protected]/exporterhelper/metrics.go:171 +0x7e
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1({0x911f350?, 0xc002879af0?}, {0x90d5d50?, 0xc0034429f0?})
	go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:95 +0x84
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume(0x912a020, 0xc002d8c6f0)
	go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:57 +0xc7
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1()
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43 +0x79
created by go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start in goroutine 1
	go.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:39 +0x7d

Repro / setup:

kubectl --context $CTX_CLUSTER3 create namespace observability

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# otel collector config via helm
cat > otel-config-datadog.yaml <<EOF
mode: deployment
config:
  exporters:
    datadog:
      api:
        site: datadoghq.eu
        key: <key>
  service:
    pipelines:
      logs:
        exporters:
          - datadog
      traces:
        exporters:
          - datadog
      metrics:
        exporters:
          - datadog
EOF

helm upgrade --install \
  --kube-context ${CTX_CLUSTER3} \
  -n observability \
  --set mode=deployment \
  -f otel-config-datadog.yaml \
  opentelemetry-collector open-telemetry/opentelemetry-collector

# enable Metrics
kumactl apply -f - <<EOF
type: MeshMetric
name: metrics-default
mesh: default
spec:
  targetRef:
    kind: Mesh
  # applications:
  #  - name: "backend"
  default:
    backends:
    - type: OpenTelemetry
      openTelemetry: 
        endpoint: "opentelemetry-collector.observability.svc:4317"
EOF
@slonka slonka added triage/pending This issue will be looked at on the next triage meeting kind/bug A bug area/observability labels Feb 21, 2024
@slonka
Copy link
Contributor Author

slonka commented Feb 21, 2024

Original author @bcollard

@Automaat
Copy link
Contributor

Automaat commented Feb 26, 2024

We can add debug exporter example:

metrics:
  exporters:
    - datadog
    - debug

This will log all collected metrics, so we could find metrics on which datadog exporter fails, and create issue in OpenTelemetry collector. @bcollard could you look at it?

@jakubdyszkiewicz jakubdyszkiewicz added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/pending This issue will be looked at on the next triage meeting labels Feb 26, 2024
@bcollard
Copy link

otel-exporter-2.log
otel-exporter-1.log

Otel-collector keeps crashing with the debug exporter for metrics.

@Automaat
Copy link
Contributor

Automaat commented Mar 4, 2024

I see that I forgot about rest of the debug exporter config. @bcollard can you run this again with this config:

mode: deployment
config:
  exporters:
    debug:
      verbosity: detailed
    datadog:
      api:
        site: datadoghq.eu
        key: <key>
  service:
    pipelines:
      logs:
        exporters:
          - datadog
      traces:
        exporters:
          - datadog
      metrics:
        exporters:
          - datadog
          - debug

This should properly log collected metrics co we can debug further

@bcollard
Copy link

bcollard commented Mar 4, 2024

here attached
otel-cluster1.log
otel-cluster2.log

"kuma" appears a lot in the otel-cluster1.log file, not in the other.

@Automaat
Copy link
Contributor

Logs look fine, but we could also verify if this is only datadog collector issue by pushing metrics to some other saas product like grafana and check if this issue is still there. There is an example on how to set this up in demo-scene repo. Cold out try this without datadog exporter @bcollard ?

@Automaat Automaat reopened this Mar 12, 2024
Copy link
Contributor

Removing closed state labels due to the issue being reopened.

@github-actions github-actions bot added the triage/pending This issue will be looked at on the next triage meeting label Mar 12, 2024
@jakubdyszkiewicz jakubdyszkiewicz removed the triage/pending This issue will be looked at on the next triage meeting label Mar 12, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Mar 25, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Jul 3, 2024
Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@jakubdyszkiewicz jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Jul 10, 2024
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Oct 9, 2024
Copy link
Contributor

github-actions bot commented Oct 9, 2024

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lukidzi
Copy link
Contributor

lukidzi commented Oct 21, 2024

@Automaat Is it still the case?

@lukidzi lukidzi added triage/pending This issue will be looked at on the next triage meeting and removed triage/stale Inactive for some time. It will be triaged again triage/accepted The issue was reviewed and is complete enough to start working on it labels Oct 21, 2024
@jakubdyszkiewicz jakubdyszkiewicz removed the triage/pending This issue will be looked at on the next triage meeting label Oct 28, 2024
@jakubdyszkiewicz jakubdyszkiewicz added the triage/needs-reproducing Someone else should try to reproduce this label Oct 28, 2024
@jakubdyszkiewicz
Copy link
Contributor

We tried to reproduce it at Test Friday and we don't see this problem.
What we tried is otel/opentelemetry-collector-contrib:0.115.1 and Kong Mesh 2.9.2
Please reopen if this is still a problem

@jakubdyszkiewicz jakubdyszkiewicz closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2025
@jakubdyszkiewicz jakubdyszkiewicz added triage/rotten closed due to lack of information for too long, rejected feature... and removed triage/needs-reproducing Someone else should try to reproduce this labels Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability area/policies kind/bug A bug triage/rotten closed due to lack of information for too long, rejected feature...
Projects
None yet
Development

No branches or pull requests

5 participants