Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create token metrics only when they are available #1092

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

eero-t
Copy link
Contributor

@eero-t eero-t commented Dec 30, 2024

Description

This avoids generating useless token / request histogram metrics for services that use Orchestrator class, but never call its token processing functionality. Such dummy metrics can confuse telemetry users.

(It also helps in differentiating frontend megaservice metrics from backend megaservice ones, especially when multiple OPEA applications with wrapper microservices run in the same cluster.)

Issues

n/a.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Dependencies

n/a.

Tests

Manual testing with latest versions, to verify that:

  • services processing tokens, generate token histogram metrics
  • ones not processing them, produce only pending requests gauge

Copy link

codecov bot commented Dec 30, 2024

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
comps/cores/mega/orchestrator.py 96.42% 1 Missing ⚠️
Files with missing lines Coverage Δ
comps/cores/mega/orchestrator.py 91.36% <96.42%> (+0.40%) ⬆️

@eero-t
Copy link
Contributor Author

eero-t commented Jan 3, 2025

@Spycsh Could you review this?

(And maybe also #1107.)

@eero-t
Copy link
Contributor Author

eero-t commented Jan 3, 2025

Rebased to main.

@Spycsh
Copy link
Member

Spycsh commented Jan 6, 2025

This avoids generating useless token / request histogram metrics for services that use Orchestrator class, but never call its token processing functionality. Such dummy metrics can confuse telemetry users.

Why will this happen? Metrics are only updated when calling self.metrics.pending_update-like methods in schedule, right? These are all controllable code. So what you mean is that there are some other thing that can update the metrics?

@eero-t
Copy link
Contributor Author

eero-t commented Jan 7, 2025

Why will this happen?

@Spycsh Because Prometheus client will start providing metrics after they've been created.

In current code, all metrics are created when Orchestrator/OrhestratorMetrics is instantiated: https://github.com/opea-project/GenAIComps/blob/main/comps/cores/mega/orchestrator.py#L33

Metrics are only updated when calling self.metrics.pending_update-like methods in schedule, right?

Those methods only update the value of the metric, they do not create them. This PR changes Histogram metric creation to be delayed until first call of the update methods.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 7, 2025

I dropped pending metric doc update & rebased to main. I'll have it in separate PR where I fix additional issues I noticed, which require pending requests metric type / name change.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 7, 2025

opea/dataprep-redis:latest does not seem to generate megaservice_* metrics anymore.

Has ServiceOrchestrator use been dropped from backend services?

@eero-t
Copy link
Contributor Author

eero-t commented Jan 7, 2025

I dropped pending metric doc update & rebased to main. I'll have it in separate PR where I fix additional issues I noticed, which require pending requests metric type / name change.

Could not find any good fix for it, so I just filed a ticket on it: #1121

@Spycsh
Copy link
Member

Spycsh commented Jan 8, 2025

This avoids generating useless token / request histogram metrics for services that use Orchestrator class, but never call its token processing functionality. Such dummy metrics can confuse telemetry users.

(It also helps in differentiating frontend megaservice metrics from backend megaservice ones, especially when multiple OPEA applications with wrapper microservices run in the same cluster.)

OK so what you mean it that the dummy metrics will show zeros after initialization and before the first request and users should not see wrong values of request number... But you think the k8s will scrape the metrics even there are no requests and it is resource-consuming so you decide to delay the initialization only when there are requests. I agree with this approach.

@Spycsh
Copy link
Member

Spycsh commented Jan 8, 2025

opea/dataprep-redis:latest does not seem to generate megaservice_* metrics anymore.

Has ServiceOrchestrator use been dropped from backend services?

dataprep microservice itself should not generate megaservice_* metrics. Only megaservices like opea/chatqna do.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 8, 2025

OK so what you mean it that the dummy metrics will show zeros after initialization and before the first request and users should not see wrong values of request number...

Technically the zero counts are not wrong, but presence of token / LLM metrics is misleading for services that will never generate tokens (or use LLM). That's the main reason for this PR.

But you think the k8s will scrape the metrics even there are no requests and it is resource-consuming so you decide to delay the initialization only when there are requests. I agree with this approach.

Visibility

All OPEA originated services use HttpService i.e. provide HTTP access metrics [1]. To see those, serviceMonitors are installed for them when monitoring option is enabled in Helm charts. Meaning that any megaservice_* metrics they generate, will also be visible to user e.g. in Grafana.

Perf

I doubt skipping generation of extra metrics has any noticeable perf impact on the service providing the metrics (currently serviceMonitors are configured to poll them at 5s interval), but every little bit can help.

Each Prometheus Histogram type provides about dozen different metrics, and in larger clusters, amount of metrics needs to be reduced to keep telemetry stack resource usage & perf reasonable. Telemetry stack resource usage should be significant concern only when there's larger number of such pods though.


[1] There's large number of HTTP metrics, and some Python ones too. It would be good to have controls for limiting those in larger clusters, but I did not see any options for that in prometheus_fastapi_instrumentator API.

@eero-t
Copy link
Contributor Author

eero-t commented Jan 10, 2025

@Spycsh from you comment in the bug #1121 (comment)

I realized that changing the method on first metric access is racy. It's possible that multiple threads end up in create method, before that method is changed to update one. Meaning that multiple identical metrics would be created, and Prometheus would barf on that.

=> I'll add lock & check to handle that.

This avoids generation of useless token/request histogram metrics
for services that use Orchestrator class, but never call its token
processing functionality.

(Helps in differentiating frontend megaservice metrics from backend
megaservice ones, especially when multiple OPEA applications run in
the same cluster.)

Also change Orchestrator CI test workaround to use unique prefix for
each metric instance, instead of metrics being (singleton) class
variables.

Signed-off-by: Eero Tamminen <[email protected]>
As that that could be called from multiple request handling threads.

Signed-off-by: Eero Tamminen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants