-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt Periodic Puma Metrics #3521
Adapt Periodic Puma Metrics #3521
Conversation
The current 'thread_info' is tightly coupled to Thin/EventMachine; don't emit those metrics when running the Puma webserver; maybe we will come up with other thread-related metrics for Puma at a later point in time.
0d83e8d
to
a105b64
Compare
bfa6629
to
120b311
Compare
Return the summed up memory bytes and cpu for the Puma process and its worker subprocesses.
- Replace VCAP::CloudController::Metrics::StatsdUpdater.new with CloudController::DependencyLocator.instance.statsd_updater - Replace Statsd.new with CloudController::DependencyLocator.instance. statsd_client - Add statsd_host + statsd_port to ClockSchema; don't rely on default values
When using Puma, the periodic_updater shall run in the main process only, thus direct invocations of 'update!' from a controller (executed within a worker process) must be removed.
- When running Puma, the DirectFileStore data store is used for prometheus metrics. - Aggregation methods for (gauge) metrics are specified and applied when using the DirectFileStore. - For 'cc_requests_outstanding_total' the values per worker are summed up. - For metrics that are collected by the periodic_updater running in the main process, the aggregation method MOST_RECENT has been specified to eliminate the unnecessary 'pid' label.
120b311
to
64bb224
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few conceptual questions.
File.unlink(file_path) | ||
end | ||
|
||
Prometheus::Client.config.data_store = Prometheus::Client::DataStores::DirectFileStore.new(dir: prometheus_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for using a file datastore instead of in memory, if we want to reset the metrics on startup anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-memory data store does not work for metrics collected by different processes (i.e. the outstanding requests), thus we need some kind of "central" storage; the easiest is the direct file store.
{ type: :counter, name: :cc_staging_requests_total, docstring: 'Number of staging requests' }, | ||
{ type: :histogram, name: :cc_staging_succeeded_duration_seconds, docstring: 'Durations of successful staging events', buckets: DURATION_BUCKETS }, | ||
{ type: :histogram, name: :cc_staging_failed_duration_seconds, docstring: 'Durations of failed staging events', buckets: DURATION_BUCKETS }, | ||
{ type: :gauge, name: :cc_requests_outstanding_total, docstring: 'Requests outstanding' }, | ||
{ type: :gauge, name: :cc_requests_outstanding_total, docstring: 'Requests outstanding', aggregation: :sum }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be interesting to see how the metrics look like if we don't aggregate them. Would the different values emitted by each worker be stored in the same time series, using a label "worker", or would they simply overwrite each other and lead to wrong metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics would be labelled with the pid
of the worker processes. I'm in favor of aggregating them to be compatible with the Thin metrics, i.e. have one metric per vm. In the end we are not even interested in the outstanding requests per vm, but for all apis, so I don't see a use case for having metrics per process.
thread_info
is tightly coupled to Thin/EventMachine; don't emit those metrics when running the Puma webserver; maybe we will come up with other thread-related metrics for Puma at a later point in time.periodic_updater
shall run in the main process only, thus direct invocations ofupdate!
from a controller (executed within a worker process) must be removed.DirectFileStore
data store is used for prometheus metrics.DirectFileStore
.cc_requests_outstanding_total
the values per worker are summed up.periodic_updater
running in the main process, the aggregation method MOST_RECENT has been specified to eliminate the unnecessarypid
label.VCAP::CloudController::Metrics::StatsdUpdater.new
withCloudController::DependencyLocator.instance.statsd_updater
.Statsd.new
withCloudController::DependencyLocator.instance.statsd_client
.statsd_host
+statsd_port
toClockSchema
; don't rely on default values (see Add statsd_host + statsd_port to cloud_controller_clock job capi-release#359)I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
main
branchI have run all the unit tests using
bundle exec rake
I have run CF Acceptance Tests