title | authors | creation-date | last-updated | status | |
---|---|---|---|---|---|
tekton-metrics |
|
2020-07-13 |
2020-07-13 |
proposed |
Add a set of metrics and tracing for monitoring and measuring performance of the Tekton pipeline runs. These metrics are targeting time spent on different parts of the pipeline including overall execution, reconciling logic, fetching resources, pulling images, and running containers.
Currently there is only one metric for capturing end to end time of the pipeline runs. To be able to investigate possible regressions caused by Tekton changes or possible causes of the slow Tekton pipelines in the production more granular metrics are needed. This would help narrow down regressions and help Tekton developers and users to find the root cause faster.
-
Allow currently supported third-party metric backends to get more granular view of different parts of a pipeline run.
-
Add a handful of (sub-)metrics that are believed useful to the current implementation while leaving the door open to add more in the future if needed.
-
Add support for more metric backends.
-
Migrate the current way of reporting metrics (which is OpenCensus via Knative libraries) to the new OpenTelemetry.
-
Implement and document the new (sub-)metrics.
-
Add telemetry tests based on the current value of the metrics.
The new metrics will have unit-tests verifying the recording of the metrics similar to the existing end to end metric.
To be able to prevent regressions on the metrics due to the changes in Tekton there will be some e2e tests that measure the metrics and expect some values for that. One of the challenges with that is the inherent flakiness of the metric values when running the tests. To overcome that we would need to run the telemetry tests multiple times and compare the median or 95th-percentile with a tolerance range.