-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: First iteration of a prometheus exporter for ara #483
Draft
dmsimard
wants to merge
9
commits into
ansible-community:master
Choose a base branch
from
dmsimard:prometheus_exporter
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
98cb485
WIP: contrib: A prometheus exporter for ara
dmsimard 14741ad
WIP: prometheus_exporter iteration 2
dmsimard 86dfdf8
WIP: prometheus_exporter iteration 3
dmsimard 7c01a0b
WIP: prometheus_exporter iteration 4
dmsimard 479ce45
WIP: prometheus_exporter iteration 5
dmsimard 0355d43
WIP: prometheus_exporter iteration 6
dmsimard 2d7cd30
WIP: prometheus_exporter iteration 7
dmsimard 7558a6f
WIP: prometheus_exporter iteration 8
dmsimard 6283872
WIP: prometheus_exporter iteration 9
dmsimard File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,282 @@ | ||
# Copyright (c) 2023 The ARA Records Ansible authors | ||
# GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt) | ||
|
||
import logging | ||
import sys | ||
import time | ||
from collections import defaultdict | ||
from datetime import datetime, timedelta | ||
|
||
from cliff.command import Command | ||
|
||
import ara.cli.utils as cli_utils | ||
from ara.cli.base import global_arguments | ||
from ara.clients.utils import get_client | ||
|
||
try: | ||
from prometheus_client import Gauge, Summary, start_http_server | ||
|
||
HAS_PROMETHEUS_CLIENT = True | ||
except ImportError: | ||
HAS_PROMETHEUS_CLIENT = False | ||
|
||
# Where possible and relevant, apply these labels to the metrics so we can write prometheus | ||
# queries to filter and aggregate by these properties | ||
# TODO: make configurable | ||
DEFAULT_PLAYBOOK_LABELS = [ | ||
"ansible_version", | ||
"client_version", | ||
"controller", | ||
"name", | ||
"path", | ||
"python_version", | ||
"server_version", | ||
"status", | ||
"updated", | ||
"user", | ||
] | ||
DEFAULT_TASK_LABELS = ["action", "name", "path", "playbook", "status", "updated"] | ||
DEFAULT_HOST_LABELS = ["name", "playbook", "updated"] | ||
|
||
|
||
# TODO: This could be made more flexible and live in a library | ||
def get_search_results(client, kind, limit, created_after): | ||
""" | ||
kind: string, one of ["playbooks", "hosts", "tasks"] | ||
limit: int, the number of items to return per page | ||
created_after: string, a date formatted as such: 2020-01-31T15:45:36.737000Z | ||
""" | ||
query = f"/api/v1/{kind}?order=-id&limit={limit}" | ||
if created_after is not None: | ||
query += f"&created_after={created_after}" | ||
|
||
response = client.get(query) | ||
items = response["results"] | ||
|
||
# Iterate through multiple pages of results if necessary | ||
while response["next"]: | ||
# For example: | ||
# "next": "https://demo.recordsansible.org/api/v1/playbooks?limit=1000&offset=2000", | ||
uri = response["next"].replace(client.endpoint, "") | ||
response = client.get(uri) | ||
items.extend(response["results"]) | ||
|
||
return items | ||
|
||
|
||
class AraPlaybookCollector(object): | ||
def __init__(self, client, log, limit, labels=DEFAULT_PLAYBOOK_LABELS): | ||
self.client = client | ||
self.log = log | ||
self.limit = limit | ||
self.labels = labels | ||
|
||
self.metrics = { | ||
"range": Gauge("ara_playbooks_range", "Limit metric collection to the N most recent playbooks"), | ||
"total": Gauge("ara_playbooks_total", "Total number of playbooks recorded by ara"), | ||
"playbooks": Summary( | ||
"ara_playbooks", "Labels and duration (in seconds) of playbooks recorded by ara", labels | ||
), | ||
} | ||
self.metrics["range"].set(self.limit) | ||
|
||
def collect_metrics(self, created_after=None): | ||
playbooks = get_search_results(self.client, "playbooks", self.limit, created_after) | ||
# Save the most recent timestamp so we only scrape beyond it next time | ||
if playbooks: | ||
created_after = cli_utils.increment_timestamp(playbooks[0]["created"]) | ||
self.log.info(f"updating metrics for {len(playbooks)} playbooks...") | ||
|
||
for playbook in playbooks: | ||
# The API returns a duration in string format, convert it back to seconds | ||
# so we can use it as a value for the metric. | ||
if playbook["duration"] is not None: | ||
# TODO: parse_timedelta throws an exception for playbooks that last longer than a day | ||
# That was meant to be fixed in https://github.com/ansible-community/ara/commit/db8243c3af938ece12c9cd59dd7fe4d9a711b76d | ||
try: | ||
seconds = cli_utils.parse_timedelta(playbook["duration"]) | ||
except ValueError: | ||
seconds = 0 | ||
else: | ||
seconds = 0 | ||
|
||
# Gather the values of each label so we can attach them to our metrics | ||
labels = {label: playbook[label] for label in self.labels} | ||
|
||
self.metrics["playbooks"].labels(**labels).observe(seconds) | ||
self.metrics["total"].inc() | ||
|
||
return created_after | ||
|
||
|
||
class AraTaskCollector(object): | ||
def __init__(self, client, log, limit, labels=DEFAULT_TASK_LABELS): | ||
self.client = client | ||
self.log = log | ||
self.limit = limit | ||
self.labels = labels | ||
|
||
self.metrics = { | ||
"range": Gauge("ara_tasks_range", "Limit metric collection to the N most recent tasks"), | ||
"total": Gauge("ara_tasks_total", "Number of tasks recorded by ara in prometheus"), | ||
"tasks": Summary("ara_tasks", "Labels and duration, in seconds, of playbook tasks recorded by ara", labels), | ||
} | ||
self.metrics["range"].set(self.limit) | ||
|
||
def collect_metrics(self, created_after=None): | ||
tasks = get_search_results(self.client, "tasks", self.limit, created_after) | ||
# Save the most recent timestamp so we only scrape beyond it next time | ||
if tasks: | ||
created_after = cli_utils.increment_timestamp(tasks[0]["created"]) | ||
self.log.info(f"updating metrics for {len(tasks)} tasks...") | ||
|
||
for task in tasks: | ||
# The API returns a duration in string format, convert it back to seconds | ||
# so we can use it as a value for the metric. | ||
if task["duration"] is not None: | ||
# TODO: parse_timedelta throws an exception for tasks that last longer than a day | ||
# That was meant to be fixed in https://github.com/ansible-community/ara/commit/db8243c3af938ece12c9cd59dd7fe4d9a711b76d | ||
try: | ||
seconds = cli_utils.parse_timedelta(task["duration"]) | ||
except ValueError: | ||
seconds = 0 | ||
else: | ||
seconds = 0 | ||
|
||
# Gather the values of each label so we can attach them to our metrics | ||
labels = {label: task[label] for label in self.labels} | ||
|
||
self.metrics["tasks"].labels(**labels).observe(seconds) | ||
self.metrics["total"].inc() | ||
|
||
return created_after | ||
|
||
|
||
class AraHostCollector(object): | ||
def __init__(self, client, log, limit, labels=DEFAULT_HOST_LABELS): | ||
self.client = client | ||
self.log = log | ||
self.limit = limit | ||
self.labels = labels | ||
|
||
self.metrics = { | ||
"changed": Gauge("ara_hosts_changed", "Number of changes on a host", labels), | ||
"failed": Gauge("ara_hosts_failed", "Number of failures on a host", labels), | ||
"ok": Gauge("ara_hosts_ok", "Number of successful tasks without changes on a host", labels), | ||
"range": Gauge("ara_hosts_range", "Limit metric collection to the N most recent hosts"), | ||
"skipped": Gauge("ara_hosts_skipped", "Number of skipped tasks on a host", labels), | ||
"total": Gauge("ara_hosts_total", "Hosts recorded by ara"), | ||
"unreachable": Gauge("ara_hosts_unreachable", "Number of unreachable errors on a host", labels), | ||
} | ||
self.metrics["range"].set(self.limit) | ||
|
||
def collect_metrics(self, created_after=None): | ||
hosts = get_search_results(self.client, "hosts", self.limit, created_after) | ||
# Save the most recent timestamp so we only scrape beyond it next time | ||
if hosts: | ||
created_after = cli_utils.increment_timestamp(hosts[0]["created"]) | ||
self.log.info(f"updating metrics for {len(hosts)} hosts...") | ||
|
||
for host in hosts: | ||
self.metrics["total"].inc() | ||
|
||
# Gather the values of each label so we can attach them to our metrics | ||
labels = {label: host[label] for label in self.labels} | ||
|
||
# The values of "changed", "failed" and so on are integers so we can | ||
# use them as values for our metric | ||
for status in ["changed", "failed", "ok", "skipped", "unreachable"]: | ||
if host[status]: | ||
self.metrics[status].labels(**labels).set(host[status]) | ||
|
||
return created_after | ||
|
||
|
||
class PrometheusExporter(Command): | ||
"""Exposes a prometheus exporter to provide metrics from an instance of ara""" | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
def get_parser(self, prog_name): | ||
parser = super().get_parser(prog_name) | ||
parser = global_arguments(parser) | ||
# fmt: off | ||
parser.add_argument( | ||
'--playbook-limit', | ||
help='Max number of playbooks to request at once (default: 1000)', | ||
default=1000, | ||
type=int | ||
) | ||
parser.add_argument( | ||
'--task-limit', | ||
help='Max number of tasks to request at once (default: 2500)', | ||
default=2500, | ||
type=int | ||
) | ||
parser.add_argument( | ||
'--host-limit', | ||
help='Max number of hosts to request at once (default: 2500)', | ||
default=2500, | ||
type=int | ||
) | ||
parser.add_argument( | ||
'--poll-frequency', | ||
help='Seconds to wait until querying ara for new metrics (default: 60)', | ||
default=60, | ||
type=int | ||
) | ||
parser.add_argument( | ||
'--prometheus-port', | ||
help='Port on which the prometheus exporter will listen (default: 8001)', | ||
default=8001, | ||
type=int | ||
) | ||
parser.add_argument( | ||
'--max-days', | ||
help='Maximum number of days to backfill metrics for (default: 90)', | ||
default=90, | ||
type=int | ||
) | ||
return parser | ||
|
||
def take_action(self, args): | ||
if not HAS_PROMETHEUS_CLIENT: | ||
self.log.error("The prometheus_client python package must be installed to run this command") | ||
sys.exit(2) | ||
|
||
verify = False if args.insecure else True | ||
if args.ssl_ca: | ||
verify = args.ssl_ca | ||
client = get_client( | ||
client=args.client, | ||
endpoint=args.server, | ||
timeout=args.timeout, | ||
username=args.username, | ||
password=args.password, | ||
cert=args.ssl_cert, | ||
key=args.ssl_key, | ||
verify=verify, | ||
run_sql_migrations=False, | ||
) | ||
|
||
# Prepare collectors so we can gather various metrics | ||
playbooks = AraPlaybookCollector(client=client, log=self.log, limit=args.playbook_limit) | ||
hosts = AraHostCollector(client=client, log=self.log, limit=args.host_limit) | ||
tasks = AraTaskCollector(client=client, log=self.log, limit=args.task_limit) | ||
|
||
start_http_server(args.prometheus_port) | ||
self.log.info(f"ara prometheus exporter listening on http://0.0.0.0:{args.prometheus_port}/metrics") | ||
|
||
created_after = (datetime.now() - timedelta(days=args.max_days)).isoformat() | ||
self.log.info( | ||
f"Backfilling metrics for the last {args.max_days} days since {created_after}... This can take a while." | ||
) | ||
|
||
latest = defaultdict(lambda: created_after) | ||
while True: | ||
latest["playbooks"] = playbooks.collect_metrics(latest["playbooks"]) | ||
latest["hosts"] = hosts.collect_metrics(latest["hosts"]) | ||
latest["tasks"] = tasks.collect_metrics(latest["tasks"]) | ||
|
||
time.sleep(args.poll_frequency) | ||
self.log.info("Checking for updated metrics...") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be interesting for the exporter to be able to filter queries like the general CLI commands work, for example
ara playbook list
(docs) has: