Sciuro is a bridge between Alertmanager and Kubernetes to sync alerts as Node Conditions. It is designed to work in tandem with other controllers that observe Node Conditions such as draino or the cluster-api.
- Alertmanager API v2
- Kubernetes 1.12+
- Download the manifests from the latest Github release
wget https://github.com/cloudflare/sciuro/releases/latest/download/cluster.yaml
wget https://github.com/cloudflare/sciuro/releases/latest/download/stable.yaml
- Apply the cluster-scoped resources that allow Sciuro to read nodes and
modify their status. If you choose a different namespace, adjust the
namespace name and
ClusterRoleBinding
accordingly.
# Review manifests and make adjustments for different namespace
kubectl apply -f cluster.yaml
- Edit the
sciuro
ConfigMap referencing the Sciuro Configuration section below. Apply the namespaced resources.
# Review manifests and make adjustments to config map
kubectl apply -f stable.yaml
The following environment variables can be set to configure Sciuro. Modifying the supplied ConfigMap will set the environment variables on the Deployment.
You must set the URL for the Alertmanager instance to sync from. In addition, filtering should be configured both on a global level and for each specific node. The Alertmanager receiver should be set to filter globally, while the node filters are set for matching alerts to a specific node.
# AlertmanagerURL is the url for the Alertmanager instance to sync from
SCIURO_ALERTMANAGER_URL: "https://CHANGEME.example.com"
# AlertReceiver is the receiver to use for server-side filtering of alerts
# must be the same across all targeted nodes in the cluster
SCIURO_ALERT_RECEIVER: "CHANGEME"
# NodeFiltersTemplate is a golang template resulting in list of filters (comma separated)
# to use for each node. These filters are logically OR'd
# for associating alerts to a node. There are two valid variables available for substitution
# FullName and ShortName where ShortName is FullName upto the first . (dot)
SCIURO_NODE_FILTERS: "instance=~({{.FullName}}|{{.ShortName}})"
Some additional optional settings are as follows:
# AlertSilenced controls whether silenced alerts are retrieved from Alertmanager
SCIURO_ALERT_SILENCED: "false"
# AlertCacheTTL is the time between fetching alerts
SCIURO_ALERT_CACHE_TTL: "60s"
The following are optional settings to configure how reconciliation with the Kubernetes node resources behaves
# NodeResync is the period at which a node fully syncs with the current alerts
SCIURO_NODE_RESYNC: "2m"
# ReconcileTimeout is the maximum time given to reconcile a node.
SCIURO_RECONCILE_TIMEOUT: "45s"
# LingerResolvedDuration is the time that non-firing alerts are kept as conditions
# with the False status. After this time, the condition will be removed entirely.
# A value of 0 will never remove these conditions.
SCIURO_LINGER_DURATION: "96h"
To change the address and port to serve metrics from:
# MetricsAddr is the address and port to serve metrics from
SCIURO_METRICS_ADDR: "0.0.0.0:8080"
# DevMode toggles additional logging information
SCIURO_DEV_MODE: "false"
# LeaderElectionNamespace is the namespace where the leader election config map will be
# managed. Defaults to the current namespace.
SCIURO_LEADER_NAMESPACE: ""
# LeaderElectionID is the name of the configmap used to manage leader elections
SCIURO_LEADER_ID: "sciuro-leader"
Sciuro is recommended to have its own Alertmanager
receiver.
Since Sciuro works in a pull model currently, this receiver does not need to
push anywhere and can simply be an empty receiver. In addition, a
route needs
to be setup to match alerts to this receiver. There are many configurations that
will achieve the above, however the below is one example partial Alertmanager
configuration that allows alerts with a notify: node-condition-k8s
label to
be picked up by Sciuro:
route:
routes:
- match_re:
notify: (?:.*\s+)?node-condition-k8s(?:\s+.*)?
receiver: node-condition-k8s
continue: true
receivers:
- name: node-condition-k8s
Assuming Prometheus as a source of alerts, an alert like the following can be created to add a condition to nodes for high uptime:
alert: NodeUpTooLong
expr: (time() - node_boot_time_seconds) / 60 / 60 / 24 > 7
labels:
notify: node-condition-k8s
priority: "8"
annotations:
description: Node '{{ $labels.instance }}' has been up for more than 7 days
summary: Node '{{ $labels.instance }}' uptime too long
With this alert in place, conditions are added to the affected nodes:
$ kubectl get node worker01 -o json | jq '.status.conditions[] | select(.type | test("^AlertManager_"))'
{
"lastHeartbeatTime": "2021-06-16T16:07:10Z",
"lastTransitionTime": "2021-06-16T15:34:07Z",
"message": "[P8] Node 'worker01' uptime too long",
"reason": "AlertIsFiring",
"status": "True",
"type": "AlertManager_NodeUpTooLong"
}
Sciuro is built and tested with bazel. To run tests:
make test
To build and push images, define the docker repository base with the run of the manifests targets:
bazel run --define repo=quay.io/myrepo //manifests:cluster > /tmp/cluster.yaml
bazel run --define repo=quay.io/myrepo //manifests:stable > /tmp/stable.yaml