Cluster Load Metric Rationale

The Problem

We define the load of the Airflow workers cluster in terms of messages queued for execution, not observing the machine resource usages inside each worker. We say the cluster is overloaded if there are no idle workers (but there are still messages waiting for execution), or it is underloaded if there are many idle workers.

Naive Approach: Queue Length

Initially we used the queue length - in AWS terms, the Approximate Number of Messages Visible metric, or ANOMV for short - as a proxy for cluster load. This didn't work well because there was no way to tell when a task has finished execution just from that metric. When the ANOMV goes to zero, the last might still be running so if they take too long the workers are going to get killed prematurely.

Solution

Empty Receives vs In Service Instances

An improvement came from the observation that idle workers will poll the queue regularly trying to find a task for execution. This means that if there are many failed attempts to dequeue a message - in AWS terms, the Number of Empty Receives metric, or NOER for short - we know we have many idle workers. In particular, we have to worry about the proportion of machines in the cluster - in AWS terms, the Group In Service Instances metric, or GISI for short - that are constantly polling.

(for a given interval of time t)
NOER := NumberOfEmptyReceives
W_i  := (average) number of idle workers
fpoll:= (average) idle worker's frequency of queue polling

[Eq. 1]
NOER ~ W_i * fpoll * t

We may then define the cluster load average as the ratio of busyness for the workers, making the gross approximation that workers are either busy or idle (negligible setup time)

l   := cluster load average
W   := (average) number of workers
W_b := (average) number of busy workers

[Eq. 2]
W = W_b + W_i
[Eq. 3]
l = W_b/W

We can plug equations 1 and 2 into 3 and get:

[Eq. 4]
l = 1 - W_i/W
  ~ 1 - NOER / (W * fpoll * t)

Experimentally, we've found the polling frequency to be about 29~30 per 5 minutes interval. We deal mostly with 5 minutes intervals because that's the granularity AWS offers for SQS metrics.

The formula will fail for empty clusters (W = 0), but that edge case is simple: if there are messages in the queue (ANOMV > 0) we have maximum load, otherwise we have no load.

Worker Survival and Graceful Shutdown

There's no way to control which instance in an ASG is chosen to be terminated. If we have 10 idle workers and a very busy one, and we ask the cluster to scale down 1 instance, there's a chance that the instance getting murdered is the one that should not.

To mitigate this issue, we have life-cycle hooks in place that delay the termination of a worker until it has no more tasks. In addition, Airflow is configured to retry tasks that failed because the process was killed.

Provide feedback

Saved searches