Cluster Load Metric Rationale

The Problem

We define the load of the Airflow workers cluster in terms of messages queued for execution, not observing the machine resource usages inside each worker. We say the cluster is overloaded if there are no idle workers (but there are still messages waiting for execution), or it is underloaded if there are many idle workers.

Previous Attempts

Queue Length

Initially we used the queue length - in AWS terms, the Approximate Number of Messages Visible metric, or ANOMV for short - as a proxy for cluster load. This didn't work well because there's really no way to tell when a task has finished execution just from that metric. When the ANOMV goes to zero it means that the last tasks are getting executed, but if they take too long the workers are going to get killed in the middle of the task because the queue stayed empty for a while. This only works if the AutoScaling Group cooldown is greater than the time the slowest task takes.

Queue Length vs Successful Dequeues

An improvement came from the observation that idle workers will poll the queue regularly trying to find a task for execution. This means that if there are many failed attempts to dequeue a message - in AWS terms, the Number of Empty Receives metric, or NOER for short - we know we have many idle workers. In particular, we have to worry about the proportion of machines in the cluster - in AWS terms, the Group In Service Instances metric, or GISI for short - that are constantly polling.

(for a given interval of time t)
NOER := NumberOfEmptyReceives
W_i  := (average) number of idle workers
fpoll:= (average) idle worker's frequency of queue polling

[Eq. 1]
NOER ~ W_i * fpoll * t

We may then define the cluster load average as the ratio of busyness for the workers, making the gross approximation that workers are either busy or idle (negligible setup time)

l   := cluster load average
W   := (average) number of workers
W_b := (average) number of busy workers

[Eq. 2]
W = W_b + W_i
[Eq. 3]
l = W_b/W

We can plug equations 1 and 2 into 3 and get:

[Eq. 4]
l = 1 - W_i/W
  ~ 1 - NOER / (W * fpoll * t)

Experimentally, we've found the polling frequency to be about 29~30 per 5 minutes interval. We deal mostly with 5 minutes intervals because that's the granularity AWS offers for SQS metrics.

The formula will fail for empty clusters (W = 0), but that edge case is simple: if there are messages in the queue (ANOMV > 0) we have maximum load, otherwise we have no load.

Discounting Startup Times

The approximation had one major flaw that it assumed machines would immediately go to work after being considered "in service" by the ASG. This could sometimes lead to load overestimation, so we added a discount factor for the average GISI considering the late participation of the worker in the polling process.

Worker Survival and Graceful Shutdown

There's no way to control which instance in an ASG is chosen to die. If we have 10 idle workers and a very busy one, and we ask the cluster to scale down 1 instance, there's a chance that the instance getting murdered is the only one that should not. So in the end we had to be prepared for tasks getting interrupted because their worker was killed unjustly.

Later on we were able to mitigate this issue by giving workers a little bit of extra time to finish their business, thanks to Life-Cycle Hooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly