systems.tex

\section{Fixed Delay Scheduling in Cluster Frameworks}\label{sec:systems}

In real systems, fixed delay scheduling is usually implemented using a preset delay 
interval (as a unit of time, in milliseconds), and timestamps, the first of
which is taken the first time a job is considered for scheduling (whether a task is actually
scheduled or not).
This timestamp is checked whenever a slot opens. If the slot is on a machine
without input data for a job, then tasks from that job may only be scheduled there if
the time between the current time and the timestamp is greater than whatever the fixed delay
interval is. If the slot \textit{does} have input data for the job, then tasks can be scheduled
no matter the timestamp. Regardless of how/why a task was scheduled for a job, once a task
is scheduled then the timestamp is reset, and the process begins again. Very little extra
state is required (just one additional field for each job currently in progress). For the purposes
of this paper, we will be focusing on how fixed delay scheduling operates in Apache Spark (detailed
below), but the principles and implementations are the same in other systems, such as 
Hadoop, YARN, and Mesos.

Apache Spark~\cite{Zaharia2012} is a general execution engine for distributed ``Big Data'' processing, 
similar to Hadoop, that breaks jobs into stages of tasks, and uses fixed delay scheduling to 
improve data locality when launching these tasks. Spark provides many advantages, such 
as an abstraction for representing distributed datasets (Resilient Distributed Datasets, 
or RDDs). These RDDs, which represent data either on disk (in Hadoop Distributed File 
System, or HDFS) or in memory, are divided into partitions across worker processes called 
executors in the cluster. Each executor has a finite number of slots (usually equal to the number 
of CPU cores available) with which to run 
tasks. The Spark scheduler, which resides in a centralized driver process separate from 
the executors, uses fixed delay scheduling to provide greater data locality when scheduling 
tasks into slots that become open. The fixed delay interval (the maximum waiting time) is set 
statically in a configuration file.

Spark (and the other systems mentioned above) also provide hierarchical fixed delay scheduling.
That is, a delay interval can be specified for each node-local, rack-local, or even datacenter-local
scheduling. If the delay interval expires for a job, rather than being scheduled, it will fall back
into the next tier, and delay there (for perhaps a different period of time than the original interval)
for slots with the relaxed locality condition of that tier (slots within the same rack as input data, for
example).