-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdynamic.tex
109 lines (94 loc) · 5.69 KB
/
dynamic.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
\section{Dynamic Delay Scheduling}\label{sec:dynamic}
\subsection{Beyond Constant Delays}
The frameworks described above use a fixed delay interval,
which remains the same throughout an entire job. Our observation is that there is a
calculable tipping point after which waiting/skipping is no longer beneficial. Generally,
we can define the total running time of a task \textit{t} as:
\[
t\_total = t\_compute + t\_netOH + sch\_delay
\]
where \textit{t\_compute} is the local computation time of the task, \textit{t\_netOH}
is the network latency incurred from data transfer (this is zero if the task is scheduled
locally), and \textit{sch\_delay} is the amount of time \textit{t} is delayed
(skipped) to wait for locality, from the moment \textit{t} is first considered for
scheduling.
Ideally, we want to minimize the task completion time of all tasks within a given job,
while also providing some guarantee as to the maximum latency introduced by the scheduler
itself through delay scheduling. The goal of delay scheduling is to remove \textit{t\_netOH} by
scheduling a task locally, but if the task is delayed longer than this \textit{t\_netOH},
then it would have actually been more efficient in hindsight (in terms of task completion time)
to immediately schedule the task in the first open slot that was considered.
Conversely, if the delay interval is too short, then the task could potentially miss
scheduling opportunities with locality, which would result in a shorter completion time
than scheduling without locality.
Dynamic Delay Scheduling improves upon fixed delay
scheduling by dynamically adapting the maximum
\textit{sch\_delay} equal to \textit{t\_netOH} on a per-task basis. This allows
for the longest waiting time for data-local slots while limiting the worst case task
running time to \textit{t\_compute + (2 * t\_netOH)}. In this way, the delay interval
will adapt to the severity of the network overhead. If \textit{t\_netOH} is low, then locality
is less of a concern and the scheduler can shorten the time it delays. If \textit{t\_netOH} becomes high,
then the scheduler increases the time that it waits, as locality is more important in poorer network
conditions; however, this means that the
network overhead incurred by running tasks needs to be monitored and reported to the
scheduler over time, in order to adapt to changing network conditions and potentially
heterogeneous data block sizes.
\subsection{An Adaptive Solution Using Task Feedback}
The network overhead incurred by missing data locality depends on many factors, including
network traffic, the distance of a task from its data, and the input data size. These
factors vary from job to job, and some (like network traffic) can change during the job
execution itself. In order to adapt the delay scheduling interval to changing conditions,
we have designed a simple feedback mechanism, in which each task reports the network overhead it experienced
upon completion (\textit{t\_netOH}, for a task \textit{t}). This metric is sent to a
centralized fair scheduler, which then changes its delay scheduling interval using a
running average of feedback from completed tasks. This new delay scheduling interval is
then used when scheduling subsequent tasks of the same type.
The feedback metric itself (\textit{t\_netOH}) can be any time measurement which
reflects the network overhead incurred by a task. In the case of Dynamic Delay Scheduling, we are
considering tasks which read data from distributed file systems,
so the \textit{t\_netOH} for each task is the amount of time spent
reading data blocks remotely; however, any arbitrary type of task, which has some way of
measuring its own overhead, could set this as well and benefit from the adaptation.
%\newcommand{algrule}[1][.2pt]{\par\vskip.5\baselineskip\hrule height #1\par\vskip.5\baselineskip}
\begin{algorithm}[]
\footnotesize
\DontPrintSemicolon
\caption{Dynamic Delay Scheduling}
\textbf{Initialization:}\;
delay\_interval $\leftarrow$ default (3 seconds)\;
\hrule width 0.4\textwidth
\;
\textbf{Upon receiving task $t$ completion event:}\;
delay\_interval $\leftarrow$ (delay\_interval + t\_netOH) / 2\;
\hrule width 0.4\textwidth
\;
\textbf{Scheduling Procedure:}\;
\For{each open slot $s$ on a node $n$ in N}{
sort $J$ in increasing order of currently running tasks\;
\For{$j$ in $J$}{
\If{$j$ has just launched a task}{
$j$.wait\_begin = current\_time\;
}
\eIf{a task $t$ in $j$ has data on $s$'s node $n$}{
launch t in $s$\;
}{
\If{a task $t$ in $j$ is unlaunched}{
\eIf {current\_time - $j$.wait\_begin $>$ delay\_interval}{
launch $t$ in slot $s$\;
}{
continue\;
}
}
}
}
}
\end{algorithm}
Algorithm 1 describes our adaptive solution in detail, in a cluster with \textit{N} nodes and \textit{J} jobs.
When tasks complete, the delay interval will be changed to reflect
recent overhead measurements. The delay interval adapts relatively quickly (the average of
the last few readings), since feedback can only be received when tasks complete. For workloads
in which tasks complete very often, the window for the average can be broadened.
During scheduling, this new interval is used for delay scheduling (and this interval can
change even while a job is currently waiting, as tasks complete). It is worth noting that
there is nothing preventing this adaptive solution from also being hierarchical, as many
frameworks desire/implement separate delay intervals for node-local or rack-local scheduling.