-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpaper.tex
1146 lines (1007 loc) · 56.7 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[sigconf, nonacm]{acmart}
\usepackage{amsmath,amsfonts}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{multirow}
\usepackage{xcolor,colortbl}
\usepackage{algorithmic, algorithm}
\usepackage{caption}
\usepackage{subcaption}
\usepackage[draft]{fixme}
\usepackage{xspace}
\usepackage{hyperref} % makes cross-refs (biblio, figures, algos, ...) clickable
%% The following content must be adapted for the final version
% paper-specific
\newcommand\vldbdoi{XX.XX/XXX.XX}
\newcommand\vldbpages{XXX-XXX}
% issue-specific
\newcommand\vldbvolume{14}
\newcommand\vldbissue{1}
\newcommand\vldbyear{2020}
% should be fine as it is
\newcommand\vldbauthors{\authors}
\newcommand\vldbtitle{\shorttitle}
% leave empty if no availability url should be set
\newcommand\vldbavailabilityurl{}%http://vldb.org/pvldb/format_vol14.html}
% whether page numbers should be shown or not, use 'plain' for review versions, 'empty' for camera ready
\newcommand\vldbpagestyle{plain}
\newcommand{\tristan}[1]{\fxnote{\small \color{red} \textbf{From Tristan:}#1} \color{black}}
\newcommand{\timothee}[1]{\color{blue}\textbf{From Timothée:}#1\color{black}}
\newcommand{\mytodo}[1]{\color{red}\textbf{TODO:}#1\color{black}}
\newcommand{\keep}[0]{\textsc{keep}\xspace}
\begin{document}
\title{A heuristic to repartition large multi-dimensional arrays with reduced disk seeking}
\author{Timoth\'ee Gu\'edon$^{1,2}$, Val\'erie Hayot-Sasson$^1$, Tristan Glatard$^1$}
\affiliation{$^1$ Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada\\
$^2$ ESIEA school of engineering, France}
\begin{abstract}
Multi-dimensional arrays have become the main scientific data structures,
but their manipulation raises performance issues when they exceed memory
capacity. In particular, accessing specific array regions can require
millions to billions of disk seek operations, with important consequences
on I/O performance. While traditional approaches to address this problem
focus on file format optimizations, we are looking for algorithmic
solutions where applications control their I/O to reduce
seeking. In this paper, we propose the \keep heuristic to minimize the
number of seeks required for the repartitioning of large multi-dimensional
arrays. The \keep heuristic uses a memory cache to reconstruct contiguous
data sections in memory. We evaluate it on arrays ranging from 85.7~GiB to
1~TiB, and with memory amounts ranging from 4 to 275~GiB. Repartitioning
time is reduced by a factor of up to 2.5, and seeking is reduced by four orders
of magnitude. Due to the effect of asynchronous writes to memory page cache
enabled by the Linux kernel, speed up is only observed for arrays that
exceed the size of working memory. The \keep heuristic could be applied in
platforms that manipulate large data arrays, as is commonly the case in
scientific imaging.
\end{abstract}
\listoffixmes
\newpage
\maketitle
%%% do not modify the following VLDB block %%
%%% VLDB block start %%%
\pagestyle{\vldbpagestyle}
\begingroup\small\noindent\raggedright\textbf{PVLDB Reference Format:}\\
\vldbauthors. \vldbtitle. PVLDB, \vldbvolume(\vldbissue): \vldbpages, \vldbyear.\\
%\href{https://doi.org/\vldbdoi}{doi:\vldbdoi}
\endgroup
\begingroup
\renewcommand\thefootnote{}\footnote{\noindent
This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit \url{https://creativecommons.org/licenses/by-nc-nd/4.0/} to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing \href{mailto:[email protected]}{[email protected]}. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. \\
\raggedright Proceedings of the VLDB Endowment, Vol. \vldbvolume, No. \vldbissue\ %
ISSN 2150-8097. \\
\href{https://doi.org/\vldbdoi}{doi:\vldbdoi} \\
}\addtocounter{footnote}{-1}\endgroup
%%% VLDB block end %%%
%%% do not modify the following VLDB block %%
%%% VLDB block start %%%
\ifdefempty{\vldbavailabilityurl}{}{
\vspace{.3cm}
\begingroup\small\noindent\raggedright\textbf{PVLDB Artifact Availability:}\\
The source code, data, and/or other artifacts have been made available at \url{\vldbavailabilityurl}.
\endgroup
}
%%% VLDB block end %%%
\vspace*{-0.3cm}
%----------------------------------------
\section{Introduction}
%----------------------------------------
% Multidimensional arrays are important
Multidimensional arrays are
used in many scientific disciplines as the main data structure~\cite{harris2020array}. In neuroimaging for instance, arrays
commonly store 2D, 3D or 3D+t brain images acquired from histology, magnetic
resonance imaging (MRI), or other modalities. Due to the improvement of
acquisition methods, it is now common for such images to exceed the amount
of working memory available on typical computers, requiring their
partitioning in blocks stored in independent files or in multiple HDF5
sections. BigBrain~\cite{Amunts1472}, for instance, is a human brain model
providing microscopic data at the resolution of 20 micrometers that comes
in the form of many 3D blocks. Partitioning large arrays enables efficient
queries, flexibility in adding new data~\cite{optimal_chuking}, granular data
transfers, and parallel processing with frameworks such as
Dask~\cite{matthew_rocklin-proc-scipy-2015}. Applications may adopt
different block geometries depending on their processing requirements, such
as 2D slices or 3D blocks.
% Seeks create performance issues
Manipulating large multidimensional arrays, however, comes with performance
challenges related to the ordering and organization of array elements on
storage devices. In particular, reading or writing elements scattered in different
disk regions can generate millions to billions of \emph{seeks}, the
operations by which storage drives move to a different location, resulting
in important penalties on both SSD and HDD drives. Our goal is to minimize the number of seeks required when manipulating
large multi-dimensional arrays.
% Existing work
Currently, the main approach to this problem is to adapt the order of array
elements on storage devices to the type of queries done by applications.
For instance, the NDStore platform~\cite{lillaney2018building} orders voxels of 3D images
using Morton space-filling curves~\cite{morton1966computer}, so that nearby
voxels are stored at nearby disk addresses with high
probability. Other works determine from historical queries a
block shape that minimizes seeking
in future queries~\cite{optimal_chuking}.
While such approaches are efficient, they also require that data
be converted to a storage format specific to the application, which is
impractical and hampers data sharing. Instead, we are looking for
algorithmic approaches where applications control their I/Os to reduce seeking.
% Goal
This paper presents the \keep heuristic, an algorithm to reduce the
number of seeks created when repartitioning large multi-dimensional arrays in blocks of
arbitrary shape. Repartitioning may be triggered either by application requirements
or for performance reasons, to improve
memory usage and I/O efficiency. For instance, the Python HDF5 library
\texttt{h5py} recommends a chunk size ``between 10~KiB and 1~MiB"~\cite{collette_2014}, while the Dask parallel processing framework
~\cite{matthew_rocklin-proc-scipy-2015} suggests a chunk size greater than
100~MB~\cite{rocklin_bourbeau_2019}.
The \keep heuristic leverages a memory cache to read and write data as
contiguously as possible, similarly to the "clustered" and "multiple"
algorithms described in~\cite{seqalgorithms}. The storage order of array
elements on disk is assumed to be known to the application, but unconstrained.
Implementations based on our algorithm could therefore use arbitrary file
formats. We focus on sequential algorithms, assuming that arrays are stored
on a single device and accessed by single-threaded I/Os. Extensions to parallel
environments are part of our future work. Although in practice seek times
depend on various factors, such as the distance between the initial and
target drive positions, we focus on minimizing seek \emph{numbers}, for
simplicity. Likewise, the effects of I/O optimizations such as page caching
or readahead will be discussed experimentally but not modelled.
In summary, the paper makes the following contributions:
\begin{itemize}
\item We define the repartitioning problem for data arrays;
\item We propose the \keep heuristic to the re-partitioning problem;
\item We evaluate it for various memory amounts and data sizes.
\end{itemize}
The remainder of this paper presents these contributions and discusses the
validity and implications of our results.
% % need for tools to repartition
% Using chunked multidimensional arrays requires tools to efficiently split,
% merge, and ``resplit" or ``repartition" data files. Previous work in~\cite{seqalgorithms}
% showed that naive algorithms to split an array into several chunks or merge
% array chunks into one output file do not perform well due to millions of seeks
% occurring on disk. The present study is focused on sequential algorithms for
% repartitioning multidimensional arrays, letting the parallel and distributed
% cases as future work, although they are relevant, too. To the best of our
% knowledge, the repartition problem has not been extensively studied.
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{./figures/figure_1.pdf}
\caption{Ordering of 3D array elements in files in C order. The first stored element is colored in green,
the last one is in blue. Seeking to the red elements is required when extracting the 3D block with plain borders.}
\label{fig:seeks_and_rowmajor}
\end{figure}
\section{Problem definition and baseline}
For the sake of clarity and without loss of generality, we assume that
files are written in row-major order (Fig.~\ref{fig:seeks_and_rowmajor},
a.k.a. C order), where the fastest moving dimension in the file is the
last dimension of the array, and the slowest moving dimension in the file
is the first dimension of the array. This convention is, for instance, the
one used in the HDF5 format ~\cite{hdf5}.
Accessing data from an array stored on disk generates seeks in two
situations: (1) when an array block is opened for reading or writing, and
(2) when the reading or writing process moves to a different address within
the block.
\subsection{The re-partitioning problem}
\tristan{Maybe ask Harutyunan if he wants to formulate it better and check if it's NP complete}
\begin{table}
\caption{Notations}
\begin{tabular}{ll}
\multicolumn{2}{c}{\cellcolor{black!25}Mono-space fonts are used for array partitions}\\
\texttt{A} & 3D array to repartition \\
\texttt{inBlocks} & set of input blocks \\
\texttt{outBlocks} & set of output blocks \\
\texttt{readBlocks} & set of read blocks \\
\texttt{writeBlocks} & set of write blocks \\
&\\
\multicolumn{2}{c}{\cellcolor{black!25}Square capitals are used for block shapes}\\
A = ($A_0$, $A_1$, $A_2$) & shape of \texttt{A}\\
I = ($I_0$, $I_1$, $I_2$)& shape of input blocks\\
O = ($O_0$, $O_1$, $O_2$) & shape of output blocks\\
R = ($R_0$, $R_1$, $R_2$) & shape of read blocks\\
W = ($W_0$, $W_1$, $W_2$) & shape of write blocks\\
&\\
\multicolumn{2}{c}{\cellcolor{black!25}Round capitals are used for block end coordinates}\\
$\mathcal{I}$ = ($\mathcal{I}_0$, $\mathcal{I}_1$, $\mathcal{I}_2$), \enskip $\mathcal{I}_d \subset \mathbb{N} $ & input block end coordinates \\
$\mathcal{O}$ = ($\mathcal{O}_0$, $\mathcal{O}_1$, $\mathcal{O}_2$), \enskip $\mathcal{O}_d \subset \mathbb{N} $ & output block end coordinates\\
$\mathcal{R}$ = ($\mathcal{R}_0$, $\mathcal{R}_1$, $\mathcal{R}_2$), \enskip $\mathcal{R}_d \subset \mathbb{N} $ & read block end coordinates\\
$\mathcal{W}$ = ($\mathcal{W}_0$, $\mathcal{W}_1$, $\mathcal{W}_2$), \enskip $\mathcal{W}_d \subset \mathbb{N} $ & write block end coordinates\\
&\\
\multicolumn{2}{c}{\cellcolor{black!25}$n_X$ are used for block numbers}\\
$n_I$ = number of input blocks\\
$n_O$ = number of output blocks\\
$n_R$ = number of read blocks\\
$n_W$ = number of write blocks\\
\end{tabular}
\label{table:notations}
\end{table}
\begin{algorithm}[b]
\caption{General re-partitioning algorithm}
\label{algo:generalrepartition}
\begin{algorithmic}[1]
\STATE \textbf{Inputs}
\STATE inBlocks: input blocks of shape $I$, stored on disk
\STATE outBlocks: output blocks of shape $O$, to be written
\STATE $m$: memory available
\STATE
\STATE \textbf{Outputs}
\STATE none (outBlocks are written)
\STATE
\STATE \textbf{Algorithm}
\STATE readBlocks, writeBlocks $\leftarrow$ getReadWriteBlocks($I$, $O$, $m$)
\STATE cache $\leftarrow$ init\_cache(readBlocks, writeBlocks)
\FOR{readBlock in readBlocks}
\STATE data $\leftarrow$ read(readBlock, inBlocks)
\STATE completeBlocks = cache.insert(data)
\FOR{writeBlock in completeBlocks}
\STATE write(writeBlock, outBlocks)
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
We focus on 3D arrays for simplicity. Consider a 3D array \texttt{A} of shape $A =
(A_0, A_1, A_2)$, partitioned in uniformly-shaped input blocks of shape $I =
(I_0, I_1, I_2)$ stored on disk. Our goal is to re-partition the input
blocks into uniform output blocks of shape $O = (O_0, O_1, O_2)$,
where $O \neq I$. Notations are summarized in
Table~\ref{table:notations}.
We formalize the re-partitioning problem as in Algorithm~\ref{algo:generalrepartition}.
A re-partitioning algorithm takes a
list of input blocks \texttt{inBlocks}, a list of output
blocks \texttt{outBlocks}, and the amount of memory \texttt{m}
available for the algorithm. Subject to $m$, the algorithm determines (1)
\texttt{readBlocks}, a list of uniformly-shaped blocks of shape $R$ to be read from the
input blocks, and (2) \texttt{writeBlocks}, a list of blocks,
usually not uniformly-shaped, to be written to output blocks.
If read blocks and input blocks have a different shape, then reading read blocks requires
more than one seek. Similarly, if write blocks and output blocks have a different shape,
then writing write blocks requires more than one seeks. Input, output, read
and write blocks all form a partition of the 3D array \texttt{A}. The
main loop of the algorithm (line 12) loads one read block at a time (line
13), inserts it into a cache (line 14), and writes write blocks from the
cache when they are complete (lines 15 -- 17). Function \texttt{write} opens
each output block only once, and writes the required elements of \texttt{data} in it.
It should be noted that blocks are passed to the algorithm by reference,
not by data. Only variables \texttt{data} and \texttt{cache}
hold actual data, contributing to the consumed memory.
Since read and write blocks both represent a partition of $A$, all the
elements in \texttt{inBlocks} are read exactly once and written exactly once too. Other
problem formalizations may allow for input elements to be read and
discarded, to reduce seeking. Exploring such a trade-off
between seeks and redundant reads could be an interesting extension to this problem.
Our goal is to minimize the number of seeks done by the algorithm, which
occurs during reads (line 13) and writes (line 16). The re-partitioning
problem is therefore to find \texttt{readBlocks} and \texttt{writeBlocks} that
minimize the number of
seeks done by the algorithm, subject to the memory constraint $m$.
% Problem: re-partitioning of multi-dimensional arrays
% Find: readBlocks and writeBlocks
% Subject to: m
% Such that: seeks is minimal
% \tristan{formalize problem definition}.
Solutions of this problem materialize as implementations of function
\texttt{getReadWriteBlocks} (Alg.~\ref{algo:generalrepartition}, line 10). A
lower bound on the number of seeks for the repartitioning problem is $n_I +
n_O$, with $n_I$ the number of input blocks and $n_O$ the number of output
blocks. Indeed, input and output blocks all have to be opened at least
once, which requires a seek.
For simplicity, we require that all blocks in \texttt{readBlocks} have the
same shape $R$. We also assume that the size of array elements is 1B.
To the best of our knowledge, no algorithm was
proposed for the repartitioning problem.
\subsection{Baseline}
Our baseline algorithm for the repartitioning problem loads one input block
at a time ($R$=$I$), and directly writes it to the appropriate output
blocks ($W$=$R$). It assumes that input blocks fit in memory. The number of
generated seeks $s_b$
depends on how write blocks overlap with
output blocks in each dimension (Fig.~\ref{fig:baseline_figure} \tristan{check}).
To determine $s_b$, we define
$\mathcal{W}_d$ and $\mathcal{O}_d$, the sets of write and output block end
coordinates in dimension $d$:
\[
\forall d \in \llbracket 1, 3 \rrbracket,
\]
\begin{equation}
\mathcal{W}_d = \Big\{ iW_d, i \in \llbracket 1, A_d/W_d \rrbracket \Big\} \quad \mathrm{and} \quad
\mathcal{O}_d = \Big\{ iO_d, i \in \llbracket 1, A_d/O_d \rrbracket \Big\} ~\label{eq:baseline-blocks}
\end{equation}
We then define the \emph{cuts} of the $i^{th}$ output block along dimension
$d$ as follows:
\[
\forall i \in \llbracket 1, A_d/O_d \rrbracket,
\]
\[
C_{i, d} = C_{i, d}\left( \mathcal{W}_d, \mathcal{O}_d \right) = \Big\{ w \in \mathcal{W}_d, (i-1)O_d < w < iO_d \Big\}
\]
The number of output block pieces along dimension $d$ is then:
\[
c_d = c_d\left( \mathcal{W}_d, \mathcal{O}_d \right) =
\sum_{i=1}^{A_d/O_d} \vert C_{i, d}\left( \mathcal{W}_d, \mathcal{O}_d \right) \vert + 1 - \delta_{\vert C_{i, d}\left( \mathcal{W}_d, \mathcal{O}_d \right) \vert, 0}
\]
where $\vert$ . $\vert$ is the cardinality operator and $\delta$ is the Kronecker delta \tristan{I dont think we need the 1 and delta}.
The number of \emph{matches} between block endings is:
\[
m_d = m_d\left( \mathcal{W}_d, \mathcal{O}_d \right) = \vert \mathcal{W}_d \cup \mathcal{O}_d \vert - c_d
\]
Finally, the number of seeks created by the baseline algorithm is:
\begin{equation}
s_b = A_0 A_1 c_2 +
A_0 c_1 m_2 +
c_0 m_1 m_2
+ n_I + (m_0+c_0)(m_1+c_1)(m_2+c_2) \label{eq:seeks-baseline}
\end{equation}
with $(m_0+c_0)(m_1+c_1)(m_2+c_2)$ the number of output block openings.
\tristan{update}
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{./figures/baseline_figure.pdf}
\caption{Overlapping of write and output blocks in the plan
($X_1$,$X_2$), showing cuts (red lines) and matches (green lines)}
\label{fig:baseline_figure}
\end{figure}
As can be seen from Equation~\ref{eq:seeks-baseline}, cuts along dimension 2 generate $\mathcal{O}(A_0 A_1)$
seeks, which is prohibitive when \texttt{A} is large, cuts along dimension 1 generate $\mathcal{O}(A_0)$ seeks and cuts in
dimension 0 generate $\mathcal{O}(1)$ seeks. The \keep heuristic described
in the next section aims at avoiding cuts primarily in dimension 2, and if
possible in dimension 1 and 0. An experimental verification of the model will be provided in
Section ~\ref{sec:experiments}. \tristan{this is not a model, it is a count}
%----------------------------------------
\section{The \keep heuristic}
%----------------------------------------
As mentioned previously, the baseline strategy empties
the cache at each iteration, which generates many seeks when input and output
blocks have different shapes. Instead, the proposed
\keep heuristic keeps in cache the array elements that cannot be written
contiguously to the output blocks.
\subsection{Overview}
\begin{algorithm}
\caption{\keep heuristic (implements \texttt{getReadWriteBlocks}) \tristan{update}}
\label{algo:keep}
\begin{algorithmic}[1]
\STATE \textbf{Inputs}
\STATE inBlocks: input blocks of shape $I$, stored on disk
\STATE outBlocks: output blocks of shape $O$, to be written
\STATE $m$: memory available
\STATE
\STATE \textbf{Outputs}
\STATE readBlocks: a list of read blocks
\STATE writeBlocks: a list of write blocks
\STATE
\STATE \textbf{Algorithm}
r = rat
for x in sorted(divisors(A0), x < r0 ascending=False)
if (x, r1, r2) fits in mem:
return
\STATE readShapes = candidateReadShapes($I$, $O$, $A$)
\STATE minSeeks = None
\FOR{$R$ in readShapes}
\STATE readBlocks = blocks($R$, $A$)
\STATE writeBlocks = createWriteBlocks(readBlocks, outBlocks)
\STATE mc = peakMemory($R$)
\IF{mc > m}
\STATE \textbf{continue} \COMMENT{Memory constraint cannot be fulfiled}
\ENDIF
\IF{$R$ == $\hat r$}
\STATE bestReadBlocks = readBlocks
\STATE \textbf{break} \COMMENT{$\hat r$ fulfills memory constraint}
\ENDIF
\STATE s = generatedSeeks($I$, $R$, writeBlocks, $O$)
\IF{minSeeks == None or s < minSeeks}
\STATE minSeeks, bestReadBlocks = s, readBlocks
\ENDIF
\ENDFOR
\RETURN bestReadBlocks, writeBlocks
\end{algorithmic}
\end{algorithm}
The \keep heuristic (Alg.~\ref{algo:keep}) is a brute-force search
of the best read shape among a list of candidates (line 11).
For each candidate read shape, it creates a list of write blocks (line
15), determines the memory consumed by this solution (line 16), and provided that the
memory constraint is respected, computes
the number of seeks created by the solution (line
24), and selects the read shape that minimizes the number of seeks (lines
25-27).
The \keep heuristic uses the following functions, described in the
remainder of this section: \texttt{candidateReadShapes} (line 11),
\texttt{blocks} (line 14), \texttt{createWriteBlocks} (line 15),
\texttt{peakMemory} (line 16), and
\texttt{generatedSeeks} (line 24).
\subsection{Candidate read shapes}
% Read shape candidates are generated as to avoid cuts along dimension 2
% during reads and writes. Ideally, the read shape would cover an exact
% number of input blocks, to avoid seeks during reads, and would include at
% least one output block, to avoid seeks during writes. In practice, this
% might not be possible due to memory limitations, since at each iteration
% the parts of the read block that are not written must remain in cache.
To generate candidate read shapes, we first define a shape $\hat r$ such
that in each dimension $d$, $\hat r_d$ is the smallest multiple of $I_d$
that is greater or equal to $O_d$:
\[
\forall d \in \{1, 2, 3\},
\]
\[
\hat r_d = I_d \left \lceil \frac{O_d}{I_d} \right \rceil
% \enskip \exists \hat k \in \mathbb{N}, \enskip \hat r_d = \hat kI_d \enskip \mathrm{and} \enskip \hat k I_d \geq O_d \enskip \mathrm{and} \enskip \forall k < \hat k, \enskip kI_d < O_d
\]
Using $\hat r$ as read shape minimizes seeking in the input blocks, as
in this situation each iteration reads input blocks completely, requiring a
single seek per input block. If there is enough memory to use write blocks
of shape $O$, then this solution also minimizes the seeks in the output
blocks, as output blocks can be written in a single seek too.
Function \texttt{candidateReadShape} returns the following shapes:
\[
\Big\{ (C_0, C_1, \hat r_2) \in \mathbb{N}^3 \enskip / \enskip
\forall i \in \{0, 1\}, \enskip \exists k_i \in \mathbb{N}, \enskip
A_i = k_iC_i \enskip \mathrm{and} \enskip C_i \leq \hat r_i \Big\}
\]
The third coordinate of all read shapes is set to $\hat r_2$, to guarantee
that no cuts along dimension 2 will be made during reads. This assumes that
$m \geq \hat r_2$, i.e., a block of shape (1, 1, $\hat r_2$) fits in
memory, which seems reasonable. The other two coordinates are set to divisors of $A_0$ and $A_1$ which are
less than or equal to $\hat r_0$ and $\hat r_1$ respectively.
The candidate shapes are sorted first by decreasing $C_1$ values, then by
decreasing $C_0$ values. The first tested read shape is therefore
$\hat r$. While $\hat r$ is the optimal read shape, leading to a single
seek per input or output block, there is no obvious relation between the
other candidate shapes and the number of seeks that they generate since
this number depends on how input, read, write, and output blocks are
arranged. For this reason, the \keep heuristic returns if $\hat r$
respects the memory constraint (line 20-23), but it otherwise evaluates all
the other read shapes.
\subsection{Block order}
Creating uniform blocks from a block shape is
straightforward, but their order impacts the amount of data
stored in cache and therefore the peak memory consumption. For
simplicity, function \texttt{block} returns read blocks in the
order used for array elements on disk (C order in this paper).
\subsection{Creation of write blocks}
\tristan{revise and check. Maybe turn table into graph}
Function \texttt{createWriteBlocks} returns write blocks created from read and output blocks
as follows. For each read block,
eight write blocks are first created, defined as the $F_i$ blocks in
Figure~\ref{fig:nomenclature_overlaps}.
The $F_i$ blocks with $i \in [\![1;7]\!]$ are then merged
recursively with sub-blocks of neighboring read blocks, following the rules
in Table~\ref{tab:fusion}.
In fact, the $F_i, i \in [\![1;7]\!]$ are fused with $F_0$ blocks from
neighboring buffers.
The coordinates of $F_0$ are computed as follows: The upper-left coordinates are
$(0,0,0)$ in the referential of the read buffer, and the bottom-right coordinates
are defined to be the first output block's end coordinates encountered.
\begin{table}
\centering
\caption{Recursive merging of the initial write blocks. The first column
identifies a sub-block of a read block (i.e., an initial write block),
and the second column identifies the sub-block(s) of the neighboring read
block(s) that are merged with the sub-block of the first column. For instance, sub-block $F_1$ is merged with sub-block $F_0$
of the neighboring read block in dimension $X_2$. \tristan{target blocks are ORs}}
\begin{tabular}{c|l}
\rowcolor{black!25}
Sub-block & Merged with \\
\rowcolor{black!25}
& (sub-block, neighbor direction) \\
$F_0$ & -- \\
\rowcolor{black!10}
$F_1$ & ($F_0$, $X_2$) \\
$F_2$ & ($F_0$, $X_1$) \\
\rowcolor{black!10}
$F_3$ & ($F_1$, $X_1$), ($F_2$, $X_2$) \\
$F_4$ & ($F_0$, $X_0$) \\
\rowcolor{black!10}
$F_5$ & ($F_1$, $X_0$), ($F_4$, $X_2$)\\
$F_6$ & ($F_2$, $X_0$), ($F_4$, $X_1$)\\
\rowcolor{black!10}
$F_7$ & ($F_3$, $X_0$), ($F_5$, $X_1$), ($F_6$, $X_2$)\\
\end{tabular}
\label{tab:fusion}
\end{table}
\begin{figure}
\centering
\includegraphics[width=0.5\textwidth]{./figures/figure_3.pdf}
\caption{Division of a read block into write blocks. $F_0$, the write block represented in blue,
is the output block containing the origin of the read block. $F_1$-$F_7$ are the intersections
of the read block with the neighboring output blocks.
Following this division, write blocks are merged using the scheme in Table~\ref{tab:fusion}.}
\label{fig:nomenclature_overlaps}
\end{figure}
\subsection{Peak memory}
\tristan{update}
To determine the peak memory used by a particular combination of read and
write blocks, one needs to determine the amount of data in the cache at
each iteration, which depends on how read and write blocks overlap. Since we were only able to find an analytical
upper bound of the peak memory usage, which in fact largely overestimates
it, we instead simulate each iteration of the \keep heuristic to measure
the cache size.
The simulation requires $\vert \texttt{readBlocks} \vert$ iterations. It
assumes that (1) read blocks are in the order used for array elements on
disk (C order in this paper), consistently with function \texttt{blocks}
and (2) write blocks are defined as described in the previous section.
Under these assumptions, we note that the
maximum number of sub-blocks of each type stored in the cache at any given
iteration is 1 for sub-blocks of type $F_1$, $A_2/R_2$ for sub-blocks of
type $F_2$ or $F_3$, and $A_1/R_1A_2/R_2$ for sub-blocks of type $F_4$ to
$F_7$. The simulation is therefore implemented with 3 circular buffers
$B_1$, $B_2$, $B_3$ of size 1, $A_2/R_2$ and $A_1/R_1A_2/R_2$ respectively.
At each iteration, the size of $F_1$ is added to $B_1$, the size of $F_2
\cup F_3$ is added to $B_2$, and the size of $F_4 \cup \ldots \cup F_7$ is
added to $B_3$. The sum of elements in $B_1$, $B_2$ and $B_3$ is computed
and its max across all iterations is returned as the peak memory
consumption. \tristan{add read block size}
\subsection{Number of generated seeks}
\tristan{update}
Similar to the
baseline, seeks generated by the \keep heuristic happen during reads (line
13 in Alg.~\ref{algo:generalrepartition}) and during writes (line 18). To count the seeks generated
during reads, we note that the read process is dual to the write process:
writing in-memory blocks with end coordinates $\mathcal{M}$ to disk blocks
with end coordinates $\mathcal{D}$ generates the same number of seeks as
reading disk blocks with end coordinates $\mathcal{D}$ to in-memory blocks
with end coordinates $\mathcal{M}$.
We can determine this number using Equation~\ref{eq:seeks-baseline} without the last term:
\begin{eqnarray*}
\textsc{seeks} \left( \mathcal{M}, \mathcal{D} \right) &=& A_0 A_1 c_2\left(\mathcal{M}_d, \mathcal{D}_d\right) + \\
&& A_0 c_1\left(\mathcal{M}_d, \mathcal{D}_d\right) m_2\left(\mathcal{M}_d, \mathcal{D}_d\right) +\\
&& c_0\left(\mathcal{M}_d, \mathcal{D}_d\right) m_1\left(\mathcal{M}_d, \mathcal{D}_d\right) + \\
&& m_2\left(\mathcal{M}_d, \mathcal{D}_d \right) + \\
&& (m_0+c_0)(m_1+c_1)(m_2+c_2)
\end{eqnarray*}
Function
\textsc{seeks} does not assume that $\mathcal{M}$ or $\mathcal{D}$ relate to
blocks of uniform shape. Therefore, it can be applied to the \keep
heuristics where write blocks are not necessarily of uniform shape.
The number of seeks generated by the \keep algorithm is given by:
\begin{equation}
s_k = \textsc{seeks} \left( \mathcal{R}, \mathcal{I} \right) + \textsc{seeks} \left( \mathcal{W}, \mathcal{O} \right) \label{eq:seeks-keep}
\end{equation}
The first term represents the number of seeks required to read input
blocks of end coordinates $\mathcal{I}$ into read blocks of end
coordinates $\mathcal{R}$ whereas the second term represents the number of
seeks required to write write blocks of end coordinates $\mathcal{W}$ into
output blocks of end coordinates $\mathcal{O}$.
The last two terms represent the number of input block openings and the number
of output block openings, respectively.
Function
\texttt{generatedSeeks} is
a direct implementation of Equation~\ref{eq:seeks-keep}.
\subsection{Implementation}
\tristan{mention tests and coverage;
mention that we check seek model and memory consumption after each run; mention profiling}
\tristan{Mention array copies in Python, slicing doesnt copy data but references are copied
look at https://stackoverflow.com/questions/5131538/slicing-a-list-in-python-without-generating-a-copy}
We implemented the baseline and \keep heuristic in Python 3.7. The
project repository is available at
\url{https://github.com/big-data-lab-team/repartition_experiments}. It includes documentation on
how to install the package and its dependencies. The code used for the
experiments is also available in the same repository.
The implementation supports the HDF5 file format, but the code is modular
enough to accommodate other formats in the future.
To reduce the time requirements of brute-force search, we limit the testing
of candidate read shapes as follows in the implementation: starting from
$\hat r$ = ($\hat r_0$, $\hat r_1$, $\hat r_2$), we evaluate candidate read
shapes sorted by decreasing $X_1$ values, then by decreasing $X_0$ values.
We stop the search if the best number of seeks (\texttt{minSeeks} in
Alg.~\ref{algo:keep}) does not decrease for 10 iterations.
We could have passed the data from to read buffers to the cache by reference
which would have saved us the time of copying the data.
However, keeping a reference to the read buffers prevented them to be freed and
therefore increased the maximum amount of memory used drastically.
Therefore, we found that copying the data into the cache enabled us to stay below the
maximum memory consumption that we predicted at the buffer selection time.
%----------------------------------------
\section{Experiments}
%----------------------------------------
\label{sec:experiments}
\subsection{Seek model validation}
We tested our models against lighter implementations of the baseline and \keep
algorithms. We tested each model on 1,000 randomly generated cases.
To that aim, we first generate a random original array by multiplying 4 numbers
randomly drawn from ${2, 3, 5, 7}$, putting each number to the power 1 or 2
(again, randomly) and multiplying them:
$\hat r = (r,r,r)$ with $r = (x_0^{p_0}x_1^{p_1}x_2^{p_2}x_3^{p_3})$.
We then compute the divisors of $r$ and take two of them for the shapes of $I$
and $O$. We create 5 cases by picking 5 buffer shapes from the possible buffer
shapes given the configuration $(\hat r, I, O)$. We do the same after exchanging
$I$ with $O$. This gives us 10 configurations for a given randomly created $\hat r$.
We repeat the process until 1,000 cases have been randomly generated.
Then, for each case, we run our model to compute the number of seeks that the algorithm
should do, and compare it to the real number of seeks computed by running a light version
of the algorithm.
\begin{table*}
\setlength{\tabcolsep}{2pt}
\centering
\caption{Input and output blocks shapes}
\vspace*{-0.5cm}
\begin{tabular}{c|ccc|ccc}
\multicolumn{7}{c}{$A=(3500,3500,3500)$}\\
\rowcolor{black!25}
& \multicolumn{3}{c|}{\texttt{inBlocks}} & \multicolumn{3}{c}{\texttt{outBlocks}} \\
\rowcolor{black!25}
Config & $I$ & $n_I$ & size & $O$ & $n_O$ & size \\
\rowcolor{black!25}
& & & (MiB)& & & (MiB) \\
\hline
0 & (875,875,875) & 64 & 1277 & (875,1750,875) & 32 & 2,555 \\
1 & & & & (700,875,700) & 100 & 818 \\
\rowcolor{black!10}
2 & (350,350,350) &1,000 & 82 & (500,500,500) & 343 & 238 \\
\rowcolor{black!10}
3 & & & & (250,250,250) & 2,744 & 29 \\
4 & (175,175,175) &8,000 & 10 & (250,250,250) & 2,744 & 29 \\
\rowcolor{black!10}
5 & (350,875,350) &400 & 204 & (500,875,500) & 196 & 417 \\
\rowcolor{black!10}
6 & & & & (350,500,350) & 700 & 116 \\
\end{tabular}
\hfill
\begin{tabular}{c|ccc|ccc}
\multicolumn{7}{c}{}\\
\multicolumn{7}{c}{$A=(8000,8000,8000)$}\\
\rowcolor{black!25}
& \multicolumn{3}{c|}{\texttt{inBlocks}} & \multicolumn{3}{c}{\texttt{outBlocks}} \\
\rowcolor{black!25}
Config & $I$ & $n_I$ & size & $O$ & $n_O$ & size \\
\rowcolor{black!25}
& & & (MiB)& & & (MiB) \\
\hline
0 & (2000,2000,2000) & 64 & 15,258 & (2000,4000,2000) & 32 & 30,518 \\
1 & & & & (1600,1600,1600) & 125 & 8,766 \\
\rowcolor{black!10}
2 & (800,800,800) & 1000 & 977 & (1000,1000,1000) & 512 & 1,907 \\
\rowcolor{black!10}
3 & & & & (500,500,500) & 4096 & 238 \\
4 & (200,200,200) & 64000 & 15 & (250,250,250) & 32768 & 30 \\
5 & & & & (160,160,160) & 125000 & 8 \\
\rowcolor{black!10}
6 & (400,400,400) & 8000 & 122 & (500,500,500) & 4096 & 238 \\
\rowcolor{black!10}
7 & & & & (250,250,250) & 32768 & 30 \\
\end{tabular}
\label{tab:exp}
\end{table*}
\subsection{Experiment conditions}
The experiments were run on a DELL-EMC PowerEdge C6420 server
with CentOS Linux release 8.1.1911 (Core), kernel release
$4.18.0-147.5.1.\textrm{el}8\_1.\textrm{x}86\_64$, 250GB of RAM,
6 $\times$ SSDs Intel SSDSC2KG480G8R (480~GB, xfs),
2 $\times$ Intel Xeon Gold 6130 CPUs @ 2.10GHz (16 cores/CPU, 2 threads/core).
The server was accessed through the SLURM batch manager.
We repartitioned two arrays of 2-byte elements (float 16): a "small" one (85.75~GiB) of shape $A=(3500,3500,3500)$
and a "large" one (1024~GiB) of shape $A=(8000,8000,8000)$.
We used the input and output block shapes in Table~\ref{tab:exp}.
We repartitioned the small array with $m$=275~GiB (256~GB), 8.6~GiB (8~GB)
and 4.3~GiB (4~GB), to benchmark the baseline and \keep algorithms with different
memory constraints. For the large array we used $m$=256~GB only.
The memory was controlled through SLURM's cgroup-based memory requirement,
which means that $m$ was the total amount of RAM allocated to the
repartitioning task, including anonymous memory and cache. Swapping was
disabled.
A single disk was used when processing the small array. However, for the
large array, the 6 disks available on the server were used as the capacity of a single disk was not
enough to accommodate the input or the output blocks. We used the disks in
round-robin.
We did 5 repetitions of each configuration for the small array, and 1
repetition for the large array. The page cache was flushed at the beginning
of each repetition.
\subsection{Results}
\begin{figure*}
\centering
\begin{subfigure}{\textwidth}
\caption{Number of seeks (log scale)}
\includegraphics[width=\textwidth]{./figures/seeks_3500.pdf}
\label{fig:seeks_3500}
\end{subfigure}
\begin{subfigure}{\textwidth}
\caption{Processing time}
\includegraphics[width=\textwidth]{./figures/time_3500.pdf}
\label{fig:time_3500}
\end{subfigure}
\caption{Repartitioning results for the small array (85.75~GiB). Averages on 5 repetitions. \textbf{(a)} Compared to baseline, the \keep heuristic
reduces the number of seeks by four orders of magnitude. \textbf{(b)} The \keep heuristic provides speed-up factors of up to 2.5 (configuration 4).}
\label{fig:3500}
\end{figure*}
\subsubsection{Small array (85.75~GiB)}
Results are shown in Figure~\ref{fig:3500}.
By design, the number of seeks is drastically reduced by the \keep
heuristic compared to baseline (Fig.~\ref{fig:seeks_3500}), by a factor of
90,000 on average. In configurations 1 to 5 in particular, baseline
requires up to $10^8$ seeks while the \keep heuristic only requires up to
$10^4$.
This reduced seeking translates to important reductions of the
repartitioning time (Fig.~\ref{fig:time_3500}). Compared to baseline, the
\keep heuristic provides speed up factors of up to 1.4 for
$m$=275~GiB, 1.9 for $m$=8.6~GiB, and 2.2 for $m$=4.3~GiB.
Speed-up is only observed when baseline requires O($10^8$) seeks, which occurs
in configurations 1 to 5: lower amounts of seeks don't seem to have an effect on
I/O time. Speed-up is mostly coming from write time reduction, since
baseline read complete input blocks without seeking.
% Considering run 4 which shows the most important differences in terms of
% processing time, the amount of seeks have been reduced from about 100 millions
% seeks to thousands of seeks.
% We also remark that although the number of seeks have been reduced from millions
% to thousands of seeks for runs 1 to 5, the reduction in processing time does not
% follow linearly the reduction in the number of seeks.
% This is probably due to the fact that some seeks are more costly than others,
% which has not been taken into account in our problem formulation.
The baseline performances improve drastically with the amount of memory
allocated to the repartitioning. For $m$=275~GiB, repartitioning times of
baseline and \keep are on par for all configurations except configuration
4. This is most likely due to the use of asynchronous write back to disk
through the Linux page cache. In our experiments, the ratio of "dirty"
data, the cache data waiting to be written to disk, was set to 40\%, a
common value for compute nodes, which means that the small array entirely
fit in memory for $m$=275~GiB. The \keep strategy is mostly useful for
arrays that do not fit in page cache memory.
The overhead of the \keep heuristic consists of search time
(Alg.~\ref{algo:keep}) and cache management (data copy to/from cache in
Alg.~\ref{algo:generalrepartition}). While search time is negligible, cache
management is substantial except for configuration 0 where the total number
of blocks is limited.
Table~\ref{tab:buffer_shapes} shows the read block shapes selected by the
\keep heuristic. The optimal shape $\hat r$ was used for all configurations with $m=$ 275~GiB.
\begin{table}
\centering
\caption{Read block shapes and peak memory estimates selected by the
\keep heuristic to repartition the small image.}
\begin{tabular}[t]{cccc}
\hline
\rowcolor{black!25}
Config & $m$=4.3 GiB & $m$=8.6 GiB & $m$=275 GiB \\
0 & (875, 1750, 875)* & (875, 1750, 875)* & (875, 1750, 875)* \\
& 2.8 GiB & 2.8 GiB & 2.8 GiB \\
\rowcolor{black!10}
1 & (700, 875, 875) & (700, 875, 875) & (875, 875, 875)* \\
\rowcolor{black!10}
& 1.8 GiB & 1.8 GiB & 15.3 GiB \\
2 & (500, 700, 700) & (500, 700, 700) & (700, 700, 700)* \\
& 2.1 GiB & 2.1 GiB & 12.2 GiB \\
\rowcolor{black!10}
3 & (250, 350, 350) & (350, 350, 350) & (350, 350, 350)* \\
\rowcolor{black!10}
& 0.45 GiB & 5.5 GiB & 5.5 GiB \\
4 & (250, 350, 350) & (350, 350, 350)* & (350, 350, 350)* \\
& 0.45 GiB & 5.5 GiB & 5.5 GiB \\
\rowcolor{black!10}
5 & (500, 875, 700) & (500, 875, 700) & (700, 875, 700)* \\
\rowcolor{black!10}
& 1.0 GiB & 1.0 GiB & 11.5 GiB \\
6 & (350, 875, 350)* & (350, 875, 350)* & (350, 875, 350)* \\
& 1.2 GiB & 1.2 GiB & 1.2 GiB \\
\hline
\end{tabular}
* selected read block shape was $\hat r$
\label{tab:buffer_shapes}
\end{table}
\subsubsection{Large array}
\tristan{We can't really present these results unless the overheads are fixed,
including cache management and read times.}
Given the 250~GB RAM available, the read buffer shapes tested were always equal to
the input aggregates' shapes.
Globally, one can see on Figure~\ref{fig:time_8000} that apart from run 2 the \keep
strategy and baseline are pretty equivalent, and for 5 runs over 8 the \keep
strategy is slightly faster.
There is no important improvement, however, which is mainly due to 3 bottlenecks:
\begin{itemize}
\item the overhead time
\item the preprocessing time
\item the read time
\end{itemize}
It seems logical to get an overhead time that explodes, especially in run 2, as
it was already important on the small array. As it is mostly due to cache manipulations,
a bigger array implies bigger data transferred during these cache manipulations.
Interestingly, the preprocessing time can be very large.
The preprocessing time includes the computation of the write buffers and the
computation of a Python dictionary associating each buffer with the output blocks
it intersects. The latter operation can be very expensive when there are a lot
of buffers.
The experiment on a big array was worth it as it clearly shows a limit
of the \keep strategy: We can see on Figure~\ref{fig:time_8000} that for runs
2 to 4, the write time (in green) has been tremendously reduced, but at the
cost of an increase in read time.
Finally, one can see on Figure~\ref{fig:seeks_8000} that in this experiment, too,
the \keep strategy succeeded in reducing the number of seeks significantly. Once
again, the reduction in processing time is also non-linear with the reduction in
seeking.
\begin{figure*}[h]
\centering
\includegraphics[scale=0.36]{./figures/time_8000.pdf}
\caption{Results in terms of processing time for the keep and baseline algorithms. From left to right, the results are presented for 256, 8 and 4GB of available main memory.}
\label{fig:time_8000}
\end{figure*}
\begin{figure*}[h]
\centering
\includegraphics[scale=0.36]{./figures/seeks_8000.pdf}
\caption{Results in terms of seeks for the keep and baseline algorithms. From left to right, the results are presented for 256, 8 and 4GB of available main memory.}
\label{fig:seeks_8000}
\end{figure*}
%----------------------------------------
\section{Discussion}
%----------------------------------------
\tristan{Tristan to refocus the discussion and conclusion}
%
The \keep heuristic can reduce the number of seeks required to repartition an array
from millions to thousands compared to the baseline
algorithm. As a result, write time is drastically reduced and speed-up
factors of up to \mytodo{x} are observed. Speed-up factors are expected to
increase with the size of the repartitioned array; in 3D, speed-up is
proportional to $A_0$$A_1$ when enough memory is available to avoid cuts in
dimension 2. Speed-up factors are also expected to increase with the dimension of
the repartitioned array; in dimension $d$, speed-up is proportional
to the product of the first d-1 $A_i$ when enough memory is available.
Therefore, the \keep heuristic is adapted to the repartitioning of large
multi-dimensional arrays in the current context of growing data volumes.
\subsection{Relevant extensions}
% Memory
Allocating only reasonable amounts of working memory, 4.3~GiB for an array
of 85.75~GiB in our experiments, seems to be sufficient to remove most of
the seeks required for the repartitioning. This is consistent with the fact
that the \keep heuristic only stores partial output blocks in cache, until
they can be combined into complete output blocks. In the future, it would
be interesting to derive an upper bound of the peak memory consumption, to
generalize this observation beyond the particular input and output block shapes
tested in our experiments.
% Page cache
The memory page cache provided by the Linux kernel plays a significant role
in the repartitioning problem when the amount of memory increases. When
repartitioning a 85.75~GiB array with 275~GiB of working memory, all the
output blocks are written to memory and asynchronously flushed to disk,
bringing the baseline algorithm on par with the \keep heuristic. This is
not surprising given the importance of page cache for I/O intensive
applications~\cite{hayot2019performance}. The Linux kernel provides other
relevant I/O strategies such as readahead and clustered writes that may
interfere with the \keep heuristic. Incorporating awareness of kernel I/O
strategies into the repartitioning algorithm may further enhance
performance.
% Parallelization and clusters
Most applications would process large multi-dimensional arrays in parallel,
on a multi-core computer or on an HPC cluster. The \keep heuristic is
directly applicable to multi-core environments, assuming sequential I/Os.
The case of HPC clusters, however, it would have to be extended to consider network data
transfers. In addition, HPC clusters would most
likely store input blocks on a parallel file system such as Lustre where
files are striped across multiple disks and nodes. An extended formulation of the
repartitioning problem would be required in this context.
It would be interesting to integrate such an extension in the Dask engine,
a popular parallelization framework of the SciPy ecosystem~\cite{matthew_rocklin-proc-scipy-2015}.
In particular, the \keep heuristic would interface nicely with the \texttt{dask.array} API.
\tristan{Timothée, you wrote this: "there is still margin for improvement in optimizing
the read time.". Could you explain why?}
\timothee{ce que je veux dire c'est qu'au final la méthode de "buffer extension" permet de réduire les seeks en écriture mais pas en lecture alors que de base on souhaitait que ça réduise en écriture ET en lecture}
Other array manipulation problems could benefit from a similar approach.
% Buffer ordering
% Seek types
% The buffer order is important when using the keep strategy, as it defines when
% a piece of data will be loaded and when a write buffer will be complete, hence
% ready to be written down and freed from the cache.
% Representing the read buffers by a complete bidirectional graph in which each
% vertex is a buffer, we can model the problem as kind of a complex shortest path
% problem.
% Indeed, for a given buffer shape $B$, we want to find the read buffers order
% that will keep the amount of memory used by the re-partition algorithm low.
% Solving this problem would incur more infering at runtime and a potentially
% complex algorithm to run.
% We decided to keep things simple using a naive buffer order, as the goal of
% this study was primarily to assess if the keep algorithm works.
% Using the keep strategy, one may order the buffer loadings to reduce the maximum
% quantity of extra data stored in memory, hence, reducing the amount of seeks.
% For example, if the shape mismatches occur only in the $k$ axis, loading the
% buffers in this direction will enable recycling the extra data in cache,
% resulting in a smallest memory consumption over time.