-
Notifications
You must be signed in to change notification settings - Fork 120
Sequence combinators
Currently, streaming in Accelerate, as described here and implemented in the streaming
branch, suffers from a few major drawbacks.
-
We're very limited in what we can express. There is of course the regularity constraint, but even if we accept that restriction for now, there are still other needless restrictions like sequence length, etc.
-
Since adding support for sequence fusion, the AST has become quite complex. A large part of this is because we maintain a second environment for sequences instead of using the array environment.
-
There is a lot of working being done in the CUDA backend to support sequence execution. Ideally, we should be doing as much as possible in the frontend.
-
Communication and computation doesn't overlap even though there is no real reason it shouldn't.
-
Our ability to stream data from host memory is very limited. Most of this is CUDA's fault, but we can do a little better, as described below.
-
We don't have an efficient implementation of
fromSeq
. Currently we rely on(++)
, but that results in O(n^2) complexity when used to concatenate many arrays.
Here are the new proposed set of sequence combinators, with some new helper structures. See here for a complete version of AST.hs with comments.
data Producer index acc aenv a where
Pull :: Arrays a
=> Source a
-> Producer index acc aenv a
Subarrays :: (Shape sh, Elt e, sh :<= DIM3)
=> PreExp acc aenv sh
-> Array sh e
-> Producer index acc aenv (Array sh e)
Produce :: Arrays a
=> Maybe (PreExp acc aenv Int)
-> PreOpenAfun acc aenv (index -> a)
-> Producer index acc aenv a
MapAccumFlat :: (Arrays a, Shape sh, Elt e, Shape sh', Elt e')
=> PreOpenAfun acc aenv (a -> Vector sh -> Vector e -> (a, Vector sh', Vector e'))
-> acc aenv a
-> acc aenv (Array sh e)
-> Producer index acc aenv (Array sh' e')
ProduceAccum :: (Arrays a, Arrays b)
=> Maybe (PreExp acc aenv Int)
-> PreOpenAfun acc aenv (index -> b -> (a, b))
-> acc aenv b
-> Producer index acc aenv a
data Consumer index acc aenv a where
FoldSeqFlatten :: (Arrays a, Shape sh, Elt e)
=> PreOpenAfun acc aenv (a -> Vector sh -> Vector e -> a)
-> acc aenv a
-> acc aenv (Array sh e)
-> Consumer index acc aenv a
Iterate :: Arrays a
=> Maybe (PreExp acc aenv Int)
-> PreOpenAfun acc aenv (index -> a -> a)
-> acc aenv a
-> Consumer index acc aenv a
Stuple :: (Arrays a, IsAtuple a)
=> Atuple (Consumer index acc aenv) (TupleRepr a)
-> Consumer index acc aenv a
data Source a where
List :: Arrays a
=> [a]
-> Source a
RegularList :: (Shape sh, Elt e)
=> sh
-> [Array sh e]
-> Source (Array sh e)
The following table indicates at what stages of the pipeline these AST nodes will actually occur as well as if they can be vectorised. As can be seen, after vectorisation we have an almost entirely different syntax.
Combinator | Pre-vectorisation | Post-vectorisation | Vectorisable |
---|---|---|---|
Pull | ✔️ | ✔️ | ✔️ |
SubArrays | ✔️ | ✔️ | |
Produce | ✔️ | ️✔️ | |
MapAccumFlat | ✔️ | ️✔️* | |
ProduceAccum | ✔️ | ||
FoldSeqFlatten | ✔️ | ️✔️* | |
Iterate | ✔️ | ||
Stuple | ✔️ | ✔️ | ️✔️ |
* These combinators are vectorisable in as much as the functions they contain already take whole chunks as arguments.
Ideally, we'd like to capture the which stage of the pipeline certain operations can be expected in the type, as described in #220.
Conspicuously absent are operations like MapSeq
and ZipWithSeq
. Both of these can be implemented in terms of Produce
, so there's not really any benefit to having them cluttering up the AST definition.
We also introduce this array operation. See below for why.
Subarray :: (Shape sh, Elt e)
=> PreExp acc aenv sh -- index of subarray
-> PreExp acc aenv sh -- extent of subarray
-> Array sh e
-> PreOpenAcc acc aenv (Array sh e)
It's tempting to think that we could get rid of Subarrays
and just use a combination of Produce
and Subarray
. However, Subarray
is not actually vectorisable. We generate SubArray
during vectorisation, but it should not occur before.
Because of our nested array representation, we had to add IndexTrans
to the streaming
branch, which looks like this:
IndexTrans :: Shape sl
=> PreOpenExp acc env aenv sl
-> PreOpenExp acc env aenv sl
Given a shape, it yields the transpose of that shape, essentially reversing the order of the dimensions. The reason we had to do this was we encountered the problem of needing to get the outermost dimension or all dimensions except the outermost (i.e. indexLast
and indexInit
). Rather than just adding these two as primitives, it made more sense to add IndexTrans
as it also has the benefit of letting us write a rank polymorphic array transpose and slice along the outer dimension.
TLM @robeverest: it sounded like you needed something like this?
If we don't have an explicit toSeq
AST node, as is proposed above, we also need to add something along the lines of,
ToSlice :: (Shape sh, Elt slix)
=> SliceIndex (EltRepr slix) sl co (EltRepr sh)
-> PreOpenExp acc env aenv slix
-> PreOpenExp acc env aenv sh
-> PreOpenExp acc env aenv Int
-> PreOpenExp acc env aenv slix
This essentially allows us to go from a the linear index of a slice in an array to the the actual slice index.
We are very limited in the way we can stream data from host memory into the device. If we are performing a natural sequence computation (elements are computed one at a time, not in chunks), it's fairly straightforward. However, if we want to perform a chunked sequence computation, we need to transfer the data in bulk.
Currently, our "use-lazy" optimisation allows us to load many slices of an array in at once. This works, but it has the following disadvantages:
- As CUDA only supports strided copies of either a 2D or 3D array, it's hard to work with arrays of higher dimensions. We do our best to minimise the number of copies necessary, but it is possible to construct pathological cases where every element of a chunk is copied separately.
- We can only load slices of an array, not subarrays. For example, it would be nice to be able split a matrix up into submatrices.
- CUDA has a maximum pitch (the width of the source array) for all its strided copy functions. This seems to be 2GB on most hardware.
The basic idea for fixing this, at least in part, is to introduce a new sequence operation:
subarrays :: (Shape sh, Elt e, sh :<= DIM3)
=> Exp sh
-> Array sh e
-> Seq [Array sh e]
To explain how this works, it's best to view an example.
Supposing we have this array that we want to extract 25 submatrices from, in the order given.
If we're dealing with a natural sequence, then it's just a straightforward 2D memcpy
for each subarray. If the sequence is chunked, however, we need to do some extra work. Supposing the ideal chunk size contained 3 of the the submatrices, the first chunk could be copied with one call to a 2D memcpy
.
However, the 2nd chunk goes over 2 columns. The only way to resolve this is to do 2 memcpy
s like so:
Provided the chunk size is less than the height of the array, we can always copy a chunk with no more than 2 memcpy
s. If the chunk size is more than the height of the array, we have extra problems.
Take this different example. Supposing the chunk size contained 7 subarrays, we would need to transfer the first and second chunks as shown in these diagrams.
In general, our chunks will always contain three components, which we'll call the head, body, and tail. Each of these need to be transferred separately. In addition, the body is not always in the form we want it. It is entirely possible that it will straddle two whole columns, as in the 2nd chunk. Because our nested array representation is all subarrays concatenated along the outer dimension, transferring the body is going to produce an incorrect ordering. This is trivially solved by an extra permutation, which, if fused into the rest of the computation, should not cause any more overhead than needed.
Using this technique we can always do a chunked transfer in at most 3 memcpy
s and with one potential permutation. Slicing up an array is simply a more specific form of subarray extraction, so this subsumes that (at least for arrays of DIM3 or lower).
In the case of 3 dimensional arrays, this technique requires 5 memcpy
s and 3 permutations. Conceptually, it should generalize to any dimension, but I (Rob) don't think there is any benefit in trying to write a rank polymorphic implementation due to CUDA's restrictions.
This technique should also for loading subarrays in row-major order as well. The only difference being that head, body and tail will all need to be permuted.
TODO: Backend is oblivious. Algorithm is implemented in accelerate and injected during vectorisation.
TODO