diff --git a/doc/book/random/python_code_that_suck_less.Rmd b/doc/book/random/python_code_that_suck_less.Rmd index 899fb80de..4d81825cf 100644 --- a/doc/book/random/python_code_that_suck_less.Rmd +++ b/doc/book/random/python_code_that_suck_less.Rmd @@ -326,6 +326,10 @@ const auto& x_ijk = x[{i, j, k}]; This is aesthetically better and more meaningful than ```{cpp} const auto& xijk = ...; // too crammed + +// The following is fine too as typing the underscore symbol can be a hassle. +// I am not a snake_case zealot either... +const auto xi = x[i]; ``` or the ugly diff --git a/doc/book/random/vector_intrinsics.Rmd b/doc/book/random/vector_intrinsics.Rmd index fb3b44c8e..9abcfff02 100644 --- a/doc/book/random/vector_intrinsics.Rmd +++ b/doc/book/random/vector_intrinsics.Rmd @@ -3,9 +3,11 @@ Sounds boring? I promise you this is going to be way more interesting than you might think. There's quite a bit interesting things to learn. -Story time before we dive into an existing example. +Story time before we dive into the implementation of the Gaussian blur. -I applied for a C++ technical leadership role once at some company. I was +## Story Time + +Once I applied for a C++ technical leadership role for some company. I was considered for the role after a preliminary behavioral screening by the CTO. He then told me who I would be interviewing with in the next interviews. A few days later, he called me back for a quick chat. He ended up telling me they would not @@ -14,56 +16,85 @@ explaining why, I guessed that probably one of their senior engineers disqualified me as he did care more about my ability in understanding and manipulating CPU vector instructions. -The CTO profusedly apologized to me by saying that I certainly was gifted but my -image processing code was very slow because of my naive implementation. It did -sound unfair to me. +The CTO profusedly apologized to me. He said politely that I certainly was +gifted but my C++ code was not to their standard. From what I guessed, they +probably found that my image processing code was implemented too naively. + +This was based on the fact that the CTO told me they were doing lots of +optimization involving CPU vector intrinsics in order to calculate data as fast +as possible. That it really was not that difficult and that I could get up to +speed fairly quickly. -I have mostly an applied math background, so it sounded unfair to me. It did +I have mostly an applied math background, so it did sound unfair to me. It did made me feel that you are never enough whatever you achieve. Like you do need to know every single aspect of engineering from high level to low level. In hindsight I would not have been happy with the job anyways. Still frustrating... In that moment, I was telling myself: what can you do when you are already being -shut the door? Going by the same logic, David Lowe showcased his SIFT research -code, he would not qualify for the job either: I learnt from studying his code. +shut the door? Going by the same logic, if David Lowe showcased his SIFT +research code, he would not qualify for the job either: I learnt from studying +his code. Right, back to the topic: today how can we achieve that? Nowadays, we have some answers with Halide. And we can do it very elegantly without coding in assembly directly. -The main beauty with Halide is that you can decouple (1) the algorithm: the -separable convolution and (2) the scheduling strategy, i.e., how do you make it -as fast as the baseline: OpenCV's Gaussian blur? +## What is Halide? + +With Halide you can write a an image processing filter and optimize the way it is run. + +You write sets of arithmetic instructions that operates on image buffers with +specific parallelism patterns (multicore and vectorization). Then you can tell +to compile with a C++ method to generate the optimize the image filter as a C++ +static library. + +The main beauty with Halide is that you can decouple: + +1. The algorithm: the separable convolution and, +2. the scheduling strategy: the parallelism strategy to make it as fast as the +baseline if not faster than OpenCV's Gaussian blur. Halide can check and provide guarantees that your algorithm remains correct for the schedule you are implementing. -Another beauty with Halide is that it abstracts CPU, GPU intrinsics for you in a -generic C++ code for a wide range of platforms. You don't have to be an expert -in CPU vector intrinsics, but you do need to know the scheduling strategies to -say optimize the speed at which your convolutional operations run. So unless you -read publications and the presentations, it is hard for the layman or the novice -to know what works and what doesn't. +With Halide, you won't write any nested `for` loops and multi-threaded +processing with bound-checking. So you can express ideas at a higher level. +Halide abstracts these parallelisms for you and supports a wide range of +platforms. You don't have to be an expert in CPU vector intrinsics, but you do +need to know the schedule strategies to say optimize the speed at which your +convolutional operations run. Halide has identified the most common schedule +patterns that are used to optimize image processing code. -## Naive Implementation of the Gaussian blur in C. +You still need to skim the publications and the presentations, practise. But in +my experience, it is still difficult for the layman or the novice to identify +the schedule patterns that work and those that don't. -Right, myself more than 15 years ago, let's imagine that I am a fresh graduate -with a Master's degree looking for an internship. I was a bit aware of CPU -registers but I can't see how it would fit in the grand scheme of things. I've -heard of vector instructions but never really. As an avid StackOverflow reader, -I have been told the big lie: don't prematurely optimize and just let the -compiler do its job. And just make things work. -I did learn the Gaussian filter is separable. Don't write the 2D convolution, -but exploit its separable properties. With the innocence of my youth, I am -proudly exhibiting a naive implementation in C++ in my Sara repo, something -along the lines: +## Naive C++ Implementation of the Gaussian Blur + +Right, let's rewind back in time and imagine myself more than 15 years ago. +Freshly graduated with a Master's degree looking for an internship. I am a bit +aware of CPU registers but I can't see how it would fit in the grand scheme of +things. I've heard of vector instructions but never understood what they were +about. As an avid Internet reader, I have been sold the big lie: don't +prematurely optimize and just let the compiler do its job. And just make things +work. + +I did learn in class that the Gaussian filter is separable: Don't write the 2D +convolution naively, exploit its separable property. I am proudly exhibiting a +naive implementation in C++ in my Sara repo, something along the lines: + +Icing on the cake, I have discovered multicore parallelism via OpenMP. ```{Rcpp} auto conv_x(const float *in, const float *kernel, float *out, const int w, const int h, const int ksz) -> void { const auto r = ksz / 2; + +#ifdef _OPENMP +#pragma omp parallel for +#endif for (auto y = 0; y < h; ++y) { for (auto x = 0; x < w; ++x) @@ -88,11 +119,19 @@ auto conv_x(const float *in, const float *kernel, float *out, } } } +``` + +We execute the same dance for the y-convolution: +```{Rcpp} auto conv_y(const float *in, const float *kernel, float *out, const int w, const int h, const int ksz) -> void { const auto r = ksz / 2; + +#ifdef _OPENMP +#pragma omp parallel for +#endif for (auto y = 0; y < h; ++y) { for (auto x = 0; x < w; ++x) @@ -119,27 +158,32 @@ auto conv_y(const float *in, const float *kernel, float *out, } ``` -I write unit tests, validate on synthetic tests, check the boundary conditions -and try it on a real image with a Gaussian kernel. I am happy it works -reasonably fast when compiling in Release mode on Visual Studio. Job done! I am -proudly showcasing my code on GitHub. Never complained about it as I bothered -about real-time issues as later I learnt about in CUDA and would write in CUDA -anyways. +I diligently write unit tests, validate on synthetic tests, check the boundary +conditions and try it on a real image with a Gaussian kernel. I am happy it +works reasonably fast when compiling in Release mode on Visual Studio. Job done! +I am proudly showcasing my code on GitHub. Never complained about it as I +bothered about real-time issues as later I learnt about in CUDA and would write +in CUDA anyways. -## 2D separable convolutional operation. +## Halide Implementation of the Algorithm -Many years later, the CTO and his minion tells you that you are not good enough -without elaborating what he was expecting. OK, let's make it better! +Back to the time where the CTO and his minions tell you that you are not good +enough without elaborating why and what he was expecting to see. -How? I vaguely understand you have to write the code with CPU intrinsics? Should -I write in C-style or in ASM. How to do it with limited bandwidth after +Fine! let's see how we could optimize the code... + +OK How? I vaguely understand you have to write the code with CPU intrinsics? +Should I write in C-style or in ASM. How to do it with limited bandwidth after finishing your day job and wanting to learn? -Use Halide! Halide is very elegant language. Here is how we could rewrite the 2D -separable convolution. +Use Halide? Yes! It is a very elegant language. It can also compile your code +directly in OpenCL and Vulkan, Direct3D bit code. Here is how we could rewrite +the 2D separable convolution. -``` +Let us write the implementation first. + +```{cpp} #include auto x = Halide::Var{"x"}; @@ -155,31 +199,94 @@ auto tile = Halide::Var{"tile"}; auto input = Halide::Func{"input"}; auto kernel = Halide::Func{"kernel"}; +static constexpr auto sigma = 3.f; +static constexpr auto truncation_factor = 4.f; +static constexpr auto ksz = static_cast(2 * sigma * truncation_factor); +static constexpr auto kr = ksz / 2; + input(x, y) = x + y; -kernel(x) = Halide::exp(-x); +kernel(x) = Halide::exp(-x * x / (0.5f * sigma * sigma)); auto conv_x = Halide::Func{"conv_x"}; auto conv_y = Halide::Func{"conv_y"}; -const auto ksz = 20; auto k = Halide::RDom(0, ksz, "k"); // The algorithm -conv_x(x, y) = Halide::sum(input(x + k - ksz / 2, y) * kernel(k), "conv_x"); -conv_y(x, y) = Halide::sum(conv_x(x, y + k - ksz / 2) * kernel(k), "conv_y"); +conv_x(x, y) = Halide::sum(input(x + k - kr, y) * kernel(k)); +conv_y(x, y) = Halide::sum(conv_x(x, y + k - kr / 2) * kernel(k)); ``` -The optimal scheduling is not that obvious in CPU. +## Shedule Optimization + +### Two Types of Parallelisms on CPUs -The first obvious strategy to do is to calculate the x-convolution first. Then -calculate the transpose of the y-convoluved map and transpose it back to obtain -the final results. +There are two types of parallelisms on the CPU. -We can do a lot better by dividing the final convolved image into tiles of 64 -by 64 pixels. Each image block is like a smaller independent image on which we -perform the 2D separable convolution. +1. Multicore processing: + This is straightforward to understand and is about keep all the CPU cores as + busy as possible with minimal data sharing. OpenMP is simple and helps to + parallelize image filter quite easily once we identify the parts of the + algorithm that operate independently. + +2. Vector instructions: + Until I implemented filters with Halide, I could not understand what CPU + instrutions were really about. + I am not going to pretend to be an expert in CPU optimization but this little + paragraph should convince you why it is so interesting to apply vector + instructions wherever possible. + So, as a first approximation, a CPU vector instruction typically enables the + programmer to perform arithmetic operations on small vectors in a single CPU + cycle. Typically arithmetic operations such addition, multiplication and more + can operate on 4D vectors. That is where we can observe additional 4x speed + up or more if your CPU can process those operations on bigger vectors. + +For more accurate and more comprehensive information, I will simply encourage +you to do your own research and share what you have learnt. + +Like image filter, BLAS routines makes extensive use of the two types +parallelism on CPU platforms. + +### Schedule 1 + +The obvious strategy is to start from the idea of separable convolution and +vectorize the convolution wherever possible. + +``` +kernel.compute_root(); + +conv_x + .compute_root() + .parallel(y) + .vectorize(x, 32, Halide::TailStrategy::GuardWithIf) + ; +conv_y + .parallel(y) + .vectorize(x, 32, Halide::TailStrategy::GuardWithIf) + ; +``` + +Then you decide to compete with OpenCV and you +realize that it crushes your implementation performance by being 2x faster. + +In fact, the optimal schedule is really not obvious in CPU as exposed in Halide +presentations. Until you dig into Halide publications, you start to understand +how much work and expertise it is to optimize a typical image processing filter +in Photoshop and took 3 months of hard work to call CPU vector intrinsics. + +The first step to achieve this to divide the final convolved image into tiles of +64 by 64 pixels. The set of tiles can be seen as an input batch of many smaller +images. The output is another batch of image tiles with the same image sizes +(let's just assume that the image width and height are multiple of 64.) + +Because each output image tile is independent of each other, we can calculate +smaller x-convolution and y-convolution. For each tile we can fit the +x-convolution in the CPU cache and it improves data locality in the memory +cache. Then we explot the CPU vector instructions to calculate convolution in +batch. + +### Schedule 2: Parallel tile processing, vectorize, CPU cache fitting -Scheduling ``` // The schedule kernel.compute_root(); @@ -188,20 +295,19 @@ conv_y // .tile(x, y, xo, yo, xi, yi, 64, 64) .fuse(xo, yo, tile) .parallel(tile) - // .parallel(yo) .vectorize(xi, 4, Halide::TailStrategy::GuardWithIf) // ; -conv_x // - // .store_at(conv_y, tile) +conv_x .compute_at(conv_y, xi) // .vectorize(x, 4, Halide::TailStrategy::GuardWithIf) // ; ``` +#### Second-Guessing what Halide does There is a lot to unpack here. Let's break it down bit by bit. The line -``` +```{cpp} conv_y.tile(x, y, xo, yo, xi, yi, 64, 64, Halide::TailStrategy::GuardWithIf) .fuse(xo, yo, tile_index) .parallel(tile_index); @@ -211,7 +317,7 @@ The line translates in C++ as This is according to my second-guessing: -``` +```{cpp} #pragma omp parallel for // .parallel(y) for (auto tile_index = 0; tile_index < T; ++T) { @@ -281,14 +387,16 @@ spend some time to learn how to actually read assembly code. Halide developers has done a very good job to help us understand what assembly code is generated mapping the assembly code to the pseudo code. -``` +```{cpp} # Generate code as follows conv_y.compile_to_stmt("separable_conv_2d.stmt.html", {}, Halide::HTML); ``` -It becomes clear that convolution operation operates in batch via fma operation. +Then it becomes clear that the convolution operation is implemented by batch +where we calculate 4, 8 or 16 convolved values at the same time by repeating +the vectorized fused multiply-add `fmla.4s` operation. References: - https://diveintosystems.org/book/C9-ARM64/common.html -- https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FMLA--vector---Floating-point-fused-Multiply-Add-to-accumulator--vector--?lang=en +- https://developer.arm.com/documentation/