From fdd8855d4c85f135a5045e49226c2e7fec1a3f35 Mon Sep 17 00:00:00 2001 From: "Edward J. Parkinson" Date: Wed, 31 Jul 2024 11:36:02 +0100 Subject: [PATCH 1/3] re-structure parts of episode 7 and episode 8 into optional episodes, at the end of the course --- _episodes/07-derived-data-types.md | 426 +++++ ...-to-mpi.md => 08-porting-serial-to-mpi.md} | 1126 ++++++------ ...optimising-mpi.md => 09-optimising-mpi.md} | 0 ...tterns.md => 10-communication-patterns.md} | 6 +- ...cation.md => 11-advanced-communication.md} | 1624 +++++++---------- 5 files changed, 1605 insertions(+), 1577 deletions(-) create mode 100644 _episodes/07-derived-data-types.md rename _episodes/{09-porting-serial-to-mpi.md => 08-porting-serial-to-mpi.md} (97%) rename _episodes/{10-optimising-mpi.md => 09-optimising-mpi.md} (100%) rename _episodes/{08-communication-patterns.md => 10-communication-patterns.md} (99%) rename _episodes/{07-advanced-communication.md => 11-advanced-communication.md} (56%) diff --git a/_episodes/07-derived-data-types.md b/_episodes/07-derived-data-types.md new file mode 100644 index 0000000..78b1ddc --- /dev/null +++ b/_episodes/07-derived-data-types.md @@ -0,0 +1,426 @@ +--- +title: Derived Data Types +slug: "dirac-intro-to-mpi-derived-types" +teaching: 25 +exercises: 20 +questions: +- How do I use complex data structures in MPI? +- What is contiguous memory, and why does it matter? +objectives: +- Understand the problems of non-contiguous memory in MPI +- Know how to define and use derived data types +keypoints: +- Any data being transferred should be a single contiguous block of memory +- By defining derived data types, we can more easily send data which is not contiguous +--- + +We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're now +dangerous enough to write a simple and successful MPI application. We've worked, so far, with simple data structures, +such as single variables or small 1D arrays. In reality, any useful software we write will use more complex data +structures, such as n-dimensional arrays, structures and other complex types. Working with these in MPI require a bit +more work to communicate them correctly and efficiently. + +To help with this, MPI provides an interface to create new types known as *derived data types*. A derived type acts as a +way to enable the translation of complex data structures into instructions which MPI uses for efficient data access +communication. In this episode, we will learn how to use derived data types to send array vectors and sub-arrays. + +> ## Size limitations for messages +> +> All throughout MPI, the argument which says how many elements of data are being communicated is an integer: `int +> count`. In most 64-bit Linux systems, `int`'s are usually 32-bit and so the biggest number you can pass to `count` is +> `2^31 - 1 = 2,147,483,647`, which is about 2 billion. Arrays which exceed this length can't be communicated easily in +> versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In +> older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in +> smaller, more manageable chunks. The other is to use derived types, to re-shape the data. +{: .callout} + +Almost all scientific and computing problems nowadays require us to think in more than one dimension. Using +multi-dimensional arrays, such for matrices or tensors, or discretising something onto a 2D or 3D grid of points +are fundamental parts of a lot of software. However, the additional dimensions comes with additional complexity, +not just in the code we write, but also in how data is communicated. + +To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax, + +```c +int matrix[2][3] = { {1, 2, 3}, {4, 5, 6} }; // matrix[rows][cols] +``` + +This creates an array with two rows and three columns. The first row contains the values `{1, 2, 3}` and the second row +contains `{4, 5, 6}`. The number of rows and columns can be any value, as long as there is enough memory available. + +## The importance of memory contiguity + +When a sequence of things is contiguous, it means there are multiple adjacent things without anything in between them. +In the context of MPI, when we talk about something being contiguous we are almost always talking about how arrays, and +other complex data structures, are stored in the computer's memory. The elements in an array are contiguous when the +next and previous elements are stored in adjacent memory locations. + +The memory space of a computer is linear. When we create a multi-dimensional array, the compiler and operating system +decide how to map and store the elements onto that linear space. There are two ways to do this: +[row-major or column-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The difference +is which elements of the array are contiguous in memory. Arrays are row-major in C and column-major in Fortran. +In a row-major array, the elements in each column of a row are contiguous, so element `x[i][j]` is +preceded by `x[i][j - 1]` and is followed by `x[i][j +1]`. In Fortran, arrays are column-major so `x(i, j)` is +followed by `x(i + 1, j)` and so on. + +The diagram below shows how a 4 x 4 matrix is mapped onto a linear memory space, for a row-major array. At the top of +the diagram is the representation of the linear memory space, where each number is ID of the element in memory. Below +that are two representations of the array in 2D: the left shows the coordinate of each element and the right shows the +ID of the element. + +Column memory layout in C + +The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous +in linear memory. The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). Columns +in row-major arrays are contiguous. The next diagram instead shows how elements in adjacent rows are mapped in memory. + +Row memory layout in C + +Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, +we can see that the elements are not contiguous. Element `[0][1]` maps to element 2 and element `[1][1]` maps to element +6 and so on. Elements in the same column but in a different row are separated by four other elements, in this example. +In other words, elements in other rows are not contiguous. + +> ## Does memory contiguity affect performance? +> +> Do you think memory contiguity could impact the performance of our software, in a negative way? +> +> > ## Solution +> > +> > Yes, memory contiguity can affect how fast our programs run. When data is stored in a neat and organized way, the +> > computer can find and use it quickly. But if the data is scattered around randomly (fragmented), it takes more time +> > to locate and use it, which decreases performance. Keeping our data and data access patterns organized can +> > make our programs faster. But we probably won't notice the difference for small arrays and data structures. +> {: .solution} +{: .challenge} + +> ## What about if I use `malloc()`? +> +> More often than not we will see `malloc()` being used to allocate memory for arrays. Especially if the code is using +> an older standard, such as C90, which does not support [variable length +> arrays](https://en.wikipedia.org/wiki/Variable-length_array). When we use `malloc()`, we get a contiguous array of +> elements. To create a 2D array using `malloc()`, we have to first create an array of pointers (which are contiguous) +> and allocate memory for each pointer, +> +> ```c +> int num_rows = 3, num_cols = 5; +> +> float **matrix = malloc(num_rows * sizeof(float*)); // Each pointer is the start of a row +> for (int i = 0; i < num_rows; ++i) { +> matrix[i] = malloc(num_cols * sizeof(float)); // Here we allocate memory to store the column elements for row i +> } +> +> for (int i = 0; i < num_rows; ++i) { +> for (int j = 0; i < num_cols; ++j) { +> matrix[i][j] = 3.14159; // Indexing is done as matrix[rows][cols] +> } +> } +> ``` +> +> There is one problem though. `malloc()` *does not* guarantee that subsequently allocated memory will be contiguous. +> When `malloc()` requests memory, the operating system will assign whatever memory is free. This is not always next to +> the block of memory from the previous allocation. This makes life tricky, since data *has* to be contiguous for MPI +> communication. But there are workarounds. One is to only use 1D arrays (with the same number of elements as the higher +> dimension array) and to map the n-dimensional coordinates into a linear coordinate system. For example, the element +> `[2][4]` in a 3 x 5 matrix would be accessed as, +> +> ```c +> int index_for_2_4 = matrix1d[5 * 2 + 4]; // num_cols * row + col +> ``` +> +> Another solution is to move memory around so that it is contiguous, such as in [this +> example](code/examples/08-malloc-trick.c) or by using a more sophisticated function such as [`arralloc()` +> function](code/arralloc.c) (not part of the standard library) which can allocate arbitrary n-dimensional arrays into a +> contiguous block. +> +{: .callout} + +For a row-major array, we can send the elements of a single row (for a 4 x 4 matrix) easily, + +```c +MPI_Send(&matrix[1][0], 4, MPI_INT ...); +``` + +The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. As the columns are four +elements long, we have specified to only send four integers. Even though we're working here with a 2D array, sending a +single row of the matrix is the same as sending a 1D array. Instead of using a pointer to the start of the array, an +address to the first element of the row (`&matrix[1][0]`) is used instead. It's not possible to do the same for a column +of the matrix, because the elements down the column are not contiguous. + +## Using vectors to send slices of an array + +To send a column of a matrix or array, we have to use a *vector*. A vector is a derived data type that represents +multiple (or one) contiguous sequences of elements, which have a regular spacing between them. By using vectors, we can +create data types for column vectors, row vectors or sub-arrays, similar to how we can [create slices for Numpy arrays +in Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, +communication. To create a vector, we create a new data type using `MPI_Type_vector()`, + +```c +int MPI_Type_vector( + int count, + int blocklength, + int stride, + MPI_Datatype oldtype, + MPI_Datatype *newtype +); +``` + +| `count`: | The number of "blocks" which make up the vector | +| `blocklength`: | The number of contiguous elements in a block | +| `stride`: | The number of elements between the start of each block | +| `oldtype`: | The data type of the elements of the vector, e.g. MPI_INT, MPI_FLOAT | +| `newtype`: | The newly created data type to represent the vector | + +To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix +with a row in between (rows 2 and 4), + +How a vector is laid out in memory + +A *block* refers to a sequence of contiguous elements. In the diagrams above, each sequence of contiguous purple or +orange elements represents a block. The *block length* is the number of elements within a block; in the above this is +four. The *stride* is the distance between the start of each block, which is eight in the example. The count is the +number of blocks we want. When we create a vector, we're creating a new derived data type which includes one or more +blocks of contiguous elements. + +> ## Why is this functionality useful? +> +> The advantage of using derived types to send vectors is to streamline and simplify communication of complex and +> non-contiguous data. They are most commonly used where there are boundary regions between MPI ranks, such as in +> simulations using domain decomposition (see the optional Common Communication Patterns episode +> for more detail), irregular meshes or composite data structures (covered in the optional Advanced Data Communication +> episode). +> +{: .callout} + +Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. This +finalises the creation of a derived type. Forgetting to do this step leads to unexpected behaviour, and potentially +disastrous consequences! + +```c +int MPI_Type_commit( + MPI_Datatype *datatype // The data type to commit - note that this is a pointer +); +``` + +When a data type is committed, resources which store information on how to handle it are internally allocated. This +contains data structures such as memory buffers as well as data used for bookkeeping. Failing to free those resources +after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. To +free up the resources, we use `MPI_Type_free()`, + +```c +int MPI_Type_free ( + MPI_Datatype *datatype // The data type to clean up -- note this is a pointer +); +``` + +The following example code uses a vector to send two rows from a 4 x 4 matrix, as in the example diagram above. + +```c +// The vector is a MPI_Datatype +MPI_Datatype rows_type; + +// Create the vector type +const int count = 2; +const int blocklength = 4; +const int stride = 8; +MPI_Type_vector(count, blocklength, stride, MPI_INT, &rows_type); + +// Don't forget to commit it +MPI_Type_commit(&rows_type); + +// Send the middle row of our 2d matrix array. Note that we are sending +// &matrix[1][0] and not matrix. This is because we are using an offset +// to change the starting point of where we begin sending memory +int matrix[4][4] = { + { 1, 2, 3, 4}, + { 5, 6, 7, 8}, + { 9, 10, 11, 12}, + {13, 14, 15, 16}, +}; + +if (my_rank == 0) { + MPI_Send(&matrix[1][0], 1, rows_type, 1, 0, MPI_COMM_WORLD); +} else { + // The receive function doesn't "work" with vector types, so we have to + // say that we are expecting 8 integers instead + const int num_elements = count * blocklength; + int recv_buffer[num_elements]; + MPI_Recv(recv_buffer, num_elements, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); +} + +// The final thing to do is to free the new data type when we no longer need it +MPI_Type_free(&rows_type); +``` + +There are two things above, which look quite innocent, but are important to understand. First of all, the send buffer +in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. In `MPI_Send()`, the send buffer is a pointer to the memory +location where the start of the data is stored. In the above example, the intention is to only send the second and forth +rows, so the start location of the data to send is the address for element `[1][0]`. If we used `matrix`, the first and +third rows would be sent instead. + +The other thing to notice, which is not immediately clear why it's done this way, is that the receive data type is +`MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This +is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout +of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. + +> ## Sending columns from an array +> +> Create a vector type to send a column in the following 2 x 3 array: +> +> ```c +> int matrix[2][3] = { +> {1, 2, 3}, +> {4, 5, 6}, +> }; +>``` +> +> With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 +> to rank 1 and print the results. You may want to use [this code](code/solutions/skeleton-example.c) as your starting +> point. +> +> > ## Solution +> > +> > If your solution is correct you should see 2 and 5 printed to the screen. In the solution below, to send a 2 x 1 +> > column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. To send the correct +> > column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. To see why the +> > stride is 3, take a look at the diagram below, +> > +> > Stride example for question +> > +> > You can see that there are *three* contiguous elements between the start of each block of 1. +> > +> > ```c +> > #include +> > #include +> > +> > int main(int argc, char **argv) +> > { +> > int my_rank; +> > int num_ranks; +> > MPI_Init(&argc, &argv); +> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); +> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); +> > +> > int matrix[2][3] = { +> > {1, 2, 3}, +> > {4, 5, 6}, +> > }; +> > +> > if (num_ranks != 2) { +> > if (my_rank == 0) { +> > printf("This example only works with 2 ranks\n"); +> > } +> > MPI_Abort(MPI_COMM_WORLD, 1); +> > } +> > +> > MPI_Datatype col_t; +> > MPI_Type_vector(2, 1, 3, MPI_INT, &col_t); +> > MPI_Type_commit(&col_t); +> > +> > if (my_rank == 0) { +> > MPI_Send(&matrix[0][1], 1, col_t, 1, 0, MPI_COMM_WORLD); +> > } else { +> > int buffer[2]; +> > MPI_Status status; +> > +> > MPI_Recv(buffer, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); +> > +> > printf("Rank %d received the following:", my_rank); +> > for (int i = 0; i < 2; ++i) { +> > printf(" %d", buffer[i]); +> > } +> > printf("\n"); +> > } +> > +> > MPI_Type_free(&col_t); +> > +> > return MPI_Finalize(); +> > } +> > ``` +> > +> {: .solution} +> +{: .challenge} + +> ## Sending sub-arrays of an array +> +> By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank +> 1, +> +> ```c +> int matrix[4][4] = { +> { 1, 2, 3, 4}, +> { 5, 6, 7, 8}, +> { 9, 10, 11, 12}, +> {13, 14, 15, 16} +> }; +> ``` +> +> You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with +> the 4 x 4 matrix above and modifying the vector type and communication functions as required. +> +> > ## Solution +> > +> > The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. In the solution below, +> > we have created a vector with a count and block length of 2 and with a stride of 4. The first two +> > arguments means two vectors of block length 2 will be sent. The stride of 4 results from that there are 4 elements +> > between the start of each distinct block as shown in the image below, +> > +> > Stride example for question +> > +> > You must always remember to send the address for the starting point of the *first* block as the send buffer, which +> > is why `&array[1][1]` is the first argument in `MPI_Send()`. +> > +> > ```c +> > #include +> > #include +> > +> > int main(int argc, char **argv) +> > { +> > int matrix[4][4] = { +> > { 1, 2, 3, 4}, +> > { 5, 6, 7, 8}, +> > { 9, 10, 11, 12}, +> > {13, 14, 15, 16} +> > }; +> > +> > int my_rank; +> > int num_ranks; +> > MPI_Init(&argc, &argv); +> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); +> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); +> > +> > if (num_ranks != 2) { +> > if (my_rank == 0) { +> > printf("This example only works with 2 ranks\n"); +> > } +> > MPI_Abort(MPI_COMM_WORLD, 1); +> > } +> > +> > MPI_Datatype sub_array_t; +> > MPI_Type_vector(2, 2, 4, MPI_INT, &sub_array_t); +> > MPI_Type_commit(&sub_array_t); +> > +> > if (my_rank == 0) { +> > MPI_Send(&matrix[1][1], 1, sub_array_t, 1, 0, MPI_COMM_WORLD); +> > } else { +> > int buffer[4]; +> > MPI_Status status; +> > +> > MPI_Recv(buffer, 4, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); +> > +> > printf("Rank %d received the following:", my_rank); +> > for (int i = 0; i < 4; ++i) { +> > printf(" %d", buffer[i]); +> > } +> > printf("\n"); +> > } +> > +> > MPI_Type_free(&sub_array_t); +> > +> > return MPI_Finalize(); +> > } +> > ``` +> > +> {: .solution} +{: .challenge} diff --git a/_episodes/09-porting-serial-to-mpi.md b/_episodes/08-porting-serial-to-mpi.md similarity index 97% rename from _episodes/09-porting-serial-to-mpi.md rename to _episodes/08-porting-serial-to-mpi.md index a7e4b97..a27f740 100644 --- a/_episodes/09-porting-serial-to-mpi.md +++ b/_episodes/08-porting-serial-to-mpi.md @@ -1,563 +1,563 @@ ---- -title: Porting Serial Code to MPI -slug: "dirac-intro-to-mpi-porting-serial-to-mpi" -teaching: 0 -exercises: 0 -questions: -- What is the best way to write a parallel code? -- How do I parallelise my existing serial code? -objectives: -- Practice how to identify which parts of a codebase would benefit from parallelisation, and those that need to be done serially or only once -- Convert a serial code into a parallel code -- Differentiate between choices of communication pattern and algorithm design -keypoints: -- Start from a working serial code -- Write a parallel implementation for each function or parallel region -- Connect the parallel regions with a minimal amount of communication -- Continuously compare the developing parallel code with the working serial code ---- - -In this section we will look at converting a complete code from serial to parallel in a number of steps. - -## An Example Iterative Poisson Solver - -This episode is based on a code that solves the Poisson's equation using an iterative method. -Poisson's equation appears in almost every field in physics, and is frequently used to model many physical phenomena such as heat conduction, and applications of this equation exist for both two and three dimensions. -In this case, the equation is used in a simplified form to describe how heat diffuses in a one dimensional metal stick. - -In the simulation the stick is split into a given number of slices, each with its own temperature. - -![Stick divided into separate slices with touching boundaries at each end](fig/poisson_stick.png) - -The temperature of the stick itself across each slice is initially set to zero, whilst at one boundary of the stick the amount of heat is set to 10. -The code applies steps that simulates heat transfer along it, bringing each slice of the stick closer to a solution until it reaches a desired equilibrium in temperature along the whole stick. - -Let's download the code, which can be found [here](code/examples/poisson/poisson.c), and take a look at it now. - -### At a High Level - main() - -We'll begin by looking at the `main()` function at a high level. - -~~~ -#define MAX_ITERATIONS 25000 -#define GRIDSIZE 12 - -... - -int main(int argc, char **argv) { - - // The heat energy in each block - float *u, *unew, *rho; - float h, hsq; - double unorm, residual; - int i; - - u = malloc(sizeof(*u) * (GRIDSIZE+2)); - unew = malloc(sizeof(*unew) * (GRIDSIZE+2)); - rho = malloc(sizeof(*rho) * (GRIDSIZE+2)); -~~~ -{: .language-c} - -It first defines two constants that govern the scale of the simulation: - -- `MAX_ITERATIONS`: determines the maximum number of iterative steps the code will attempt in order to find a solution with sufficiently low equilibrium -- `GRIDSIZE`: the number of slices within our stick that will be simulated. Increasing this will increase the number of stick slices to simulate, which increases the processing required - -Next, it declares some arrays used during the iterative calculations: - -- `u`: each value represents the current temperature of a slice in the stick -- `unew`: during an iterative step, is used to hold the newly calculated temperature of a slice in the stick -- `rho`: holds a separate coefficient for each slice of the stick, used as part of the iterative calculation to represent potentially different boundary conditions for each stick slice. For simplicity, we'll assume completely homogeneous boundary conditions, so these potentials are zero - -Note we are defining each of our array sizes with two additional elements, the first of which represents a touching 'boundary' before the stick, i.e. something with a potentially different temperature touching the stick. The second added element is at the end of the stick, representing a similar boundary at the opposite end. - -The next step is to initialise the initial conditions of the simulation: - -~~~ - // Set up parameters - h = 0.1; - hsq = h * h; - residual = 1e-5; - - // Initialise the u and rho field to 0 - for (i = 0; i <= GRIDSIZE + 1; ++i) { - u[i] = 0.0; - rho[i] = 0.0; - } - - // Create a start configuration with the heat energy - // u=10 at the x=0 boundary for rank 1 - u[0] = 10.0; -~~~ -{: .language-c} - -`residual` here refers to the threshold of temperature equilibrium along the stick we wish to achieve. Once it's within this threshold, the simulation will end. -Note that initially, `u` is set entirely to zero, representing a temperature of zero along the length of the stick. -As noted, `rho` is set to zero here for simplicity. - -Remember that additional first element of `u`? Here we set it to a temperature of `10.0` to represent something with that temperature touching the stick at one end, to initiate the process of heat transfer we wish to simulate. - -Next, the code iteratively calls `poisson_step()` to calculate the next set of results, until either the maximum number of steps is reached, or a particular measure of the difference in temperature along the stick returned from this function (`unorm`) is below a particular threshold. - -~~~ - // Run iterations until the field reaches an equilibrium - // and no longer changes - for (i = 0; i < NUM_ITERATIONS; ++i) { - unorm = poisson_step(u, unew, rho, hsq, GRIDSIZE); - if (sqrt(unorm) < sqrt(residual)) { - break; - } - } -~~~ -{: .language-c} - -Finally, just for show, the code outputs a representation of the result - the end temperature of each slice of the stick. - -~~~ - printf("Final result:\n"); - for (int j = 1; j <= GRIDSIZE; ++j) { - printf("%d-", (int) u[j]); - } - printf("\n"); - printf("Run completed in %d iterations with residue %g\n", i, unorm); -} -~~~ -{: .language-c} - -### The Iterative Function - poisson_step() - -The `poisson_step()` progresses the simulation by a single step. After it accepts its arguments, for each slice in the stick it calculates a new value based on the temperatures of its neighbours: - -~~~ - for (int i = 1; i <= points; ++i) { - float difference = u[i-1] + u[i+1]; - unew[i] = 0.5 * (difference - hsq * rho[i]); - } -~~~ -{: .language-c} - -Next, it calculates a value representing the overall cumulative change in temperature along the stick compared to its previous state, which as we saw before, is used to determine if we've reached a stable equilibrium and may exit the simulation: - -~~~ - unorm = 0.0; - for (int i = 1; i <= points; ++i) { - float diff = unew[i] - u[i]; - unorm += diff * diff; - } -~~~ -{: .language-c} - -And finally, the state of the stick is set to the newly calculated values, and `unorm` is returned from the function: - -~~~ - // Overwrite u with the new field - for (int i = 1 ;i <= points; ++i) { - u[i] = unew[i]; - } - - return unorm; -} -~~~ -{: .language-c} - -### Compiling and Running the Poisson Code - -You may compile and run the code as follows: - -~~~ -gcc poisson.c -o poisson -lm -./poisson -~~~ -{: .language-bash} - -And should see the following: - -~~~ -Final result: -9-8-7-6-6-5-4-3-3-2-1-0- -Run completed in 182 iterations with residue 9.60328e-06 -~~~ -{: .output} - -Here, we can see a basic representation of the temperature of each slice of the stick at the end of the simulation, and how the initial `10.0` temperature applied at the beginning of the stick has transferred along it to this final state. -Ordinarily, we might output the full sequence to a file, but we've simplified it for convenience here. - -## Approaching Parallelism - -So how should we make use of an MPI approach to parallelise this code? -A good place to start is to consider the nature of the data within this computation, and what we need to achieve. - -For a number of iterative steps, currently the code computes the next set of values for the entire stick. -So at a high level one approach using MPI would be to split this computation by dividing the stick into sections each with a number of slices, and have a separate rank responsible for computing iterations for those slices within its given section. -Essentially then, for simplicity we may consider each section a stick on its own, with either two neighbours at touching boundaries (for middle sections of the stick), or one touching boundary neighbour (for sections at the beginning and end of the stick, which also have either a start or end stick boundary touching them). For example, considering a `GRIDSIZE` of 12 and three ranks: - -![Stick divisions subdivided across ranks](fig/poisson_stick_subdivided.png) - -We might also consider subdividing the number of iterations, and parallelise across these instead. -However, this is far less compelling since each step is completely dependent on the results of the prior step, -so by its nature steps must be done serially. - -The next step is to consider in more detail this approach to parallelism with our code. - -> ## Parallelism and Data Exchange -> -> Looking at the code, which parts would benefit most from parallelisation, -> and are there any regions that require data exchange across its processes in order for -> the simulation to work as we intend? -> ->> ## Solution ->> ->> Potentially, the following regions could be executed in parallel: ->> ->> * The setup, when initialising the fields ->> * The calculation of each time step, `unew` - this is the most computationally intensive of the loops ->> * Calculation of the cumulative temperature difference, `unorm` ->> * Overwriting the field `u` with the result of the new calculation ->> ->> As `GRIDSIZE` is increased, these will take proportionally more time to complete, so may benefit from parallelisation. ->> ->> However, there are a few regions in the code that will require exchange of data across the parallel executions ->> to work correctly: ->> ->> * Calculation of `unorm` is a sum that requires difference data from all sections of the stick, so we'd need to somehow communicate these difference values to a single rank that computes and receives the overall sum ->> * Each section of the stick does not compute a single step in isolation, it needs boundary data from neighbouring sections of the stick to arrive at its computed temperature value for that step, so we'd need to communicate temperature values between neighbours (i.e. using a nearest neighbours communication pattern) ->{: .solution} -{: .challenge} - -We also need to identify any sizeable serial regions. -The sum of the serial regions gives the minimum amount of time it will take to run the program. -If the serial parts are a significant part of the algorithm, it may not be possible to write an efficient parallel version. - -> ## Serial Regions -> -> Examine the code and try to identify any serial regions that can't (or shouldn't) be parallelised. -> ->> ## Solution ->> ->> There aren't any large or time consuming serial regions, which is good from a parallelism perspective. ->> However, there are a couple of small regions that are not amenable to running in parallel: ->> ->> * Setting the `10.0` initial temperature condition at the stick 'starting' boundary. We only need to set this once at the beginning of the stick, and not at the boundary of every section of the stick ->> * Printing a representation of the final result, since this only needs to be done once to represent the whole stick, and not for every section. ->> ->> So we'd need to ensure only one rank deals with these, which in MPI is typically the zeroth rank. ->> This also makes sense in terms of our parallelism approach, since the zeroth rank would be the beginning of the stick, ->> where we'd set the initial boundary temperature. ->{: .solution} -{: .challenge} - - -## Parallelising our Code - -So now let's apply what we've learned about MPI together with our consideration of the code. -First, make a copy of the `poisson.c` code that we will work on (don't modify the original, we'll need this later!). -In the shell, for example: - -~~~ -cp poisson.c poisson_mpi.c -~~~ -{: .language-bash} - -And then we can add our MPI parallelisation modifications to `poisson_mpi.c` - -We'll start at a high level with `main()`, although first add `#include ` at the top of our code so we can make use of MPI. -We'll do this parallelisation in a number of stages. - -In MPI, all ranks execute the same code. -When writing a parallel code with MPI, a good place to start is to think about a single rank. -What does this rank need to do, and what information does it need to do it? - -The first goal should be to write a simple code that works correctly. -We can always optimise further later! - -### main(): Adding MPI at a High-level - -Then as we usually do, let's initialise MPI and obtain our rank and total number of ranks. -With the latter information, we can then calculate how many slices of the stick this rank will compute. -Near the top of `main()` add the following: - -~~~ - int rank, n_ranks, rank_gridsize; - - MPI_Init(&argc, &argv); - - // Find the number of slices calculated by each rank - // The simple calculation here assumes that GRIDSIZE is divisible by n_ranks - MPI_Comm_rank(MPI_COMM_WORLD, &rank); - MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); - rank_gridsize = GRIDSIZE / n_ranks; -~~~ -{: .language-c} - -Using `rank_gridsize`, we can now amend our initial array declaration sizes to use this instead: - -~~~ - u = malloc(sizeof(*u) * (rank_gridsize + 2)); - unew = malloc(sizeof(*unew) * (rank_gridsize + 2)); - rho = malloc(sizeof(*rho) * (rank_gridsize + 2)); -~~~ -{: .language-c} - -Then at the very end of `main()` let's complete our use of MPI: - -~~~ - MPI_Finalize(); -} -~~~ -{: .language-c} - -### main(): Initialising the Simulation and Printing the Result - -Since we're not initialising for the entire stick (`GRIDSIZE`) but just the section apportioned to our rank (`rank_gridsize`), we need to amend the loop that initialises `u` and `rho` accordingly, to: - -~~~ - // Initialise the u and rho field to 0 - for (i = 0; i <= rank_gridsize + 1; ++i) { - u[i] = 0.0; - rho[i] = 0.0; - } -~~~ -{: .language-c} - -As we found out in the *Serial Regions* exercise, we need to ensure that only a single rank (rank zero) needs to initiate the starting temperature within it's section, so we need to put a condition on that initialisation: - -~~~ - // Create a start configuration with the heat energy - // u = 10 at the x = 0 boundary for rank 0 - if (rank == 0) { - u[0] = 10.0; - } -~~~ -{: .language-c} - -We also need to collect the overall results from all ranks and output that collected result, but again, only for rank zero. -To collect the results from all ranks (held in `u`) we can use `MPI_Gather()`, to send all `u` results to rank zero to hold in a results array. -Note that this will also include the result from rank zero! - -Add the following to the list of declarations at the start of `main()`: - -~~~ - float *resultbuf; -~~~ -{: .language-c} - -Then before `MPI_Finalize()` let's amend the code to the following: - -~~~ - // Gather results from all ranks - // We need to send data starting from the second element of u, since u[0] is a boundary - resultbuf = malloc(sizeof(*resultbuf) * GRIDSIZE); - MPI_Gather(&u[1], rank_gridsize, MPI_FLOAT, resultbuf, rank_gridsize, MPI_FLOAT, 0, MPI_COMM_WORLD); - - if (rank == 0) { - printf("Final result:\n"); - for (int j = 0; j < GRIDSIZE; j++) { - printf("%d-", (int) resultbuf[j]); - } - printf("\nRun completed in %d iterations with residue %g\n", i, unorm); - } -~~~ -{: .language-c} - -`MPI_Gather()` is ideally suited for our purposes, since results from ranks are arranged within `resultbuf` in rank order, -so we end up with all slices across all ranks representing the entire stick. -However, note that we need to send our data starting from `u[1]`, since `u[0]` is the section's boundary value we don't want to include. - -This has an interesting effect we need to account for - note the change to the `for` loop. -Since we are gathering data from each rank (including rank 0) starting from `u[1]`, `resultbuf` will not contain any section boundary values so our loop no longer needs to skip the first value. - -### main(): Invoking the Step Function - -Before we parallelise the `poisson_step()` function, let's amend how we invoke it and pass it some additional parameters it will need. -We need to amend how many slices it will compute, and add the `rank` and `n_ranks` variables, since we know from `Parallelism and Data Exchange` that it will need to perform some data exchange with other ranks: - -~~~ - unorm = poisson_step(u, unew, rho, hsq, rank_gridsize, rank, n_ranks); -~~~ -{: .language-c} - -### poisson_step(): Updating our Function Definition - -Moving into the `poisson_step()` function, we first amend our function to include the changes to parameters: - -~~~ -double poisson_step( - float *u, float *unew, float *rho, - float hsq, int points, - int rank, int n_ranks -) { -~~~ -{: .language-c} - -### poisson_step(): Calculating a Global `unorm` - -We know from `Parallelism and Data Exchange` that we need to calculate `unorm` across all ranks. -We already have it calculated for separate ranks, so need to *reduce* those values in an MPI sense to a single summed value. -For this, we can use `MPI_Allreduce()`. - -Insert the following into the `poisson_step()` function, putting the declarations at the top of the function, -and the `MPI_Allreduce()` after the calculation of `unorm`: - -~~~ - double unorm, global_unorm; - - MPI_Allreduce(&unorm, &global_unorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); -~~~ -{: .language-c} - -So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. -We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: - -~~~ - return global_unorm; -} -~~~ -{: .language-c} - -### poisson_step(): Communicate Boundaries to Neighbours - -In order for our parallel code to work, we know from `Parallelism and Data Exchange` that each section of slices is not computed in isolation. -After we've computed new values we need to send our boundary slice values to our neighbours if those neighbours exist - -the beginning and end of the stick will each only have one neighbour, so we need to account for that. - -We also need to ensure that we don't encounter a deadlock situation when exchanging the data between neighbours. -They can't all send to the rightmost neighbour simultaneously, since none will then be waiting and able to receive. -We need a message exchange strategy here, so let's have all odd-numbered ranks send their data first (to be received by even ranks), -then have our even ranks send their data (to be received by odd ranks). -Such an order might look like this (highlighting the odd ranks - only one in this example - with the order of communications indicated numerically): - -![Communication strategy - odd ranks send to potential neighbours first, then receive from them](fig/poisson_comm_strategy_1.png) - -So following the `MPI_Allreduce()` we've just added, let's deal with odd ranks first (again, put the declarations at the top of the function): - -~~~ - float sendbuf, recvbuf; - MPI_Status mpi_status; - - // The u field has been changed, communicate it to neighbours - // With blocking communication, half the ranks should send first - // and the other half should receive first - if ((rank % 2) == 1) { - // Ranks with odd number send first - - // Send data down from rank to rank - 1 - sendbuf = unew[1]; - MPI_Send(&sendbuf, 1, MPI_FLOAT, rank - 1, 1, MPI_COMM_WORLD); - // Receive data from rank - 1 - MPI_Recv(&recvbuf, 1, MPI_FLOAT, rank - 1, 2, MPI_COMM_WORLD, &mpi_status); - u[0] = recvbuf; - - if (rank != (n_ranks - 1)) { - // Send data up to rank + 1 (if I'm not the last rank) - MPI_Send(&u[points], 1, MPI_FLOAT, rank + 1, 1, MPI_COMM_WORLD); - // Receive data from rank + 1 - MPI_Recv(&u[points + 1], 1, MPI_FLOAT, rank + 1, 2, MPI_COMM_WORLD, &mpi_status); - } -~~~ -{: .language-c} - -Here we use C's inbuilt modulus operator (`%`) to determine if the rank is odd. -If so, we exchange some data with the rank preceding us, and the one following. - -We first send our newly computed leftmost value (at position `1` in our array) to the rank preceding us. -Since we're an odd rank, we can always assume a rank preceding us exists, -since the earliest odd rank 1 will exchange with rank 0. -Then, we receive the rightmost boundary value from that rank. - -Then, if the rank following us exists, we do the same, but this time we send the rightmost value at the end of our stick section, -and receive the corresponding value from that rank. - -These exchanges mean that - as an odd rank - we now have effectively exchanged the states of the start and end of our slices with our respective neighbours. - -And now we need to do the same for those neighbours (the even ranks), mirroring the same communication pattern but in the opposite order of receive/send. -From the perspective of evens, it should look like the following (highlighting the two even ranks): - -![Communication strategy - even ranks first receive from odd ranks, then send to them](fig/poisson_comm_strategy_2.png) - -~~~ - } else { - // Ranks with even number receive first - - if (rank != 0) { - // Receive data from rank - 1 (if I'm not the first rank) - MPI_Recv(&u[0], 1, MPI_FLOAT, rank - 1, 1, MPI_COMM_WORLD, &mpi_status); - // Send data down to rank-1 - MPI_Send(&u[1], 1, MPI_FLOAT, rank - 1, 2, MPI_COMM_WORLD); - } - - if (rank != (n_ranks - 1)) { - // Receive data from rank + 1 (if I'm not the last rank) - MPI_Recv(&u[points + 1], 1, MPI_FLOAT, rank + 1, 1, MPI_COMM_WORLD, &mpi_status); - // Send data up to rank+1 - MPI_Send(&u[points], 1, MPI_FLOAT, rank + 1, 2, MPI_COMM_WORLD); - } - } -~~~ -{: .language-c} - -Once complete across all ranks, every rank will then have the slice boundary data from its neighbours ready for the next iteration. - -### Running our Parallel Code - -You can obtain a full version of the parallelised Poisson code from [here](code/examples/poisson/poisson_mpi.c). Now we have the parallelised code in place, we can compile and run it, e.g.: - -~~~ -mpicc poisson_mpi.c -o poisson_mpi -lm -mpirun -n 2 poisson_mpi -~~~ -{: .language-bash} - -~~~ -Final result: -9-8-7-6-6-5-4-3-3-2-1-0- -Run completed in 182 iterations with residue 9.60328e-06 -~~~ -{: .output} - -Note that as it stands, the implementation assumes that `GRIDSIZE` is divisible by `n_ranks`. -So to guarantee correct output, we should use only - -### Testing our Parallel Code - -We should always ensure that as our parallel version is developed, that it behaves the same as our serial version. -This may not be possible initially, particularly as large parts of the code need converting to use MPI, but where possible, we should continue to test. So we should test once we have an initial MPI version, and as our code develops, perhaps with new optimisations to improve performance, we should test then too. - -> ## An Initial Test -> -> Test the mpi version of your code against the serial version, using 1, 2, 3, and 4 ranks with the MPI version. -> Are the results as you would expect? -> What happens if you test with 5 ranks, and why? -> -> > ## Solution -> > -> > Using these ranks, the MPI results should be the same as our serial version. -> > Using 5 ranks, our MPI version yields `9-8-7-6-5-4-3-2-1-0-0-0-` which is incorrect. -> > This is because the `rank_gridsize = GRIDSIZE / n_ranks` calculation becomes `rank_gridsize = 12 / 5`, which produces 2.4. -> > This is then converted to the integer 2, which means each of the 5 ranks is only operating on 2 slices each, for a total of 10 slices. -> > This doesn't fill `resultbuf` with results representing an expected `GRIDSIZE` of 12, hence the incorrect answer. -> > -> > This highlights another aspect of complexity we need to take into account when writing such parallel implementations, where we must ensure a problem space is correctly subdivided. -> > In this case, we could implement a more careful way to subdivide the slices across the ranks, with some ranks obtaining more slices to make up the shortfall correctly. ->{: .solution} -{: .challenge} - -> ## Limitations! -> -> You may remember that for the purposes of this episode we've assumed a homogeneous stick, -> by setting the `rho` coefficient to zero for every slice. -> As a thought experiment, if we wanted to address this limitation and model an inhomogeneous stick with different static coefficients for each slice, -> how could we amend our code to allow this correctly for each slice across all ranks? -> ->> ## Solution ->> ->> One way would be to create a static lookup array with a `GRIDSIZE` number of elements. ->> This could be defined in a separate `.h` file and imported using `#include`. ->> Each rank could then read the `rho` values for the specific slices in its section from the array and use those. ->> At initialisation, instead of setting it to zero we could do: ->> ->> ~~~ ->> rho[i] = rho_coefficients[(rank * rank_gridsize) + i] ->> ~~~ ->> {: .language-c} ->{: .solution} -{: .challenge} +--- +title: Porting Serial Code to MPI +slug: "dirac-intro-to-mpi-porting-serial-to-mpi" +teaching: 0 +exercises: 0 +questions: +- What is the best way to write a parallel code? +- How do I parallelise my existing serial code? +objectives: +- Practice how to identify which parts of a codebase would benefit from parallelisation, and those that need to be done serially or only once +- Convert a serial code into a parallel code +- Differentiate between choices of communication pattern and algorithm design +keypoints: +- Start from a working serial code +- Write a parallel implementation for each function or parallel region +- Connect the parallel regions with a minimal amount of communication +- Continuously compare the developing parallel code with the working serial code +--- + +In this section we will look at converting a complete code from serial to parallel in a number of steps. + +## An Example Iterative Poisson Solver + +This episode is based on a code that solves the Poisson's equation using an iterative method. +Poisson's equation appears in almost every field in physics, and is frequently used to model many physical phenomena such as heat conduction, and applications of this equation exist for both two and three dimensions. +In this case, the equation is used in a simplified form to describe how heat diffuses in a one dimensional metal stick. + +In the simulation the stick is split into a given number of slices, each with its own temperature. + +![Stick divided into separate slices with touching boundaries at each end](fig/poisson_stick.png) + +The temperature of the stick itself across each slice is initially set to zero, whilst at one boundary of the stick the amount of heat is set to 10. +The code applies steps that simulates heat transfer along it, bringing each slice of the stick closer to a solution until it reaches a desired equilibrium in temperature along the whole stick. + +Let's download the code, which can be found [here](code/examples/poisson/poisson.c), and take a look at it now. + +### At a High Level - main() + +We'll begin by looking at the `main()` function at a high level. + +~~~ +#define MAX_ITERATIONS 25000 +#define GRIDSIZE 12 + +... + +int main(int argc, char **argv) { + + // The heat energy in each block + float *u, *unew, *rho; + float h, hsq; + double unorm, residual; + int i; + + u = malloc(sizeof(*u) * (GRIDSIZE+2)); + unew = malloc(sizeof(*unew) * (GRIDSIZE+2)); + rho = malloc(sizeof(*rho) * (GRIDSIZE+2)); +~~~ +{: .language-c} + +It first defines two constants that govern the scale of the simulation: + +- `MAX_ITERATIONS`: determines the maximum number of iterative steps the code will attempt in order to find a solution with sufficiently low equilibrium +- `GRIDSIZE`: the number of slices within our stick that will be simulated. Increasing this will increase the number of stick slices to simulate, which increases the processing required + +Next, it declares some arrays used during the iterative calculations: + +- `u`: each value represents the current temperature of a slice in the stick +- `unew`: during an iterative step, is used to hold the newly calculated temperature of a slice in the stick +- `rho`: holds a separate coefficient for each slice of the stick, used as part of the iterative calculation to represent potentially different boundary conditions for each stick slice. For simplicity, we'll assume completely homogeneous boundary conditions, so these potentials are zero + +Note we are defining each of our array sizes with two additional elements, the first of which represents a touching 'boundary' before the stick, i.e. something with a potentially different temperature touching the stick. The second added element is at the end of the stick, representing a similar boundary at the opposite end. + +The next step is to initialise the initial conditions of the simulation: + +~~~ + // Set up parameters + h = 0.1; + hsq = h * h; + residual = 1e-5; + + // Initialise the u and rho field to 0 + for (i = 0; i <= GRIDSIZE + 1; ++i) { + u[i] = 0.0; + rho[i] = 0.0; + } + + // Create a start configuration with the heat energy + // u=10 at the x=0 boundary for rank 1 + u[0] = 10.0; +~~~ +{: .language-c} + +`residual` here refers to the threshold of temperature equilibrium along the stick we wish to achieve. Once it's within this threshold, the simulation will end. +Note that initially, `u` is set entirely to zero, representing a temperature of zero along the length of the stick. +As noted, `rho` is set to zero here for simplicity. + +Remember that additional first element of `u`? Here we set it to a temperature of `10.0` to represent something with that temperature touching the stick at one end, to initiate the process of heat transfer we wish to simulate. + +Next, the code iteratively calls `poisson_step()` to calculate the next set of results, until either the maximum number of steps is reached, or a particular measure of the difference in temperature along the stick returned from this function (`unorm`) is below a particular threshold. + +~~~ + // Run iterations until the field reaches an equilibrium + // and no longer changes + for (i = 0; i < NUM_ITERATIONS; ++i) { + unorm = poisson_step(u, unew, rho, hsq, GRIDSIZE); + if (sqrt(unorm) < sqrt(residual)) { + break; + } + } +~~~ +{: .language-c} + +Finally, just for show, the code outputs a representation of the result - the end temperature of each slice of the stick. + +~~~ + printf("Final result:\n"); + for (int j = 1; j <= GRIDSIZE; ++j) { + printf("%d-", (int) u[j]); + } + printf("\n"); + printf("Run completed in %d iterations with residue %g\n", i, unorm); +} +~~~ +{: .language-c} + +### The Iterative Function - poisson_step() + +The `poisson_step()` progresses the simulation by a single step. After it accepts its arguments, for each slice in the stick it calculates a new value based on the temperatures of its neighbours: + +~~~ + for (int i = 1; i <= points; ++i) { + float difference = u[i-1] + u[i+1]; + unew[i] = 0.5 * (difference - hsq * rho[i]); + } +~~~ +{: .language-c} + +Next, it calculates a value representing the overall cumulative change in temperature along the stick compared to its previous state, which as we saw before, is used to determine if we've reached a stable equilibrium and may exit the simulation: + +~~~ + unorm = 0.0; + for (int i = 1; i <= points; ++i) { + float diff = unew[i] - u[i]; + unorm += diff * diff; + } +~~~ +{: .language-c} + +And finally, the state of the stick is set to the newly calculated values, and `unorm` is returned from the function: + +~~~ + // Overwrite u with the new field + for (int i = 1 ;i <= points; ++i) { + u[i] = unew[i]; + } + + return unorm; +} +~~~ +{: .language-c} + +### Compiling and Running the Poisson Code + +You may compile and run the code as follows: + +~~~ +gcc poisson.c -o poisson -lm +./poisson +~~~ +{: .language-bash} + +And should see the following: + +~~~ +Final result: +9-8-7-6-6-5-4-3-3-2-1-0- +Run completed in 182 iterations with residue 9.60328e-06 +~~~ +{: .output} + +Here, we can see a basic representation of the temperature of each slice of the stick at the end of the simulation, and how the initial `10.0` temperature applied at the beginning of the stick has transferred along it to this final state. +Ordinarily, we might output the full sequence to a file, but we've simplified it for convenience here. + +## Approaching Parallelism + +So how should we make use of an MPI approach to parallelise this code? +A good place to start is to consider the nature of the data within this computation, and what we need to achieve. + +For a number of iterative steps, currently the code computes the next set of values for the entire stick. +So at a high level one approach using MPI would be to split this computation by dividing the stick into sections each with a number of slices, and have a separate rank responsible for computing iterations for those slices within its given section. +Essentially then, for simplicity we may consider each section a stick on its own, with either two neighbours at touching boundaries (for middle sections of the stick), or one touching boundary neighbour (for sections at the beginning and end of the stick, which also have either a start or end stick boundary touching them). For example, considering a `GRIDSIZE` of 12 and three ranks: + +![Stick divisions subdivided across ranks](fig/poisson_stick_subdivided.png) + +We might also consider subdividing the number of iterations, and parallelise across these instead. +However, this is far less compelling since each step is completely dependent on the results of the prior step, +so by its nature steps must be done serially. + +The next step is to consider in more detail this approach to parallelism with our code. + +> ## Parallelism and Data Exchange +> +> Looking at the code, which parts would benefit most from parallelisation, +> and are there any regions that require data exchange across its processes in order for +> the simulation to work as we intend? +> +>> ## Solution +>> +>> Potentially, the following regions could be executed in parallel: +>> +>> * The setup, when initialising the fields +>> * The calculation of each time step, `unew` - this is the most computationally intensive of the loops +>> * Calculation of the cumulative temperature difference, `unorm` +>> * Overwriting the field `u` with the result of the new calculation +>> +>> As `GRIDSIZE` is increased, these will take proportionally more time to complete, so may benefit from parallelisation. +>> +>> However, there are a few regions in the code that will require exchange of data across the parallel executions +>> to work correctly: +>> +>> * Calculation of `unorm` is a sum that requires difference data from all sections of the stick, so we'd need to somehow communicate these difference values to a single rank that computes and receives the overall sum +>> * Each section of the stick does not compute a single step in isolation, it needs boundary data from neighbouring sections of the stick to arrive at its computed temperature value for that step, so we'd need to communicate temperature values between neighbours (i.e. using a nearest neighbours communication pattern) +>{: .solution} +{: .challenge} + +We also need to identify any sizeable serial regions. +The sum of the serial regions gives the minimum amount of time it will take to run the program. +If the serial parts are a significant part of the algorithm, it may not be possible to write an efficient parallel version. + +> ## Serial Regions +> +> Examine the code and try to identify any serial regions that can't (or shouldn't) be parallelised. +> +>> ## Solution +>> +>> There aren't any large or time consuming serial regions, which is good from a parallelism perspective. +>> However, there are a couple of small regions that are not amenable to running in parallel: +>> +>> * Setting the `10.0` initial temperature condition at the stick 'starting' boundary. We only need to set this once at the beginning of the stick, and not at the boundary of every section of the stick +>> * Printing a representation of the final result, since this only needs to be done once to represent the whole stick, and not for every section. +>> +>> So we'd need to ensure only one rank deals with these, which in MPI is typically the zeroth rank. +>> This also makes sense in terms of our parallelism approach, since the zeroth rank would be the beginning of the stick, +>> where we'd set the initial boundary temperature. +>{: .solution} +{: .challenge} + + +## Parallelising our Code + +So now let's apply what we've learned about MPI together with our consideration of the code. +First, make a copy of the `poisson.c` code that we will work on (don't modify the original, we'll need this later!). +In the shell, for example: + +~~~ +cp poisson.c poisson_mpi.c +~~~ +{: .language-bash} + +And then we can add our MPI parallelisation modifications to `poisson_mpi.c` + +We'll start at a high level with `main()`, although first add `#include ` at the top of our code so we can make use of MPI. +We'll do this parallelisation in a number of stages. + +In MPI, all ranks execute the same code. +When writing a parallel code with MPI, a good place to start is to think about a single rank. +What does this rank need to do, and what information does it need to do it? + +The first goal should be to write a simple code that works correctly. +We can always optimise further later! + +### main(): Adding MPI at a High-level + +Then as we usually do, let's initialise MPI and obtain our rank and total number of ranks. +With the latter information, we can then calculate how many slices of the stick this rank will compute. +Near the top of `main()` add the following: + +~~~ + int rank, n_ranks, rank_gridsize; + + MPI_Init(&argc, &argv); + + // Find the number of slices calculated by each rank + // The simple calculation here assumes that GRIDSIZE is divisible by n_ranks + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &n_ranks); + rank_gridsize = GRIDSIZE / n_ranks; +~~~ +{: .language-c} + +Using `rank_gridsize`, we can now amend our initial array declaration sizes to use this instead: + +~~~ + u = malloc(sizeof(*u) * (rank_gridsize + 2)); + unew = malloc(sizeof(*unew) * (rank_gridsize + 2)); + rho = malloc(sizeof(*rho) * (rank_gridsize + 2)); +~~~ +{: .language-c} + +Then at the very end of `main()` let's complete our use of MPI: + +~~~ + MPI_Finalize(); +} +~~~ +{: .language-c} + +### main(): Initialising the Simulation and Printing the Result + +Since we're not initialising for the entire stick (`GRIDSIZE`) but just the section apportioned to our rank (`rank_gridsize`), we need to amend the loop that initialises `u` and `rho` accordingly, to: + +~~~ + // Initialise the u and rho field to 0 + for (i = 0; i <= rank_gridsize + 1; ++i) { + u[i] = 0.0; + rho[i] = 0.0; + } +~~~ +{: .language-c} + +As we found out in the *Serial Regions* exercise, we need to ensure that only a single rank (rank zero) needs to initiate the starting temperature within it's section, so we need to put a condition on that initialisation: + +~~~ + // Create a start configuration with the heat energy + // u = 10 at the x = 0 boundary for rank 0 + if (rank == 0) { + u[0] = 10.0; + } +~~~ +{: .language-c} + +We also need to collect the overall results from all ranks and output that collected result, but again, only for rank zero. +To collect the results from all ranks (held in `u`) we can use `MPI_Gather()`, to send all `u` results to rank zero to hold in a results array. +Note that this will also include the result from rank zero! + +Add the following to the list of declarations at the start of `main()`: + +~~~ + float *resultbuf; +~~~ +{: .language-c} + +Then before `MPI_Finalize()` let's amend the code to the following: + +~~~ + // Gather results from all ranks + // We need to send data starting from the second element of u, since u[0] is a boundary + resultbuf = malloc(sizeof(*resultbuf) * GRIDSIZE); + MPI_Gather(&u[1], rank_gridsize, MPI_FLOAT, resultbuf, rank_gridsize, MPI_FLOAT, 0, MPI_COMM_WORLD); + + if (rank == 0) { + printf("Final result:\n"); + for (int j = 0; j < GRIDSIZE; j++) { + printf("%d-", (int) resultbuf[j]); + } + printf("\nRun completed in %d iterations with residue %g\n", i, unorm); + } +~~~ +{: .language-c} + +`MPI_Gather()` is ideally suited for our purposes, since results from ranks are arranged within `resultbuf` in rank order, +so we end up with all slices across all ranks representing the entire stick. +However, note that we need to send our data starting from `u[1]`, since `u[0]` is the section's boundary value we don't want to include. + +This has an interesting effect we need to account for - note the change to the `for` loop. +Since we are gathering data from each rank (including rank 0) starting from `u[1]`, `resultbuf` will not contain any section boundary values so our loop no longer needs to skip the first value. + +### main(): Invoking the Step Function + +Before we parallelise the `poisson_step()` function, let's amend how we invoke it and pass it some additional parameters it will need. +We need to amend how many slices it will compute, and add the `rank` and `n_ranks` variables, since we know from `Parallelism and Data Exchange` that it will need to perform some data exchange with other ranks: + +~~~ + unorm = poisson_step(u, unew, rho, hsq, rank_gridsize, rank, n_ranks); +~~~ +{: .language-c} + +### poisson_step(): Updating our Function Definition + +Moving into the `poisson_step()` function, we first amend our function to include the changes to parameters: + +~~~ +double poisson_step( + float *u, float *unew, float *rho, + float hsq, int points, + int rank, int n_ranks +) { +~~~ +{: .language-c} + +### poisson_step(): Calculating a Global `unorm` + +We know from `Parallelism and Data Exchange` that we need to calculate `unorm` across all ranks. +We already have it calculated for separate ranks, so need to *reduce* those values in an MPI sense to a single summed value. +For this, we can use `MPI_Allreduce()`. + +Insert the following into the `poisson_step()` function, putting the declarations at the top of the function, +and the `MPI_Allreduce()` after the calculation of `unorm`: + +~~~ + double unorm, global_unorm; + + MPI_Allreduce(&unorm, &global_unorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); +~~~ +{: .language-c} + +So here, we use this function in an `MPI_SUM` mode, which will sum all instances of `unorm` and place the result in a single (`1`) value global_unorm`. +We must also remember to amend the return value to this global version, since we need to calculate equilibrium across the entire stick: + +~~~ + return global_unorm; +} +~~~ +{: .language-c} + +### poisson_step(): Communicate Boundaries to Neighbours + +In order for our parallel code to work, we know from `Parallelism and Data Exchange` that each section of slices is not computed in isolation. +After we've computed new values we need to send our boundary slice values to our neighbours if those neighbours exist - +the beginning and end of the stick will each only have one neighbour, so we need to account for that. + +We also need to ensure that we don't encounter a deadlock situation when exchanging the data between neighbours. +They can't all send to the rightmost neighbour simultaneously, since none will then be waiting and able to receive. +We need a message exchange strategy here, so let's have all odd-numbered ranks send their data first (to be received by even ranks), +then have our even ranks send their data (to be received by odd ranks). +Such an order might look like this (highlighting the odd ranks - only one in this example - with the order of communications indicated numerically): + +![Communication strategy - odd ranks send to potential neighbours first, then receive from them](fig/poisson_comm_strategy_1.png) + +So following the `MPI_Allreduce()` we've just added, let's deal with odd ranks first (again, put the declarations at the top of the function): + +~~~ + float sendbuf, recvbuf; + MPI_Status mpi_status; + + // The u field has been changed, communicate it to neighbours + // With blocking communication, half the ranks should send first + // and the other half should receive first + if ((rank % 2) == 1) { + // Ranks with odd number send first + + // Send data down from rank to rank - 1 + sendbuf = unew[1]; + MPI_Send(&sendbuf, 1, MPI_FLOAT, rank - 1, 1, MPI_COMM_WORLD); + // Receive data from rank - 1 + MPI_Recv(&recvbuf, 1, MPI_FLOAT, rank - 1, 2, MPI_COMM_WORLD, &mpi_status); + u[0] = recvbuf; + + if (rank != (n_ranks - 1)) { + // Send data up to rank + 1 (if I'm not the last rank) + MPI_Send(&u[points], 1, MPI_FLOAT, rank + 1, 1, MPI_COMM_WORLD); + // Receive data from rank + 1 + MPI_Recv(&u[points + 1], 1, MPI_FLOAT, rank + 1, 2, MPI_COMM_WORLD, &mpi_status); + } +~~~ +{: .language-c} + +Here we use C's inbuilt modulus operator (`%`) to determine if the rank is odd. +If so, we exchange some data with the rank preceding us, and the one following. + +We first send our newly computed leftmost value (at position `1` in our array) to the rank preceding us. +Since we're an odd rank, we can always assume a rank preceding us exists, +since the earliest odd rank 1 will exchange with rank 0. +Then, we receive the rightmost boundary value from that rank. + +Then, if the rank following us exists, we do the same, but this time we send the rightmost value at the end of our stick section, +and receive the corresponding value from that rank. + +These exchanges mean that - as an odd rank - we now have effectively exchanged the states of the start and end of our slices with our respective neighbours. + +And now we need to do the same for those neighbours (the even ranks), mirroring the same communication pattern but in the opposite order of receive/send. +From the perspective of evens, it should look like the following (highlighting the two even ranks): + +![Communication strategy - even ranks first receive from odd ranks, then send to them](fig/poisson_comm_strategy_2.png) + +~~~ + } else { + // Ranks with even number receive first + + if (rank != 0) { + // Receive data from rank - 1 (if I'm not the first rank) + MPI_Recv(&u[0], 1, MPI_FLOAT, rank - 1, 1, MPI_COMM_WORLD, &mpi_status); + // Send data down to rank-1 + MPI_Send(&u[1], 1, MPI_FLOAT, rank - 1, 2, MPI_COMM_WORLD); + } + + if (rank != (n_ranks - 1)) { + // Receive data from rank + 1 (if I'm not the last rank) + MPI_Recv(&u[points + 1], 1, MPI_FLOAT, rank + 1, 1, MPI_COMM_WORLD, &mpi_status); + // Send data up to rank+1 + MPI_Send(&u[points], 1, MPI_FLOAT, rank + 1, 2, MPI_COMM_WORLD); + } + } +~~~ +{: .language-c} + +Once complete across all ranks, every rank will then have the slice boundary data from its neighbours ready for the next iteration. + +### Running our Parallel Code + +You can obtain a full version of the parallelised Poisson code from [here](code/examples/poisson/poisson_mpi.c). Now we have the parallelised code in place, we can compile and run it, e.g.: + +~~~ +mpicc poisson_mpi.c -o poisson_mpi -lm +mpirun -n 2 poisson_mpi +~~~ +{: .language-bash} + +~~~ +Final result: +9-8-7-6-6-5-4-3-3-2-1-0- +Run completed in 182 iterations with residue 9.60328e-06 +~~~ +{: .output} + +Note that as it stands, the implementation assumes that `GRIDSIZE` is divisible by `n_ranks`. +So to guarantee correct output, we should use only + +### Testing our Parallel Code + +We should always ensure that as our parallel version is developed, that it behaves the same as our serial version. +This may not be possible initially, particularly as large parts of the code need converting to use MPI, but where possible, we should continue to test. So we should test once we have an initial MPI version, and as our code develops, perhaps with new optimisations to improve performance, we should test then too. + +> ## An Initial Test +> +> Test the mpi version of your code against the serial version, using 1, 2, 3, and 4 ranks with the MPI version. +> Are the results as you would expect? +> What happens if you test with 5 ranks, and why? +> +> > ## Solution +> > +> > Using these ranks, the MPI results should be the same as our serial version. +> > Using 5 ranks, our MPI version yields `9-8-7-6-5-4-3-2-1-0-0-0-` which is incorrect. +> > This is because the `rank_gridsize = GRIDSIZE / n_ranks` calculation becomes `rank_gridsize = 12 / 5`, which produces 2.4. +> > This is then converted to the integer 2, which means each of the 5 ranks is only operating on 2 slices each, for a total of 10 slices. +> > This doesn't fill `resultbuf` with results representing an expected `GRIDSIZE` of 12, hence the incorrect answer. +> > +> > This highlights another aspect of complexity we need to take into account when writing such parallel implementations, where we must ensure a problem space is correctly subdivided. +> > In this case, we could implement a more careful way to subdivide the slices across the ranks, with some ranks obtaining more slices to make up the shortfall correctly. +>{: .solution} +{: .challenge} + +> ## Limitations! +> +> You may remember that for the purposes of this episode we've assumed a homogeneous stick, +> by setting the `rho` coefficient to zero for every slice. +> As a thought experiment, if we wanted to address this limitation and model an inhomogeneous stick with different static coefficients for each slice, +> how could we amend our code to allow this correctly for each slice across all ranks? +> +>> ## Solution +>> +>> One way would be to create a static lookup array with a `GRIDSIZE` number of elements. +>> This could be defined in a separate `.h` file and imported using `#include`. +>> Each rank could then read the `rho` values for the specific slices in its section from the array and use those. +>> At initialisation, instead of setting it to zero we could do: +>> +>> ~~~ +>> rho[i] = rho_coefficients[(rank * rank_gridsize) + i] +>> ~~~ +>> {: .language-c} +>{: .solution} +{: .challenge} diff --git a/_episodes/10-optimising-mpi.md b/_episodes/09-optimising-mpi.md similarity index 100% rename from _episodes/10-optimising-mpi.md rename to _episodes/09-optimising-mpi.md diff --git a/_episodes/08-communication-patterns.md b/_episodes/10-communication-patterns.md similarity index 99% rename from _episodes/08-communication-patterns.md rename to _episodes/10-communication-patterns.md index b70eb88..46b5971 100644 --- a/_episodes/08-communication-patterns.md +++ b/_episodes/10-communication-patterns.md @@ -1,8 +1,8 @@ --- -title: Common Communication Patterns +title: Common Communication Patterns [Optional] slug: "dirac-intro-to-mpi-communication-patterns" -teaching: 0 -exercises: 0 +teaching: 20 +exercises: 30 math: true questions: - What are some common data communication patterns in MPI? diff --git a/_episodes/07-advanced-communication.md b/_episodes/11-advanced-communication.md similarity index 56% rename from _episodes/07-advanced-communication.md rename to _episodes/11-advanced-communication.md index 98b371b..5f44e1b 100644 --- a/_episodes/07-advanced-communication.md +++ b/_episodes/11-advanced-communication.md @@ -1,1011 +1,613 @@ ---- -title: Advanced Communication Techniques -slug: "dirac-intro-to-mpi-advanced-communication" -teaching: 25 -exercises: 20 -questions: -- How do I use complex data structures in MPI? -- What is contiguous memory, and why does it matter? -objectives: -- Understand the problems of non-contiguous memory in MPI -- Know how to define and use derived datatypes -keypoints: -- Communicating complex, heterogeneous or non-contiguous data structures in MPI requires a bit more work -- Any data being transferred should be a single contiguous block of memory -- By defining derived datatypes, we can easily send data which is not contiguous -- The functions `MPI_Pack` and `MPI_Unpack` can be used to manually create a contiguous memory block of data ---- - -We've so far seen the basic building blocks for splitting work and communicating data between ranks, meaning we're now -dangerous enough to write a simple and successful MPI application. We've worked, so far, with simple data structures, -such as single variables or small 1D arrays. In reality, any useful software we write will use more complex data -structures, such as structures, n-dimensional arrays and other complex types. Working with these in MPI require a bit -more work to communicate them correctly and efficiently. - -To help with this, MPI provides an interface to create new types known as *derived datatypes*. A derived type acts as a -way to enable the translation of complex data structures into instructions which MPI uses for efficient data access -communication. - -> ## Size limitations for messages -> -> All throughout MPI, the argument which says how many elements of data are being communicated is an integer: `int -> count`. In most 64-bit Linux systems, `int`'s are usually 32-bit and so the biggest number you can pass to `count` is -> `2^31 - 1 = 2,147,483,647`, which is about 2 billion. Arrays which exceed this length can't be communicated easily in -> versions of MPI older than MPI-4.0, when support for "large count" communication was added to the MPI standard. In -> older MPI versions, there are two workarounds to this limitation. The first is to communicate large arrays in -> smaller, more manageable chunks. The other is to use derived types, to re-shape the data. -{: .callout} - -## Multi-dimensional arrays - -Almost all scientific and computing problems nowadays require us to think in more than one dimension. Using -multi-dimensional arrays, such for matrices or tensors, or discretising something onto a 2D or 3D grid of points -are fundamental parts for most scientific software. However, the additional dimensions comes with additional complexity, -not just in the code we write, but also in how data is communicated. - -To create a 2 x 3 matrix, in C, and initialize it with some values, we use the following syntax, - -```c -int matrix[2][3] = { {1, 2, 3}, {4, 5, 6} }; // matrix[rows][cols] -``` - -This creates an array with two rows and three columns. The first row contains the values `{1, 2, 3}` and the second row -contains `{4, 5, 6}`. The number of rows and columns can be any value, as long as there is enough memory available. - -### The importance of memory contiguity - -When a sequence of things is contiguous, it means there are multiple adjacent things without anything in between them. -In the context of MPI, when we talk about something being contiguous we are almost always talking about how arrays, and -other complex data structures, are stored in the computer's memory. The elements in an array are contiguous when the -next, or previous, element are stored in the adjacent memory location. - -The memory space of a computer is linear. When we create a multi-dimensional array, the compiler and operating system -decide how to map and store the elements into that linear space. There are two ways to do this: -[row-major or column-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order). The difference -is which elements of the array are contiguous in memory. Arrays are row-major in C and column-major in Fortran. -In a row-major array, the elements in each column of a row are contiguous, so element `x[i][j]` is -preceded by `x[i][j - 1]` and is followed by `x[i][j +1]`. In Fortran, arrays are column-major so `x(i, j)` is -followed by `x(i + 1, j)` and so on. - -The diagram below shows how a 4 x 4 matrix is mapped onto a linear memory space, for a row-major array. At the top of -the diagram is the representation of the linear memory space, where each number is ID of the element in memory. Below -that are two representations of the array in 2D: the left shows the coordinate of each element and the right shows the -ID of the element. - -Column memory layout in C - -The purple elements (5, 6, 7, 8) which map to the coordinates `[1][0]`, `[1][1]`, `[1][2]` and `[1][3]` are contiguous -in linear memory. The same applies for the orange boxes for the elements in row 2 (elements 9, 10, 11 and 12). Columns -in row-major arrays are contiguous. The next diagram instead shows how elements in adjacent rows are mapped in memory. - -Row memory layout in C - -Looking first at the purple boxes (containing elements 2, 6, 10 and 14) which make up the row elements for column 1, -we can see that the elements are not contiguous. Element `[0][1]` maps to element 2 and element `[1][1]` maps to element -6 and so on. Elements in the same column but in a different row are separated by four other elements, in this example. -In other words, elements in other rows are not contiguous. - -> ## Does memory contiguity affect performance? -> -> Do you think memory contiguity could impact the performance of our software, in a negative way? -> -> > ## Solution -> > -> > Yes, memory contiguity can affect how fast our programs run. When data is stored in a neat and organized way, the -> > computer can find and use it quickly. But if the data is scattered around randomly (fragmented), it takes more time -> > to locate and use it, which decreases performance. Keeping our data and data access patterns organized can -> > make our programs faster. But we probably won't notice the difference for small arrays and data structures. -> {: .solution} -{: .challenge} - -> ## What about if I use `malloc()`? -> -> More often than not, we will see `malloc()` being used to allocate memory for arrays. Especially if the code is using -> an older standard, such as C90, which does not support [variable length -> arrays](https://en.wikipedia.org/wiki/Variable-length_array). When we use `malloc()`, we get a contiguous array of -> elements. To create a 2D array using `malloc()`, we have to first create an array of pointers (which are contiguous) -> and allocate memory for each pointer, -> -> ```c -> int num_rows = 3, num_cols = 5; -> -> float **matrix = malloc(num_rows * sizeof(float*)); // Each pointer is the start of a row -> for (int i = 0; i < num_rows; ++i) { -> matrix[i] = malloc(num_cols * sizeof(float)); // Here we allocate memory to store the column elements for row i -> } -> -> for (int i = 0; i < num_rows; ++i) { -> for (int j = 0; i < num_cols; ++j) { -> matrix[i][j] = 3.14159; // Indexing is done as matrix[rows][cols] -> } -> } -> ``` -> -> There is one problem though. `malloc()` *does not* guarantee that subsequently allocated memory will be contiguous. -> When `malloc()` requests memory, the operating system will assign whatever memory is free. This is not always next to -> the block of memory from the previous allocation. This makes life tricky, since data *has* to be contiguous for MPI -> communication. But there are workarounds. One is to only use 1D arrays (with the same number of elements as the higher -> dimension array) and to map the n-dimensional coordinates into a linear coordinate system. For example, the element -> `[2][4]` in a 3 x 5 matrix would be accessed as, -> -> ```c -> int index_for_2_4 = matrix1d[5 * 2 + 4]; // num_cols * row + col -> ``` -> -> Another solution is to move memory around so that it is contiguous, such as in [this -> example](code/examples/08-malloc-trick.c) or by using a more sophisticated function such as [`arralloc()` -> function](code/arralloc.c) (not part of the standard library) which can allocate arbitrary n-dimensional arrays into a -> contiguous block. -> -{: .callout} - -For a row-major array, we can send the elements of a single row (for a 4 x 4 matrix) easily, - -```c -MPI_Send(&matrix[1][0], 4, MPI_INT ...); -``` - -The send buffer is `&matrix[1][0]`, which is the memory address of the first element in row 1. As the columns are four -elements long, we have specified to only send four integers. Even though we're working here with a 2D array, sending a -single row of the matrix is the same as sending a 1D array. Instead of using a pointer to the start of the array, an -address to the first element of the row (`&matrix[1][0]`) is used instead. It's not possible to do the same for a column -of the matrix, because the elements down the column are not contiguous. - -### Using vectors to send slices of an array - -To send a column of a matrix, we have to use a *vector*. A vector is a derived datatype that represents multiple (or -one) contiguous sequences of elements, which have a regular spacing between them. By using vectors, we can create data -types for column vectors, row vectors or sub-arrays, similar to how we can [create slices for Numpy arrays in -Python](https://numpy.org/doc/stable/user/basics.indexing.html), all of which can be sent in a single, efficient, -communication. To create a vector, we create a new datatype using `MPI_Type_vector()`, - -```c -int MPI_Type_vector( - int count, - int blocklength, - int stride, - MPI_Datatype oldtype, - MPI_Datatype *newtype -); -``` - -| `count`: | The number of "blocks" which make up the vector | -| `blocklength`: | The number of contiguous elements in a block | -| `stride`: | The number of elements between the start of each block | -| `oldtype`: | The data type of the elements of the vector, e.g. MPI_INT, MPI_FLOAT | -| `newtype`: | The newly created data type to represent the vector | - -To understand what the arguments mean, look at the diagram below showing a vector to send two rows of a 4 x 4 matrix -with a row in between (rows 2 and 4), - -How a vector is laid out in memory - -A *block* refers to a sequence of contiguous elements. In the diagrams above, each sequence of contiguous purple or -orange elements represents a block. The *block length* is the number of elements within a block; in the above this is -four. The *stride* is the distance between the start of each block, which is eight in the example. The count is the -number of blocks we want. When we create a vector, we're creating a new derived datatype which includes one or more -blocks of contiguous elements. - -Before we can use the vector we create to communicate data, it has to be committed using `MPI_Type_commit()`. This -finalises the creation of a derived type. Forgetting to do this step leads to unexpected behaviour, and potentially -disastrous consequences! - -```c -int MPI_Type_commit( - MPI_Datatype *datatype // The datatype to commit - note that this is a pointer -); -``` - -When a datatype is committed, resources which store information on how to handle it are internally allocated. This -contains data structures such as memory buffers as well as data used for bookkeeping. Failing to free those resources -after finishing with the vector leads to memory leaks, just like when we don't free memory created using `malloc()`. To -free up the resources, we use `MPI_Type_free()`, - -```c -int MPI_Type_free ( - MPI_Datatype *datatype // The datatype to clean up -- note this is a pointer -); -``` - -The following example code uses a vector to send two rows from a 4 x 4 matrix, as in the example diagram above. - -```c -// The vector is a MPI_Datatype -MPI_Datatype rows_type; - -// Create the vector type -const int count = 2; -const int blocklength = 4; -const int stride = 8; -MPI_Type_vector(count, blocklength, stride, MPI_INT, &rows_type); - -// Don't forget to commit it -MPI_Type_commit(&rows_type); - -// Send the middle row of our 2d matrix array. Note that we are sending -// &matrix[1][0] and not matrix. This is because we are using an offset -// to change the starting point of where we begin sending memory -int matrix[4][4] = { - { 1, 2, 3, 4}, - { 5, 6, 7, 8}, - { 9, 10, 11, 12}, - {13, 14, 15, 16}, -}; - -if (my_rank == 0) { - MPI_Send(&matrix[1][0], 1, rows_type, 1, 0, MPI_COMM_WORLD); -} else { - // The receive function doesn't "work" with vector types, so we have to - // say that we are expecting 8 integers instead - const int num_elements = count * blocklength; - int recv_buffer[num_elements]; - MPI_Recv(recv_buffer, num_elements, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); -} - -// The final thing to do is to free the new datatype when we no longer need it -MPI_Type_free(&rows_type); -``` - -There are two things above, which look quite innocent, but are important to understand. First of all, the send buffer -in `MPI_Send()` is not `matrix` but `&matrix[1][0]`. In `MPI_Send()`, the send buffer is a pointer to the memory -location where the start of the data is stored. In the above example, the intention is to only send the second and forth -rows, so the start location of the data to send is the address for element `[1][0]`. If we used `matrix`, the first and -third rows would be sent instead. - -The other thing to notice, which is not immediately clear why it's done this way, is that the receive datatype is -`MPI_INT` and the count is `num_elements = count * blocklength` instead of a single element of `rows_type`. This -is because when a rank receives data, the data is contiguous array. We don't need to use a vector to describe the layout -of contiguous memory. We are just receiving a contiguous array of `num_elements = count * blocklength` integers. - -> ## Sending columns from an array -> -> Create a vector type to send a column in the following 2 x 3 array: -> -> ```c -> int matrix[2][3] = { -> {1, 2, 3}, -> {4, 5, 6}, -> }; ->``` -> -> With that vector type, send the middle column of the matrix (elements `matrix[0][1]` and `matrix[1][1]`) from rank 0 -> to rank 1 and print the results. You may want to use [this code](code/solutions/skeleton-example.c) as your starting -> point. -> -> > ## Solution -> > -> > If your solution is correct you should see 2 and 5 printed to the screen. In the solution below, to send a 2 x 1 -> > column of the matrix, we created a vector with `count = 2`, `blocklength = 1` and `stride = 3`. To send the correct -> > column our send buffer was `&matrix[0][1]` which is the address of the first element in column 1. To see why the -> > stride is 3, take a look at the diagram below, -> > -> > Stride example for question -> > -> > You can see that there are *three* contiguous elements between the start of each block of 1. -> > -> > ```c -> > #include -> > #include -> > -> > int main(int argc, char **argv) -> > { -> > int my_rank; -> > int num_ranks; -> > MPI_Init(&argc, &argv); -> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); -> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); -> > -> > int matrix[2][3] = { -> > {1, 2, 3}, -> > {4, 5, 6}, -> > }; -> > -> > if (num_ranks != 2) { -> > if (my_rank == 0) { -> > printf("This example only works with 2 ranks\n"); -> > } -> > MPI_Abort(MPI_COMM_WORLD, 1); -> > } -> > -> > MPI_Datatype col_t; -> > MPI_Type_vector(2, 1, 3, MPI_INT, &col_t); -> > MPI_Type_commit(&col_t); -> > -> > if (my_rank == 0) { -> > MPI_Send(&matrix[0][1], 1, col_t, 1, 0, MPI_COMM_WORLD); -> > } else { -> > int buffer[2]; -> > MPI_Status status; -> > -> > MPI_Recv(buffer, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); -> > -> > printf("Rank %d received the following:", my_rank); -> > for (int i = 0; i < 2; ++i) { -> > printf(" %d", buffer[i]); -> > } -> > printf("\n"); -> > } -> > -> > MPI_Type_free(&col_t); -> > -> > return MPI_Finalize(); -> > } -> > ``` -> > -> {: .solution} -> -{: .challenge} - -> ## Sending sub-arrays of an array -> -> By using a vector type, send the middle four elements (6, 7, 10, 11) in the following 4 x 4 matrix from rank 0 to rank -> 1, -> -> ```c -> int matrix[4][4] = { -> { 1, 2, 3, 4}, -> { 5, 6, 7, 8}, -> { 9, 10, 11, 12}, -> {13, 14, 15, 16} -> }; -> ``` -> -> You can re-use most of your code from the previous exercise as your starting point, replacing the 2 x 3 matrix with -> the 4 x 4 matrix above and modifying the vector type and communication functions as required. -> -> > ## Solution -> > -> > The receiving rank(s) should receive the numbers 6, 7, 10 and 11 if your solution is correct. In the solution below, -> > we have created a vector with a count and block length of 2 and with a stride of 4. The first two -> > arguments means two vectors of block length 2 will be sent. The stride of 4 results from that there are 4 elements -> > between the start of each distinct block as shown in the image below, -> > -> > Stride example for question -> > -> > You must always remember to send the address for the starting point of the *first* block as the send buffer, which -> > is why `&array[1][1]` is the first argument in `MPI_Send()`. -> > -> > ```c -> > #include -> > #include -> > -> > int main(int argc, char **argv) -> > { -> > int matrix[4][4] = { -> > { 1, 2, 3, 4}, -> > { 5, 6, 7, 8}, -> > { 9, 10, 11, 12}, -> > {13, 14, 15, 16} -> > }; -> > -> > int my_rank; -> > int num_ranks; -> > MPI_Init(&argc, &argv); -> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); -> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); -> > -> > if (num_ranks != 2) { -> > if (my_rank == 0) { -> > printf("This example only works with 2 ranks\n"); -> > } -> > MPI_Abort(MPI_COMM_WORLD, 1); -> > } -> > -> > MPI_Datatype sub_array_t; -> > MPI_Type_vector(2, 2, 4, MPI_INT, &sub_array_t); -> > MPI_Type_commit(&sub_array_t); -> > -> > if (my_rank == 0) { -> > MPI_Send(&matrix[1][1], 1, sub_array_t, 1, 0, MPI_COMM_WORLD); -> > } else { -> > int buffer[4]; -> > MPI_Status status; -> > -> > MPI_Recv(buffer, 4, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); -> > -> > printf("Rank %d received the following:", my_rank); -> > for (int i = 0; i < 4; ++i) { -> > printf(" %d", buffer[i]); -> > } -> > printf("\n"); -> > } -> > -> > MPI_Type_free(&sub_array_t); -> > -> > return MPI_Finalize(); -> > } -> > ``` -> > -> {: .solution} -{: .challenge} - -## Structures in MPI - -Structures, commonly known as structs, are custom datatypes which contain multiple variables of (usually) different -types. Some common use cases of structs, in scientific code, include grouping together constants or global variables, or -they are used to represent a physical thing, such as a particle, or something more abstract like a cell on a simulation -grid. When we use structs, we can write clearer, more concise and better structured code. - -To communicate a struct, we need to define a derived datatype which tells MPI about the layout of the struct in memory. -Instead of `MPI_Type_create_vector()`, for a struct, we use, -`MPI_Type_create_struct()`, - -```c -int MPI_Type_create_struct( - int count, - int *array_of_blocklengths, - MPI_Aint *array_of_displacements, - MPI_Datatype *array_of_types, - MPI_Datatype *newtype, -); -``` - -| `count`: | The number of fields in the struct | -| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send` | -| `*array_of_displacements`: | The relative positions of each field in bytes | -| `*array_of_types`: | The MPI type of each field | -| `*newtype`: | The newly created data type for the struct | - -The main difference between vector and struct derived types is that the arguments for structs expect arrays, since -structs are made up of multiple variables. Most of these arguments are straightforward, given what we've just seen for -defining vectors. But `array_of_displacements` is new and unique. - -When a struct is created, it occupies a single contiguous block of memory. But there is a catch. For performance -reasons, compilers insert arbitrary "padding" between each member for performance reasons. This padding, known as [data -structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment), optimises both the layout of the memory -and the access of it. As a result, the memory layout of a struct may look like this instead: - -Memory layout for a struct - -Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about -is not contiguous any more. This is why we need the `array_of_displacements` argument, which specifies the distance, in -bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the -stride in vectors. - -To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To -do this, we can use the function `MPI_Get_address()`, - -```c -int MPI_Get_address{ - const void *location, - MPI_Aint *address, -}; -``` - -| `*location`: | A pointer to the variable we want the address of | -| `*address`: | The address of the variable, as an MPI_Aint (address integer) | - -In the following example, we use `MPI_Type_create_struct()` and `MPI_Get_address()` to create a derived type for a -struct with two members, - -```c -// Define and initialize a struct, named foo, with an int and a double -struct MyStruct { - int id; - double value; -} foo = {.id = 0, .value = 3.1459}; - -// Create arrays to describe the length of each member and their type -int count = 2; -int block_lengths[2] = {1, 1}; -MPI_Datatype block_types[2] = {MPI_INT, MPI_DOUBLE}; - -// Now we calculate the displacement of each member, which are stored in an -// MPI_Aint designed for storing memory addresses -MPI_Aint base_address; -MPI_Aint block_offsets[2]; - -MPI_Get_address(&foo, &base_address); // First of all, we find the address of the start of the struct -MPI_Get_address(&foo.id, &block_offsets[0]); // Now the address of the first member "id" -MPI_Get_address(&foo.value, &block_offsets[1]); // And the second member "value" - -// Calculate the offsets, by subtracting the address of each field from the -// base address of the struct -for (int i = 0; i < 2; ++i) { - // MPI_Aint_diff is a macro to calculate the difference between two - // MPI_Aints and is a replacement for: - // (MPI_Aint) ((char *) block_offsets[i] - (char *) base_address) - block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); -} - -// We finally can create out struct data type -MPI_Datatype struct_type; -MPI_Type_create_struct(count, block_lengths, block_offsets, block_types, &struct_type); -MPI_Type_commit(&struct_type); - -// Another difference between vector and struct derived types is that in -// MPI_Recv, we use the struct type. We have to do this because we aren't -// receiving a contiguous block of a single type of date. By using the type, we -// tell MPI_Recv how to understand the mix of data types and padding and how to -// assign those back to recv_struct -if (my_rank == 0) { - MPI_Send(&foo, 1, struct_type, 1, 0, MPI_COMM_WORLD); -} else { - struct MyStruct recv_struct; - MPI_Recv(&recv_struct, 1, struct_type, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); -} - -// Remember to free the derived type -MPI_Type_free(&struct_type); -``` - -> ## Sending a struct -> -> By using a derived data type, write a program to send the following struct `struct Node node` from one rank to -> another, -> -> ```c -> struct Node { -> int id; -> char name[16]; -> double temperature; -> }; -> -> struct Node node = { .id = 0, .name = "Dale Cooper", .temperature = 42}; ->``` -> -> You may wish to use [this skeleton code](code/solutions/skeleton-example.c) as your stating point. -> -> > ## Solution -> > -> > Your solution should look something like the code block below. When sending a *static* array (`name[16]`), we have -> > to use a count of 16 in the `block_lengths` array for that member. -> > -> > ```c -> > #include -> > #include -> > -> > struct Node { -> > int id; -> > char name[16]; -> > double temperature; -> > }; -> > -> > int main(int argc, char **argv) -> > { -> > int my_rank; -> > int num_ranks; -> > MPI_Init(&argc, &argv); -> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); -> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); -> > -> > if (num_ranks != 2) { -> > if (my_rank == 0) { -> > printf("This example only works with 2 ranks\n"); -> > } -> > MPI_Abort(MPI_COMM_WORLD, 1); -> > } -> > -> > struct Node node = {.id = 0, .name = "Dale Cooper", .temperature = 42}; -> > -> > int block_lengths[3] = {1, 16, 1}; -> > MPI_Datatype block_types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE}; -> > -> > MPI_Aint base_address; -> > MPI_Aint block_offsets[3]; -> > MPI_Get_address(&node, &base_address); -> > MPI_Get_address(&node.id, &block_offsets[0]); -> > MPI_Get_address(&node.name, &block_offsets[1]); -> > MPI_Get_address(&node.temperature, &block_offsets[2]); -> > for (int i = 0; i < 3; ++i) { -> > block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); -> > } -> > -> > MPI_Datatype node_struct; -> > MPI_Type_create_struct(3, block_lengths, block_offsets, block_types, &node_struct); -> > MPI_Type_commit(&node_struct); -> > -> > if (my_rank == 0) { -> > MPI_Send(&node, 1, node_struct, 1, 0, MPI_COMM_WORLD); -> > } else { -> > struct Node recv_node; -> > MPI_Recv(&recv_node, 1, node_struct, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); -> > printf("Received node: id = %d name = %s temperature %f\n", recv_node.id, recv_node.name, -> > recv_node.temperature); -> > } -> > -> > MPI_Type_free(&node_struct); -> > -> > return MPI_Finalize(); -> > } -> > ``` -> > -> {: .solution} -{: .challenge} - -> ## What if I have a pointer in my struct? -> -> Suppose we have the following struct with a pointer named `position` and some other fields, -> -> ```c -> struct Grid { -> double *position; -> int num_cells; -> }; -> grid.position = malloc(3 * sizeof(double)); ->``` -> -> If we use `malloc()` to allocate memory for `position`, how would we send data in the struct and the memory we -> allocated one rank to another? If you are unsure, try writing a short program to create a derived type for the struct. -> -> > ## Solution -> > -> > The short answer is that we can't do it using a derived type, and will have to *manually* communicate the data -> > separately. The reason why can't use a derived type is because the address of `*position` is the address of the -> > pointer. The offset between `num_cells` and `*position` is the size of the pointer and whatever padding the compiler -> > adds. It is not the data which `position` points to. The memory we allocated for `*position` is somewhere else in -> > memory, as shown in the diagram below, and is non-contiguous with respect to the fields in the struct. -> > -> > Memory layout for a struct with a pointer -> > -> > -> {: .solution} -> -{: .challenge} - -> ## A different way to calculate displacements -> -> There are other ways to calculate the displacement, other than using what MPI provides for us. Another common way is -> to use the `offsetof()` macro part of ``. `offsetof()` accepts two arguments, the first being the struct -> type and the second being the member to calculate the offset for. -> -> ```c -> #include -> MPI_Aint displacements[2]; -> displacements[0] = (MPI_Aint) offsetof(struct MyStruct, id); // The cast to MPI_Aint is for extra safety -> displacements[1] = (MPI_Aint) offsetof(struct MyStruct, value); ->``` -> -> This method and the other shown in the previous examples both returns the same displacement values. It's mostly a -> personal choice which you choose to use. Some people prefer the "safety" of using `MPI_Get_address()` whilst others -> prefer to write more concise code with `offsetof()`. Of course, if you're a Fortran programmer then you can't use the -> macro! -> -{: .callout} - -## Dealing with other non-contiguous data - -The previous two sections covered how to communicate complex but structured data between ranks using derived datatypes. -However, there are *always* some edge cases which don't fit into a derived types. For example, just in the last exercise -we've seen that pointers and derived types don't mix well. Furthermore, we can sometimes also reach performance -bottlenecks when working with heterogeneous data which doesn't fit, or doesn't make sense to be, in a derived type, as -each data type needs to be communicated in separate communication calls. This can be especially bad if blocking -communication is used! For edge cases situations like this, we can use the `MPI_Pack()` and `MPI_Unpack()` functions to -do things ourselves. - -Both `MPI_Pack()` and `MPI_Unpack()` are methods for manually arranging, packing and unpacking data into a contiguous -buffer, for cases where regular communication methods and derived types don't work well or efficiently. They can also be -used to create self-documenting message, where the packed data contains additional elements which describe the size, -structure and contents of the data. But we have to be careful, as using packed buffers comes with additional overhead, -in the form of increased memory usage and potentially more communication overhead as packing and unpacking data is not -free. - -When we use `MPI_Pack()`, we take non-contiguous data (sometimes of different datatypes) and "pack" it into a -contiguous memory buffer. The diagram below shows how two (non-contiguous) chunks of data may be packed into a contiguous -array using `MPI_Pack()`. - -![Layout of packed memory](fig/packed_buffer_layout.png) - -The coloured boxes in both memory representations (memory and pakced) are the same chunks of data. The green boxes -containing only a single number are used to document the number of elements in the block of elements they are adjacent -to, in the contiguous buffer. This is optional to do, but is generally good practise to include to create a -self-documenting message. From the diagram we can see that we have "packed" non-contiguous blocks of memory into a -single contiguous block. We can do this using `MPI_Pack()`. To reverse this action, and "unpack" the buffer, we use -`MPI_Unpack()`. As you might expect, `MPI_Unpack()` takes a buffer, created by `MPI_Pack()` and unpacks the data back -into various memory address. - -To pack data into a contiguous buffer, we have to pack each block of data, one by one, into the contiguous buffer using -the `MPI_Pack()` function, - -```c -int MPI_Pack( - const void *inbuf, - int incount, - MPI_Datatype datatype, - void *outbuf, - int outsize, - int *position, - MPI_Comm comm -); -``` - -| `*inbuf`: | The data to pack into the buffer | -| `incount`: | The number of elements to pack | -| `datatype`: | The data type of the data to pack | -| `*outbuf`: | The out buffer of contiguous data | -| `outsize`: | The size of the out buffer, in bytes | -| `*position`: | A counter for how far into the contiguous buffer to write (records the position, in bytes) | -| `comm`: | The communicator | - -In the above, `inbuf` is the data we want to pack into a contiguous buffer and `incount` and `datatype` define the -number of elements in and the datatype of `inbuf`. The parameter `outbuf` is the contiguous buffer the data is packed -into, with `outsize` being the total size of the buffer in *bytes*. The `position` argument is used to keep track of the -current position, in bytes, where data is being packed into `outbuf`. - -Uniquely, `MPI_Pack()`, and `MPI_Unpack()` as well, measure the size of the contiguous buffer, `outbuf`, in bytes rather than -in number of elements. Given that `MPI_Pack()` is all about manually arranging data, we have to also manage the -allocation of memory for `outbuf`. But how do we allocate memory for it, and how much should we allocate? Allocation is -done by using `malloc()`. Since `MPI_Pack()` works with `outbuf` in terms of bytes, the convention is to declare -`outbuf` as a `char *`. The amount of memory to allocate is simply the amount of space, in bytes, required to store all -of the data we want to pack into it. Just like how we would normally use `malloc()` to create an array. If we had -an integer array and a floating point array which we wanted to pack into the buffer, then the size required is easy to -calculate, - -```c -// The total buffer size is the sum of the bytes required for the int and float array -int size_int_array = num_int_elements * sizeof(int); -int size_float_array = num_float_elements * sizeof(float); -int buffer_size = size_int_array + size_float_array; -// The buffer is a char *, but could also be cast as void * if you prefer -char *buffer = malloc(buffer_size * sizeof(char)); // a char is 1 byte, so sizeof(char) is optional -``` - -If we are also working with derived types, such as vectors or structs, then we need to find the size of those types. By -far the easiest way to handle these is to use `MPI_Pack_size()`, which supports derived datatypes through the -`MPI_Datatype`, - -```c -nt MPI_Pack_size( - int incount, - MPI_Datatype datatype, - MPI_Comm comm, - int *size -); -``` - -| `incount`: | The number of data elements | -| `datatype`: | The data type of the data | -| `comm`: | The communicator | -| `*size`: | The calculated upper size limit for the buffer, in bytes | - -`MPI_Pack_size()` is a helper function to calculate the *upper bound* of memory required. It is, in general, preferable -to calculate the buffer size using this function, as it takes into account any implementation specific MPI detail and -thus is more portable between implementations and systems. If we wanted to calculate the memory required for three -elements of some derived struct type and a `double` array, we would do the following, - -```c -int struct_array_size, float_array_size; -MPI_Pack_size(3, STRUCT_DERIVED_TYPE, MPI_COMM_WORLD, &struct_array_size); -MPI_Pack_size(50, MPI_DOUBLE. MPI_COMM_WORLD, &float_array_size); -int buffer_size = struct_array_size + float_array_size; -``` - -When a rank has received a contiguous buffer, it has to be unpacked into its constituent parts, one by one, using -`MPI_Unpack()`, - -```c -int MPI_Unpack( - void *inbuf, - int insize, - int *position, - void *outbuf, - int outcount, - MPI_Datatype datatype, - MPI_Comm comm, -); -``` - -| `*inbuf`: | The contiguous buffer to unpack | -| `insize`: | The total size of the buffer, in bytes | -| `*position`: | The position, in bytes, from where to start unpacking from | -| `*outbuf`: | An array, or variable, to unpack data into -- this is the output | -| `outcount`: | The number of elements of data to unpack | -| `datatype`: | The data type of elements to unpack | -| `comm`: | The communicator | - -The arguments for this function are essentially the reverse of `MPI_Pack()`. Instead of being the buffer to pack into, -`inbuf` is now the packed buffer and `position` is the position, in bytes, in the buffer where to unpacking from. -`outbuf` is then the variable we want to unpack into, and `outcount` is the number of elements of `datatype` to unpack. - -In the example below, `MPI_Pack()`, `MPI_Pack_size()` and `MPI_Unpack()` are used to communicate a (non-contiguous) -3 x 3 matrix. - -```c -// Allocate and initialise a (non-contiguous) 2D matrix that we will pack into -// a buffer -int num_rows = 3, num_cols = 3; -int **matrix = malloc(num_rows * sizeof(int *)); -for (int i = 0; i < num_rows; ++i) { - matrix[i] = malloc(num_cols * sizeof(int)); - for (int j = 0; i < num_cols; ++j) { - matrix[i][j] = num_cols * i + j; - } -} - -// Determine the upper limit for the amount of memory the buffer requires. Since -// this is a simple situation, we could probably have done this manually using -// `num_rows * num_cols * sizeof(int)`. The size `pack_buffer_size` is returned in -// bytes -int pack_buffer_size; -MPI_Pack_size(num_rows * num_cols, MPI_INT, MPI_COMM_WORLD, &pack_buffer_size); - -if (my_rank == 0) { - // Create the pack buffer and pack each row of data into it buffer - // one by one - int position = 0; - char *packed_data = malloc(pack_buffer_size); - for (int i = 0; i < num_rows; ++i) { - MPI_Pack(matrix[i], num_cols, MPI_INT, packed_data, pack_buffer_size, &position, MPI_COMM_WORLD); - } - - // Send the packed data to rank 1 - MPI_Send(packed_data, pack_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); -} else { - // Create a receive buffer and get the packed buffer from rank 0 - char *received_data = malloc(pack_buffer_size); - MPI_Recv(received_data, pack_buffer_size + 1, MPI_PACKED, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); - - // Allocate a matrix to put the receive buffer into -- this is for demonstration purposes - int **my_matrix = malloc(num_rows * sizeof(int *)); - for (int i = 0; i < num_cols; ++i) { - my_matrix[i] = malloc(num_cols * sizeof(int)); - } - - // Unpack the received data row by row into my_matrix - int position = 0; - for (int i = 0; i < num_rows; ++i) { - MPI_Unpack(received_data, pack_buffer_size, &position, my_matrix[i], num_cols, MPI_INT, MPI_COMM_WORLD); - } -} -``` - -> ## Blocking or non-blocking? -> -> The processes of packing data into a contiguous buffer does not happen asynchronously. The same goes for unpacking -> data. But this doesn't restrict the packed data from being only sent synchronously. The packed data can be -> communicated using any communication function, just like the previous derived types. It works just as well to -> communicate the buffer using non-blocking methods, as it does using blocking methods. -{: .callout} - -> ## What if the other rank doesn't know the size of the buffer? -> -> In some cases, the receiving rank may not know the size of the buffer used in `MPI_Pack()`. This could happen if a -> message is sent and received in different functions, if some ranks have different branches through the program -> or if communication happens in a dynamic or non-sequential way. -> -> In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the a message being sent and to get the -> number of elements in the message. -> -> ```c -> // First probe for a message, to get the status of it -> MPI_Status status; -> MPI_Probe(0, 0, MPI_COMM_WORLD, &status); -> // Using MPI_Get_count we can get the number of elements of a particular data type -> int message_size; -> MPI_Get_count(&status, MPI_PACKED, &buffer_size); -> // MPI_PACKED represents an element of a "byte stream." So, buffer_size is the size of the buffer to allocate -> char *buffer = malloc(buffer_size); ->``` -> -{: .callout} - -> ## Sending heterogeneous data in a single communication -> -> Suppose we have two arrays below, where one contains integer data and the other floating point data. Normally we would -> use multiple communication calls to send each type of data individually, for a known number of elements. For this -> exercise, communicate both arrays using a packed memory buffer. -> -> ```c -> int int_data_count = 5; -> int float_data_count = 10; -> -> int *int_data = malloc(int_data_count * sizeof(int)); -> float *float_data = malloc(float_data_count * sizeof(float)); -> -> // Initialize the arrays with some values -> for (int i = 0; i < int_data_count; ++i) { -> int_data[i] = i + 1; -> } -> for (int i = 0; i < float_data_count; ++i) { -> float_data[i] = 3.14159 * (i + 1); -> } -> ``` -> -> Since the arrays are dynamically allocated, in rank 0, you should also pack the number of elements in each array. Rank -> 1 may also not know the size of the buffer. How would you deal with that? -> -> You can use this [skeleton code](code/solutions/07-pack-skeleton.c) to begin with. -> -> > ## Solution -> > -> > The additional restrictions for rank 1 not knowing the size of the arrays or packed buffer add some complexity to -> > receiving the packed buffer from rank 0. -> > -> > ```c -> > #include -> > #include -> > #include -> > -> > int main(int argc, char **argv) -> > { -> > int my_rank; -> > int num_ranks; -> > MPI_Init(&argc, &argv); -> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); -> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); -> > -> > if (num_ranks != 2) { -> > if (my_rank == 0) { -> > printf("This example only works with 2 ranks\n"); -> > } -> > MPI_Abort(MPI_COMM_WORLD, 1); -> > } -> > -> > if (my_rank == 0) { -> > int int_data_count = 5, float_data_count = 10; -> > int *int_data = malloc(int_data_count * sizeof(int)); -> > float *float_data = malloc(float_data_count * sizeof(float)); -> > for (int i = 0; i < int_data_count; ++i) { -> > int_data[i] = i + 1; -> > } -> > for (int i = 0; i < float_data_count; ++i) { -> > float_data[i] = 3.14159 * (i + 1); -> > } -> > -> > // use MPI_Pack_size to determine how big the packed buffer needs to be -> > int buffer_size_count, buffer_size_int, buffer_size_float; -> > MPI_Pack_size(2, MPI_INT, MPI_COMM_WORLD, &buffer_size_count); // 2 * INT because we will have 2 counts -> > MPI_Pack_size(int_data_count, MPI_INT, MPI_COMM_WORLD, &buffer_size_int); -> > MPI_Pack_size(float_data_count, MPI_FLOAT, MPI_COMM_WORLD, &buffer_size_float); -> > int total_buffer_size = buffer_size_int + buffer_size_float + buffer_size_count; -> > -> > int position = 0; -> > char *buffer = malloc(total_buffer_size); -> > -> > // Pack the data size, followed by the actually data -> > MPI_Pack(&int_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); -> > MPI_Pack(int_data, int_data_count, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); -> > MPI_Pack(&float_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); -> > MPI_Pack(float_data, float_data_count, MPI_FLOAT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); -> > -> > // buffer is sent in one communication using MPI_PACKED -> > MPI_Send(buffer, total_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); -> > -> > free(buffer); -> > free(int_data); -> > free(float_data); -> > } else { -> > int buffer_size; -> > MPI_Status status; -> > MPI_Probe(0, 0, MPI_COMM_WORLD, &status); -> > MPI_Get_count(&status, MPI_PACKED, &buffer_size); -> > -> > char *buffer = malloc(buffer_size); -> > MPI_Recv(buffer, buffer_size, MPI_PACKED, 0, 0, MPI_COMM_WORLD, &status); -> > -> > int position = 0; -> > int int_data_count, float_data_count; -> > -> > // Unpack an integer why defines the size of the integer array, -> > // then allocate space for an unpack the actual array -> > MPI_Unpack(buffer, buffer_size, &position, &int_data_count, 1, MPI_INT, MPI_COMM_WORLD); -> > int *int_data = malloc(int_data_count * sizeof(int)); -> > MPI_Unpack(buffer, buffer_size, &position, int_data, int_data_count, MPI_INT, MPI_COMM_WORLD); -> > -> > MPI_Unpack(buffer, buffer_size, &position, &float_data_count, 1, MPI_INT, MPI_COMM_WORLD); -> > float *float_data = malloc(float_data_count * sizeof(float)); -> > MPI_Unpack(buffer, buffer_size, &position, float_data, float_data_count, MPI_FLOAT, MPI_COMM_WORLD); -> > -> > printf("int data: ["); -> > for (int i = 0; i < int_data_count; ++i) { -> > printf(" %d", int_data[i]); -> > } -> > printf(" ]\n"); -> > -> > printf("float data: ["); -> > for (int i = 0; i < float_data_count; ++i) { -> > printf(" %f", float_data[i]); -> > } -> > printf(" ]\n"); -> > -> > free(int_data); -> > free(float_data); -> > free(buffer); -> > } -> > -> > return MPI_Finalize(); -> > } -> > ``` -> > -> {: .solution} -{: .challenge} +--- +title: Advanced Data Communication [Optional] +slug: "dirac-intro-to-mpi-advanced-data-communication" +teaching: 25 +exercises: 20 +questions: +- How do I communicate structures? +- How do I communicate non-contiguous data which isn't easy to express as a derived data type? +objectives: +- Know how to define and use derived datatypes to communicate structures +- Know how to use `MPI_Pack` and `MPI_Unpack` to communicate complex data structures +keypoints: +- Structures can be communicated easier by using `MPI_Type_create_struct` to create a derived type describing the structure +- The functions `MPI_Pack` and `MPI_Unpack` can be used to manually create a contiguous memory block of data, to communicate complex and/or heterogeneous data structures +--- + +In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, +which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types +that allow us to handle other complex data structures efficiently. In this episode, we will see how to create structure +derived types. Additionally, we will also learn how to use `MPI_Pack` and `MPI_Unpack` to manually pack complex data +structures and heterogeneous into a single contiguous buffer, when other methods of communication are too complicated or +inefficient. + +## Structures in MPI + +Structures, commonly known as structs, are custom datatypes which contain multiple variables of (usually) different +types. Some common use cases of structs, in scientific code, include grouping together constants or global variables, or +they are used to represent a physical thing, such as a particle, or something more abstract like a cell on a simulation +grid. When we use structs, we can write clearer, more concise and better structured code. + +To communicate a struct, we need to define a derived datatype which tells MPI about the layout of the struct in memory. +Instead of `MPI_Type_create_vector()`, for a struct, we use, +`MPI_Type_create_struct()`, + +```c +int MPI_Type_create_struct( + int count, + int *array_of_blocklengths, + MPI_Aint *array_of_displacements, + MPI_Datatype *array_of_types, + MPI_Datatype *newtype, +); +``` + +| `count`: | The number of fields in the struct | +| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send` | +| `*array_of_displacements`: | The relative positions of each field in bytes | +| `*array_of_types`: | The MPI type of each field | +| `*newtype`: | The newly created data type for the struct | + +The main difference between vector and struct derived types is that the arguments for structs expect arrays, since +structs are made up of multiple variables. Most of these arguments are straightforward, given what we've just seen for +defining vectors. But `array_of_displacements` is new and unique. + +When a struct is created, it occupies a single contiguous block of memory. But there is a catch. For performance +reasons, compilers insert arbitrary "padding" between each member for performance reasons. This padding, known as [data +structure alignment](https://en.wikipedia.org/wiki/Data_structure_alignment), optimises both the layout of the memory +and the access of it. As a result, the memory layout of a struct may look like this instead: + +Memory layout for a struct + +Although the memory used for padding and the struct's data exists in a contiguous block, the actual data we care about +is not contiguous any more. This is why we need the `array_of_displacements` argument, which specifies the distance, in +bytes, between each struct member relative to the start of the struct. In practise, it serves a similar purpose of the +stride in vectors. + +To calculate the byte displacement for each member, we need to know where in memory each member of a struct exists. To +do this, we can use the function `MPI_Get_address()`, + +```c +int MPI_Get_address{ + const void *location, + MPI_Aint *address, +}; +``` + +| `*location`: | A pointer to the variable we want the address of | +| `*address`: | The address of the variable, as an MPI_Aint (address integer) | + +In the following example, we use `MPI_Type_create_struct()` and `MPI_Get_address()` to create a derived type for a +struct with two members, + +```c +// Define and initialize a struct, named foo, with an int and a double +struct MyStruct { + int id; + double value; +} foo = {.id = 0, .value = 3.1459}; + +// Create arrays to describe the length of each member and their type +int count = 2; +int block_lengths[2] = {1, 1}; +MPI_Datatype block_types[2] = {MPI_INT, MPI_DOUBLE}; + +// Now we calculate the displacement of each member, which are stored in an +// MPI_Aint designed for storing memory addresses +MPI_Aint base_address; +MPI_Aint block_offsets[2]; + +MPI_Get_address(&foo, &base_address); // First of all, we find the address of the start of the struct +MPI_Get_address(&foo.id, &block_offsets[0]); // Now the address of the first member "id" +MPI_Get_address(&foo.value, &block_offsets[1]); // And the second member "value" + +// Calculate the offsets, by subtracting the address of each field from the +// base address of the struct +for (int i = 0; i < 2; ++i) { + // MPI_Aint_diff is a macro to calculate the difference between two + // MPI_Aints and is a replacement for: + // (MPI_Aint) ((char *) block_offsets[i] - (char *) base_address) + block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); +} + +// We finally can create out struct data type +MPI_Datatype struct_type; +MPI_Type_create_struct(count, block_lengths, block_offsets, block_types, &struct_type); +MPI_Type_commit(&struct_type); + +// Another difference between vector and struct derived types is that in +// MPI_Recv, we use the struct type. We have to do this because we aren't +// receiving a contiguous block of a single type of date. By using the type, we +// tell MPI_Recv how to understand the mix of data types and padding and how to +// assign those back to recv_struct +if (my_rank == 0) { + MPI_Send(&foo, 1, struct_type, 1, 0, MPI_COMM_WORLD); +} else { + struct MyStruct recv_struct; + MPI_Recv(&recv_struct, 1, struct_type, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); +} + +// Remember to free the derived type +MPI_Type_free(&struct_type); +``` + +> ## Sending a struct +> +> By using a derived data type, write a program to send the following struct `struct Node node` from one rank to +> another, +> +> ```c +> struct Node { +> int id; +> char name[16]; +> double temperature; +> }; +> +> struct Node node = { .id = 0, .name = "Dale Cooper", .temperature = 42}; +>``` +> +> You may wish to use [this skeleton code](code/solutions/skeleton-example.c) as your stating point. +> +> > ## Solution +> > +> > Your solution should look something like the code block below. When sending a *static* array (`name[16]`), we have +> > to use a count of 16 in the `block_lengths` array for that member. +> > +> > ```c +> > #include +> > #include +> > +> > struct Node { +> > int id; +> > char name[16]; +> > double temperature; +> > }; +> > +> > int main(int argc, char **argv) +> > { +> > int my_rank; +> > int num_ranks; +> > MPI_Init(&argc, &argv); +> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); +> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); +> > +> > if (num_ranks != 2) { +> > if (my_rank == 0) { +> > printf("This example only works with 2 ranks\n"); +> > } +> > MPI_Abort(MPI_COMM_WORLD, 1); +> > } +> > +> > struct Node node = {.id = 0, .name = "Dale Cooper", .temperature = 42}; +> > +> > int block_lengths[3] = {1, 16, 1}; +> > MPI_Datatype block_types[3] = {MPI_INT, MPI_CHAR, MPI_DOUBLE}; +> > +> > MPI_Aint base_address; +> > MPI_Aint block_offsets[3]; +> > MPI_Get_address(&node, &base_address); +> > MPI_Get_address(&node.id, &block_offsets[0]); +> > MPI_Get_address(&node.name, &block_offsets[1]); +> > MPI_Get_address(&node.temperature, &block_offsets[2]); +> > for (int i = 0; i < 3; ++i) { +> > block_offsets[i] = MPI_Aint_diff(block_offsets[i], base_address); +> > } +> > +> > MPI_Datatype node_struct; +> > MPI_Type_create_struct(3, block_lengths, block_offsets, block_types, &node_struct); +> > MPI_Type_commit(&node_struct); +> > +> > if (my_rank == 0) { +> > MPI_Send(&node, 1, node_struct, 1, 0, MPI_COMM_WORLD); +> > } else { +> > struct Node recv_node; +> > MPI_Recv(&recv_node, 1, node_struct, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); +> > printf("Received node: id = %d name = %s temperature %f\n", recv_node.id, recv_node.name, +> > recv_node.temperature); +> > } +> > +> > MPI_Type_free(&node_struct); +> > +> > return MPI_Finalize(); +> > } +> > ``` +> > +> {: .solution} +{: .challenge} + +> ## What if I have a pointer in my struct? +> +> Suppose we have the following struct with a pointer named `position` and some other fields, +> +> ```c +> struct Grid { +> double *position; +> int num_cells; +> }; +> grid.position = malloc(3 * sizeof(double)); +>``` +> +> If we use `malloc()` to allocate memory for `position`, how would we send data in the struct and the memory we +> allocated one rank to another? If you are unsure, try writing a short program to create a derived type for the struct. +> +> > ## Solution +> > +> > The short answer is that we can't do it using a derived type, and will have to *manually* communicate the data +> > separately. The reason why can't use a derived type is because the address of `*position` is the address of the +> > pointer. The offset between `num_cells` and `*position` is the size of the pointer and whatever padding the compiler +> > adds. It is not the data which `position` points to. The memory we allocated for `*position` is somewhere else in +> > memory, as shown in the diagram below, and is non-contiguous with respect to the fields in the struct. +> > +> > Memory layout for a struct with a pointer +> > +> > +> {: .solution} +> +{: .challenge} + +> ## A different way to calculate displacements +> +> There are other ways to calculate the displacement, other than using what MPI provides for us. Another common way is +> to use the `offsetof()` macro part of ``. `offsetof()` accepts two arguments, the first being the struct +> type and the second being the member to calculate the offset for. +> +> ```c +> #include +> MPI_Aint displacements[2]; +> displacements[0] = (MPI_Aint) offsetof(struct MyStruct, id); // The cast to MPI_Aint is for extra safety +> displacements[1] = (MPI_Aint) offsetof(struct MyStruct, value); +>``` +> +> This method and the other shown in the previous examples both returns the same displacement values. It's mostly a +> personal choice which you choose to use. Some people prefer the "safety" of using `MPI_Get_address()` whilst others +> prefer to write more concise code with `offsetof()`. Of course, if you're a Fortran programmer then you can't use the +> macro! +> +{: .callout} + +## Complex non-contiguous and heterogeneous data + +The previous two sections covered how to communicate complex but structured data between ranks using derived datatypes. +However, there are *always* some edge cases which don't fit into a derived types. For example, just in the last exercise +we've seen that pointers and derived types don't mix well. Furthermore, we can sometimes also reach performance +bottlenecks when working with heterogeneous data which doesn't fit, or doesn't make sense to be, in a derived type, as +each data type needs to be communicated in separate communication calls. This can be especially bad if blocking +communication is used! For edge cases situations like this, we can use the `MPI_Pack()` and `MPI_Unpack()` functions to +do things ourselves. + +Both `MPI_Pack()` and `MPI_Unpack()` are methods for manually arranging, packing and unpacking data into a contiguous +buffer, for cases where regular communication methods and derived types don't work well or efficiently. They can also be +used to create self-documenting message, where the packed data contains additional elements which describe the size, +structure and contents of the data. But we have to be careful, as using packed buffers comes with additional overhead, +in the form of increased memory usage and potentially more communication overhead as packing and unpacking data is not +free. + +When we use `MPI_Pack()`, we take non-contiguous data (sometimes of different datatypes) and "pack" it into a +contiguous memory buffer. The diagram below shows how two (non-contiguous) chunks of data may be packed into a contiguous +array using `MPI_Pack()`. + +![Layout of packed memory](fig/packed_buffer_layout.png) + +The coloured boxes in both memory representations (memory and pakced) are the same chunks of data. The green boxes +containing only a single number are used to document the number of elements in the block of elements they are adjacent +to, in the contiguous buffer. This is optional to do, but is generally good practise to include to create a +self-documenting message. From the diagram we can see that we have "packed" non-contiguous blocks of memory into a +single contiguous block. We can do this using `MPI_Pack()`. To reverse this action, and "unpack" the buffer, we use +`MPI_Unpack()`. As you might expect, `MPI_Unpack()` takes a buffer, created by `MPI_Pack()` and unpacks the data back +into various memory address. + +To pack data into a contiguous buffer, we have to pack each block of data, one by one, into the contiguous buffer using +the `MPI_Pack()` function, + +```c +int MPI_Pack( + const void *inbuf, + int incount, + MPI_Datatype datatype, + void *outbuf, + int outsize, + int *position, + MPI_Comm comm +); +``` + +| `*inbuf`: | The data to pack into the buffer | +| `incount`: | The number of elements to pack | +| `datatype`: | The data type of the data to pack | +| `*outbuf`: | The out buffer of contiguous data | +| `outsize`: | The size of the out buffer, in bytes | +| `*position`: | A counter for how far into the contiguous buffer to write (records the position, in bytes) | +| `comm`: | The communicator | + +In the above, `inbuf` is the data we want to pack into a contiguous buffer and `incount` and `datatype` define the +number of elements in and the datatype of `inbuf`. The parameter `outbuf` is the contiguous buffer the data is packed +into, with `outsize` being the total size of the buffer in *bytes*. The `position` argument is used to keep track of the +current position, in bytes, where data is being packed into `outbuf`. + +Uniquely, `MPI_Pack()`, and `MPI_Unpack()` as well, measure the size of the contiguous buffer, `outbuf`, in bytes rather than +in number of elements. Given that `MPI_Pack()` is all about manually arranging data, we have to also manage the +allocation of memory for `outbuf`. But how do we allocate memory for it, and how much should we allocate? Allocation is +done by using `malloc()`. Since `MPI_Pack()` works with `outbuf` in terms of bytes, the convention is to declare +`outbuf` as a `char *`. The amount of memory to allocate is simply the amount of space, in bytes, required to store all +of the data we want to pack into it. Just like how we would normally use `malloc()` to create an array. If we had +an integer array and a floating point array which we wanted to pack into the buffer, then the size required is easy to +calculate, + +```c +// The total buffer size is the sum of the bytes required for the int and float array +int size_int_array = num_int_elements * sizeof(int); +int size_float_array = num_float_elements * sizeof(float); +int buffer_size = size_int_array + size_float_array; +// The buffer is a char *, but could also be cast as void * if you prefer +char *buffer = malloc(buffer_size * sizeof(char)); // a char is 1 byte, so sizeof(char) is optional +``` + +If we are also working with derived types, such as vectors or structs, then we need to find the size of those types. By +far the easiest way to handle these is to use `MPI_Pack_size()`, which supports derived datatypes through the +`MPI_Datatype`, + +```c +nt MPI_Pack_size( + int incount, + MPI_Datatype datatype, + MPI_Comm comm, + int *size +); +``` + +| `incount`: | The number of data elements | +| `datatype`: | The data type of the data | +| `comm`: | The communicator | +| `*size`: | The calculated upper size limit for the buffer, in bytes | + +`MPI_Pack_size()` is a helper function to calculate the *upper bound* of memory required. It is, in general, preferable +to calculate the buffer size using this function, as it takes into account any implementation specific MPI detail and +thus is more portable between implementations and systems. If we wanted to calculate the memory required for three +elements of some derived struct type and a `double` array, we would do the following, + +```c +int struct_array_size, float_array_size; +MPI_Pack_size(3, STRUCT_DERIVED_TYPE, MPI_COMM_WORLD, &struct_array_size); +MPI_Pack_size(50, MPI_DOUBLE. MPI_COMM_WORLD, &float_array_size); +int buffer_size = struct_array_size + float_array_size; +``` + +When a rank has received a contiguous buffer, it has to be unpacked into its constituent parts, one by one, using +`MPI_Unpack()`, + +```c +int MPI_Unpack( + void *inbuf, + int insize, + int *position, + void *outbuf, + int outcount, + MPI_Datatype datatype, + MPI_Comm comm, +); +``` + +| `*inbuf`: | The contiguous buffer to unpack | +| `insize`: | The total size of the buffer, in bytes | +| `*position`: | The position, in bytes, from where to start unpacking from | +| `*outbuf`: | An array, or variable, to unpack data into -- this is the output | +| `outcount`: | The number of elements of data to unpack | +| `datatype`: | The data type of elements to unpack | +| `comm`: | The communicator | + +The arguments for this function are essentially the reverse of `MPI_Pack()`. Instead of being the buffer to pack into, +`inbuf` is now the packed buffer and `position` is the position, in bytes, in the buffer where to unpacking from. +`outbuf` is then the variable we want to unpack into, and `outcount` is the number of elements of `datatype` to unpack. + +In the example below, `MPI_Pack()`, `MPI_Pack_size()` and `MPI_Unpack()` are used to communicate a (non-contiguous) +3 x 3 matrix. + +```c +// Allocate and initialise a (non-contiguous) 2D matrix that we will pack into +// a buffer +int num_rows = 3, num_cols = 3; +int **matrix = malloc(num_rows * sizeof(int *)); +for (int i = 0; i < num_rows; ++i) { + matrix[i] = malloc(num_cols * sizeof(int)); + for (int j = 0; i < num_cols; ++j) { + matrix[i][j] = num_cols * i + j; + } +} + +// Determine the upper limit for the amount of memory the buffer requires. Since +// this is a simple situation, we could probably have done this manually using +// `num_rows * num_cols * sizeof(int)`. The size `pack_buffer_size` is returned in +// bytes +int pack_buffer_size; +MPI_Pack_size(num_rows * num_cols, MPI_INT, MPI_COMM_WORLD, &pack_buffer_size); + +if (my_rank == 0) { + // Create the pack buffer and pack each row of data into it buffer + // one by one + int position = 0; + char *packed_data = malloc(pack_buffer_size); + for (int i = 0; i < num_rows; ++i) { + MPI_Pack(matrix[i], num_cols, MPI_INT, packed_data, pack_buffer_size, &position, MPI_COMM_WORLD); + } + + // Send the packed data to rank 1 + MPI_Send(packed_data, pack_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); +} else { + // Create a receive buffer and get the packed buffer from rank 0 + char *received_data = malloc(pack_buffer_size); + MPI_Recv(received_data, pack_buffer_size + 1, MPI_PACKED, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); + + // Allocate a matrix to put the receive buffer into -- this is for demonstration purposes + int **my_matrix = malloc(num_rows * sizeof(int *)); + for (int i = 0; i < num_cols; ++i) { + my_matrix[i] = malloc(num_cols * sizeof(int)); + } + + // Unpack the received data row by row into my_matrix + int position = 0; + for (int i = 0; i < num_rows; ++i) { + MPI_Unpack(received_data, pack_buffer_size, &position, my_matrix[i], num_cols, MPI_INT, MPI_COMM_WORLD); + } +} +``` + +> ## Blocking or non-blocking? +> +> The processes of packing data into a contiguous buffer does not happen asynchronously. The same goes for unpacking +> data. But this doesn't restrict the packed data from being only sent synchronously. The packed data can be +> communicated using any communication function, just like the previous derived types. It works just as well to +> communicate the buffer using non-blocking methods, as it does using blocking methods. +{: .callout} + +> ## What if the other rank doesn't know the size of the buffer? +> +> In some cases, the receiving rank may not know the size of the buffer used in `MPI_Pack()`. This could happen if a +> message is sent and received in different functions, if some ranks have different branches through the program +> or if communication happens in a dynamic or non-sequential way. +> +> In these situations, we can use `MPI_Probe()` and `MPI_Get_count` to find the a message being sent and to get the +> number of elements in the message. +> +> ```c +> // First probe for a message, to get the status of it +> MPI_Status status; +> MPI_Probe(0, 0, MPI_COMM_WORLD, &status); +> // Using MPI_Get_count we can get the number of elements of a particular data type +> int message_size; +> MPI_Get_count(&status, MPI_PACKED, &buffer_size); +> // MPI_PACKED represents an element of a "byte stream." So, buffer_size is the size of the buffer to allocate +> char *buffer = malloc(buffer_size); +>``` +> +{: .callout} + +> ## Sending heterogeneous data in a single communication +> +> Suppose we have two arrays below, where one contains integer data and the other floating point data. Normally we would +> use multiple communication calls to send each type of data individually, for a known number of elements. For this +> exercise, communicate both arrays using a packed memory buffer. +> +> ```c +> int int_data_count = 5; +> int float_data_count = 10; +> +> int *int_data = malloc(int_data_count * sizeof(int)); +> float *float_data = malloc(float_data_count * sizeof(float)); +> +> // Initialize the arrays with some values +> for (int i = 0; i < int_data_count; ++i) { +> int_data[i] = i + 1; +> } +> for (int i = 0; i < float_data_count; ++i) { +> float_data[i] = 3.14159 * (i + 1); +> } +> ``` +> +> Since the arrays are dynamically allocated, in rank 0, you should also pack the number of elements in each array. Rank +> 1 may also not know the size of the buffer. How would you deal with that? +> +> You can use this [skeleton code](code/solutions/07-pack-skeleton.c) to begin with. +> +> > ## Solution +> > +> > The additional restrictions for rank 1 not knowing the size of the arrays or packed buffer add some complexity to +> > receiving the packed buffer from rank 0. +> > +> > ```c +> > #include +> > #include +> > #include +> > +> > int main(int argc, char **argv) +> > { +> > int my_rank; +> > int num_ranks; +> > MPI_Init(&argc, &argv); +> > MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); +> > MPI_Comm_size(MPI_COMM_WORLD, &num_ranks); +> > +> > if (num_ranks != 2) { +> > if (my_rank == 0) { +> > printf("This example only works with 2 ranks\n"); +> > } +> > MPI_Abort(MPI_COMM_WORLD, 1); +> > } +> > +> > if (my_rank == 0) { +> > int int_data_count = 5, float_data_count = 10; +> > int *int_data = malloc(int_data_count * sizeof(int)); +> > float *float_data = malloc(float_data_count * sizeof(float)); +> > for (int i = 0; i < int_data_count; ++i) { +> > int_data[i] = i + 1; +> > } +> > for (int i = 0; i < float_data_count; ++i) { +> > float_data[i] = 3.14159 * (i + 1); +> > } +> > +> > // use MPI_Pack_size to determine how big the packed buffer needs to be +> > int buffer_size_count, buffer_size_int, buffer_size_float; +> > MPI_Pack_size(2, MPI_INT, MPI_COMM_WORLD, &buffer_size_count); // 2 * INT because we will have 2 counts +> > MPI_Pack_size(int_data_count, MPI_INT, MPI_COMM_WORLD, &buffer_size_int); +> > MPI_Pack_size(float_data_count, MPI_FLOAT, MPI_COMM_WORLD, &buffer_size_float); +> > int total_buffer_size = buffer_size_int + buffer_size_float + buffer_size_count; +> > +> > int position = 0; +> > char *buffer = malloc(total_buffer_size); +> > +> > // Pack the data size, followed by the actually data +> > MPI_Pack(&int_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); +> > MPI_Pack(int_data, int_data_count, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); +> > MPI_Pack(&float_data_count, 1, MPI_INT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); +> > MPI_Pack(float_data, float_data_count, MPI_FLOAT, buffer, total_buffer_size, &position, MPI_COMM_WORLD); +> > +> > // buffer is sent in one communication using MPI_PACKED +> > MPI_Send(buffer, total_buffer_size, MPI_PACKED, 1, 0, MPI_COMM_WORLD); +> > +> > free(buffer); +> > free(int_data); +> > free(float_data); +> > } else { +> > int buffer_size; +> > MPI_Status status; +> > MPI_Probe(0, 0, MPI_COMM_WORLD, &status); +> > MPI_Get_count(&status, MPI_PACKED, &buffer_size); +> > +> > char *buffer = malloc(buffer_size); +> > MPI_Recv(buffer, buffer_size, MPI_PACKED, 0, 0, MPI_COMM_WORLD, &status); +> > +> > int position = 0; +> > int int_data_count, float_data_count; +> > +> > // Unpack an integer why defines the size of the integer array, +> > // then allocate space for an unpack the actual array +> > MPI_Unpack(buffer, buffer_size, &position, &int_data_count, 1, MPI_INT, MPI_COMM_WORLD); +> > int *int_data = malloc(int_data_count * sizeof(int)); +> > MPI_Unpack(buffer, buffer_size, &position, int_data, int_data_count, MPI_INT, MPI_COMM_WORLD); +> > +> > MPI_Unpack(buffer, buffer_size, &position, &float_data_count, 1, MPI_INT, MPI_COMM_WORLD); +> > float *float_data = malloc(float_data_count * sizeof(float)); +> > MPI_Unpack(buffer, buffer_size, &position, float_data, float_data_count, MPI_FLOAT, MPI_COMM_WORLD); +> > +> > printf("int data: ["); +> > for (int i = 0; i < int_data_count; ++i) { +> > printf(" %d", int_data[i]); +> > } +> > printf(" ]\n"); +> > +> > printf("float data: ["); +> > for (int i = 0; i < float_data_count; ++i) { +> > printf(" %f", float_data[i]); +> > } +> > printf(" ]\n"); +> > +> > free(int_data); +> > free(float_data); +> > free(buffer); +> > } +> > +> > return MPI_Finalize(); +> > } +> > ``` +> > +> {: .solution} +{: .challenge} From b3efa3a1d50281db59c0ca0769a3812087acc153 Mon Sep 17 00:00:00 2001 From: "Edward J. Parkinson" Date: Wed, 31 Jul 2024 11:37:43 +0100 Subject: [PATCH 2/3] fixing comment style to match rest of course --- _episodes/03-communicating-data.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_episodes/03-communicating-data.md b/_episodes/03-communicating-data.md index 376c1fb..c600856 100644 --- a/_episodes/03-communicating-data.md +++ b/_episodes/03-communicating-data.md @@ -112,10 +112,10 @@ we can only them as arguments in MPI functions. > need to change the type, you would only have to do it in one place, e.g.: > > ```c -> /* define constants for your data types */ +> // define constants for your data types > #define MPI_INT_TYPE MPI_INT > #define INT_TYPE int -> /* use them as you would normally */ +> // use them as you would normally > INT_TYPE my_int = 1; > ``` > @@ -157,7 +157,7 @@ functions like `MPI_Comm_rank()` to get the rank number, ```c int my_rank; -MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* MPI_COMM_WORLD is the communicator the rank belongs to */ +MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); // MPI_COMM_WORLD is the communicator the rank belongs to ``` In addition to `MPI_COMM_WORLD`, we can make sub-communicators and distribute ranks into them. Messages can only be sent From 7c37257e9ffc54a8be92ce14db1c0005858e0a09 Mon Sep 17 00:00:00 2001 From: "Edward J. Parkinson" Date: Wed, 31 Jul 2024 11:46:59 +0100 Subject: [PATCH 3/3] fix missing brackets when naming functions --- _episodes/02-mpi-api.md | 14 +++--- _episodes/04-point-to-point-communication.md | 46 ++++++++++---------- _episodes/09-optimising-mpi.md | 22 +++++----- _episodes/11-advanced-communication.md | 8 ++-- 4 files changed, 45 insertions(+), 45 deletions(-) diff --git a/_episodes/02-mpi-api.md b/_episodes/02-mpi-api.md index 0f046f2..64f056b 100644 --- a/_episodes/02-mpi-api.md +++ b/_episodes/02-mpi-api.md @@ -63,9 +63,9 @@ It's important to note that these functions represent only a subset of the funct In general, an MPI program follows a basic outline that includes the following steps: -1. ***Initialization:*** The MPI environment is initialized using the `MPI_Init` function. This step sets up the necessary communication infrastructure and prepares the program for message passing. +1. ***Initialization:*** The MPI environment is initialized using the `MPI_Init()` function. This step sets up the necessary communication infrastructure and prepares the program for message passing. 2. ***Communication:*** MPI provides functions for sending and receiving messages between processes. The `MPI_Send` function is used to send messages, while the `MPI_Recv` function is used to receive messages. -3. ***Termination:*** Once the necessary communication has taken place, the MPI environment is finalised using the `MPI_Finalize` function. This ensures the proper termination of the program and releases any allocated resources. +3. ***Termination:*** Once the necessary communication has taken place, the MPI environment is finalised using the `MPI_Finalize()` function. This ensures the proper termination of the program and releases any allocated resources. ## Getting Started with MPI: MPI on HPC @@ -186,9 +186,9 @@ mpirun -n 4 ./hello_world However, in the example above, the program does not know it was started by `mpirun`, and each copy just works as if they were the only one. For the copies to work together, they need to know about their role in the computation, in order to properly take advantage of parallelisation. This usually also requires knowing the total number of tasks running at the same time. -- The program needs to call the `MPI_Init` function. -- `MPI_Init` sets up the environment for MPI, and assigns a number (called the _rank_) to each process. -- At the end, each process should also cleanup by calling `MPI_Finalize`. +- The program needs to call the `MPI_Init()` function. +- `MPI_Init()` sets up the environment for MPI, and assigns a number (called the _rank_) to each process. +- At the end, each process should also cleanup by calling `MPI_Finalize()`. ~~~ int MPI_Init(&argc, &argv); @@ -196,9 +196,9 @@ int MPI_Finalize(); ~~~ {: .language-c} -Both `MPI_Init` and `MPI_Finalize` return an integer. +Both `MPI_Init()` and `MPI_Finalize()` return an integer. This describes errors that may happen in the function. -Usually we will return the value of `MPI_Finalize` from the main function +Usually we will return the value of `MPI_Finalize()` from the main function > ## I don't use command line arguments > diff --git a/_episodes/04-point-to-point-communication.md b/_episodes/04-point-to-point-communication.md index 854a37e..0042d50 100644 --- a/_episodes/04-point-to-point-communication.md +++ b/_episodes/04-point-to-point-communication.md @@ -9,33 +9,33 @@ objectives: - Describe what is meant by point-to-point communication. - Learn how to send and receive data between ranks. keypoints: -- Use `MPI_Send` and `MPI_Recv` to send and receive data between ranks -- Using `MPI_SSend` will always block the sending rank until the message is received -- Using `MPI_Send` may block the sending rank until the message is received, depending on whether the message is buffered and the buffer is available for reuse -- Using `MPI_Recv` will always block the receiving rank until the message is received +- Use `MPI_Send()` and `MPI_Recv()` to send and receive data between ranks +- Using `MPI_Ssend()` will always block the sending rank until the message is received +- Using `MPI_Send()` may block the sending rank until the message is received, depending on whether the message is buffered and the buffer is available for reuse +- Using `MPI_Recv()` will always block the receiving rank until the message is received --- In the previous episode we introduced the various types of communication in MPI. -In this section we will use the MPI library functions `MPI_Send` and -`MPI_Recv`, which employ point-to-point communication, to send data from one rank to another. +In this section we will use the MPI library functions `MPI_Send()` and +`MPI_Recv()`, which employ point-to-point communication, to send data from one rank to another. Sending data from one rank to another using MPI_SSend and MPI_Recv -Let's look at how `MPI_Send` and `MPI_Recv`are typically used: +Let's look at how `MPI_Send()` and `MPI_Recv()`are typically used: - Rank A decides to send data to rank B. It first packs the data to send into a buffer, from which it will be taken. -- Rank A then calls `MPI_Send` to create a message for rank B. The underlying MPI communication +- Rank A then calls `MPI_Send()` to create a message for rank B. The underlying MPI communication is then given the responsibility of routing the message to the correct destination. - Rank B must know that it is about to receive a message and acknowledge this -by calling `MPI_Recv`. This sets up a buffer for writing the incoming data when it arrives +by calling `MPI_Recv()`. This sets up a buffer for writing the incoming data when it arrives and instructs the communication device to listen for the message. -As mentioned in the previous episode, `MPI_Send` and `MPI_Recv` are *synchronous* operations, +As mentioned in the previous episode, `MPI_Send()` and `MPI_Recv()` are *synchronous* operations, and will not return until the communication on both sides is complete. ## Sending a Message: MPI_Send -The `MPI_Send` function is defined as follows: +The `MPI_Send()` function is defined as follows: ~~~c int MPI_Send( @@ -67,11 +67,11 @@ MPI_Send(message, 14, MPI_CHAR, 1, 0, MPI_COMM_WORLD); So we are sending 14 elements of `MPI_CHAR` one time, and specified `0` for our message tag since we don't anticipate having to send more than one type of message. This call is synchronous, and will block until the corresponding -`MPI_Recv` operation receives and acknowledges receipt of the message. +`MPI_Recv()` operation receives and acknowledges receipt of the message. > ## MPI_Ssend: an Alternative to MPI_Send > -> `MPI_Send` represents the "standard mode" of sending messages to other ranks, but some aspects of its behaviour +> `MPI_Send()` represents the "standard mode" of sending messages to other ranks, but some aspects of its behaviour > are dependent on both the implementation of MPI being used, and the circumstances of its use: there are three > scenarios to consider: > @@ -83,7 +83,7 @@ having to send more than one type of message. This call is synchronous, and will > receive. It is dependent on the MPI implementation as to what scenario is selected, based on performance, memory, > and other considerations. > -> A very similar alternative to `MPI_Send` is to use `MPI_Ssend` - synchronous send - which ensures the communication +> A very similar alternative to `MPI_Send()` is to use `MPI_Ssend()` - synchronous send - which ensures the communication > is both synchronous and blocking. This function guarantees that when it returns, the destination has categorically > started receiving the message. {: .callout} @@ -91,7 +91,7 @@ having to send more than one type of message. This call is synchronous, and will ## Receiving a Message: MPI_Recv -Conversely, the `MPI_Recv` function looks like the following: +Conversely, the `MPI_Recv()` function looks like the following: ~~~c int MPI_Recv( @@ -123,13 +123,13 @@ MPI_Recv(message, 14, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &status); {: .language-c} Here, we create our buffer to receive the data, as well as a variable to hold the exit status of the receive operation. -We then call `MPI_Recv`, specifying our returned data buffer, the number of elements we will receive (14) which will be +We then call `MPI_Recv()`, specifying our returned data buffer, the number of elements we will receive (14) which will be of type `MPI_CHAR` and sent by rank 0, with a message tag of 0. -As with `MPI_Send`, this call will block - in this case until the message is received and acknowledgement is sent +As with `MPI_Send()`, this call will block - in this case until the message is received and acknowledgement is sent to rank 0, at which case both ranks will proceed. Let's put this together with what we've learned so far. -Here's an example program that uses `MPI_Send` and `MPI_Recv` to send the string `"Hello World!"` +Here's an example program that uses `MPI_Send()` and `MPI_Recv()` to send the string `"Hello World!"` from rank 0 to rank 1: ~~~ @@ -212,7 +212,7 @@ int main(int argc, char **argv) { > > ## Solution > > > > 1. The program will hang since it's waiting for a message with a tag that will never be sent (press `Ctrl-C` to kill -> > the hanging process). To resolve this, make the tag in `MPI_Recv` match the tag you specified in `MPI_Send`. +> > the hanging process). To resolve this, make the tag in `MPI_Recv()` match the tag you specified in `MPI_Send()`. > > 2. You will likely see a message like the following: > > ~~~ > > [...:220695] *** An error occurred in MPI_Recv @@ -391,13 +391,13 @@ int main(int argc, char **argv) { > >> ## Solution >> ->> `MPI_Send` will block execution until the receiving process has called ->> `MPI_Recv`. This prevents the sender from unintentionally modifying the message +>> `MPI_Send()` will block execution until the receiving process has called +>> `MPI_Recv()`. This prevents the sender from unintentionally modifying the message >> buffer before the message is actually sent. ->> Above, both ranks call `MPI_Send` and just wait for the other to respond. +>> Above, both ranks call `MPI_Send()` and just wait for the other to respond. >> The solution is to have one of the ranks receive its message before sending. >> ->> Sometimes `MPI_Send` will actually make a copy of the buffer and return immediately. +>> Sometimes `MPI_Send()` will actually make a copy of the buffer and return immediately. >> This generally happens only for short messages. >> Even when this happens, the actual transfer will not start before the receive is posted. >> diff --git a/_episodes/09-optimising-mpi.md b/_episodes/09-optimising-mpi.md index 364aade..d235a8c 100644 --- a/_episodes/09-optimising-mpi.md +++ b/_episodes/09-optimising-mpi.md @@ -103,12 +103,12 @@ communication overhead typically increases with the number of processes used. > ## Testing Code Performance on SLURM -> +> > We also need a way to test our code on our HPC infrastructure of choice. > This will likely vary from system to system depending on your infrastructure configuration, > but for the COSMA site on DiRAC this may look something like (replacing the ``, ``, > `` accordingly): -> +> > ~~~ > #!/usr/bin/env bash > #SBATCH --account= @@ -123,15 +123,15 @@ processes used. > module unload gnu_comp > module load gnu_comp/11.1.0 > module load openmpi/4.1.4 -> +> > time mpirun -n 1 poisson_mpi > ~~~ > {: .language-bash} -> +> > So here, after loading the required compiler and OpenMPI modules, > we use the `time` command to output how long the process took to run for a given number of processors. > and ensure we specify `ntasks` correctly as the required number of cores we wish to use. -> +> > We can then submit this using `sbatch`, e.g. `sbatch poisson-mpi.sh`, > with the output captured by default in a `slurm-....out` file > which will include the time taken to run the program. @@ -246,9 +246,9 @@ to exhibit *linear weak scaling*. The other significant factors in the speed of a parallel program are communication speed and latency. -Communication speed is determined by the amount of data one needs to +Communication speed is determined by the amount of data one needs to send/receive, and the bandwidth of the underlying hardware for the communication. -Latency consists of the software latency (how long the +Latency consists of the software latency (how long the operating system needs in order to prepare for a communication), and the hardware latency (how long the hardware takes to send/receive even a small bit of data). @@ -410,11 +410,11 @@ spent in the actual compute sections of the code. > ## Profile Your Poisson Code > > Compile, run and analyse your own MPI version of the poisson code. -> +> > How closely does it match the performance above? What are the main differences? > Try reducing the number of processes used, rerun and investigate the profile. -> Is it still MPI-bound? -> +> Is it still MPI-bound? +> > Increase the problem size, recompile, rerun and investigate the profile. > What has changed now? {: .challenge} @@ -422,7 +422,7 @@ spent in the actual compute sections of the code. > ## Iterative Improvement > > In the Poisson code, try changing the location of the calls -> to `MPI_Send`. How does this affect performance? +> to `MPI_Send()`. How does this affect performance? > {: .challenge} diff --git a/_episodes/11-advanced-communication.md b/_episodes/11-advanced-communication.md index 5f44e1b..c20cf29 100644 --- a/_episodes/11-advanced-communication.md +++ b/_episodes/11-advanced-communication.md @@ -8,16 +8,16 @@ questions: - How do I communicate non-contiguous data which isn't easy to express as a derived data type? objectives: - Know how to define and use derived datatypes to communicate structures -- Know how to use `MPI_Pack` and `MPI_Unpack` to communicate complex data structures +- Know how to use `MPI_Pack()` and `MPI_Unpack()` to communicate complex data structures keypoints: - Structures can be communicated easier by using `MPI_Type_create_struct` to create a derived type describing the structure -- The functions `MPI_Pack` and `MPI_Unpack` can be used to manually create a contiguous memory block of data, to communicate complex and/or heterogeneous data structures +- The functions `MPI_Pack()` and `MPI_Unpack()` can be used to manually create a contiguous memory block of data, to communicate complex and/or heterogeneous data structures --- In an earlier episode, we introduced the concept of derived data types to send vectors or a sub-array of a larger array, which may or may not be contiguous in memory. Other than vectors, there are multiple other types of derived data types that allow us to handle other complex data structures efficiently. In this episode, we will see how to create structure -derived types. Additionally, we will also learn how to use `MPI_Pack` and `MPI_Unpack` to manually pack complex data +derived types. Additionally, we will also learn how to use `MPI_Pack()` and `MPI_Unpack()` to manually pack complex data structures and heterogeneous into a single contiguous buffer, when other methods of communication are too complicated or inefficient. @@ -43,7 +43,7 @@ int MPI_Type_create_struct( ``` | `count`: | The number of fields in the struct | -| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send` | +| `*array_of_blocklengths`: | The length of each field, as you would use to send that field using `MPI_Send()` | | `*array_of_displacements`: | The relative positions of each field in bytes | | `*array_of_types`: | The MPI type of each field | | `*newtype`: | The newly created data type for the struct |