diff --git a/_episodes/06-non-blocking-communication.md b/_episodes/06-non-blocking-communication.md index 2fd90ce..f0d3bb5 100644 --- a/_episodes/06-non-blocking-communication.md +++ b/_episodes/06-non-blocking-communication.md @@ -18,37 +18,54 @@ keypoints: --- In the previous episodes, we learnt how to send messages between two ranks or collectively to multiple ranks. In both -cases, we used blocking communication functions which meant our program wouldn't progress until data had been sent and -received successfully. It takes time, and computing power, to transfer data into buffers, to send that data around (over -the network) and to receive the data into another rank. But for the most part, the CPU isn't actually doing anything. +cases, we used blocking communication functions which meant our program wouldn't progress until the communication had +completed. It takes time and computing power to transfer data into buffers, to send that data around (over the +network) and to receive the data into another rank. But for the most part, the CPU isn't actually doing much at all +during communication, when it could still be number crunching. ## Why bother with non-blocking communication? -Non-blocking communication is a communication mode, which allows ranks to continue working on other tasks, whilst data -is transferred in the background. When we use blocking communication, like `MPI_Send()`, `MPI_Recv()`, `MPI_Reduce()` -and etc, execution is passed from our program to MPI and is not passed back until the communication has finished. With -non-blocking communication, the communication beings and control is passed back immediately. Whilst the data is -transferred in the background, our application is free to do other work. This ability to *overlap* computation and -communication is absolutely critical for good performance for many HPC applications. The CPU is used very little when -communicating data, so we are effectively wasting resources by not using them when we can. With good use of non-blocking -communication, we can continue to use the CPU whilst communication happens and, at the same time, hide/reduce some of -the communication overhead by overlapping communication and computation. - -Reducing the communication overhead is incredibly important for the scalability of HPC applications, especially when we -use lots of ranks. As the number of ranks increases, the communication overhead to talk to every rank, naturally, also -increases. Blocking communication limits the scalability of our MPI applications, as it can, relatively speaking, take a -long time to talk to lots of ranks. But since with non-blocking communication ranks don't sit around waiting for a -communication operation to finish, the overhead of talking to lots of reduced. The asynchronous nature of non-blocking -communication makes it more flexible, allowing us to write more sophisticated and performance communication algorithms. - -All of this comes with a price. Non-blocking communication is more difficult to use *effectively*, and oftens results in -more complex code. Not only does it result in more code, but we also have to think about the structure and flow of our -code in such a way there there is *other* work to do whilst data is being communicated. Additionally, whilst we usually -expect non-blocking communication to improve th performance, and scalability, of our parallel algorithms, it's not -always clear cut or predictable if it can help. If we are not careful, we may end up replacing blocking communication -overheads with synchronization overheads. For example, if one rank depends on the data of another rank and there is no -other work to do, that rank will have to wait around until the data it needs is ready, as illustrated in the diagram -below. +Non-blocking communication is communication which happens in the background. So we don't have to let any CPU cycles go +to waste! If MPI is dealing with the data transfer in the background, we can continue to use the CPU in the foreground +and keep doing tasks whilst the communication completes. By *overlapping* computation with communication, we hide the +latency/overhead of communication. This is critical for lots of HPC applications, especially when using lots of CPUs, +because, as the number of CPUs increases, the overhead of communicating with them all also increases. If you use +blocking synchronous sends, the time spent communicating data may become longer than the time spent creating data to +send! All non-blocking communications are asynchronous, even when using synchronous sends, because the communication +happens in the background, even though the communication cannot complete until the data is received. + +> ## So, how do I use non-blocking communication? +> +> Just as with buffered, synchronous, ready and standard sends, MPI has to be programmed to use either blocking or +> non-blocking communication. For almost every blocking function, there is a non-blocking equivalent. They have the same +> name as their blocking counterpart, but prefixed with "I". The "I" stands for "immediate", indicating that the +> function returns immediately and does not block the program. The table below shows some examples of blocking functions +> and their non-blocking counterparts. +> +> | Blocking | Non-blocking | +> | --------------- | ---------------- | +> | `MPI_Bsend()` | `MPI_Ibsend()` | +> | `MPI_Barrier()` | `MPI_Ibarrier()` | +> | `MPI_Reduce()` | `MPI_Ireduce()` | +> +> But, this isn't the complete picture. As we'll see later, we need to do some additional bookkeeping to be able to use +> non-blocking communications. +> +{: .callout} + +By effectively utilizing non-blocking communication, we can develop applications that scale significantly better during +intensive communication. However, this comes with the trade-off of both increased conceptual and code complexity. Since +non-blocking communication doesn't keep control until the communication finishes, we don't actually know if a +communication has finished unless we check; this is usually referred to as synchronisation, as we have to keep ranks in +sync to ensure they have the correct data. So whilst our program continues to do other work, it also has to keep pinging +to see if the communication has finished, to ensure ranks are synchronised. If we check too often, or don't have enough +tasks to "fill in the gaps", then there is no advantage to using non-blocking communication and we may replace +communication overheads with time spent keeping ranks in sync! It is not always clear cut or predictable if non-blocking +communication will improve performance. For example, if one ranks depends on the data of another, and there are no tasks +for it to do whilst it waits, that rank will wait around until the data is ready, as illustrated in the diagram below. +This essentially makes that non-blocking communication a blocking communication. Therefore unless our code is structured +to take advantage of being able to overlap communication with computation, non-blocking communication adds complexity to +our code for no gain. Non-blocking communication with data dependency @@ -99,21 +116,6 @@ The arguments are identical to `MPI_Send()`, other than the addition of the `*re as an *handle* (because it "handles" a communication request) which is used to track the progress of a (non-blocking) communication. -> ## Naming conventions -> -> Non-blocking functions have the same name as their blocking counterpart, but prefixed with "I". The "I" stands for -> "immediate", indicating that the function returns immediately and does not block the program whilst data is being -> communicated in the background. The table below shows some examples of blocking functions and their non-blocking -> counterparts. -> -> | Blocking | Non-blocking| -> | -------- | ----------- | -> | `MPI_Bsend()` | `MPI_Ibsend()` | -> | `MPI_Barrier()` | `MPI_Ibarrier()` | -> | `MPI_Reduce()` | `MPI_Ireduce()` | -> -{: .callout} - When we use non-blocking communication, we have to follow it up with `MPI_Wait()` to synchronise the program and make sure `*buf` is ready to be re-used. This is incredibly important to do. Suppose we are sending an array of integers,