diff --git a/high_performance_computing/computer_simulations/00_practical.md b/high_performance_computing/computer_simulations/00_practical.md new file mode 100644 index 00000000..0a33d37d --- /dev/null +++ b/high_performance_computing/computer_simulations/00_practical.md @@ -0,0 +1,270 @@ +--- +name: Traffic Simulation Performance +dependsOn: [ + high_performance_computing.parallel_computing.03_parallel_performance +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +## Part 1: Traffic Simulation - Serial + +Let's first revisit the serial and OpenMP implementations of the traffic simulation model, demonstrated in earlier sections, and investigate the basic performance characteristics of these implementations. + +:::callout{variant="tip"} +If on ARCHER2, to find the serial version of the traffic simulation code, firstly make sure you're on the `/work` partition (i.e. `cd /work/[project code]/[project code]/yourusername`). +::: + +Change directory to where the code is located, and use `make` as before to compile it: + +```bash +cd foundation-exercises/traffic/C-SER +make +``` + +:::callout + +## A Reminder + +You may wish to reacquaint yourself with *The traffic model* section in the *Parallel Computing* material that describes the simulation model. +::: + +A number of variables are currently fixed in the source code, which you can see by looking at the following lines +in `traffic.c`: + +```c + int ncell = 100000; + maxiter = 200000000/ncell; + ... + density = 0.52; +``` + +- The number of simulation cells is set to `100000`, so our simulated road is 100,000 * 5 = 500,000 metres long +- The number of iterations of the simulation is calculated based on the number of cells, such that - as coded - fewer cells means more iterations, but in this instance 200,000,000 / 100,000 = 2,000 total iterations +- The target traffic density is set to `0.52`, so the simulation aims to occupy just over half of the road cells + +You can run the serial program direct on the login nodes: + +```bash +./traffic +``` + +You should see: + +```output +Length of road is 100000 +Number of iterations is 2000 +Target density of cars is 0.520000 +Initialising road ... +...done +Actual density of cars is 0.517560 + +At iteration 200 average velocity is 0.919951 +At iteration 400 average velocity is 0.926559 +At iteration 600 average velocity is 0.928743 +At iteration 800 average velocity is 0.930308 +At iteration 1000 average velocity is 0.930849 +At iteration 1200 average velocity is 0.931196 +At iteration 1400 average velocity is 0.931312 +At iteration 1600 average velocity is 0.931506 +At iteration 1800 average velocity is 0.931737 +At iteration 2000 average velocity is 0.931989 + +Finished + +Time taken was 1.293764 seconds +Update rate was 154.587714 MCOPs +``` + +The result we are interested in this the final average velocity that is reported at iteration 2000 (i.e. the end of the simulation). In this case, the final average velocity of the traffic was 0.93. + +## Part 2: Traffic Simulation - OpenMP + +You'll find the OpenMP version of this code in `foundation-exercises/traffic/C-OMP`. +Change to this directory, and compile the code as before. +The simulation is set at the same initial parameters as the serial version of the code +(if you're interested, take a look at the source code). + +What we'd like to do now is measure how long it takes to run the simulation given an increasing number of threads, +so we can determine an ideal number of threads for running simulations in the future. + +::::challenge{id=compsim_pr.1 title="Traffic Simulation: Scripting the Process"} +We could submit a number of separate jobs running the code with an increasing number of threads, +or if running this on our own machine, create a Bash script that does this locally, +but with the simulation's current configuration, each of these jobs would only take a second or so to run +(although if it took much longer than this, then separate jobs would likely make sense!). + +So instead of creating a number of separate scripts and submitting/running those, +we'll put all the runs into a single script. +Create a single script that does the following for 1, 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 threads: + +- Sets the number of threads (i.e. setting the `OMP_NUM_THREADS` variable) +- Runs the `traffic` code + +If you're writing ARCHER2 job submission scripts you'll need to set `--cpus-per-task` to the maximum number of threads you'll use in the script (i.e. 20), +and set `--time` to a suitable value so encompass all the separate runs. + +Then, either submit the job script using `sbatch` to submit it to ARCHER2 or run it directly using e.g. `bash script.sh`. + +:::solution + +(If you're running this on your own machine in a normal Bash script, you can ignore the lines starting `#SBATCH`) + +```bash +#!/bin/bash + +#SBATCH --job-name=Traffic-OMP +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=20 +#SBATCH --time=00:05:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +export OMP_NUM_THREADS=1 +./traffic + +export OMP_NUM_THREADS=2 +./traffic + +export OMP_NUM_THREADS=4 +./traffic + +export OMP_NUM_THREADS=6 +./traffic + +export OMP_NUM_THREADS=8 +./traffic + +export OMP_NUM_THREADS=10 +./traffic + +export OMP_NUM_THREADS=12 +./traffic + +export OMP_NUM_THREADS=14 +./traffic + +export OMP_NUM_THREADS=16 +./traffic + +export OMP_NUM_THREADS=18 +./traffic + +export OMP_NUM_THREADS=20 +./traffic +``` + +Or, if you're familiar with Bash loops: + +```bash +#!/bin/bash + +#SBATCH --job-name=Traffic-OMP +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=20 +#SBATCH --time=00:05:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +for THREADS in 1 2 4 6 8 10 12 14 16 18 20 +do + export OMP_NUM_THREADS=${THREADS} + ./traffic +done +``` + +::: +:::: + +::::challenge{id=compsim_pr.2 title="Traffic Simulation: Measuring Multiple Threads Runtimes"} + +Next, let's look at the timings together by first entering them into a table, +by examining the output (or via Slurm output files) and enter each time into a table, e.g. using the following columns: + +| #Threads | Time(s) +|----------|-------- +| 1 | ... +| 2 | ... +| ... | ... + +:::solution + +Of course, your timings may differ! + +| #Threads | Time(s) +|----------|-------- +| 1 | 1.744 +| 2 | 0.899 +| 4 | 0.468 +| 6 | 0.316 +| 8 | 0.248 +| 10 | 0.211 +| 12 | 0.185 +| 14 | 0.167 +| 16 | 0.157 +| 18 | 0.146 +| 20 | 0.140 + +::: +:::: + +::::challenge{id=compsim_pr.3 title="Traffic Simulation: Analysing Timings"} + +Compare the timing results against the serial version of the code. +At what number of threads does the OpenMP version yield faster results? +What does this mean in terms of the overhead of using OpenMP for this simulation code as it stands? + +:::solution +Looking at your results, you may find that using just two threads is significantly faster. +In terms of overhead, this means that the overhead of using OpenMP has a significant impact on a single thread, +as one may expect, but by 2 threads we see a significant speedup. +::: + +At what point does there appear to be diminishing returns when increasing the number of threads? + +:::solution +It depends on what you consider a diminishing return, +but (at least for my runs) beyond about 14 threads the yields are significantly smaller (6% speed increase and below). + +Of course, for expediency in this exercise we're using small problem spaces to reduce the job's execution time, but for much larger problem spaces and runtimes the time savings we see here would be significant. +::: +:::: + +:::callout + +## How to Time Code that doesn't Time Itself? + +With the traffic simulation code we're fortunate that it has an in-built ability to time itself. +What about code that doesn't do this? +Fortunately, there's a bash command `time` that can be used. +For example, change directory to where your serial version of hello world is located, and then: + +```bash +time ./hello-SER yourname +``` + +```output +Hello World! +Hello yourname, this is ln01. + +real 0m0.059s +user 0m0.004s +sys 0m0.000s +``` + +Which gives us, essentially, the completed run time of 0.059s. +::: diff --git a/high_performance_computing/computer_simulations/01_intro.md b/high_performance_computing/computer_simulations/01_intro.md new file mode 100644 index 00000000..c4601cf5 --- /dev/null +++ b/high_performance_computing/computer_simulations/01_intro.md @@ -0,0 +1,407 @@ +--- +name: Introduction to Computer Simulations +dependsOn: [ + high_performance_computing.computer_simulations.00_practical +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Simulation of gravity waves](images/Gravitywaves.jpeg) + +## Computer simulations + +Most people will have heard of computer simulations, after all, the term has become a part of popular culture; some people even suggest we may be living in a Matrix-style computer simulation. +But understanding what a computer simulation truly is, is a different matter. + +In the English language, the word simulate originally meant to imitate the appearance or character of and was derived from the Latin simulat, which meant copied or represented. A modern dictionary also states that the word simulation may refer to the representation of the behaviour or characteristics of one system through the use of another system. This is precisely what computer simulations are meant to do. + +In essence, a computer simulation is a program that iteratively examines the behavior of a real-world system modeled mathematically. +Put simply, simulation involves running a model to study the system of interest. +However, simulations often encompass much more than just running a program. +They involve defining a suitable model, implementing it, analyzing the results, and interpreting the data - often through some sort of visualization. + +By definition a mathematical model is a simplification of reality. Typically, it is impossible to perfectly capture all the components of a system and their interaction. In fact, more often than not, it would not be a useful thing to do. Think about it - if you could simulate something well enough in about 5 minutes, would you want to spend 5 days to do it perfectly? + +For simulations like weather forecasting, or wildfire modelling, the time-to-solution should to be as short as possible. In other cases, you may not care about how long it takes to reach the solution, but what about the required computing power and associated costs? Computer simulations are very much about tradeoffs: time-to-solution, performance and computational cost vs. scientific usefulness. + +No matter how detailed a model is, the physical system it represents will always involve unaccounted phenomena. +This is acceptable, as long as the model still offers useful insights into the system’s behavior. +A good model focuses on capturing the dominant factors influencing the system, using a minimal yet effective set of variables and relationships. +A model’s predictive power lies not in its complexity or completeness but in its ability to accurately represent the key drivers of the system with the necessary approximations. + +Now, how one goes about creating a model and a simulation? + +--- + +![Circular relationship between reality, conceptual model and computerised model](images/hero_bd3c2838-0873-4170-a3eb-5f53462415c4.png) +*© 1979 by Simulation Councils, Inc.* + +## Errors and approximations + +The diagram above was developed in 1979 by the Technical Committee on Model Credibility of the Society for Computer Simulation. +While technology has advanced considerably since then, the fundamental approach to modeling and simulation remains largely unchanged. +The relationship between a model, a simulation, and the reality it represents is still the same. + +Each step in the simulation process involves approximations and is subject to errors and uncertainties. The first step—analyzing the physical system and creating a conceptual model is qualitative: It focuses on identifying all possible factors within the system and deciding which components and interactions are essential, and which can be reasonably neglected. +At this stage, no mathematical equations are involved. + +### Uncertainties + +A special treatment is given to the elements of a system that need to be treated as nondeterministic, which simply means that their behaviour is not precisely predictable. +In nondeterministic simulations the same input can produce different outputs. +This unpredictability may be a result of inherent variations of the physical system, any potential inaccuracy stemming from the lack of knowledge, or human interaction with the system. +All of these are sources of a system uncertainties and should be taken into consideration in the mathematical description of the system. + +The transition between the conceptual and mathematical model involves capturing the interactions of the relevant components in a set of equations, and determining the boundary and initial conditions. +The next step ensures that discrete mathematics is used; continuous equations are approximated by small and distinct steps so that computers can deal with them one by one. + +Once an appropriate algorithm (a set of rules and methods used to solve the problem) has been decided on and the model is implemented, errors start creeping in: inaccuracies that are not caused by the difference between the model and reality. +While it is possible for the programmer to make mistakes, even a perfectly written simulation is beholden to numerical precision limitations inherent in computational methods. + +### Rounding Error + +Rounding errors, errors that arise due to this limit of numerical precision, occur because the number of digits that can be used to represent a real number on a computer is limited by the finite number of memory bits that are allocated to store that number. +Therefore, numbers that require more digits to be expressed (sometimes an infinite number!) end up being rounded - they are just approximations of the numbers they are meant to represent. This difference between the real number and its approximation is referred to as the rounding error. The vast majority of numbers need to be approximated. + +When operations are performed on approximated numbers, the resulting values are also approximations. +Over time, these rounding errors can accumulate, significantly impacting the accuracy of simulation results. +This issue becomes even more pronounced in parallel computing, where the order in which partial results (computed by individual processes) are combined affects the final outcome. +Due to rounding errors, combining these partial results in a different sequence can lead to slight variations in the result. + +The aim of the verification stage is to ensure that the model is implemented correctly, and each part of the simulation does exactly what is expected of it. +Most of the errors should be discovered and fixed at this stage. +The aim of the validation stage is to ensure that, for its intended purpose, the simulation is sufficiently close to reality. +Often it is possible to know the accuracy of an algorithm in advance, or even to control the accuracy of your simulation during execution to meet a chosen standard, this is especially true if using standard algorithms to perform your calculations. + +The key point to remember about models, and hence simulations, is that although they simplify and idealise, they are still able to tell us something about the nature of the system they describe. + +Think back to our toy traffic model, how many approximations, uncertainties or potential sources of errors can you think of? + +:::callout{variant="tip"} +Errors introduced by numerical precision can not be eliminated but they can be mitigated, appropriate choices of step sizes, appropriate normalisation of your variables to avoid overly large or small numbers (both of which cause larger rounding errors), appropriate algorithm choices and etc. + +It is important to not only be aware of this source of error but also be aware of their size and use relevant strategies to reduce them if they are too large. +::: + +--- + +![Herd of sheep](images/andrea-lightfoot-Pj6fYNRzRT0-unsplash.jpg) +*Image courtesy of [Andrea Lightfoot](https://unsplash.com/@andreaelphotography) from [Unsplash](https://unsplash.com)* + +## Wolf-sheep predation simulation + +In this step we are going to use the wolf-sheep predation model to illustrate how models and simulations work. Hopefully, this will allow you to better understand the concepts introduced in the previous steps. + +There are two main variations to this model. + +In the first variation, wolves and sheep wander randomly around the landscape, while the wolves look for sheep to prey on. Each step costs the wolves energy, and they must eat sheep in order to replenish their energy - when they run out of energy they die. To allow the population to continue, each wolf or sheep has a fixed probability of reproducing at each time step. This variation produces interesting population dynamics, but is ultimately unstable, i.e. one or other population tends to die out. + +The second variation includes grass (green) in addition to wolves and sheep. The behavior of the wolves is identical to the first variation, however this time the sheep must eat grass in order to maintain their energy - when they run out of energy they die. Once grass is eaten it will only regrow after a fixed amount of time. This variation is more complex than the first, but it is generally stable. + +To play with the model, or to have a look at the code, go to the Modelling Commons page dedicated to the [wolf-sheep model](http://modelingcommons.org/browse/one_model/1390#model_tabs_browse_nlw) and first click on the SETUP button and then on the GO button. For more information on the model see the INFO tab. + +We know that the web version of the model works on Chrome and Safari browsers. If you cannot get it to work on your browser you can try downloading the netlogo software from [here](https://ccl.northwestern.edu/netlogo/s). It comes with a library full of interesting models and the wolf-sheep model can be found under the biology section. + +![Screenshot of wolf sheep predation model software](images/hero_b12d1403-058b-4971-9417-f188a1440b3a.png) + +The wolf-sheep model is very simplistic and would not be useful in studying the actual dynamic between both populations, but it is good enough to illustrate some of the concepts we covered in the previous step. + +### Uncertainties + +When we were talking about nondeterministic characteristics of the systems we mentioned uncertainties and other factors that could impact our conceptual model of the systems. Clearly, the wolf-sheep predation model ignores all of the nondeterministic aspects of the systems, for example: + +- What happened to the meteorological seasons? Surely, no one needs convincing that a harsh and longer than usual winter would affect both animal populations negatively, right? The seasons could be considered as one of the inherent variations of the system - they always happen but with a varying duration and intensity; +- What about the health state of both populations? If either of the populations was infected with some disease it would have a great impact on the other but it may not manifest immediately - hence, our lack of knowledge; +- Or what would happen if annoyed shepherds decided to deal with the wolves? There is no easy way of predicting the extent of human intervention. + +Does that give you a better idea of what uncertainties are? Can you think of any other sources of uncertainties in this model? + +© Modelling Commons - 1997 Uri Wilensky + +--- + +![Overhead photo of runners on a track](images/steven-lelham-atSaEOeE8Nk-unsplash.jpg) +*Image courtesy of [Steven Lelham](https://unsplash.com/@slelham) from [Unsplash](https://unsplash.com)* + +## Wolf-sheep predation simulation - Initial and Boundary Conditions + +Other concepts that we have not talked about yet, are boundary and initial condition. In order to discuss these concepts we would like you to explore how they affect the wolf-sheep predation model. + +In this case the boundary conditions define what happens to wolves and sheep that get to the edge of the area covered by the simulation. Try to stop the simulation when one of the animals is about to go out of the frame. Did you notice that it appears immediately on the opposite side of the frame? This behaviour is due to the periodic boundary conditions. + +As you know, initial conditions refer to the parameters you choose before running the simulation e.g. the initial number of wolves and sheep, the grass regrow period and the reproduction rates of both populations. + +To understand the importance of choosing the right input parameters, and see the difference even the smallest variation can produce, you are encouraged to play with the values of different parameters. +Try to find values of parameters that: + +- Allow both populations to be stable (i.e. neither population is dying out) even after 1000 time steps. +- Allow the sheep population to die out. +- Allow the wolf population to die out. + +:::callout{variant="discussion"} +How difficult was it to find the sets of parameters to meet the conditions? Have you noticed anything unexpected? +::: + +--- + +## Terminology Recap + +::::challenge{id=comp_sim_intro.1 title="Computer Simulations Q1"} +A ____ ____ is a description of a physical system using mathematical concepts and language, and an act of running such model on a computer is called +____ ____ . It is not possible for a model to capture all physical phenomena, therefore they are ____ as best as possible. + +:::solution +A) mathematical model + +B) computer simulation + +C) approximated +::: +:::: + +::::challenge{id=comp_sim_intro.2 title="Computer Simulations Q2"} +The inaccuracies that are not caused by the lack of knowledge are known as +____ . The difference between the real number and its approximation is referred to as the +____ ____ . + +:::solution +A) errors + +B) rounding error +::: +:::: + +--- + +![Venn diagram of computer science, mathematics, and applied discipline, with computational science in the overlap](images/hero_e436356c-c306-4ece-bcb6-b2c906973579.png) + +## Computational Science + +Computational science is a rapidly growing interdisciplinary field. There are many problems in science and technology that cannot be sufficiently studied experimentally or theoretically. It may be too expensive or too dangerous, or simply impossible due to the space and timescales involved. + +In fact, computational science is considered by many to be a third methodology in scientific research, along with theory and experiment, and working in tandem with them. Computational science can be used to corroborate theories that cannot be confirmed or denied experimentally, for example theories relating to the creation of the universe. On the other hand, advances in experimental techniques and the resulting data explosion, allow for data-driven modelling and simulation. + +You should not confuse computational science, which uses computational methods to deal with scientific problems, and computer science, which focuses on the computer itself. Having said that, computational science draws upon computer science, as well as upon mathematics and applied sciences. Computational science typically consists of three main components: + +1. algorithms and models +1. software developed to solve problems and +1. the computer and information infrastructure e.g. hardware, networking and data management components. + +Clearly, computational science is an intersection between mathematics, applied disciplines and computer science. + +Some of the disciplines traditionally associated with computational science include: atmospheric sciences (e.g. weather forecasting, climate and ocean modelling, seismology etc.), astrophysics, nuclear engineering, chemistry, fluid dynamics, structural analysis and material research. It’s easy to see why these disciplines were quick to take up computational science. + +Other disciplines, such as medicine (e.g. medical imaging, blood flow simulations, bone structure simulations), biology (e.g. ecosystem and environmental modelling) and economics are embracing computational science as well. +It has become quite common to come across terms like computational economics or computational biology. + +Even more recently, advances in machine learning has led to an entirely new subfield of computational science and its application across a huge range of disciplines. + +:::callout{variant="discussion"} +If you want to know what sorts of applications are run on the UK’s national supercomputer ARCHER2, visit the relevant page on the ARCHER2 website. Is there anything that surprises you there? +::: + +--- + +![Question marks](images/laurin-steffens-IVGZ6NsmyBI-unsplash.jpg) +*Image courtesy of [Laurin Steffens](https://unsplash.com/@lausteff) from [Unsplash](https://unsplash.com)* + +## Why use supercomputers? + +In this exercise we want you to have a look at three examples of computer simulations and answer the following questions: + +- In your opinion, what are three main reasons for using supercomputers in science and industry? +- Can you think of any problems that are still too difficult to be solved with our current computing technology? + +The excerpts below are from three different projects involving computer simulations, and should provide you enough food for thought. + +### Protein Folding + +(The Folding@Home Project page can be found [here](https://foldingathome.org/dig-deeper/?lng=en-UK)): + +“Proteins are necklaces of amino acids — long chain molecules. They come in many different shapes and sizes, and they are the basis of how biology gets things done. As enzymes, they are the driving force behind all of the biochemical reactions which make biology work. As structural elements, they are the main constituent of our bones, muscles, hair, skin and blood vessels. As antibodies, they recognize invading elements and allow the immune system to get rid of the unwanted invaders. They also help move muscles and process the signals from the sensory system. For these reasons, scientists have sequenced the human genome — the blueprint for all of the proteins in biology — but how can we understand what these proteins do and how they work? + +However, only knowing this amino acid sequence tells us little about what the protein does and how it does it. In order to carry out their function (e.g. as enzymes or antibodies), they must take on a particular shape, also known as a fold. Thus, proteins are truly amazing machines: before they do their work, they assemble themselves! This self-assembly is called folding. Out of an astronomical number of possible ways to fold, a protein can pick one in microseconds to milliseconds (i.e. in a millionth to a thousandth of a second). How a protein does this is an intriguing mystery.” + +The Folding@Home project is actually an example of a distributed computing project. The project makes use of idle processing resources of personal computers owned by people who voluntarily installed the project software on their systems. Nevertheless, supercomputers are used as well to simulate protein folding. If you are interested, you can read about the ANTON supercomputer and what it does in a Nature article. + +### Recreating the Big Bang + +(The Illustris project page can be found [here](http://www.illustris-project.org/)): + +“The Illustris project is a set of large-scale cosmological simulations, including the most ambitious simulation of galaxy formation yet performed. The calculation tracks the expansion of the universe, the gravitational pull of matter onto itself, the motion or “hydrodynamics” of cosmic gas, as well as the formation of stars and black holes. These physical components and processes are all modeled starting from initial conditions resembling the very young universe 300,000 years after the Big Bang until the present day, spanning over 13.8 billion years of cosmic evolution. The simulated volume contains tens of thousands of galaxies captured in high-detail, covering a wide range of masses, rates of star formation, shapes, sizes, and with properties that agree well with the galaxy population observed in the real universe.” + +### Weather modelling + +(More information about ensemble forecasting done by the UK Met Office can be found [here](https://www.metoffice.gov.uk/research/weather/ensemble-forecasting/what-is-an-ensemble-forecast)): + +“A forecast is an estimate of the future state of the atmosphere. It is created by estimating the current state of the atmosphere using observations, and then calculating how this state will evolve in time using a numerical weather prediction computer model. As the atmosphere is a chaotic system, very small errors in its initial state can lead to large errors in the forecast. + +This means that we can never create a perfect forecast system because we can never observe every detail of the atmosphere’s initial state. Tiny errors in the initial state will be amplified, so there is always a limit to how far ahead we can predict any detail. To test how these small differences in the initial conditions may affect the outcome of the forecast, an ensemble system can be used to produce many forecasts.” + +--- + +![Scrabble letters spelling 'One step at a time'](images/brett-jordan-FHLGDs4CkY8-unsplash.jpg) +*Image courtesy of [Brett Jordan](https://unsplash.com/@brett_jordan) from [Unsplash](https://unsplash.com)* + +## Simulation steps + +In the previous steps we discussed briefly what simulations are and how they are created. +Most people using scientific simulations in their work do not write them, they simply make use of or adapt already existing pieces of software. +From the user perspective, the simulation process can be thought of as consisting of three linked steps: pre-processing, running of a simulation and post-processing. + +### Pre-processing + +The pre-processing stage takes care of the model settings and input data. Different simulations require different inputs - simple models do not require much input (e.g. our traffic and wolf-sheep predator models), but most of the useful simulations deal with a large amount of data. Usually, the input data comes from real-world observations, measurements and experiments. Let us take weather modelling as an example. How do you think numerical weather prediction works? + +To make a forecast it is necessary to have a clear picture of the current state of the atmosphere and the Earth’s surface. Moreover, the quality of the forecast strongly depends on how well the numerical model can deal with all this information. Now, where do all these data points come from? They are gathered by various weather stations, satellite instruments, ships, buoys… and so on. + +It’s not hard to imagine that all of these may record and store their measurements differently. That is why the pre-processing step is necessary - it prepares the data for further procedures, so that they can be easily and effectively used. This may mean simply making sure all the data is in the same format, and there are no invalid data entries, or performing more complicated operations such as removing noise from data, or normalising the data sets. The pre-processing stage ends when a simulation is ready to be launched. + +### Execution + +Quite often, especially on a large machines, once the simulation has been started it runs until the end or until a certain, significant point in a calculation (we call these checkpoints) has been reached, and only then the output is produced. With batch systems, once an user submits their executable along with required input files to the submission queue, the job gets scheduled by a job scheduler, and some time later it runs and generates its output. In other words, you do not really see what is happening in the simulation and cannot interact with it. + +There are a number of reasons why supercomputing facilities use this approach but the main ones are: + +1. A machine is a shared resource but most users want/need an exclusive access to the compute resources. +1. Most of the applications are written in a way that require dedicated resources to scale efficiently. +1. The whole system must be utilised as fully as possible (even during weekends and public holidays!) otherwise its resources are being wasted. + +The point is that real-time visualisations (in situ visualisations), although slowly making their appearance, are not really used in a large scale simulations run on supercomputers. Downloading data to off-site locations (i.e. off the compute nodes of a supercomputer) allows interactive visualisations to be performed, without issues caused by limiting batch-mode workflows necessary on supercomputers. This means that to see what has happened during the simulation, i.e. to create a step-by-step visualisation of the simulation, it is necessary to save a lot of data at each time step. + +### Post-processing + +The post-processing stage extracts the results of the simulation and puts them into a usable form. Initially, the typical output of any kind of simulation was simply a string of numbers, presented in a table or a matrix, and showing how different parameters changed during the simulations. However, humans are not very good at interpreting numbers. It is much easier to understand the results presented using graphs and animations, than to scan and interpret tables of numbers. + +For example, in weather forecasting it is common to show the movement of rain or clouds over a map showing geographical coordinates and timestamps. Nowadays, it is common for the simulation outputs to graphically display large amounts of data. + +:::callout{variant="discussion"} +What do you think may be required to perform the pre- and post-processing steps? Do you think they have to be done on the same machine the simulation is run on? Do you think these steps have different hardware or software requirements than the execution step? +::: + +--- + +## Questions on Computer Simulations + +::::challenge{id=comp_sim_intro.3 title="Computer Simulations Q3"} +What are the main reasons for running computer simulations on supercomputers? + +Select all the answers you think are correct. + +A) to solve larger or more complex problems + +B) to solve problems faster + +C) it’s often cheaper than carrying out experiments + +D) computer simulations may be the only way of studying some problems + +:::solution +All are correct! + +Think of time and cost benefits of using computer simulations over other methods and means of doing science. + +Correct! - Having more computational power means you can tackle larger or more complex problems in a relatively short amount of time. + +Correct! - in most simulations directly linked to our everyday lives (e.g. weather forecasting, medical modelling) the time-to-solution is very important. Being able to solve problems faster is important. + +Correct! - if you need to test hundreds of different problem settings it’s often cheaper to run hundreds of simulations than carry out hundreds of experiments. + +Correct! - problems dealing with extremely small or large time and space-scales are often difficult, if not impossible to study otherwise. +::: +:::: + +::::challenge{id=comp_sim_intro.4 title="Computer Simulations Q4"} +Which of the following statements about simulations and models are true? + +A) a simulation is an act of running a model + +B) a model is an exact representation of reality + +C) a model should always capture all interactions between the components of the system it models + +D) all of the above are true + +:::solution + +A) + +Think about the relation between a model, simulation and reality. + +Correct! - the execution of mathematical model on a computer is called simulation + +::: +:::: + +::::challenge{id=comp_sim_intro.5 title="Computer Simulations Q5"} +Which of the following aspects of a computer simulation are never approximated? + +A) interactions between the system’s components + +B) initial and boundary conditions + +C) numerical values of simulation variables during the simulation run + +D) none of the above + +:::solution + +D) + +Think about why approximations are necessary. + +Correct! - above approximations are necessary to allow computer simulations to study the behaviour of systems in time and cost effective manner. + +::: +:::: + +::::challenge{id=comp_sim_intro.6 title="Computer Simulations Q6"} +Which of the following statements about computational science are true? + +Select all the answers you think are correct. + +A) computational science can be considered to be an intersection between mathematics, computer science and applied disciplines + +B) computational science is a rigid discipline, which is not evolving in time + +C) only classical fields like physics and chemistry are making use of computational science + +D) computational science is focused on using computational methods to solve scientific problems + +:::solution +A) and D) + +Think about what computational science is and how and when it came about. + +Correct! - computational science draws upon these three fields to solve scientific problems. + +Correct! - solving scientific problems is the main goal of computational science, and computational methods are tools used to achieve that. +::: +:::: + +::::challenge{id=comp_sim_intro.7 title="Computer Simulations Q7"} +Which of the following statements about computer simulations are true? + +A) pre- and post-processing steps of computer simulations are not very important + +B) computer simulations are usually run interactively on supercomputers + +C) the pre-processing step usually prepares initial parameters and data, and the post-processing step puts the results in a more usable formats + +D) all of the above + +:::solution +C) + +Think about what computer simulations are meant to do and how they do it. + +Correct! - pre-processing is necessary to ensure that simulation is run with correct settings; post-processing makes sure that the simulation’s output is readable or ready for further processing. +::: +:::: diff --git a/high_performance_computing/computer_simulations/02_weather_simulations.md b/high_performance_computing/computer_simulations/02_weather_simulations.md new file mode 100644 index 00000000..d4b08697 --- /dev/null +++ b/high_performance_computing/computer_simulations/02_weather_simulations.md @@ -0,0 +1,270 @@ +--- +name: Weather Simulations +dependsOn: [ + high_performance_computing.computer_simulations.01_intro +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + + +# Weather simulations + +## Predicting Weather and Climate + +In this short PRACE video, Prof. Pier Luigi Vidale talks about possibilities and challenges of weather and climate simulations. + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_ojrnelre&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_riytah2w" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Predicting_weather_Climate_hd"} +:::: + +:::solution{title="Transcript"} +0:12 - Climate science. + +0:15 - Have you ever wondered if natural disasters can be anticipated? Predicting how our climate and physical world changes is one of society’s greatest challenges. Professor Pier Luigi Vidale is a scientist addressing this challenge. Weather and climate simulation require very large supercomputers. We want to be able to simulate the weather and climate processes. We want to simulate things like typhoons, hurricanes, wind storms, major rain storms, flooding events. This is like looking at climate with a very high definition camera. We have been able to simulate some historical damage due to very powerful storms, like hurricanes, in ways that have never been observed before. These are types of simulations that were, even a few years ago, impossible. + +1:06 - And we can do this for actually hundreds to thousands of years of simulation, which is not available in our observations. So in many ways, we are using the computer simulation as a synthesis of observations we’ve never had. It’s also true that natural catastrophes are among the most costly to society. And this is very important for government, but also for industry. And much of my activity deals with your insurance industry and understanding how we can reduce losses to business, to society, and also loss of life. And we do need to understand how these weather and climate phenomena are changing with time. This is crucial to our society +::: + +In the next steps we will use an example of Numerical Weather Modelling to illustrate the key concepts of computer simulations. + +Despite many weather forecasts being widely available to the public, not many people understand how they are created and how they should be interpreted. +There is a huge difference between a 3-day and a 14-day forecast; different forecast ranges focus on different aspects of the atmosphere so, depending on the range, different models are used to predict the weather. + +For example, processes that do not have a clear impact on day-to-day forecasts, such as deep ocean circulation or carbon cycle, are absolutely essential to long range forecasts and climate modelling. That’s why climate projections use coupled ocean-atmosphere models, while short range weather forecasts do not. + +:::callout{variant="discussion"} +How often do you check weather forecasts? Usually, how much in advance are you checking them? From your own experience, how often are short-term forecasts correct? What about longer-term forecasts? Why makes you think this? +::: + +© PRACE + +--- + +![Person holding umbrella in the rain](images/erik-witsoe-mODxn7mOzms-unsplash.jpg) +*Image courtesy of [Erik Witsoe](https://unsplash.com/@ewitsoe) from [Unsplash](https://unsplash.com)* + +## Weather simulation - how does it work? + +Meteorology was one of the first disciplines to harness the power of computers, but the idea of using equations to predict the weather predates the computer era. It was first proposed in 1922 by the English mathematician Lewis Fry Richardson. + +Not having any computing power at his disposal, he estimated that making a useful, timely forecast would require 64,000 people to perform the calculations. Not very feasible at the time, but his theory formed the basis for weather forecasting. + +### Numerical Weather Prediction + +The forecast starts with a creation of a three-dimensional grid consisting of many data points representing the current atmospheric conditions over a region of interest, extending from the surface to the upper atmosphere. +Each data point contains a set of atmospheric variables, e.g. temperature, pressure, wind speed and direction, humidity and so on, taken from the observational data. +The interaction and evolution of these atmospheric variables is dictated by a set of model equations. + +These equations can be divided into two categories - dynamical and physical. The dynamical equations treat the Earth as a rotating sphere and the atmosphere as a fluid, so describing the evolution of the atmospheric flow means solving the equations of motion for a fluid on a rotating sphere. However, this is not enough to capture the complex behaviour of the atmosphere so a number of physical equations are added to represent other atmospheric processes, such as warming, cooling, drying and moistening of the atmosphere, cloud formation, precipitation and so on. + +Now, you already know that computers work in discrete steps, so, to predict a new weather state some time into the future, these equations need to be solved a number of times. The number of time steps and their length depends on a forecast timescale and type - short, medium or long term. + +### The Butterfly Effect + +Moreover, the atmosphere is a chaotic system, which means it is very susceptible to variations in the initial conditions. A tiny difference in the initial state of the atmosphere at the beginning of the simulation may lead to very different weather forecasts several days later. This concept of small causes having large effects is referred to as the butterfly effect. + +You may be familiar both with the term and the associated metaphor (a butterfly flapping its wings influencing a distant hurricane several weeks later). After all, it has been used not only in science but also in popular culture. The term was actually coined by Edward Lorenz, one of the pioneers of chaos theory, who encountered the effect while studying weather modelling. In 1961 he showed that running a weather simulation, stopping it and then restarting it, produced a different weather forecast than a simulation run without stopping! + +This behaviour was explained by the way computers work - stopping of the simulation meant that the values of all variables had to be output to storage and then to restart, the numbers were re-input back into memory. +The problem was, the level of precision of those stored numbers was less than the precision the computer had used to compute them. +The numbers were being rounded, assuming that such small differences could have no significant effect. +Lorenz rounded the numbers accurate to six decimal places (e.g. 6.174122) to three decimal places (e.g. 6.174) when output. +When the simulation was restarted,the initially small differences were amplified into completely different weather forecasts! + +Typically, to lessen the uncertainty in weather predictions, ensemble forecasting is used. In simple terms, a number of simulations are run with slightly different initial conditions and the results are combined into probabilistic forecasts, showing how likely particular weather conditions are. If results of the ensemble runs are similar, then the uncertainty is small, and if they are significantly different then the uncertainty is bigger. + +:::callout{variant="discussion} +Does this explain why the public weather forecasts should be taken with a pinch of salt? Does weather forecasting work as you expected? Do you find anything surprising? + +Considering the chaotic nature of weather forecasts, how does the range of the forecast effect the differences between forecasts in the ensemble? +::: + +--- + +![Rendering of planet earth with blue points on its surface](images/shubham-dhage-fmCr42xCLtk-unsplash.jpg) +*Image courtesy of [Shubham Dhage](https://unsplash.com/@theshubhamdhage) from [Unsplash](https://unsplash.com)* + +## Pre-processing in Weather Simulations + +You have seen how important initial conditions are, this is even more true for modelling chaotic systems such as weather. To produce any useful forecast, it is absolutely essential to start with the right set of parameters. However, do we actually know the current state of the atmosphere? How well do we understand processes governing it? + +Well, the short answer is not well enough! We do not possess enough information to be able to tell what is happening at all points on and above our planet’s surface. + +### Data Assimilation + +The higher levels in the atmosphere, large areas of ocean, and inaccessible regions on land are examples of places on Earth for which we do not posses sufficient observational data. This certainly poses a problem, after all we need to re-create the current weather state as closely as possible. Somehow, we need to fill in these gaps in our knowledge. + +Weather science does exactly that using a process called data assimilation, which combines available observational data with a forecast of what we think the current state of the atmosphere is. This is done by comparing a previous forecast with the most recently received observations and then adjusting the model to reflect these observations. This process is repeated until satisfactory results are achieved. This way, the best estimate of the current weather state can be used as an input to the actual simulation. + +Depending on the forecast range, the rate of data assimilation is different - shorter and more localised forecasts are fed with observational data more often, and with more data points, than the longer ones. + +### Parametrisation + +Besides the initial condition derived from the observational data, there are also other parameters that need to be included in weather models. These parameters are introduced to account for the processes that are too small or too complex to be explicitly represented. For example, the descent rate of raindrops, or a cumulus cloud, which is typically smaller than 1km. + +Among the processes that are to complex to be directly included in weather models is cloud microphysics - processes that lead to the formation, growth and precipitation of atmospheric clouds. Due to the complexity and difference in scale it’s too computationally expensive to directly include these processes in weather models. Nevertheless, capturing and describing their effect on the weather patterns is important. This is done through the parametrisation derived from observational data and our understanding of these processes. + +Do you think we will ever reach a point in the history of weather simulation when the steps of data assimilation and parametrisation will become unnecessary? Why do you think so? + +--- + +![NASA computer weather simulation model depicting aerosol movement across the globe](images/Paint_by_Particle.jpeg) +*NASA/Goddard Space Flight Center* + +## Running Weather Simulations + +There are many different models and each of them is run in different configurations - over different forecast ranges, over different land scales and with different resolutions. Do they have anything in common then? + +What they have in common is that we want them to provide us with forecasts containing as much detail as possible, but at the same time being produced in a timely fashion. The problem is the increased complexity of simulations demands more computing power to issue forecasts within schedule and at a reasonable cost. + +For example, at the European Centre for Medium-Range Weather Forecasts ([ECMWF](http://www.ecmwf.int/en/about)), a single 10-day forecast is run in one hour. However, as we mentioned before, to estimate the effect of uncertainties the ensemble forecasting is commonly used. The ECMWF typically runs the ensemble consisting of 50 single forecasts. Compared to a single forecast at the same resolution, the ensemble run is 50 times more expensive and produces 50 times as much data. + +This is only possible thanks to their [supercomputing system](http://www.ecmwf.int/en/computing/our-facilities/supercomputer) consisting of over 100,000 CPU-cores. The problem is that with the rate at which the models are being improved, it is estimated that in the future 20 million cores would be needed to do the same job. + +### Resolution + +One of the ways to improve the forecasts is to increase the model resolution. Constructing a finer grid means providing more details of the surface characteristics (e.g. mountains, seas) and reducing errors in the descriptions of smaller-scale physical processes. Another way to improve a forecast is to add more complexity to the model, for example by adding aerosols containing particles such as dust, volcanic ash, and pollution, or including more atmosphere-ocean interactions. + +![ECMWF](images/hero_314aac5e-0e8a-4049-becd-db3d5d99ee30.png) +*© ECMWF* + +These improvements are not easy to implement, and at the same time increase the computational intensity tremendously. If the model was perfectly scalable it would be enough to increase the number of CPU-cores used in the simulation. Then you could increase the resolution by doubling the number of grid points, run the simulation on twice as many cores and expect it to be completed in the same time as the original simulation. Unfortunately, this only works if calculations are independent from each other. + +Even if there is no coupling between different variables (i.e. they do not affect each other in any way), which is not always the case, at some point in the calculation (sometimes each timestep), the partial values of variables calculated over all grid points need to be summed into one value. In other words, the data scattered among all the CPU-cores involved in the calculation need to be collected, summed and again redistributed to allow the parallel calculations to continue. + +Global communication across a large number of CPU-cores can have a significant impact on computing performance. The improvements to the model may also result in an increased amount of data processed during the simulation, and more data being communicated between the CPU-cores. This may simply kill scalability! + +Do you think running a weather simulation on 20 million CPU-cores is possible? Why? Are there any conditions that weather models would have to meet to make it possible? + +© The ECMWF + +--- + +![Visualisation of tracking a superstorm across USA](images/tracking_superstorm.jpg) +*NASA's Goddard Space Flight Center and NASA Center for Climate Simulation* + +### Visualisation and post-processing in Weather Simulations + +As we said, being able to run more complex simulations also means producing more data. However, data by itself is not very useful, it only becomes valuable when we know how to interpret it. This is especially true for complex simulations such as weather forecasting. + +Understanding weather data without some sort of post-processing and visualisation is close to impossible. + +### Post-processing + +The main aim of post processing is to make the forecasts more useful and usable. This includes tailoring the output to the need of the intended audience. For example, slightly different forecasts will be produced for media, transport, agriculture or defence services. + +The post-processing step is also used to improve weather models by relating their outputs to the observational data. This helps to account for local influences, which are not completely resolved in the representation of the model output, and to choose the parameters representing phenomena that are not captured by the model. Different statistical methods are used to correct the systematic biases (differences between the calculated and observed values) and other inadequacies of the numerical models. + +Quite often special software is required to allow post-processing and then visualisation of the simulation results. + +### Visualisations + +At any given moment in time, a weather state is represented by at least tens of thousands of data points. +The sheer volume of data is so vast that some visual form is needed to make sense of it. + +One of the earliest visualisations techniques used in weather science were maps. They usually focus on a few variables only (e.g. temperature and cloud/rain cover) and show how they will behave over the next hours or days. If you are interested in seeing how weather maps used in TV weather forecasts changed over time, visit the BBC article - [Presenting a warm front: 60 years of the British TV weather forecast](https://www.bbc.co.uk/news/magazine-25665340). + +More recently, emphasis is put on the use of interactive displays, especially on the web, giving users control over the type and form of the displayed information. Animations are also widely used because they are able to effectively condense vast amounts of data into memorable visual sequences. + +If you are interested in new cutting edge visualisation techniques developed for weather and environmental science, we invite you to watch the introduction to the [Informatics Lab](https://www.youtube.com/watch?v=s6ito6QxbH4) run by the Met Office (UK’s official weather service provider). On the [Lab’s website](http://www.informaticslab.co.uk/) you will also find different demos that you can play with. Especially interesting is [Fly Through Model Fields](https://archived.informaticslab.co.uk/projects/three-d-vis.html) project. + +Can you imagine any of the Informatics Lab projects being used in real life situations? Would they be useful? Why do you think so? + +--- + +## Terminology Recap + +::::challenge{id=weather_sim.1 title="Weather Simulations Q1"} +From the user perspective, the simulation process can be thought of as consisting of three linked steps. The first one, taking care of model settings and input data, is referred to as the ____ stage. Then comes the ____ +stage, which on large machines is handled by the batch system. Finally, we have the ____ stage which takes the results of a simulation and puts them into a usable form. + +:::solution +A) Preprocessing + +B) Execution + +C) post-processing +::: +:::: + +::::challenge{id=weather_sim.2 title="Weather Simulations Q2"} +The process by which observations of the actual system are incorporated into the model state of a numerical model of that system is called ____ ____ . Introducing additional parameters into a model to account for the processes that are too small or too complex to be explicitly represented is called ____ . + +:::solution +A) data + +B) assimilation + +C) parameterisation +::: +:::: + +--- + +## Scalability of weather simulations + +The need for greater computing power in weather forecasting is driven by advances in modelling the Earth’s physical processes, the use of more observational data and finer model grid resolutions. However, is it enough to simply keep increasing the computational power indefinitely? + +In March 2016 ECMWF launched a new model, which reduced the horizontal grid spacing from 16 to 9 km, resulting in 3 times as many prediction points which now total 904 million. It’s estimated that the increased resolution improves the accuracy of forecasts by 2-3% for many parameters. The graph below shows the results of simulations carried out by ECMWF at a range of hypothetical model grid resolutions. + +![Result of ECMWF simulations at a range of hypothetical model grid resolutions](images/hero_ef5206cc-5956-4ce2-a77f-054447fce6a9.png) + +Using the graph and what you have learnt so far, try to answer the following questions: + +- Why does scalability range get better with increased resolution? +- What are the benefits of increased resolution? +- Why do ensemble forecasts scale better than single forecasts? +- What do you think about this scaling behaviour? Is it good or bad? +- What do you think the existence of a power limit means? +- What do you think should be changed to realise the goal of 5km horizontal resolution for ensemble forecasting by 2025? + +If you would like to learn more about the upgraded model visit the dedicated [ECMWF’s media centre page](http://www.ecmwf.int/en/about/media-centre/news/2016/new-forecast-model-cycle-brings-highest-ever-resolution). + +© The ECMWF + +--- + +![Bottles on factory production line](images/andrew-seaman-RuudPEDUM3w-unsplash.jpg) +*Image courtesy of [Andrew Seaman](https://unsplash.com/@amseaman) from [Unsplash](https://unsplash.com)* + +## Bottlenecks of Weather Simulations + +The limitations of weather simulations can be divided into two categories: theoretical and practical. The theoretical limitation is related to the mathematical description of the model. + +Processes governing the atmosphere are very complex and it’s difficult to capture them with equations. +Even if it were possible, they would not have exact solutions because solving them without approximations is not possible. +Over the years, thanks to the increase in the computational power, the effects of these approximations have been increasingly minimised through better parametrisation, increased resolution and more complex models. +Nowadays, a seven-day forecast is as accurate as a three-day forecast was in 1975. + +### Increasing Resolution + +Improving the accuracy of weather forecasts even further requires incorporating more interactions between inter-scale phenomena (e.g. atmosphere-ocean coupling) and further increasing resolution of the models, which needless to say, would make the simulations significantly more computationally intensive. At the moment, the models are not fine enough to capture smaller scale details. + +To give you a better picture, to represent a feature within a model you need at least 4 points attributed to it. In other words, the grid needs to be 1/4 of the size of the modelled feature. Now imagine, you want to simulate a weather phenomenon of the size 4 km x 4 km, a thunderstorm maybe? To do that, you need to use a model with a 1 km grid resolution, but most of the models are coarser than that. + +Generally, decreasing the spacing between grid points is not easy, not only because the computational intensity scales to the square of the spacing (for each direction) but also because of the need to maintain the numerical stability of the equations used to simulate the atmospheric variables. + +Numerical stability refers to the behaviour of the equations solved with erroneous input - in a numerically unstable algorithm a small error in the input causes a larger error in the results. Therefore, increasing the resolution of the grid may require the equations to be rewritten to maintain their numerical stability. +Generally, the higher the resolution (smaller grid spacing) the shorter the time step that is needed to maintain stability. +This limitation on stability due to step-size is known as the [Courant–Friedrichs–Lewy condition](https://en.wikipedia.org/wiki/Courant%E2%80%93Friedrichs%E2%80%93Lewy_condition). + +### Observational Data + +Another bottleneck is related to the observational data. The atmosphere is a chaotic system - without substantial amounts of data the chances of creating accurate forecasts decrease significantly, especially for increased forecast range. That is why the weather forecasting relies on the observational data coming from many different sources, both terrestrial and space based. It has to be noted that they describe the atmosphere in a slightly different way and so there is no unique way to represent the data uniformly, this makes calibrating and pre-processing the data to be used as input to the simulations an ongoing challenge. +Despite the difficulty of utilising these disparate data sources, it’s absolutely crucial to have as many different sources as possible, otherwise the forecasts will only be able to capture large scale patterns. + +To illustrate the importance of the initial data, consider the following example of Superstorm Sandy, which was one of the costliest hurricanes in US history and deadliest in 2012. The ECMWF successfully approximated its path, predicting its route seven days before it turned left and hit the shores of New Jersey. This almost unprecedented path was attributed to interactions with the large-scale atmospheric flow and highlighted the importance of the data provided by satellite observations. + +After the storm, ECMWF run a number of experiments to determine the role of satellite data by running simulations with deliberately withheld satellite data. The results showed that without the data gathered by polar-orbiting satellites the model would have failed to predict the hurricane hitting New Jersey! + +You may think that with the technological advances the number of satellites is steadily increasing so there is no danger of not being able to provide enough data to the weather models. However, in fact, a number of the satellites are ageing and funding their replacement is not always easy. It seems that governments are not very keen on spending money on new satellites. This could definitely prove to be a problem and cause degradation in our ability to predict the weather but also impact our understanding of Earth’s climate and life support systems. + +:::callout{variant="discussion"} +In your opinion, what is the most limiting factor in our current ability to forecast the weather? +::: diff --git a/high_performance_computing/computer_simulations/03_towards_future.md b/high_performance_computing/computer_simulations/03_towards_future.md new file mode 100644 index 00000000..6e28b0a8 --- /dev/null +++ b/high_performance_computing/computer_simulations/03_towards_future.md @@ -0,0 +1,191 @@ +--- +name: Towards the Future +dependsOn: [ + high_performance_computing.computer_simulations.02_weather_simulations +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Snapshot of simulated cardiac electrical activity in two bodies](images/hero_4c1c8a04-c056-418c-a617-a2780f86ad05.jpg) +*Snapshot of simulated cardiac electrical activity. © 2016 ARCHER image competition* + +## So why are supercomputers needed? + +In the previous steps you have familiarised yourself with three different large scale computer simulations so you should have a good understanding of why supercomputers are needed. + +In this discussion step we would like you to have a look at [ARCHER2’s case studies](https://www.archer2.ac.uk/research/case-studies/). Pick one of them and try to answer the following questions: + +- which discipline does the case study belong to? +- is there any social or economic benefit to it? +- what are the main reasons for using supercomputers to study this problem? +- in your opinion, what is the most surprising or interesting aspect of this case study? +- are there any similarities between the case study you picked and the three case studies we have discussed in this module? Why do you think that is? + +--- + +![Very tall building](images/verne-ho-0LAJfSNa-xQ-unsplash.jpg) +*Image courtesy of [Verne Ho](https://unsplash.com/@verneho) from [Unsplash](https://unsplash.com)* + +## Future of Supercomputing - the Exascale + +So what does the future of supercomputing look like? Much of the current research and development in HPC is focused on Exascale computing. + +For HPC architectures, this can be taken to mean working towards a system with a floating point performance of at least 1 Exaflop/s (i.e. 1018 or a million million million floating point calculations per second). + +In 2016, the US government announced the Exascale Computing Project, which aims to have its first supercomputer operating at 1 Exaflop/s or more in production by 2021. If past trends in the Top500 list had been followed, then a 1 Exaflop/s system would have been expected in 2018 – the fact that this date will be missed by at least 2 or 3 years is a measure of the technical challenges involved, both in the hardware and software. + +As of 2022 exascale has only been achieved by one system [https://www.top500.org/system/180047/](Frontier) based at Oak Ridge National Laboratory. Although at some level this is just an arbitrary number, it has become a significant technological (and political) milestone. +2022 saw the first exascale system, [https://www.top500.org/system/180047/](Frontier), based at Oak Ridge National Laboratory, with 2 more exascale systems, Aurora and El Capitan - both also in the US, following in 2024. Although at some level this is just an arbitrary number, it has become a significant technological (and political) milestone and more machines will follow in the near future as computing needs continue to grow. +Some of the main barriers to building a useful and economically viable Exascale machine are: + +### Hardware speeds + +Since around 2006, there has been little significant increase in the clock frequency of processors. The only way to extract more performance from computers has been through more parallelism, by having more cores per chip, and by making each cores capable of more floating point operations per second. Without a fundamental move away from the current silicon technology, there is no real prospect of significantly higher clock frequencies in the next 5-10 years. There is also not much prospect of reducing network latencies by very much either. On the plus side, new memory designs such as 3D stacked memory, do promise some increases in memory bandwidth. + +### Energy consumption + +If we were to build a 1 Exaflop/s computer today, using standard Intel Xeon processors, then it would consume around 400 megawatts of power: that’s enough electricity for 700,000 households, or about 1% of the UK’s entire electricity generating capacity! +That’s not only hugely expensive, but it would require a big investment in engineering infrastructure, and would be politically challenging from a carbon footprint point of view. +There are also plans to use waste heat from supercomputers as heating for homes, or for direct energy recovery which could substantially decrease the cost and environmental impact of running these scales of machines. + +The target energy consumption for an Exascale system is 20-30 megawatts. Some of the required savings can be made by using special purpose manycore processors, such as GPUs, instead of standard Xeons, but we are still around a factor of 5 off this target. Closing this gap is one of the big challenges in the short to medium term – some savings can be made by reducing the clock frequency of processors, but this has to be compensated for by a corresponding increase in the number of cores, in order to meet the total computational capacity target. + +### Reliability + +As the number of cores, and other components such as memory, network links and disks, increases, so does the risk that components will fail more often. As a rule of thumb, a supercomputer service becomes unacceptable to users if the rate of visible failures (i.e. failures that cause running applications to crash) is more than about one per week. Building bigger and bigger supercomputers, with more and more components, the mean time between failures will tend to decrease to a point where the system becomes effectively unusable. While some types of program can be written so as to be able to deal with hardware failures, it turns out to be very hard to do this for most HPC applications without seriously compromising performance. + +### Application Scalability + +It’s all very well to build an Exascale computer, but there isn’t much point unless applications can make practical use of them. As the degree of parallelism in the hardware continues to increase, it gets harder and harder to make applications scale without running into unavoidable bottlenecks. Strong scaling (i.e. obeying Amdahl’s Law) is very challenging indeed. Weak scaling (as in Gustafson’s Law) is easier to achieve, but often doesn’t result in solving the problems scientists are actually interested in. + +It is likely that for the first generation of Exascale systems, there will be only a small number (maybe only in the low tens) of application codes that can usefully exploit them. Even to achieve this will require heroic efforts by application developers and computer scientists, and also some degree of co-design: the hardware itself may be tailored to suit one or few particular application codes rather than providing full general purpose functionality. + +So that’s where HPC is heading in the next 5 years – a strong push towards Exascale systems. Even though these may only be usable by a small number of applications, the technologies developed along the way (both in hardware and software) will undoubtedly have an impact at the more modest (i.e. Tera- and Peta-) scale, and the performance of real application codes will continue to increase, even if at a somewhat reduced rate than in past decades. + +In your opinion, which of the above barriers is the hardest to breach? Why do you think so? + +--- + +![Atomic particle](images/physics-3871216_640.jpg) +*Image courtesy of [geralt](https://pixabay.com/users/geralt-9301/) from [Unsplash](https://pixabay.com)* + +## Quantum Computing + +There has been a lot of excitement in recent years about the possibilities of quantum computers. These are systems that use the quantum mechanical effects of superposition and entanglement to do calculations at a much faster rate than is possible with classical computers based on binary logic. + +In this article, we are not going to talk about how it actually works, you can read about that in an online article published on the Plus Maths website - [How does quantum computing work?](https://plus.maths.org/content/how-does-quantum-commuting-work) + +There are many research efforts going on right now into the fundamental technology to build such a machine: it is not yet possible to make the quantum states exist for long enough, or in large enough numbers to do any more than the very simplest computations. + +Even if the technological problems can be solved, there is a big problem for quantum computing, in that we can’t just take a normal computer program and run it on a quantum computer: completely new quantum algorithms have to be invented to solve scientific problems. Coming up with these algorithms is really hard: there are only a few tens of these known to exist, of which maybe a handful are of more than purely theoretical interest. One of the earliest discovered, and best known, is Shor’s algorithm, a way to find the prime factors of large numbers much faster than can be done with classical computers. This algorithm has an application in breaking some commonly used cryptography methods, but this is of limited practical use, as there are other known cryptography methods that are not liable to this form of attack. + +A much more interesting potential use of quantum computers is to simulate quantum systems such as molecules, and materials at the atomic level. Such simulations are currently done on classical supercomputers, but the algorithms used typically do not scale well to more than a few thousand processors, and so are not likely to benefit much from Exascale architectures with tens of millions of cores. In this case, quantum computers offer a very exciting potential alternative way of doing science that is currently impossible. + +The one type of quantum computer that does exist today is built by D-Wave Systems, and relies on rather different quantum effects (so-called quantum annealing) to solve a certain type of optimisation problem. So far, however, it has been difficult to demonstrate that the D-Wave machine is really reliant on quantum behaviour, and it has not been possible to show any meaningful performance advantage over conventional methods for solving the same problem on classical computers. + +For sure, quantum computers are an exciting new field, but they are not going to replace classical supercomputers any time soon, and even if they do, they will probably only be useable for solving a few, very specialised problems. + +--- + +![Artist depiction of artificial intelligence](images/artificial-intelligence-3382507_640.jpg) +*Image courtesy of [geralt](https://pixabay.com/users/geralt-9301/) from [Unsplash](https://pixabay.com)* + +## Artificial Intelligence + +What we’ve covered in the course is the use of supercomputers in computational science, for example to simulate the weather. + +The basic approaches to tackling problems like this using computer simulation have been known for some time, and parallelising on a supercomputer enables us to run very detailed simulations to greatly improve the accuracy of our predictions. Although the mathematics can be very complicated, the computer implementation is rather mechanical. We apply a set of rules to a large number of grid-points over and over again to advance the simulation through time: the much simpler traffic model is a very good analogy to the overall process. + +### Artificial Intelligence + +Although these approaches have been very successful across a wide range of problems, there are other areas where they don’t yield such good results. These tend to be in tasks that a bit fuzzier to define – tasks that humans actually find very simple – such as recognising faces, understanding speech or driving a car. Great strides have been made in recent years through advances in Artificial Intelligence (AI). + +### Neural Networks + +AI takes a rather different approach: rather than writing a separate program for each problem, you write software that can learn how to solve a much more general problem. This can be done by creating an artificial neural network which takes an input (such as an image of a face) and produces an output (such as “that looks like David”). +We first train the network on a set of known results (e.g. a set of passport photos that have already been identified by a human), and then apply the trained network to the unknown input to get useful results (e.g. “this person is David because he looks like David’s passport photo”). +All neural networks work on this basic principle of pattern recognition whether it is recognising subjects in images, trends in data, or predicting the next word(s) in a sequence (LLMs). +This is called “Machine Learning”. + +It’s very important to recognise that, although neural networks are inspired by the way the human brain is organised as a set of connected neurons, we are not simulating the brain. The Blue Brain project is building computer models of real brains: this is in order to understand how the human brain works, not to create a synthetic AI human. Neural networks are designed to solve particular tasks that the human brain can do very easily, but using completely synthetic neural networks that look nothing like the real human brain. + +### What has this got to do with Supercomputing? + +Neural networks have been around for a long time, but only recent advances in computer hardware have enabled them to be large enough and fast enough to begin to tackle problems that were previously impossible to address. The combination of modern Machine Learning software and powerful computers has come to be called Deep Learning. It is also a very parallelisable problem: if I have to train my network on 1000 known images, each image can be processed independently on a separate CPU-core. We are starting to see people designing and building supercomputers specifically targeted at Deep Learning. GPUs, with their extreme parallel nature, are very well suited to machine learning problems, and Google has even designed its own “TPU” (Tensor Processing Unit) processor specifically targeted at Deep Learning problems. + +### What about the future? + +We are still a long way from creating an artificially intelligent computer. Although supercomputer hardware can always help – we can train a network to recognise a face a thousand times faster – we will need to make tremendous advances in software to approach true AI. Although I’m sure that AI programs will make increasing use of supercomputer hardware in the future, there’s no chance that {{ machine_name }} will suddenly become self-aware and refuse to tell us what tomorrow’s weather will be! + +--- + +## Terminology Recap + +::::challenge{id=towards_future.1 title="Towards the Future Q1"} +The term ____ programming refers to an approach of combining more than one programming model in the same parallel program. The most common combination is to use the ____ +library and ____ together. In this approach, ____ allows communication between different compute nodes over the network. We use ____ to take advantage of the shared memory within each node. + +:::solution +A) hybrid + +B) mpi + +C) openMP + +D) MPI + +E) openmp +::: +:::: + +::::challenge{id=towards_future.2 title="Towards the Future Q2"} +The term ____ ____ refers to a situation where the computational work is not distributed equally among the CPU-cores. + +:::solution +Load imbalance +::: +:::: + +::::challenge{id=towards_future.3 title="Towards the Future Q3"} +We have mentioned four main barriers to building a useful and economically viable Exascale machine. There are: + +- ____ limitations (e.g. speed) - the clock frequency of processors has stagnated in the recent years. +- ____ consumption - the target for an exascale system is 20-30 megawatts. +- ____ - As the number of cores, and other components such as memory, network links and disks, increases, so does the risk that components will fail more often. +- application ____ - at the moment there is only a small number (maybe only in the low tens) of application codes that could usefully exploit exascale machines. + +:::solution +A) hardware + +B) power + +C) reliability + +D) scalability +::: +:::: + +--- + +![Lightbulb](images/johannes-plenio-voQ97kezCx0-unsplash.jpg) +*Image courtesy of [Johannes Plenio](https://unsplash.com/@jplenio) from [Unsplash](https://unsplash.com)* + +## What do you think the future holds? + +One of the defining features of supercomputing over the past two decades is that it has used commodity technology and benefited from the huge investments in consumer computers. + +Perhaps the only truly bespoke technology was the interconnect - only computational scientists needed to connect thousands of computers with microsecond latencies. However, the articles in this final activity have talked about going beyond the limits of today’s silicon technology. + +Things to consider: + +- Do you think there is consumer demand for ever faster computers? +- What applications could they have? +- Have we reached a point similar to commercial aircraft - it’s too expensive to make them go faster so investment is focused on saving cost, fuel economy, reliability etc.? +- Will the possibilities opened up by exascale supercomputers drive new technologies? diff --git a/high_performance_computing/computer_simulations/images/Gravitywaves.jpeg b/high_performance_computing/computer_simulations/images/Gravitywaves.jpeg new file mode 100644 index 00000000..4dd301b4 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/Gravitywaves.jpeg differ diff --git a/high_performance_computing/computer_simulations/images/Paint_by_Particle.jpeg b/high_performance_computing/computer_simulations/images/Paint_by_Particle.jpeg new file mode 100644 index 00000000..e11b46cd Binary files /dev/null and b/high_performance_computing/computer_simulations/images/Paint_by_Particle.jpeg differ diff --git a/high_performance_computing/computer_simulations/images/andrea-lightfoot-Pj6fYNRzRT0-unsplash.jpg b/high_performance_computing/computer_simulations/images/andrea-lightfoot-Pj6fYNRzRT0-unsplash.jpg new file mode 100644 index 00000000..f42121e4 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/andrea-lightfoot-Pj6fYNRzRT0-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/andrew-seaman-RuudPEDUM3w-unsplash.jpg b/high_performance_computing/computer_simulations/images/andrew-seaman-RuudPEDUM3w-unsplash.jpg new file mode 100644 index 00000000..0b710dd2 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/andrew-seaman-RuudPEDUM3w-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/artificial-intelligence-3382507_640.jpg b/high_performance_computing/computer_simulations/images/artificial-intelligence-3382507_640.jpg new file mode 100644 index 00000000..76bec986 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/artificial-intelligence-3382507_640.jpg differ diff --git a/high_performance_computing/computer_simulations/images/brett-jordan-FHLGDs4CkY8-unsplash.jpg b/high_performance_computing/computer_simulations/images/brett-jordan-FHLGDs4CkY8-unsplash.jpg new file mode 100644 index 00000000..4097df98 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/brett-jordan-FHLGDs4CkY8-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/erik-witsoe-mODxn7mOzms-unsplash.jpg b/high_performance_computing/computer_simulations/images/erik-witsoe-mODxn7mOzms-unsplash.jpg new file mode 100644 index 00000000..6ead20ab Binary files /dev/null and b/high_performance_computing/computer_simulations/images/erik-witsoe-mODxn7mOzms-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/hero_314aac5e-0e8a-4049-becd-db3d5d99ee30.png b/high_performance_computing/computer_simulations/images/hero_314aac5e-0e8a-4049-becd-db3d5d99ee30.png new file mode 100644 index 00000000..3e2019ee Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_314aac5e-0e8a-4049-becd-db3d5d99ee30.png differ diff --git a/high_performance_computing/computer_simulations/images/hero_4c1c8a04-c056-418c-a617-a2780f86ad05.jpg b/high_performance_computing/computer_simulations/images/hero_4c1c8a04-c056-418c-a617-a2780f86ad05.jpg new file mode 100644 index 00000000..a96fcf4e Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_4c1c8a04-c056-418c-a617-a2780f86ad05.jpg differ diff --git a/high_performance_computing/computer_simulations/images/hero_b12d1403-058b-4971-9417-f188a1440b3a.png b/high_performance_computing/computer_simulations/images/hero_b12d1403-058b-4971-9417-f188a1440b3a.png new file mode 100644 index 00000000..4deb9cf1 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_b12d1403-058b-4971-9417-f188a1440b3a.png differ diff --git a/high_performance_computing/computer_simulations/images/hero_bd3c2838-0873-4170-a3eb-5f53462415c4.png b/high_performance_computing/computer_simulations/images/hero_bd3c2838-0873-4170-a3eb-5f53462415c4.png new file mode 100644 index 00000000..3af0d68a Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_bd3c2838-0873-4170-a3eb-5f53462415c4.png differ diff --git a/high_performance_computing/computer_simulations/images/hero_e436356c-c306-4ece-bcb6-b2c906973579.png b/high_performance_computing/computer_simulations/images/hero_e436356c-c306-4ece-bcb6-b2c906973579.png new file mode 100644 index 00000000..d3f22ae8 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_e436356c-c306-4ece-bcb6-b2c906973579.png differ diff --git a/high_performance_computing/computer_simulations/images/hero_ef5206cc-5956-4ce2-a77f-054447fce6a9.png b/high_performance_computing/computer_simulations/images/hero_ef5206cc-5956-4ce2-a77f-054447fce6a9.png new file mode 100644 index 00000000..e857e427 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/hero_ef5206cc-5956-4ce2-a77f-054447fce6a9.png differ diff --git a/high_performance_computing/computer_simulations/images/johannes-plenio-voQ97kezCx0-unsplash.jpg b/high_performance_computing/computer_simulations/images/johannes-plenio-voQ97kezCx0-unsplash.jpg new file mode 100644 index 00000000..b649b035 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/johannes-plenio-voQ97kezCx0-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/laurin-steffens-IVGZ6NsmyBI-unsplash.jpg b/high_performance_computing/computer_simulations/images/laurin-steffens-IVGZ6NsmyBI-unsplash.jpg new file mode 100644 index 00000000..7abc5c74 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/laurin-steffens-IVGZ6NsmyBI-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/physics-3871216_640.jpg b/high_performance_computing/computer_simulations/images/physics-3871216_640.jpg new file mode 100644 index 00000000..6731941b Binary files /dev/null and b/high_performance_computing/computer_simulations/images/physics-3871216_640.jpg differ diff --git a/high_performance_computing/computer_simulations/images/shubham-dhage-fmCr42xCLtk-unsplash.jpg b/high_performance_computing/computer_simulations/images/shubham-dhage-fmCr42xCLtk-unsplash.jpg new file mode 100644 index 00000000..f43a398a Binary files /dev/null and b/high_performance_computing/computer_simulations/images/shubham-dhage-fmCr42xCLtk-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/steven-lelham-atSaEOeE8Nk-unsplash.jpg b/high_performance_computing/computer_simulations/images/steven-lelham-atSaEOeE8Nk-unsplash.jpg new file mode 100644 index 00000000..bf6e79d3 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/steven-lelham-atSaEOeE8Nk-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/images/tracking_superstorm.jpg b/high_performance_computing/computer_simulations/images/tracking_superstorm.jpg new file mode 100644 index 00000000..93d740c0 Binary files /dev/null and b/high_performance_computing/computer_simulations/images/tracking_superstorm.jpg differ diff --git a/high_performance_computing/computer_simulations/images/verne-ho-0LAJfSNa-xQ-unsplash.jpg b/high_performance_computing/computer_simulations/images/verne-ho-0LAJfSNa-xQ-unsplash.jpg new file mode 100644 index 00000000..5700471b Binary files /dev/null and b/high_performance_computing/computer_simulations/images/verne-ho-0LAJfSNa-xQ-unsplash.jpg differ diff --git a/high_performance_computing/computer_simulations/index.md b/high_performance_computing/computer_simulations/index.md new file mode 100644 index 00000000..2363d304 --- /dev/null +++ b/high_performance_computing/computer_simulations/index.md @@ -0,0 +1,32 @@ +--- +name: Computer Simulations +id: computer_simulations +dependsOn: [ + high_performance_computing.parallel_computing, +] +files: [ + 00_practical.md, + 01_intro.md, + 02_weather_simulations.md, + 03_towards_future.md, +] +summary: | + This module introduces computer simulations, using a number of examples, which are used to explore the behaviour + of a real-world system represented as a mathematical model. + +--- + +In this video David will give a brief description of what awaits you in this module about computer simulations. + +# Welcome to Part 4 + +::::iframe{id="kaltura_player" width="700" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_uo9lyoxr&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_h70phwce" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Welcome_to_Computer_Simulations"} +:::: + +:::solution{title="Transcript"} +0:12 - From the first three weeks, you should now have a good understanding of what supercomputers are, how they’re built, and how they’re programmed. However, other than the traffic simulation, we haven’t covered how computers are used to simulate the real world. This week we’ll cover the basic concepts of computer simulation– the methods, the approximations, and the end-to-end process from inputting the initial data to visualising the final results. We’re going to use weather forecasting as a key example. Now it’s a field that’s always been at the forefront of computer simulation, and it’s an area where supercomputers are absolutely central in letting you know in advance whether you should have an outdoor barbecue tomorrow, or order pizza and eat inside. + +0:54 - We won’t go into the details of parallelisation, but from what you’ve learned so far, you should be able to start to think about how these simulations can be broken down into many separate tasks, and then mapped onto a large parallel supercomputer. +::: + +In the previous sections we talked about the hardware of supercomputers and how to program them, in this part we will focus on computer simulations. We will use weather simulations to illustrate the key concepts. diff --git a/high_performance_computing/index.md b/high_performance_computing/index.md index 767532ec..e197116a 100644 --- a/high_performance_computing/index.md +++ b/high_performance_computing/index.md @@ -5,6 +5,10 @@ dependsOn: [ technology_and_tooling.bash_shell ] courses: [ + supercomputing, + parallel_computers, + parallel_computing, + computer_simulations, hpc_aws_slurm_setup, hpc_intro, hpc_parallel_intro, diff --git a/high_performance_computing/parallel_computers/01_basics.md b/high_performance_computing/parallel_computers/01_basics.md new file mode 100644 index 00000000..e6289967 --- /dev/null +++ b/high_performance_computing/parallel_computers/01_basics.md @@ -0,0 +1,376 @@ +--- +name: Parallelism in Everyday Computers +dependsOn: [ + high_performance_computing.supercomputing.03_supercomputing_world +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Photo of laptop motherboard](images/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg) +*Image courtesy of [Alexandre Debieve](https://unsplash.com/@alexkixa) from [Unsplash](https://unsplash.com)* + +## Computer Basics + +Before we look at how supercomputers are built, it’s worth recapping what we learned previously about how a standard home computer or laptop works. + +Things have become slightly more complicated in the past decade, so for a short while let’s pretend we are back in 2005 (notable events from 2005, at least from a UK point of view, include Microsoft founder Bill Gates receiving an honorary knighthood and the BBC relaunching Dr Who after a gap of more than a quarter of a century). + +In 2005, a personal computer typically had three main components: + +1. A single processor for performing calculations. +1. Random Access Memory (RAM) for temporary data storage. +1. A hard disk for long-term storage of programs and files. + +![Diagram of relationship between processor, memory and disk](images/hero_9090d93c-0a48-4a33-8ed4-3b8fc6acf6cf.png) + +For our purposes, the configuration of memory is the most critical aspect, so we’ll set aside the hard disk for now. + +### The Rise of Multicore Processors + +For three decades leading up to 2005, Moore’s Law ensured that processors became exponentially faster, primarily due to increasing CPU clock speeds. However, around 2005, clock speed growth plateaued at around 2 GHz. + +The reason was simple: the amount of electrical power required to run processors at these speeds had become so large that they were becoming too hot for the domestic market (could not be cooled by a simple fan) and too expensive to run for the commercial market (large electricity bills and expensive cooling infrastructure). So, around 2005, the application of Moore’s law changed: rather than using twice as many transistors to build a new, more complicated CPU with twice the frequency, manufacturers started to put two of the old CPUs on the same silicon chip - this is called a dual-core CPU. + +The trend continued with four CPUs on a chip, then more … Generically, they are called multicore CPUs, although for very large numbers the term manycore CPU is now commonplace. + +:::callout{variant="info"} +With multicore processors, terminology can be confusing. When we refer to a "processor" or "CPU," it’s not always clear whether we mean the physical chip (which houses multiple processors) or the individual processing units within. + +To avoid confusion in this course: +- CPU-core refers to each individual processing unit within a chip. +- CPU or processor refers to the entire multicore chip. + +So, a quad-core CPU (or quad-core processor) has four CPU-cores. +::: + +We now have two complementary ways of building a parallel computer: + +- Shared-memory architecture: Build a single multicore computer using a processor with dozens of CPU-cores. +- Distributed-memory architecture: Connect multiple individual computers, each with its own processor and memory, via a high-speed network. + +We will now explore these approaches in detail. + +:::callout{variant="discussion"} +What do you think the main differences between these two approaches are? Can you think of any advantages and/or disadvantages for both of them? +::: + +--- + +![Photo of two people writing on a small whiteboard](images/kaleidico-7lryofJ0H9s-unsplash.jpg) +*Image courtesy of [Kaleidico](https://unsplash.com/@kaleidico) from [Unsplash](https://unsplash.com)* + +## Shared Memory Architecture + +The fundamental feature of a shared-memory computer is that all the CPU-cores are connected to the same piece of memory. + +![Diagram depicting multiple CPU cores connected to memory via a memory bus](images/hero_55c8a23e-686f-42a9-b7e9-de0a12208486.jpg) + +This is achieved by having a memory bus that takes requests for data from multiple sources (here, each of the four separate CPU-cores) and fetches the data from a single piece of memory. The term bus apparently comes from the Latin omnibus meaning for all, indicating that it is a single resource shared by many CPU-cores. + +This is the basic architecture of a modern mobile phone, laptop or desktop PC. If you buy a system with a quad core processor and 4 GBytes of RAM, each of the 4 CPU-cores will be connected to the same 4 Gbytes of RAM, and they’ll therefore have to play nicely and share the memory fairly between each other. + +A good analogy here is to think of four office-mates or workers (the CPU-cores) sharing a single office (the computer) with a single whiteboard (the memory). Each worker has their own set of whiteboard pens and an eraser, but they are not allowed to talk to each other: they can only communicate by writing to and reading from the whiteboard. + +Later in this module, we’ll explore strategies for leveraging this shared whiteboard to enable efficient cooperation among the workers. However, this analogy already illustrates two key limitations of this approach: + +1. **memory capacity**: There is a limit to the size of the whiteboard that you can fit into an office, i.e. there is a limit to the amount of memory you can put into a single shared-memory computer; +1. **memory access speed**: imagine that there were ten people in the same office - although they can in principle all read and write to the whiteboard, there’s simply not enough room for more than around four of them to do so at the same time as they start to get in each other’s way. Although you can fill the office full of more and more workers, their productivity will stall after about 4 workers, as contention for the shared memory bus increases a bottleneck is created. + +### Limitations + +It turns out that memory access speed is a real issue in shared-memory machines. +If you look at the processor diagram above, you’ll see that all the CPU-cores share the same bus: the connection between the bus and the memory becomes a bottleneck, limiting the number of CPU-cores that can efficiently utilize the shared memory. +Coupled with the fact that the variety of programs we run on supercomputers tend to read and write large quantities of data, memory access speed often becomes the primary factor limiting calculation speed, outweighing the importance of the CPU-cores' floating-point performance. + +Several strategies have been developed to mitigate these challenges, but the overcrowded office analogy highlights the inherent difficulties when scaling to hundreds of thousands of CPU-cores. + +:::callout{variant="discussion"} +Despite its limitations, shared memory architectures are universal in modern processors. What do you think the advantages are? + +Think of owning one quad-core laptop compared to two dual-core laptops - which is more useful to you and why? +::: + +--- + +![Photo of abacus](images/oleksii-piekhov-IflQrze1wMM-unsplash.jpg) +*Image courtesy of [Oleksii Piekhov](https://unsplash.com/@opiekhov) from [Unsplash](https://unsplash.com)* + +## Simple Parallel Calculation + +We can investigate a very simple example of how we might use multiple CPU-cores by returning to the calculation we encountered in the first module: computing the average income of the entire world’s population. + +If we’re a bit less ambitious and think about several hundred people rather than several billion, we can imagine that all the individual salaries are already written on the shared whiteboard. Let’s imagine that the whiteboard is just large enough to fit 80 individual salaries. Think about the following: + +- how could four workers cooperate to add up the salaries faster than a single worker? +- using the estimates of how fast a human is from the previous module, how long would a single worker take to add up all the salaries? +- how long would 4 workers take for the same number of salaries? +- how long would 8 workers take (you can ignore the issue of overcrowding)? +- would you expect to get exactly the same answer as before? + +We’ll revisit this problem in much more detail later but you know enough already to start thinking about the fundamental issues. + +--- + +![Photo of silicon wafer containing many processor chips](images/laura-ockel-qOx9KsvpqcM-unsplash.jpg) +*Image courtesy of [Laura Ockel](https://unsplash.com/@viazavier) from [Unsplash](https://unsplash.com)* + +## Who needs a multicore laptop? + +We’ve motivated the need for many CPU-cores in terms of the need to build more powerful computers in an era when the CPU-cores themselves aren’t getting any faster. Although this argument makes sense for the world’s largest supercomputers, we now have multicore laptops and mobile phones - why do we need them? + +You might think the answer is obvious: surely two CPU-cores will run my computer program twice as fast as a single CPU-core? Well, it may not be apparent until we cover how to parallelise a calculation later, but it turns out that this is not the case. It usually requires manual intervention to enable a computer program to take advantage of multiple CPU-cores. Although this is possible to do, it certainly wouldn’t have been the case back in 2005 when multicore CPUs first became commonplace. + +What advantages do multicore processors offer to users running programs that don’t utilize parallel computing? Such programs, operating on a single CPU-core, are called serial programs. + +### Operating Systems + +As a user, you don’t directly assign programs to specific CPU-cores. +The Operating System (OS) acts as an intermediary between you and the hardware, managing access to CPU-cores, memory, and other components. +There are several common OS’s around today - e.g. Windows, macOS, Linux and Android - but they all perform the same basic function: you ask the OS to execute a program, and a component of the OS called the scheduler manages when and on which CPU-core the program is executed. + +![Diagram of user in relation to computer containing an operating system, processor and memory](images/hero_6d93ece3-84b2-495f-b5c5-0e0f652196ea.png) + +This enables even a single CPU-core machine to appear to be doing more than one thing at once - it will seem to be running dozens of programs at the same time. +What is actually happening is that the OS runs one program, say, for a hundredth of a second, then stops that program and runs another one for a hundredth of a second, etc. +Just like an animation made up of many individual frames, this gives the illusion of continuous motion. + +### How the OS exploits many CPU-cores + +On a shared-memory computer, the important point is that all the CPU-cores are under the control of a single OS (meaning you don’t need to buy 4 Windows licences for your quadcore laptop!). This means that your computer can genuinely run more than one program at the same time. It’s a bit more complicated for the OS - it has to decide not just which programs to run but also where to run them - but a good OS performs a juggling act to keep all the CPU-cores busy. + +![User in relation to computer, containing operating system, multiple cores and memory](images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png) + +This means that you can run a web browser, listen to music, edit a document and run a spreadsheet all at the same time without these different programs slowing each other down. +With shared memory, the OS can pause a program on CPU-core 1 and resume it later on CPU-core 3, as all CPU-cores can access the same shared memory. +This allows seamless task switching. + +A shared-memory computer looks like a more powerful single-core computer: it operates like a single computer because it has a single OS, which fundamentally relies on all the CPU-cores being able to access the same memory. It is this flexibility that makes multicore shared-memory systems so useful. + +So, for home use, the Operating System does everything for us, running many separate programs at the same time. +In supercomputing, the goal is to accelerate a single program rather than running multiple tasks simultaneously. +Achieving this requires effort beyond what the OS can provide. + +:::callout{variant="discussion"} +In your opinion what are the downsides of this more advanced ‘single-core computer’ approach? +::: + +--- + +## How does your laptop use multiple CPU-cores? + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_3g4n1c0n&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_vf0ln82e" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Laptop_Multiple_CPU-cores_hd"} +:::: + +:::solution{title="Transcript"} +0:15 - This short video is just a screencast to capture a session I’m running. And it’s really just to illustrate this diagram, here. We’ve seen here that the way that a shared memory machine works is that you have a single block of memory, and multiple CPU cores connected to that memory. But I’m really interested, here in the role that the operating system plays. And we’ve seen here that the user sits outside this bubble, here, and really just asks the operating system to run programs, run applications, and it’s the operating system that schedules these programs onto the different CPU cores. + +0:47 - So, I thought it would be quite nice just to take a real example, run it on my machine, and we can just see how it works in practice. So, what I’m going to have to do, is I’m going to have to close down a lot of my applications, just to make sure that I have the minimum of activity going on in the background. So, I’ll close my web browser. I’ll even turn off the networking, so we have the minimum of interference. Now, I’m running here on a Linux laptop, running Ubuntu, but you get very similar effects on any system. So, the first thing I’m going to show is, I have a performance monitor here. + +1:22 - And, what we can see here, is that this monitor has different schematics for what’s going on. But we have four CPUs here– CPU one, CPU two, CPU three, and CPU four– and they’re coloured different colours. Now, although there’s nothing going on, you might wonder, why is there so much activity. This is running at about 20%. Well, that’s because the screen recording software, the screen grabbing software I’m running is actually taking up significant amounts of CPU. So, we do have some background rate there. + +1:54 - So, what I’ve done, is, I’ve written a program which adds up various salaries to work out a total income. Now, what I’m doing here is, I’m actually adding up 1,000 salaries. Now, from the back of the envelope calculation we did last week, where we thought each cycle would take about a nanosecond, that means that 1,000 cycles adding up 1,000 salaries, is going to take about a millionth of a second. Now that’s clearly too short a time to measure. So, I’m actually repeating this calculation 10 million times. And if we work that out– 10 million times a millionth of a second– we expect this calculation should take about 10 seconds. That will be our back of the envelope estimate. + +2:30 - And, just a quick clarification– although this says CPU one, CPU two, CPU three, CPU four, in our terminology it would be CPU-cores. I would call this a single CPU with four cores here. So, let’s just run the program, which as I said, does this calculation of adding up 1,000 salaries, repeating it 10 million times, and see what our load monitor shows us. So, I’ve run it. Very quickly we should see– yes. The blue CPU, here, is taking up the slack. Oh, but it’s quickly being replaced by the orange CPU. The orange CPU here is running 100%, and, now it’s switched to another CPU and it’s gone down to zero. So, that’s a very, very interesting graph which illustrates two things. + +3:12 - First of all, that this program can only run on one CPU at a time, but– one CPU core at a time– but the operating system has decided to move it. So it started out running this program on the blue CPU, CPU four, and then it moved it to the orange CPU. But, you can see that, overall, the time was about 10 seconds, as we expected. So, that does show that the operating system does actually move these programs around. A single process, which is what this program is, can only run on one CPU core at once, but the operating system can decide to move it around. + +3:47 - Now, let’s see what happens if we run three of these programs at once. So, I’ve got multiple copies of my program, income 1K. I’ll run income 1K number one, number two, and number three, I’ll run them all at once. I can run them all at once. And let’s see what the load monitor does. I’m running them. And, almost immediately, we see that the CPUs become very, very heavily loaded. Now, in fact, almost all the CPUs become heavily loaded, because although I’m only running three copies of the program, my income program, we are also running the screen grabbing software, so it’s kind of shuffled off to the final CPU. + +4:19 - But, we can see there, that what happened was– there were two interesting things to note. One was, that all these CPUs were active at the same time, but also that the calculation still took about 10 seconds. So, what the operating system was able to do, was it was able to run three copies of the same program, at the same time, by putting them on different CPUs. + +4:43 - So, now you might ask, what happens if I run more applications, more programs, or more processes, than there are physical CPU-cores. So, here you can see I’ve actually got six copies of the program, and I’m going to run them all at once. And let’s see what happens. + +4:59 - So, you’ll see almost immediately the CPU load jumps up, and all the four CPU-cores are very, very heavily loaded. So, this looks similar to what we had before. But there’s a subtle difference– that these CPU-cores are running more than one process. Not only is the operating system scheduling processes to different CPU-cores, it’s swapping them in and out, on the same CPU-core. And, the effect of that is, the calculation no longer takes 10 seconds, it takes more than 10 seconds. Because, each of these applications, each of these processes, is having to time share on the CPU-core. And, as we see here, it takes almost twice as long. It takes about 20 seconds, which is what you might have expected. + +5:35 - So, that is quite interesting, that although the CPU, the processor, can do more than one thing at once, if there are four cores, and there are more than four programs to run, it can’t run them all at the same time. It has to time slice them in and out. And the main thing, here, is that we see that processes interact with each other, they affect each other, and it slows the runtime down. So, we’re able to run three of these programs in 10 seconds, which is the same time that one of them took, but six took about 20 seconds. + +6:02 - And, the reason I’m talking about three and six and not four and eight, when I have four CPU-cores, is because I’m trying to leave one of the CPU-cores free to run the screen grabbing software, which seems to be taking up about the whole of one CPU-core equivalent. +::: + +This video shows a simple demo to illustrate how modern operating systems take advantage of many CPU-cores. + +Watch what happens when David runs multiple copies of a simple income calculation program on his quad-core laptop. Do you find this behaviour surprising? + +![User in relation to computer, containing operating system, multiple cores and memory](images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png) + +Note that running multiple instances of our toy program simultaneously does not save time. +Each instance runs independently, producing identical results in approximately the same duration. +This demo illustrates how an operating system handles execution on multiple CPU-cores, but otherwise is a waste of resources. + +Can you think of a situation in which this kind of execution may be useful? + +We haven’t really explained what the concept of minimum interference is about - think of David closing down his browser before running his code - but can you think of a reason why it may be important to isolate your program as much as possible, especially when running on a supercomputer? What are the implications of not doing this? + +If you are interested, here is the function that David actually timed. +The function is written in C and is provided purely for reference. +It is not intended to be compiled or executed as it is. + +```c +// Add up a given number of salaries to compute total income. +// Use floating-point numbers to better represent real calculations. + +double salarysum(double salarylist[], int npeople) +{ + double total; + int i; + + total = 0.0; + + for (i=0; i < npeople; i++) + { + total = total + salarylist[i]; + } + + return total; +} +``` + +David: I re-ran the same studies covered in the video but with almost all other tasks disabled , for example I did not run the graphical performance monitor, which allowed me to have access to all four CPU-cores. Here are the results. + +| dataset | #copies | runtime (seconds) | +| ------- | ------- | ----------------- | +| small | 1 | 9.7 | +| small | 4 | 11.1 | +| small | 8 | 22.2 | + +--- + +![Person writing on whiteboard](images/jeswin-thomas-2Q3Ivd-HsaM-unsplash.jpg) +*Image courtesy of [Jeswin Thomas](https://unsplash.com/@jeswinthomas) from [Unsplash](https://unsplash.com)* + +## Memory Caches + +We mentioned before that memory access speeds are a real issue in supercomputing, and adding more and more CPU-cores to the same memory bus just makes the contention even worse. + +The standard solution is to have a memory cache. This is basically a small amount of scratch memory on every CPU-core, which is very fast. However, it is also quite small - well under a megabyte when the total memory will be more than a thousand times larger - so how can it help us? +The standard solution is to have a memory cache; a small, high-speed storage area located on each CPU-core. It allows the core to access frequently used data much faster than from main memory. +However, it is also quite small, well under a megabyte, representing less than a thousandth of the total memory. +Think of the analogy with many workers sharing an office: The obvious solution to avoid always queueing up to access the shared whiteboard is to take a temporary copy of what you are working on. +When you need to read data from the whiteboard, you copy the necessary data into your notebook and work independently, reducing contention for the shared resource. + +This works very well for a single worker: you can work entirely from your personal notebook for long periods, and then transfer any updated results to the whiteboard before moving on to the next piece of work. +It can also work very well for multiple workers as long as they only ever read data. + +### Writing data + +Unfortunately, real programs also write data, meaning workers need to update the shared whiteboard. If two people are working on the same data at the same time, we have a problem: if one worker changes some numbers in their notebook then the other worker needs to know about it. Whenever you alter a number, you must inform the other workers, for example: + +"I’ve just changed the entry for the 231st salary - if you have a copy of it then you’ll need to get the new value from me!" + +Although this could work for a small number of workers, it clearly has problems of scalability. +Imagine 100 workers: whenever you change a number you have to let 99 other people know about it, which wastes time. +Even worse, you have to be continually listening for updates from 99 other workers instead of concentrating on doing your own calculation. + +This is the fundamental dilemma: memory access is so slow that we need small, fast caches so we can access data as fast as we can process it. However, whenever we write data there is an overhead which grows with the number of CPU-cores and will eventually make everything slow down again. + +This process of ensuring consistent and up-to-date data across all CPU-cores is called cache coherency, a critical challenge in multicore processor design. +It ensures we always have up-to-date values in our notebook (or, at the very least, that we know when our notebook is out of date and we must return to the whiteboard). + +![Diagram of processors with memory caches between them and the memory (or memory bus)](images/hero_f158c8fd-2092-4272-a9dc-e4806b44f9cc.png) + +Keeping all the caches coherent when we write data is the major challenge. + +:::callout{variant="discussion"} +What do you think is the current state-of-the-art? How many CPU-cores do high-end processors have? +::: + +--- + +## Resource Contention + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_s0oh0v7t&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_fhat1vsf" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Resource_Contention_hd"} +:::: + +:::solution{title=Transcript} +0:12 - Although it’s a very simple program, it can show some very interesting affects. Now, our key observation, was that we could run three copies of this program in the same time as if we ran one copy. In 10 seconds, we could run three copies on three CPU-cores in the same time as we ran one copy on one CPU core. However, this week we’ve been talking about the shared-memory architecture, how one of the critical features of it is that all the CPU-cores share access to the shared memory. There’s a single memory bus that they have to go through. And so, they can affect each other, they can slow each other down. Why didn’t that happen here? + +0:48 - Well, it didn’t happen here, because we were summing up a very, very small number of numbers. We were summing up 1,000 incomes. And they were able to all fit into the cache. And, if you remember, each CPU-core has its own cache. And, so, for most of the time, when they were running, these CPU-cores could run completely independently of each other. However, we can look at a different situation, where the CPU cores do have to access main memory, and they will interact with each other. Programs running on different CPU-cores will slow each other down. So, what I’ve done is, I’ve written a version of this program, which doesn’t add up 1,000 numbers, it adds up a million numbers. + +1:27 - So income 1M, I’m adding up a million numbers, here. Now, what I’m doing is I’m repeating this fewer times to get a runtime of around 10 seconds. But, what we’ll do is we’ll run this calculation once, and see what happens, see what our baseline time is. So, I run one copy of this program. Remember, it’s now adding up a million numbers, a million salaries each iteration, rather than 1,000. So, there we go. And we see the same effect that suddenly one of the CPU cores is very heavily loaded, in this case, the green CPU, CPU three. And, how long is it going to run for? Well, it’s actually switched onto CPU two, and switched onto another CPU again. + +2:07 - But, overall, it’s running for about– just over 10 seconds. A bit more than 10 seconds, there, it was more like, maybe about 15 seconds. I’ll maybe run it again, to try and get a cleaner run, where it doesn’t switch around so much on different CPU-cores. Crank up again it’s on the– no, it is, it’s switching them all around. We’ll try and get another estimate of how long it takes. This is a bit nicer, it’s mainly on the orange CPU-core, there. + +2:35 - So, that looks like it’s taking about 15 seconds. But I said that the big difference between this calculation is because it’s reading and writing large amounts of memory– then processes running on different CPU cores will affect each other, because they will be accessing the same memory bus. So if I now run three copies of this program, I would expect that, although we’ll see the same effect but all three CPU-cores will be 100% loaded, we would expect the total runtime to increase beyond the 15 seconds. So let’s run that, and see what happens. + +3:09 - And, now bang, we’re up to 100% CPU. But, unlike the case where we’re accessing small amounts of memory, which sits in cache, we expect these to contend with each other. And, you can see yes, it is taking significantly longer than 15 seconds. Although you might think, in principle, these calculations are completely independent, they’re interacting by accessing the shared memory. And it looks like, in fact, they’re completely blocking each other out. + +3:43 - And there, it took about 35 seconds. So, over twice the time to do the same calculation. So, you can see it’s absolutely clear from this simple experimental result, that, even on my laptop, this shared memory bus– which mediates the memory transactions from the CPU-core to the physical memory– is not capable of sustaining all the traffic from three very, very simple programs running at once. +::: + +This video shows a simple demo to illustrate what happens when multiple cores try to use the same resources at the same time. + +As mentioned earlier, resource contention occurs when multiple CPU-cores attempt to access the same resources, such as memory, disk storage, or network buses. +Here we look at memory access. + +Watch what happens when three copies of a larger income calculation program are running on three CPU-cores at the same time. Is this what you expected? + +Keep in mind that CPU-cores are affecting each other not by exchanging data, but because they compete for the same data in memory. +In other words, the CPU-cores do not collaborate with each other i.e. they do not share the total work amongst themselves. + +Please note that the larger calculation processes 100 million salaries, not 1 million as mistakenly mentioned in the video. — David + +For Step 2.6, the calculations are reran with the graphical monitor turned off, allowing access to all 4 CPU-cores. +Here are the timings for this large dataset with the small dataset results included for comparison. + +| dataset | #copies | runtime (seconds) | +| ------ | ------- | ----------------- | +| small | 1 | 9.7 | +| small | 4 | 11.1 | +| small | 8 | 22.2 | +| large | 1 | 10.7 | +| large | 4 | 28.5 | +| large | 8 | 57.0 | + +--- + +## Terminology Quiz + +::::challenge{id=pc_basics.1 title="Parallel Computers Q1"} +A system built from a single multicore processor (perhaps with a few tens of CPU-cores) is an example of the ____ ____ +architecture, whereas a system composed of many separate processors connected via a high-speed network is referred to as the +____ ____ architecture. + +:::solution + +1) shared memory + +2) distributed memory + +::: +:::: + +::::challenge{id=pc_basics.2 title="Parallel Computers Q2"} +The two main limitations of the shared-memory architecture are: memory ____ +and memory ____ ____. The hierarchical memory structure is used to improve the memory access speeds. +The smallest but also the fastest memory is called ____ memory. +And keeping the data consistent and up-to-date on all the CPU-cores is called ____ ____. + +:::solution + +1) capacity + +2) access speed + +3) cache + +4) cache coherency + +::: +:::: + +::::challenge{id=pc_basics.3 title="Parallel Computers Q3"} +The situation when multiple CPU-cores try to use the same resources, e.g. memory, disk storage or network buses, is called ____ ____. + +:::solution + +1) resource contention + +::: +:::: diff --git a/high_performance_computing/parallel_computers/02_connecting.md b/high_performance_computing/parallel_computers/02_connecting.md new file mode 100644 index 00000000..4b22d20f --- /dev/null +++ b/high_performance_computing/parallel_computers/02_connecting.md @@ -0,0 +1,331 @@ +--- +name: Connecting Multiple Computers +dependsOn: [ + high_performance_computing.parallel_computers.01_basics +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![People on laptops sat around a desk](images/helena-lopes-2MBtXGq4Pfs-unsplash.jpg) +*Image courtesy of [Helena Lopes](https://unsplash.com/@wildlittlethingsphoto) from [Unsplash](https://unsplash.com)* + +## Distributed Memory Architecture + +Because of the difficulty of having very large numbers of CPU-cores in a single shared-memory computer, all of today’s supercomputers use the same basic approach to build a very large system: take lots of separate computers and connect them together with a fast network. + +![Diagram depicting multiple computers connected by a network](images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg) + +The most important points are: + +- every separate computer is usually called a node +- each node has its own memory, totally separate from all the other nodes +- each node runs a separate copy of the operating system +- the only way that two nodes can interact with each other is by communication over the network. + +For the moment, let’s ignore the complication that each computer is itself a shared-memory computer, and consider one processor per node. + +The office analogy can be further extended: a distributed-memory parallel computer has workers all in separate offices, each with their own personal whiteboard, who can only communicate by phoning each other. + +| Advantages | +| --- | +| The number of whiteboards (i.e. the total memory) grows as we add more offices. | +| There is no overcrowding so every worker has easy access to a whiteboard. | +| We can, in principle, add as many workers as we want provided the network can cope. | + +| Disadvantages | +| --- | +| If we have large amounts of data, we have to decide how to split it up across all the different offices. | +| We need to have lots of separate copies of the operating system. | +| It is more difficult to communicate with each other as you cannot see each others whiteboards so you have to make a phone call. | + +The second disadvantage on this list doesn’t have any direct cost implications as almost all supercomputers use some version of the Linux OS which is free but, it does mean thousands of copies of the OS, or other installed software, need to be upgraded when updates are required. + +Building networks to connect many computers is significantly easier than designing shared-memory computers with a large number of CPU-cores. +This means it is relatively straightforward to build very large supercomputers - it remains an engineering challenge, one that computer engineers excel at solving. + +So, if building a large distributed-memory supercomputer is relatively straightforward then we’ve cracked the problem? + +Well, unfortunately not. The compromises we make (many separate computers each with their own private memory) mean that the difficulties are now transferred to the software side. +Having built a supercomputer, we now have to write a program that can take advantage of all those thousands of CPU-cores. +This can be quite challenging in the distributed-memory model. + +:::callout{variant="discussion"} +Why do you think the distributed memory architecture is common in supercomputing but is not used in your laptop? +::: + +--- + +![Two calculators](images/isawred-Mn4_KuFSpe4-unsplash.jpg) +*Image courtesy of [iSawRed](https://unsplash.com/@isawred) from [Unsplash](https://unsplash.com)* + +## Simple Parallel Calculation + +Let’s return to the income calculation example. This time we’ll be a bit more ambitious and try and add up 800 salaries rather than 80. +The salaries are spread across 8 whiteboards (100 on each), all in separate offices + +Here we are exploiting the fact that distributed-memory architectures allow us to have a large amount of memory. + +If we have one worker per office, think about how you could get them all to cooperate to add up all the salaries. Consider two cases: + +- only one boss worker needs to know the final result; +- all the workers need to know the final result. + +To minimise the communication-related costs, try to make as few phone calls as possible. + +--- + +![ARCHER2 banner](images/ARCHER2.jpg) +*© ARCHER2* + +## Case study of a real machine + +To help you understand the general concepts we have introduced, we’ll now look at a specific supercomputer. I’m going to use the UK National Supercomputer, ARCHER2, as a concrete example. As well as being a machine I’m very familiar with, it has a relatively straightforward construction and is therefore a good illustration of supercomputer hardware in general. + +### General + +Archer2 is a HPE Cray EX machine, built by American supercomputer company Cray, a Hewlett Packard Enterprises company. It contains 750,080 CPU-cores and has a theoretical performance of 28 Pflop/s. It is operated by EPCC at the University of Edinburgh on behalf of EPSRC and NERC, and is the major HPC resource for UK research in engineering and in physical and environmental science. + +### Node design + +The basic processor used in ARCHER2 is the AMD Zen2 (Rome) EPYC 7742 CPU, which has a clock speed of 2.25 Ghz . The nodes on ARCHER2 have 128 cores across two of the AMD processors. All the cores are under the control of a single operating system. The OS is the HPE Cray Linux Environment, which is a specialised version of SUSE Linux. + +### Network + +The complete ARCHER2 system contains 5,860 nodes, i.e. ARCHER2 is effectively 6,000 seperate computers each running their own copy of Linux. They are connected by the HPE Slingshot interconnect, which has a complicated hierarchical structure specifically designed for supercomputing applications. Each node has two 100 Gb/s network connections, this means each node has a network bandwidth 2048 times faster than what is possible over a 100 Mb/s fast broadband connection! + +### System performance + +ARCHER2 has a total of 750,080 CPU-cores: 5,860 nodes each with 128 CPU-cores. With a Clock frequency of 2.25 Ghz, the CPU-cores can execute 2.25 billion instructions per second. However, on a modern processor, a single instruction can perform more than one floating-point operation. + +For example, on ARCHER2 one instruction can perform up to four separate additions. In fact, the cores have separate units for doing additions and for doing multiplications that run in parallel. With the wind in the right direction and everything going to plan, a core can therefore perform 16 floating-point operations per cycle: eight additions and eight multiplications. + +This gives a peak performance of 750,080 \* 2.25 \* 16 Gflop/s = 27,002,880 Glop/s, agreeing with the 25.8 Pflop/s figure in the top500 list. + +ARCHER2 comprises 23 separate cabinets, each about the height and width of a standard door, with around 32,768 CPU-cores (256 nodes) or about 60,000 virtual cores (using multi-threading) in each cabinet. + +![Photo of someone managing ARCHER2 system](images/hero_73afa9aa-74db-4ad2-893e-971956518bdf.jpg) +*© EPCC* + +### Storage + +Most of the ARCHER2 nodes have 256 GByte of memory (some have 512 GByte), giving a total memory in excess of 1.5 PByte of RAM. + +Disk storage systems are quite complicated, but they follow the same basic approach as supercomputers themselves: connect many standard units together to create a much more powerful parallel system. ARCHER2 has over a 15 PByte of Disk storage. + +### Power and Cooling + +If all the CPU-cores are fully loaded, ARCHER2 requires in excess of 4 Megawatts of power, roughly equivalent to the average consumption of around 4000 houses. +This is a significant amount of power to mitigate the associated environmental impact, ARCHER2 is supplied by a 100% renewable energy contract. + +The ARCHER2 cabinets are cooled by water flowing through pipes, with water entering at 18°C and exiting at 29°C. +The heated water is then cooled and re-circulated. +When necessary the water is cooled by electrical chillers but, most of the time, ARCHER2 can take advantage of the mild Scottish climate and cool the water for free simply by pumping it through external cooling towers, so saving significant amounts of energy. + +![Diagram of datacenter cooling](images/hero_87e2018b-86eb-4aa5-a7c4-efd271a505b2.webp) +*© Mike Brown* + +![Photo of ARCHER's cooling towers](images/hero_a887d8cf-e9a0-4810-b7ab-b7a016dfc47f.webp) +*ARCHER’s cooling towers © Mike Brown* + +--- + +## Wee ARCHIE case study + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_guapr85q&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_0dan4ubd" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Wee_Archie_case_study_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - So in this video we’re going to talk about Wee ARCHIE. So you’ve already seen a promotional video about Wee ARCHIE, but here we’re going to go into a bit more technical detail. Now Wee ARCHIE is a machine that we’ve built at EPCC specifically for outreach events to illustrate how parallel computing, supercomputing works. And what we do is we take it to conferences and workshops and schools to try and explain the basic concepts. But here I’m going to use it as a way of explaining the kinds of things we’ve been learning this week, which is things like distributed memory computing, shared memory computing, and how they’re put together into a real computer. + +0:43 - Wee ARCHIE has been built to mirror the way a real supercomputer like ARCHER, our national supercomputer, is built, but it’s a smaller version and it’s also been designed with a perspex case and designed in a way that we can look inside and show you what’s going on. So it’s very useful to illustrate the kinds of concepts that we’ve been talking about this week. So the ways in which Wee ARCHIE mirrors a real supercomputer like ARCHER are that is a distributed memory computer. It comprises a whole bunch of nodes that are connected by a network and we’ll look at them in a bit more detail later on. + +1:16 - Each of the nodes is a small shared-memory computer, each of which is running its own copy of the operating system and in this, just like in ARCHER, we’re running lots of copies of Linux. On Wee ARCHIE, we don’t use very high spec processes. That’s for reasons of economy and power consumption. So on Wee ARCHIE, each of the nodes is actually one of these Raspberry Pis. Now, a Raspberry Pi is a very low power and quite cheap processor, and these ones are actually small shared-memory machines. So each Raspberry Pi has actually got four cores on it. It’s a small quad-core, multi-processor machine, a small-shared memory system, running a single copy of Linux. + +1:57 - Now we can look in more detail at how Wee ARCHIE is constructed. So if we look down here, are the computational nodes which is a four by four grid here. We have 16 Raspberry Pi boards, each of which, as I said, is running its own copy of Linux, and each of which is a small shared-memory processor with four CPU-cores. See here they are, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. Again, just like on a real supercomputer, users don’t log onto these computational nodes, they log onto some front-end login system. And up here we have a couple of spare nodes. + +2:31 - Hardware-wise, they’re just the same as the main computational nodes, but they’re reserved for the user to log on and do various other activities, compiling code, launching programs, and doing I/O and such like. So again this is a nice analogue of how a real parallel supercomputer works. Just to look in a bit more detail at the nodes of Wee ARCHIE. You will see there are some lights here. And what we’ve done for demonstration purposes is each node of Wee ARCHIE is connected to its own little LED array. You won’t see particularly much going on at the moment, because we aren’t actually running programs on these nodes. + +3:01 - Just for information, what they display are things like the network activity on the node, the temperature of the node, and also the activity, how busy each of the four CPU-cores is. The way the networking works on Wee ARCHIE is that each node has a cable coming out of it, a network cable, and here we just use quite standard cheap networking. It’s called ethernet cabling. It’s the kind of thing you might have at home or in an office. And the cables come out and they connect to these switches here. + +3:30 - So the way that two nodes on Wee ARCHIE communicate with each other is, if they want to send a message, it goes down the cables into one of the switches and then back up the other cable into the other node. Now we’ve turned Wee ARCHIE around to look at the back to illustrate a bit more how the networking works and a bit more about how it’s all connected together. So if you can see, there are actually two cables coming out of the computational nodes. One of those is actually power, so we’re not particularly concerned about that. The other connection is a network cable. + +3:59 - So the way it works, is that the cables for each node come down into one of these switches. And then the switches themselves, are cabled together. And this illustrates, in quite a simple way, how these networks are hierarchical. So for example, two cores on the same node, two CPU-cores in the same shared-memory processor can communicate with each other without going over the network at all. If two nodes are connected to the same switch, they can communicate with each other by sending the signal down and then back out of the same switch. + +4:31 - If two nodes are connected to different switches, the message has to go down into one of the switches along to the other switch and then back to the node again. So I mentioned in the articles that the network on ARCHER, the Cray Aries network, is very complicated, has lots of different levels. But even on a very simple machine like Wee ARCHIE, you can see the network has a hierarchical structure. And so that the way that two CPU-cores communicate with each other is different depending on where they are in the machine. Although it’s been primary built as an educational tool, in fact, Wee Archie has a lot of similarities, as I said, with a real supercomputer. + +5:06 - And just to reiterate, they’re things like having a distributed-memory architecture of different nodes, each running their own operating system. The nodes are connected by networking. And each node is actualy a small shared-memory computer. The main way in which Wee ARCHIE differs from a real supercomputer, like ARCHER, is really in some of the performance characteristics. So for example, the processors aren’t as fast as you’d find on a real supercomputer, the networking isn’t as fast. The ethernet we have here is a lot slower than the dedicated Aries network we have on ARCHER. And also, of course, the sheer scale. + +5:41 - Here we only have 16 nodes, each of which has four CPU-cores, as opposed to thousands of nodes with tens of CPU-cores in them. And so Wee ARCHIE mirrors a real supercomputer such as ARCHER in almost every way, except for just the speed and the scale. +::: + +Finally, Wee ARCHIE makes its appearance again! This video uses Wee ARCHIE to explain the parallel computer architecture concepts we've introduced. + +:::callout(variant="discussion") +It is worth emphasising that the physical distance between the nodes does impact their communication time i.e. the further apart they are the longer it takes to send a message between them. Can you think of any reason why this behaviour may be problematic on large machines and any possible workarounds? + +As usual, share your thought with your fellow learners! +::: + +For anyone interested in how Wee ARCHIE has been put together (and possibly wanting to build their own cluster), we invite you to follow the links from this blog article - Setting up your own Raspberry Pi cluster. + +![Photo of Wee ARCHIE](images/hero_d827be57-5840-4339-b47c-f70c0d36fcd1.jpg) + +![Wee ARCHIE banner](images/hero_5bca5e55-5548-4a13-8f20-f07f498cec7e.jpg) + +--- + +![Photo of overly complex road junction](images/timo-volz-9Psb5Q1TLD4-unsplash.jpg) +*Image courtesy of [Timo Volz](https://unsplash.com/@magict1911) from [Unsplash](https://unsplash.com)* + +## ARCHER2 - it's more complicated + +In the last few steps we have glossed over a few details of the processors and the network. + +If you look up the specifications of the AMD Zen2 (Rome) EPYC 7742 processor you will see that it has 64 CPU-cores, whereas the ARCHER2 nodes have 128 CPU-cores. +Each node contains two physical processors, configured to share the same memory. This design makes the system appear as a single 128-core processor to the user. +This setup, known as Non-Uniform Memory Access (NUMA) architecture, is illustrated below. + +![Diagram of NUMA architecture, with two sets of multicore CPUs/memory connected by a shared memory bus](images/hero_9f93cf41-f24d-4ab2-8a7e-d25a78a8089c.png) + +Every CPU-core can access all the memory regardless of which processor it is located on but, reading data from another CPU’s memory can involve going through an additional memory bus, making the process slower than reading from its own memory. +Although this hardware arrangement introduces technical complexities, the key point is that the 128 CPU-cores function as a single shared-memory system, managed by one operating system. + +The details of the network are even more complicated with four separate levels ranging from direct connections between the nodes packaged together on the same blade, up to fibre-optic connections between separate cabinets. If you are interested in the details see the ARCHER2 website. + +--- + +## ARCHER2: building a real supercomputer + +::::iframe{width="100%" height="400" src="https://www.youtube.com/embed/UXHE7ljmhaQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen} +:::: + +Wee ARCHIE is very small and was built on someone’s desk. Real supercomputers are very large and require a lot of infrastructure to support them and manpower to build them. + +This time lapse video documents the installation of the ARCHER2 system at EPCC in Edinburgh, UK. We use it to pick out various aspects of supercomputer hardware that are not so well illustrated by Wee ARCHIE. + +:::callout{variant="discussion"} +Is there anything that surprised you? We are curious to know so feel free share your impressions by leaving a comment. +::: + +--- + +## Quiz - Processors, ARCHER2 and Wee ARCHIE + +::::challenge{id=pc_connecting.1 title="Connecting Parallel Computers Q1"} +Which of these are true about a typical processor in a modern supercomputer? + +Select all the answers you think are correct. + +A) it contains a single CPU-core + +B) it contains many separate CPU-cores + +C) it is a special processor, custom-designed for supercomputing + +D) it is basically the same processor you would find in a high-end PC or compute server + +:::solution +A) and C) + +That’s right - today almost all processors have multiple CPU-cores. + +Correct - the leading-edge CPU designs of today are those produced for general-purpose computing because the massive market for home and business computing means that billions of dollars can be invested in R&D. +::: +:::: + +::::challenge{id=pc_connecting.2 title="Connecting Parallel Computers Q2"} +How are the CPU-cores attached to the memory in a modern multicore processor? + +Select all the answers you think are correct. + +A) the memory is physically sliced up between them + +B) the memory is shared between all the CPU-cores + +C) cores share access to the memory so they can sometimes slow each other down + +D) each core can access the memory completely unaffected by the other cores + +:::solution +B) and C) + +The distinction between shared and distributed memory is one of the most fundamental concepts in parallel computer architecture + +Yes - a modern multicore processor is a small shared-memory parallel computer + +Yes - you’ve correctly identified one of the challenges of shared memory: contention for a single shared memory bus +::: +:::: + +::::challenge{id=pc_connecting.3 title="Connecting Parallel Computers Q3"} +Like almost all supercomputers, ARCHER2 is constructed as a series of separate cabinets (23 in the case of ARCHER2), each standing about as high and wide as a standard door. Why do you think this size of cabinet is chosen? + +A) it is the minimum size that can be cooled effectively + +B) it is the maximum size that can be run from a single power supply + +C) any larger and the cabinets would not fit through the doors of the computer room + +D) freight companies will not ship anything larger than this + +:::solution +C) + +Even high-tech supercomputing is influenced by everyday issues! + +Spot on! It’s the age-old problem of trying to get a grand piano into a high-rise apartment - if it’s too big or heavy to fit through the door then things get really complicated +::: +:::: + +::::challenge{id=pc_connecting.4 title="Connecting Parallel Computers Q4"} +How are ARCHER2’s 750,080 cores arranged? + +A) as one large shared-memory system + +B) as 750,080 separate nodes + +C) as 5,860 nodes each with 128 cores + +D) as 11,720 nodes each with 64 cores + +:::solution +C) + +It is essential to understand the way of shared and distributed memory computing are combined in a single supercomputer - this is now universal across every system. + +That’s correct - although the numbers will vary from system to system, a modern supercomputer comprises thousands of nodes each with many CPU-cores. +::: +:::: + +::::challenge{id=pc_connecting.5 title="Connecting Parallel Computers Q5"} +Which of these features make the UK national Supercomputer ARCHER2 different from the toy system Wee ARCHIE? + +Select all the answers you think are correct. + +A) it has multicore CPUs + +B) it has multiple nodes connected by a network + +C) it runs the Linux operating system + +D) it has a much faster network + +E) it has many more CPU-cores + +:::solution +D) and E) + +Wee ARCHIE is dwarfed in practice by a real supercomputer such as ARCHER, but how different are they in principle? Is a formula-1 racing car fundamentally different from a Volkswagen Beetle? + +Correct - we spend money on very fast supercomputer networks to ensure the minimum of delay when different nodes to communicate with each other + +That’s right - ARCHER2 has almost 13000 times as many CPU-cores as Wee ARCHIE so it is built on a much larger scale +::: +:::: diff --git a/high_performance_computing/parallel_computers/03_comparison.md b/high_performance_computing/parallel_computers/03_comparison.md new file mode 100644 index 00000000..519a791a --- /dev/null +++ b/high_performance_computing/parallel_computers/03_comparison.md @@ -0,0 +1,268 @@ +--- +name: Comparing the Two Approaches +dependsOn: [ + high_performance_computing.parallel_computers.02_connecting +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +# Comparing the two approaches + +## Looking inside your laptop + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_n0hr4o3o&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_jmzabt32" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Inside_your_laptop_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - One of the things we’ve tried to emphasise when discussing supercomputer hardware is the way that supercomputers are very much built from commodity standard components– fairly standard– reasonably high end– but fairly standard processors and memory, but very large numbers of them. And so a supercomputer gets its power from having very large numbers of CPU-cores. However, it’s clear that the way that a supercomputer is packaged and put together must be very different from the way a commodity item, like a laptop, is put together. + +0:39 - So what I’m going to do in this video is to deconstruct a laptop and also show you the inside of a board from a real supercomputer, and try to compare them so you can see what the similarities and differences between them are. So what I have here is a fairly standard laptop from the mid to late 2000s. And I’m going to take it apart, deconstruct it, to show you how it’s put together and kind of contrast the pieces that make up a standard laptop with a board from a supercomputer. So I would advise you, don’t do this at home, because it will destroy your laptop. But I’ve done this with a fairly old machines, so hopefully it’s OK! + +1:12 - So the first thing is we don’t need the screen, OK? We’re not going to need the screen, so we can discard the screen. Secondly, we would not need the keyboard, OK? We don’t log into each node on a supercomputer and then type away, so we don’t need the keyboard. There’s some other packaging here. And we’re not going to need the battery. + +1:33 - So we’re just left with the bare board here which has to all the components on it. Now we can see the components we’re going to need to retain for a supercomputer, which basically– the processor in the middle here, this is a Dual-Core Intel Processor. And actually there’s memory here as well. And the memory is actually on the other side if we turn this over– a little card here, which contains the memory. Well we can even see on the back here, there are components we really are not going to need when we have a supercomputer. There’s a wireless card interface here. There’s a disk here. + +1:59 - Now the nodes on a supercomputer tend to be diskless, and there’s connectors here for other peripherals. We’re not need them when we put this into a supercomputer. So going back to the front again, I’ve pointed out the CPU and the memory, but there’s a whole array of other components here, which are all to do with what you need to do on a laptop. You need to drive the disk. You need to drive the screen. You need to drive the keyboard. There’s a touch pad here, and there will be things like Bluetooth and all the other peripherals, which you need to connect to a laptop. None of that circuitry is going to be required when we put this into a supercomputer. + +2:32 - So hopefully, you can see– that although, we have the things we are going to need, the CPU and memory– there’s clearly a whole lot of other stuff going on here, which we’re not going to need when packing this up into a supercomputer. +::: + +We’ve explained that the hardware building blocks of supercomputers, memory and processors, are the same as for general-purpose computers. + +But ARCHER looks very different from your laptop! +In this video David deconstructs a laptop so that we can compare its packaging to the specialist design of a supercomputer. + +![Diagram of user in relation to computer containing an operating system, multicore processor and memory](images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png) + +--- + +## How similar is your laptop to a node of a supercomputer? + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_6cbiby53&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_izkgvgep" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Laptop_V_node_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - Having looked at the hardware that makes up a standard laptop, we’re now going to look at the equivalent hardware for a supercomputer. So this board comes from HECToR, which was actually the predecessor to ARCHER. HECToR was the UK National Supercomputer Service prior to the ARCHER Service starting. And this actually comes from a machine called a Cray XT4. So again, it’s a very similar architecture to the current ARCHER system. Just to explain some terminology, this is called a blade. which is quite heavy, quite difficult to lift up and down. And this is the unit to be inserted into the rack, into the cabinet. So you’ve seen in a typical supercomputer like ARCHER. + +0:45 - We have these large cabinets about the size of a door and they have racks in them. And this is the unit which we insert into the rack which contains the processors. So I’m going to take the top off and we’re going to look at what it contains inside. + +1:00 - So immediately, when we take the lid off, we can see it’s much more stripped down than inside your laptop. Only the essential components have been retained. So what we have here on this single blade is we actually have four nodes, one, two, three, four. And each of these, as we’ve described, will be running its own copy of the operating system. So a single blade effectively contains four separate computers. Of course, they’re linked through a network, but they’re all running their own copy of the operating system. Here you can see the processor. It’s a very standard AMD processor from the late 2000s, and it’s actually a quad-core processor. This actually has four CPU-cores on it. + +1:39 - But the reason this node looks different from these other nodes is, we’ve taken off these heat sinks. The nodes are covered with these copper heat sinks to try and dissipate all the heat away. And the way that works is very different from on your laptop. Your laptop will have a small internal fan which will blow air through it. There’s no fan on this board. What happens is, externally there are large fans to blow air through the board to cool it. And these are very densely stacked, so you have to blow a lot of air through to keep them cool. You want the air to flow through very nicely. + +2:09 - So you’ll see, for example here, this is a small baffle on the left to try and direct the air towards the core components. And it’s these nodes, these CPUs, which get very hot. And you maybe can’t see, but there are lots and lots of thin blades of copper here, which will get hot. And then the heat would be taken away by the air flowing through. So this blade, this board, is very, very stripped down compared to a laptop. The only additional things on this board, other than the CPUs are the memory. So you see with each node there’s a little memory card that’s attached here. Local - The memory here is local to each of the nodes. + +2:43 - So as we described, the four cores on this quad-core node share all the memory, but the memory on distinct nodes is separate. You also need a network to connect all the nodes together. And that’s actually what these components here are. We haven’t removed the heat sinks, so you can’t really see them. But this is the networking technology here. The only other component of any interest on this board is this little controller here. And this is the master controller, which controls the operation of these other components. We’ve talked about the network. And you might ask, how are these physically connected together? Well, all the connections for the entire blade come in to all these pins down the side here. + +3:22 - So all the traffic, all the data going in and out of the board will come in through cables connected to these ports here. So in conclusion, although the core components of this supercomputing blade, which are the processor and the memory, are just the same as we saw on the laptop, there are two main differences. First of all, this is much simpler. It’s been very much stripped back. And all the extra pieces we had in the laptop to do with driving the disk and the mouse and the Bluetooth and the screen have been taken away. So it’s much more simple. Because of that, we can actually pack things together much more densely. + +3:58 - And here we have four complete computers all on the one board. And we’re able to do that because we can blow air through, with the external fans, to keep the whole thing cool. We can pack things much more densely than we could in the laptop. And so this entire setup has been designed specifically for the kind of things which supercomputers do, for doing numerical computations. +::: + +Next, in this video, David takes apart a compute blade from the [HECToR](http://www.hector.ac.uk/) supercomputer. + +Do you remember this diagram? + +![Diagram depicting multiple computers connected by a network](images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg) + +Having watched the above video, how would you modify it to make it more accurate? + +![HECToR Artist’s impression of the HECToR XT4 system ](images/hero_dcac5759-2efe-4f9f-a6a7-f439ef43840c.jpg) +*© Cray Inc* + +![HECToR's compute blades](images/hero_cbe27959-b81d-41c1-8d00-9c7fc44d34e9.jpg) +*HECToR's compute blades* + +--- + +![Photo of balancing scales](images/piret-ilver-98MbUldcDJY-unsplash.jpg) +*Image courtesy of [Piret Llver](https://unsplash.com/@saltsup) from [Unsplash](https://unsplash.com)* + +## Shared memory vs Distributed memory + +We’ve seen how individual CPU-cores can be put together to form large parallel machines in two fundamentally different ways: the shared and distributed memory architectures. + +In the shared-memory architecture all the CPU-cores can access the same memory and are all controlled by a single operating system. Modern processors are all multicore processors, with many CPU-cores manufactured together on the same physical silicon chip. + +There are limitations to the shared-memory approach due to all the CPU-cores competing for access to memory over a shared bus. +This can be alleviated to some extent by introducing memory caches or putting several processors together in a NUMA architecture, but there is no way to reach the hundreds of thousands of CPU-cores with this approach. + +In the distributed-memory architecture, we take many multicore computers and connect them together in a network. +With a sufficiently fast network we can in principle extend this approach to millions of CPU-cores and beyond. + +Shared-memory systems are difficult to build but easy to use, and are ideal for laptops and desktops. + +Distributed-memory systems are easier to build but harder to use, comprising many shared-memory computers each with their own operating system and their own separate memory. +However, this is the only feasible architecture for constructing a modern supercomputer. + +:::callout(variant="discussion") +These are the two architectures used today. Do you think there is any alternative? Will we keep using them for evermore? +::: + +--- + +![Photo of car lights at night, long exposure](images/julian-hochgesang-3-y9vq8uoxk-unsplash.jpg) +*Image courtesy of [Julian Hochgesang](https://unsplash.com/@julianhochgesang) from [Unsplash](https://unsplash.com)* + +## What limits the speed of a supercomputer? + +When we talked about how fast modern processors are, we concentrated on the clock frequency (nowadays measured in GHz, i.e. in billions of operations per second) which grew exponentially with Moore’s law until around 2005. + +However, with modern distributed-memory supercomputers, two additional factors become critical: + +- CPU-cores are packaged together into shared-memory multicore nodes, so the performance of memory is important to us; +- separate nodes communicate over a network, so network performance is also important. + +### Latency and bandwidth + +Understanding memory and network performance is useful in order to grasp the practical limitations of supercomputing. +We’ll use the ARCHER system to give us some typical values of the two basic measures. +Latency and bandwidth: + +- **latency** is the minimum time required to initiate a data transfer, such as transferring a single byte. This overhead is incurred regardless of the amount of data being handled. +- **bandwidth** is the rate at which large amounts of data can be transferred. + +A helpful analogy is to compare this to an escalator. +The time it takes a single person to travel from the bottom to the top is its latency — around 10 seconds for one trip. +However this does not mean the escalator can only transport one person every ten seconds, the escalator can accommodate multiple people simultaneously, allowing several people to reach the top each second. +This is its bandwidth. + +### Numbers from ARCHER2 + +For access to memory (not cache! - access to cache is faster), the latency (delay between asking for a byte of data and reading it) is around 80 nanoseconds (80 x 10-9 or 80 billionths of a second). On ARCHER2, each node has a bandwidth of around 200 GBytes/second. + +These figures might sound impressive, but remember that at a clock speed of around 2.7 GHz, each CPU-core is issuing instructions roughly every 0.4 nanoseconds, so waiting for data from memory takes the equivalent of around 200 instructions! + +Remember also that, on ARCHER2, 128 CPU-cores are sharing access to memory so this latency can only increase due to congestion on the memory bus. +Bandwidth is also shared, giving each CPU-core just over 3 GBytes/second on average. +At a 2.7 GHz clock frequency, this implies that, in the worst-case scenario where all CPU-cores access memory simultaneously, each core can read or write just one byte per cycle. + +A simple operation such as a = b + c processes 24 bytes of memory (read a and b, write c, each floating-point number occupying 8 bytes) so we are a long way off being able to supply the CPU-core with data at the rate it requires. + +In practice, cache memory significantly mitigates these issues by providing much lower latency and higher bandwidth but back-of-the-envelope calculations, such as we have done above, do illustrate an important point about supercomputer design: + +The performance of the processors in a modern supercomputer is limited by the memory bandwidth and not the clock frequency. + +### Interconnect Archer2 + +ARCHER2 has a very high-performance network with the following characteristics: + +- a latency of around 2 microseconds (2 x 10-6 or 2 millionths of a second); +- a bandwidth between 2 nodes of around 25 GBytes/second. + +With a latency of 2 microseconds corresponding to approximately 5000 instruction cycles, even ARCHER2's high-speed network introduces a significant overhead for communication. +While the bandwidth is shared among all CPU-cores on a node, ARCHER2's thousands of separate network links collectively enable the transfer of many TBytes/second. + +We will see in the next module that if we are careful about how we split our calculation up amongst all the CPU-cores we can accommodate these overheads to a certain extent, enabling real programs to run effectively on tens of thousands of cores. Despite this, it is still true that: + +**The maximum useful size of a modern supercomputer is limited by the performance of the network.** + +:::callout{variant="discussion"} +Large internet companies like Amazon and Google also use distributed memory architectures for their computational needs. They also offer access to their machines via something known as cloud computing. Do you think Amazon and Google services have the same requirements as we do in supercomputing? What limits the performance of their computers? Are they interested in Pflops? +::: + +--- + +![Photo of someone playing a modern computer game](images/florian-olivo-Mf23RF8xArY-unsplash.jpg) +*Image courtesy of [Florian Olivo](https://unsplash.com/@florianolv) from [Unsplash](https://unsplash.com)* + +## Graphics Processors + +When looking at the top500 list, you may have noticed that many of the world’s largest supercomputers use some kind of accelerator in addition to standard CPUs. +A popular accelerator is a General Purpose Graphics Processing Unit, or GPGPU. +Since we have sen how a modern multicore CPU works, we can also begin to understand the design of a GPGPU. + +Supercomputers have traditionally relied on general-purpose components, primarily multicore CPUs, driven by commercial demand for desktop and business computing. +However, computer gaming also significant market where processor performance is critical. + +The massive demand for computer games hardware has driven the development of specialized processors - Graphics Processing Units (GPUs) — designed to produce high-quality 3D graphics. +Although complex in design, a GPU can be thought of as a specialized multicore processor with a vast number of simplified cores. +The cores can be simplified because they have been designed for a single purpose: 3D graphics. +To render high-quality graphics at dozens of frames per second, GPUs require the ability to process massive amounts of data. +To achieve this, they utilize specialised memory with significantly higher bandwidth than the memory typically used by CPUs. + +The simplified nature of each core, the much higher number of cores, and the high memory bandwidth means that the performance, in terms of pure number crunching, of a single GPU can easily outstrip that of a CPU at the expense of it being less adaptable. + +### Accelerated supercomputers + +Despite being developed for a different purpose, GPUs are highly suited for supercomputing: The calculations required for 3D graphics are very similar to those required for scientific simulations - large numbers of simple operations on huge quantities of floating-point numbers. + +- designed for very fast floating-point calculation; +- power-efficient due to the simple core design; +- high memory bandwidth to keep the computational cores supplied with data. + +The inherently parallel architecture of GPUs, with thousands of computational cores, aligns well with the decades-long focus on parallel processing in supercomputing. + +Using GPUs for applications other than graphics is called General Purpose or GPGPU computing. With a relatively small amount of additional development effort, GPU manufacturers produce versions of their processors for the general purpose market. +The supercomputing community directly benefits from the multi-billion pound research and development investments in the games market. + +Programming a GPGPU isn’t quite as straightforward as a CPU, and not all applications are suitable for its specialised architecture, but one of the main areas of research in supercomputing at the moment is making GPGPUs easier to program for supercomputing applications. + +:::callout{variant="discussion"} +Earlier we asked you to look at Piz Daint, which is accelerated compared to ARCHER2 by the addition of Nvidia’s GPGPUs. Use the sublist generator on the top500 page to check how many top500 systems use Nvidia accelerators. Do you see what you expected to see? +::: + +--- + +## terminology Recap + +::::challenge{id=pc_comparison.1 title="Comparing the Two Approaches Q1"} +One of the differences between the shared and distributed memory architectures is that shared-memory systems are managed by only one +____ ____, whereas distributed memory systems have many of them (typically one per node). + +:::solution +operating system +::: +:::: + +::::challenge{id=pc_comparison.2 title="Comparing the Two Approaches Q2"} +The two basic measures characterising the memory and network performance are: ____ +and ____ . +____ is the rate at which you transfer large amounts of data. +____ is the minimum time taken to do anything at all, i.e. the time taken to transfer a single byte. + +:::solution +A) bandwidth +B) latency +C) bandwidth +D) latency +::: +:::: + +::::challenge{id=pc_comparison.3 title="Comparing the Two Approaches Q3"} +In the distributed memory architecture, the only way that two nodes can interact with each other is by communicating over the +____. In the shared memory architecture, different CPU-cores can communicate with each other by updating the same ____ location. + +:::solution +A) network +B) memory +::: +:::: + +--- + +![Photo of child playing with building blocks](images/kelly-sikkema-JRVxgAkzIsM-unsplash.jpg) +*Image courtesy of [Kelly Sikkema](https://unsplash.com/@kellysikkema) from [Unsplash](https://unsplash.com)* + +## Game: Build your own supercomputer + +In this game you are in charge of a supercomputer centre and you have to run lots of users’ jobs on your machine. Use your budget wisely to buy the best hardware for the job! + +The main idea behind the game is to create a design of a supercomputer, balancing its components against budget and power efficiency. + +Creating an efficient design may be difficult, however worry not! It’s a game after all. You are welcome to play it again at any time during the course. It may be interesting to see how your understanding of supercomputers improved. + +![Screenshot of build your own supercomputer game](images/hero_187042a6-7e25-46dd-a3f3-810c2b184e79.png) + +Your design must handle jobs, and the completion of these provides money which can be further invested. As you progress through the levels the jobs become more complex (and lucrative) and additional components are available to be included within the machine. Besides passing through the levels, you can also obtain badges that are awarded for specific achievements such as a green supercomputer, profitable machine and the overall number of jobs run. + +Follow the link to the [game](http://supercomputing.epcc.ed.ac.uk/outreach/archer_challenge/) and start playing. We recommend doing a quick walk through first - click the ? icon on the landing page. You can summon the help menu at any point in the game by clicking on the Info icon, located in the upper right corner of the screen. + +We hope you will enjoy your experiences as a supercomputing facility manager, good luck! diff --git a/high_performance_computing/parallel_computers/04_practical.md b/high_performance_computing/parallel_computers/04_practical.md new file mode 100644 index 00000000..9a146625 --- /dev/null +++ b/high_performance_computing/parallel_computers/04_practical.md @@ -0,0 +1,352 @@ +--- +name: Shared vs Distributed Hello World +dependsOn: [ + high_performance_computing.parallel_computers.03_comparison +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +So far the code examples we've run have been limited to serial computation. +Building on what we've learned so far, this lesson will look at parallel computations using both shared and distributed memory approaches. +To get started, let's look at how we can compile and run parallel code versions of our "Hello world" example using both shared and distributed memory frameworks, +MPI (Message Passing Interface) and OpenMP (Open Multi-processing), +which are both heavily used in HPC applications and are covered in detail later on. + +## Part 1: Shared Memory Parallelism Using OpenMP + +OpenMP uses a shared memory approach to parallelism, allowing simultaneous computations to be spread over multiple threads. +These threads can be run an any number of cpu-cores. + +You'll notice the code below is more complex than the original Hello world example, with the addition of compiler directives (`#pragma`) which openMP uses to inform the compiler how to parallelise sections of the code when it builds the executable. + +Add this code to a new file `helloWorldThreaded.c`: + +```c +#include +#include +#include +#include +#include +#include + +int main(int argc, char* argv[]) +{ + // Check input argument + if(argc != 2) + { + printf("Required one argument `name`.\n"); + return 1; + } + + // Receive argument + char* iname = (char *)malloc(strlen(argv[1])); + strcpy(iname,argv[1]); + + // Get the name of the node we are running on + char hostname[HOST_NAME_MAX]; + gethostname(hostname, HOST_NAME_MAX); + + // Message from each thread on the node to the user + #pragma omp parallel + { + printf("Hello %s, this is node %s responding from thread %d\n", iname, hostname, + omp_get_thread_num()); + } + + // Release memory holding command line argument + free(iname); +} +``` + +The code block indicated by the `#pragma omp parallel` statement will be executed by multiple threads. +By default, OpenMP creates one thread per hardware thread (logical core), which typically corresponds to one or two threads per physical core, depending on whether hyper-threading is enabled. +OpenMP also allows users to manually define how many threads they want to be created. + +Let's compile this code now. +On ARCHER2, this looks like the following: + +```bash +cc helloWorldThreaded.c -fopenmp -o hello-THRD +``` + +Again, on a local machine, depending on your compiler setup, you may need to use `gcc` instead of `cc`. + +Here, we inform the C compiler that this is an OpenMP program using the `-fopenmp` flag. +Without it, the `#pragma` statements won't be interpreted and our program will just run within a single thread. + +If you run this now using `./hello-THRD yourname` you should see something like: + +```output +Hello yourname, this is node ln01 responding from thread 151 +Hello yourname, this is node ln01 responding from thread 157 +Hello yourname, this is node ln01 responding from thread 106 +Hello yourname, this is node ln01 responding from thread 65 +Hello yourname, this is node ln01 responding from thread 144 +Hello yourname, this is node ln01 responding from thread 116 +Hello yourname, this is node ln01 responding from thread 199 +Hello yourname, this is node ln01 responding from thread 239 +Hello yourname, this is node ln01 responding from thread 47 +Hello yourname, this is node ln01 responding from thread 63 +Hello yourname, this is node ln01 responding from thread 254 +Hello yourname, this is node ln01 responding from thread 173 +Hello yourname, this is node ln01 responding from thread 169 +Hello yourname, this is node ln01 responding from thread 44 +Hello yourname, this is node ln01 responding from thread 243 +Hello yourname, this is node ln01 responding from thread 244 +Hello yourname, this is node ln01 responding from thread 245 +Hello yourname, this is node ln01 responding from thread 242 +... +``` + +Which when running on an ARCHER2 login node will likely make use of 256 threads. +If on your own machine, this is probably more like 4, 8 or perhaps 16 threads. + +::::challenge{id=parallel_comp_pr.1 title="How many threads?"} +We can change the number of threads used by an OpenMP program by setting the `OMP_NUM_THREADS` environment variable. +Try this now, and check the output. + +:::solution + +```bash +export OMP_NUM_THREADS=4 +``` + +```output +Hello yourname, this is node ln01 responding from thread 0 +Hello yourname, this is node ln01 responding from thread 1 +Hello yourname, this is node ln01 responding from thread 3 +Hello yourname, this is node ln01 responding from thread 2 +``` + +::: + +:::: + +::::challenge{id=parallel_comp_pr.2 title="Why the random order?"} +You likely noticed that the order of the output from each thread is not (necessarily) output in order. +Why do you think this is? + +:::solution +Since the threads are running in parallel, they are not guaranteed to run their code statements in any particular order. +::: + +:::: + +::::challenge{id=parallel_comp_pr.3 title="Submitting an OpenMP job"} + +**To be able to run the job submission examples in this segment, you'll need to have access to a Slurm job scheduler for example on an HPC infrastructure such as ARCHER2 or DiRAC.** + +Write a job submission script that runs this OpenMP code. + +You'll need to specify the number of CPU cores to use using the `--cpus-per-task` `#SBATCH` parameter. + +:::solution + +```bash +#!/bin/bash + +#SBATCH --job-name=Hello-OMP +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=4 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +# Set the number of threads to the CPUs per task +export OMP_NUM_THREADS=4 + +./hello-THRD yourname +``` + +::: +:::: + +## Part 2: Distributed Memory Parallelism Using MPI + +MPI is a message passing interface that uses a distributed memory approach to parallelism. This allows for messages to be sent by multiple instances of the program running within different processes to each other. + +In this MPI example, which we'll put in a file called `helloWorldMPI.c`, each process prints out a hello message which states which node it is running on and which process in the group it is, and includes a string (the command line argument) passed to it from process (or *rank*) 0. +Rank 0, on the other hand, prints out a slightly different message. + +```c +#include +#include +#include +#include + +int main(int argc, char *argv[]) +{ + // Check input argument + if(argc != 2) + { + printf("Required one argument `name`.\n"); + return 1; + } + + // Receive arguments + char* iname = (char *)malloc(strlen(argv[1])+1); + char* iname2 = (char *)malloc(strlen(argv[1])+1); + + strcpy(iname, argv[1]); + strcpy(iname2, iname); + + // MPI Setup + int rank, size, len; + char name[MPI_MAX_PROCESSOR_NAME]; + + MPI_Init(&argc, &argv); + + MPI_Comm_rank(MPI_COMM_WORLD, &rank); + MPI_Comm_size(MPI_COMM_WORLD, &size); + + MPI_Get_processor_name(name, &len); + + // Create message from rank 0 to broadcast to all processes. + strcat(iname, "@"); + strcat(iname, name); + + int inameSize = strlen(iname); + + // Create buffer for message + char* buff = (char *)malloc(inameSize); + + // Sending process fills the buffer + if (rank == 0) + { + strcpy(buff, iname); + } + + // Send the message + MPI_Bcast(buff, inameSize, MPI_CHAR, 0, MPI_COMM_WORLD); + MPI_Barrier(MPI_COMM_WORLD); + + // Send different messages from different ranks + // Send hello from rank 0 + if (rank == 0) + { + printf("Hello world, my name is %s, I am printing this message from process %d of %d total processes executing, which is running on node %s. \n", iname2, rank, size, name); + } + + // Send responce from the other ranks + if (rank != 0) + { + printf("Hello, %s I am process %d of %d total processes executing and I am running on node %s.\n", buff, rank, size, name); + } + + free(buff); + free(iname2); + free(iname); + + MPI_Barrier(MPI_COMM_WORLD); + MPI_Finalize(); + + return 0; +} +``` + +You’ll notice that the program is a fair bit more complex, since here we need to handle explicitly how we send messages. +MPI is covered in detail alter but essentially, after initialising MPI and working out how many separate processes we have available to use (known as `ranks`), +rank 0 sends the command line string using `MPI_Bcast` (broadcast) to all other processes. + +On ARCHER2, you compile this code using: + +```bash +cc helloWorldMPI.c -o hello-MPI +``` + +:::callout{variant="tip"} +If you encounter compilation errors, you may need to load an mpi module before compiling using mpi, consult the documentation for your cluster to find out how. +::: + +::::callout + +## On your own machine + +If you're compiling and running this on your own machine, you'll very likely need to use a custom MPI compiler called `mpicc` instead which is typically bundled as part of an MPI installation: + +```bash +mpicc helloWorldMPI.c -o hello-MPI +``` + +Then, to run this locally on your own machine, you typically use the `mpiexec` command. +For example, to run our code over 4 processes, or ranks: + +```bash +mpiexec -n 4 ./hello-MPI +``` + +:::: + +::::challenge{id=parallel_comp_pr.4 title="Submitting an MPI job"} + +**To be able to run the job submission examples in this segment, you'll need to either have access to ARCHER2, or an HPC infrastructure running the Slurm job scheduler and knowledge of how to configure job scripts for submission.** + +Write a Slurm submission script for our MPI job, so that it runs across 4 processes. Note that you'll need to: + +- Specify the number of processes to use as an `#SBATCH` parameter. Which one should you use? (*Hint:* look back at the material that introduced the first job we submitted via Slurm) +- Use the Slurm `srun` command to run our MPI job, e.g. `srun ./hello-MPI yourname` + +:::solution +We need to use the `tasks-per-node` parameter to specify the number of processes to run. + +```bash +#!/bin/bash + +#SBATCH --job-name=Hello-MPI +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=4 +#SBATCH --cpus-per-task=1 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +srun ./hello-MPI yourname +``` + +::: +:::: + +After you've submitted the job (or run it locally) and it's completed, you should see something like: + +```output +Hello, yourname@nid001686 I am process 1 of 4 total processes executing and I am running on node nid001686. +Hello, yourname@nid001686 I am process 2 of 4 total processes executing and I am running on node nid001686. +Hello, yourname@nid001686 I am process 3 of 4 total processes executing and I am running on node nid001686. +Hello world, my name is yourname, I am sending this message from process 0 of 4 total processes executing, which is running on node nid001686. +``` + +::::challenge{id=parallel_comp_pr.5 title="Increasing the number of nodes"} +What happens if you increase the number of nodes to 2? +Why do you think this happens? + +:::solution +You'll see something like: + +```output +Hello, yourname@nid003165 I am process 4 of 8 total processes executing and I am running on node nid003174. +Hello, yourname@nid003165 I am process 5 of 8 total processes executing and I am running on node nid003174. +Hello, yourname@nid003165 I am process 6 of 8 total processes executing and I am running on node nid003174. +Hello, yourname@nid003165 I am process 7 of 8 total processes executing and I am running on node nid003174. +Hello world, my name is yourname, I am sending this message from process 0 of 8 total processes executing, which is running on node nid003165. +Hello, yourname@nid003165 I am process 2 of 8 total processes executing and I am running on node nid003165. +Hello, yourname@nid003165 I am process 3 of 8 total processes executing and I am running on node nid003165. +Hello, yourname@nid003165 I am process 1 of 8 total processes executing and I am running on node nid003165. +``` + +Increasing the number of nodes to 2, with 4 tasks (or processes) per node means we have a total of 8 processes running our code. +::: +:::: diff --git a/high_performance_computing/parallel_computers/images/ARCHER2.jpg b/high_performance_computing/parallel_computers/images/ARCHER2.jpg new file mode 100644 index 00000000..37d2ab58 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/ARCHER2.jpg differ diff --git a/high_performance_computing/parallel_computers/images/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg b/high_performance_computing/parallel_computers/images/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg new file mode 100644 index 00000000..462eadaa Binary files /dev/null and b/high_performance_computing/parallel_computers/images/alexandre-debieve-FO7JIlwjOtU-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/florian-olivo-Mf23RF8xArY-unsplash.jpg b/high_performance_computing/parallel_computers/images/florian-olivo-Mf23RF8xArY-unsplash.jpg new file mode 100644 index 00000000..be387f36 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/florian-olivo-Mf23RF8xArY-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/helena-lopes-2MBtXGq4Pfs-unsplash.jpg b/high_performance_computing/parallel_computers/images/helena-lopes-2MBtXGq4Pfs-unsplash.jpg new file mode 100644 index 00000000..d2eef215 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/helena-lopes-2MBtXGq4Pfs-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_187042a6-7e25-46dd-a3f3-810c2b184e79.png b/high_performance_computing/parallel_computers/images/hero_187042a6-7e25-46dd-a3f3-810c2b184e79.png new file mode 100644 index 00000000..5605482c Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_187042a6-7e25-46dd-a3f3-810c2b184e79.png differ diff --git a/high_performance_computing/parallel_computers/images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png b/high_performance_computing/parallel_computers/images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png new file mode 100644 index 00000000..da069e87 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_4a65543e-9635-4624-9811-5da1a0ab431e.png differ diff --git a/high_performance_computing/parallel_computers/images/hero_55c8a23e-686f-42a9-b7e9-de0a12208486.jpg b/high_performance_computing/parallel_computers/images/hero_55c8a23e-686f-42a9-b7e9-de0a12208486.jpg new file mode 100644 index 00000000..1a17ce92 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_55c8a23e-686f-42a9-b7e9-de0a12208486.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_5bca5e55-5548-4a13-8f20-f07f498cec7e.jpg b/high_performance_computing/parallel_computers/images/hero_5bca5e55-5548-4a13-8f20-f07f498cec7e.jpg new file mode 100644 index 00000000..3f3091ee Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_5bca5e55-5548-4a13-8f20-f07f498cec7e.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_6d93ece3-84b2-495f-b5c5-0e0f652196ea.png b/high_performance_computing/parallel_computers/images/hero_6d93ece3-84b2-495f-b5c5-0e0f652196ea.png new file mode 100644 index 00000000..f33ef1ab Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_6d93ece3-84b2-495f-b5c5-0e0f652196ea.png differ diff --git a/high_performance_computing/parallel_computers/images/hero_73afa9aa-74db-4ad2-893e-971956518bdf.jpg b/high_performance_computing/parallel_computers/images/hero_73afa9aa-74db-4ad2-893e-971956518bdf.jpg new file mode 100644 index 00000000..21015029 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_73afa9aa-74db-4ad2-893e-971956518bdf.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_86aabe5a-fd59-4b39-bb87-88cb856fddea.jpg b/high_performance_computing/parallel_computers/images/hero_86aabe5a-fd59-4b39-bb87-88cb856fddea.jpg new file mode 100644 index 00000000..f9200451 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_86aabe5a-fd59-4b39-bb87-88cb856fddea.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_87e2018b-86eb-4aa5-a7c4-efd271a505b2.webp b/high_performance_computing/parallel_computers/images/hero_87e2018b-86eb-4aa5-a7c4-efd271a505b2.webp new file mode 100644 index 00000000..d0075a66 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_87e2018b-86eb-4aa5-a7c4-efd271a505b2.webp differ diff --git a/high_performance_computing/parallel_computers/images/hero_9090d93c-0a48-4a33-8ed4-3b8fc6acf6cf.png b/high_performance_computing/parallel_computers/images/hero_9090d93c-0a48-4a33-8ed4-3b8fc6acf6cf.png new file mode 100644 index 00000000..7bde9a70 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_9090d93c-0a48-4a33-8ed4-3b8fc6acf6cf.png differ diff --git a/high_performance_computing/parallel_computers/images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg b/high_performance_computing/parallel_computers/images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg new file mode 100644 index 00000000..a6f7a02a Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_91d652a7-98f2-49d1-85ee-62d3ff46bac6.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_9f93cf41-f24d-4ab2-8a7e-d25a78a8089c.png b/high_performance_computing/parallel_computers/images/hero_9f93cf41-f24d-4ab2-8a7e-d25a78a8089c.png new file mode 100644 index 00000000..644e3788 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_9f93cf41-f24d-4ab2-8a7e-d25a78a8089c.png differ diff --git a/high_performance_computing/parallel_computers/images/hero_a887d8cf-e9a0-4810-b7ab-b7a016dfc47f.webp b/high_performance_computing/parallel_computers/images/hero_a887d8cf-e9a0-4810-b7ab-b7a016dfc47f.webp new file mode 100644 index 00000000..c01be9b0 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_a887d8cf-e9a0-4810-b7ab-b7a016dfc47f.webp differ diff --git a/high_performance_computing/parallel_computers/images/hero_cbe27959-b81d-41c1-8d00-9c7fc44d34e9.jpg b/high_performance_computing/parallel_computers/images/hero_cbe27959-b81d-41c1-8d00-9c7fc44d34e9.jpg new file mode 100644 index 00000000..9f775057 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_cbe27959-b81d-41c1-8d00-9c7fc44d34e9.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_d827be57-5840-4339-b47c-f70c0d36fcd1.jpg b/high_performance_computing/parallel_computers/images/hero_d827be57-5840-4339-b47c-f70c0d36fcd1.jpg new file mode 100644 index 00000000..fee90217 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_d827be57-5840-4339-b47c-f70c0d36fcd1.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_dcac5759-2efe-4f9f-a6a7-f439ef43840c.jpg b/high_performance_computing/parallel_computers/images/hero_dcac5759-2efe-4f9f-a6a7-f439ef43840c.jpg new file mode 100644 index 00000000..d81ad293 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_dcac5759-2efe-4f9f-a6a7-f439ef43840c.jpg differ diff --git a/high_performance_computing/parallel_computers/images/hero_f158c8fd-2092-4272-a9dc-e4806b44f9cc.png b/high_performance_computing/parallel_computers/images/hero_f158c8fd-2092-4272-a9dc-e4806b44f9cc.png new file mode 100644 index 00000000..daa232d2 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/hero_f158c8fd-2092-4272-a9dc-e4806b44f9cc.png differ diff --git a/high_performance_computing/parallel_computers/images/isawred-Mn4_KuFSpe4-unsplash.jpg b/high_performance_computing/parallel_computers/images/isawred-Mn4_KuFSpe4-unsplash.jpg new file mode 100644 index 00000000..5ed5ef83 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/isawred-Mn4_KuFSpe4-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/jeswin-thomas-2Q3Ivd-HsaM-unsplash.jpg b/high_performance_computing/parallel_computers/images/jeswin-thomas-2Q3Ivd-HsaM-unsplash.jpg new file mode 100644 index 00000000..1708f3ec Binary files /dev/null and b/high_performance_computing/parallel_computers/images/jeswin-thomas-2Q3Ivd-HsaM-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/julian-hochgesang-3-y9vq8uoxk-unsplash.jpg b/high_performance_computing/parallel_computers/images/julian-hochgesang-3-y9vq8uoxk-unsplash.jpg new file mode 100644 index 00000000..686dadaa Binary files /dev/null and b/high_performance_computing/parallel_computers/images/julian-hochgesang-3-y9vq8uoxk-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/kaleidico-7lryofJ0H9s-unsplash.jpg b/high_performance_computing/parallel_computers/images/kaleidico-7lryofJ0H9s-unsplash.jpg new file mode 100644 index 00000000..c529878d Binary files /dev/null and b/high_performance_computing/parallel_computers/images/kaleidico-7lryofJ0H9s-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/kelly-sikkema-JRVxgAkzIsM-unsplash.jpg b/high_performance_computing/parallel_computers/images/kelly-sikkema-JRVxgAkzIsM-unsplash.jpg new file mode 100644 index 00000000..d2b632d6 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/kelly-sikkema-JRVxgAkzIsM-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/laura-ockel-qOx9KsvpqcM-unsplash.jpg b/high_performance_computing/parallel_computers/images/laura-ockel-qOx9KsvpqcM-unsplash.jpg new file mode 100644 index 00000000..9b2707b2 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/laura-ockel-qOx9KsvpqcM-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/oleksii-piekhov-IflQrze1wMM-unsplash.jpg b/high_performance_computing/parallel_computers/images/oleksii-piekhov-IflQrze1wMM-unsplash.jpg new file mode 100644 index 00000000..0c89a9fd Binary files /dev/null and b/high_performance_computing/parallel_computers/images/oleksii-piekhov-IflQrze1wMM-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/piret-ilver-98MbUldcDJY-unsplash.jpg b/high_performance_computing/parallel_computers/images/piret-ilver-98MbUldcDJY-unsplash.jpg new file mode 100644 index 00000000..2431818f Binary files /dev/null and b/high_performance_computing/parallel_computers/images/piret-ilver-98MbUldcDJY-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/images/timo-volz-9Psb5Q1TLD4-unsplash.jpg b/high_performance_computing/parallel_computers/images/timo-volz-9Psb5Q1TLD4-unsplash.jpg new file mode 100644 index 00000000..414d4870 Binary files /dev/null and b/high_performance_computing/parallel_computers/images/timo-volz-9Psb5Q1TLD4-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computers/index.md b/high_performance_computing/parallel_computers/index.md new file mode 100644 index 00000000..5e62dc88 --- /dev/null +++ b/high_performance_computing/parallel_computers/index.md @@ -0,0 +1,31 @@ +--- +name: Parallel and Distributed Computers +id: parallel_computers +dependsOn: [ + high_performance_computing.supercomputing, +] +files: [ + 01_basics.md, + 02_connecting.md, + 03_comparison.md, + 04_practical.md, +] +summary: | + This module introduces parallel and distributed computation, looking at how our own computers work, and ways of + combining and making use of the computing power of many computational resources such as computer processors and + other machines. + +--- + +In this module we look at supercomputing hardware and architectures. + +::::iframe{id="kaltura_player" width="700" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_la39s2xl&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_0bmxhco2" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Welcome_to_Parallel_Computers"} +:::: + +:::solution{title="Transcript"} +0:11 - This week, we’ll start looking in more detail at how the many CPU-cores that make up a parallel supercomputer are put together. We’ll see that the way that the CPU-cores are connected to the memory is actually the key issue. And this leads to two distinct types of parallel computer. One approach leads to relatively small scale, everyday, parallel systems, such as your laptop, mobile phone, or the graphics card in a games console. The other approach is more unique to supercomputing, allowing us to scale up to the hundreds of thousands of cores we need to tackle the world’s largest computer simulations. +::: + +We will talk about the main building blocks of supercomputers and introduce the concepts of shared and distributed memory architectures. +You will also learn about the main differences between your laptop and a supercomputer node. +You will also have an opportunity to build and manage your own supercomputer through the Supercomputing App game. diff --git a/high_performance_computing/parallel_computing/01_intro.md b/high_performance_computing/parallel_computing/01_intro.md new file mode 100644 index 00000000..44fd67e2 --- /dev/null +++ b/high_performance_computing/parallel_computing/01_intro.md @@ -0,0 +1,267 @@ +--- +name: Introduction to Parallel Computing +dependsOn: [ + high_performance_computing.parallel_computers.03_comparison +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Photo of conductor conducting an and orchestra](images/andrea-zanenga-yUJVHiYZCGQ-unsplash.jpg) +*Image courtesy of [Andrea Zanenga](https://unsplash.com/@andreazanenga) from [Unsplash](https://unsplash.com)* + +## Parallel Computing + +We have seen over the first two parts of the course that almost all modern computers are parallel computers, consisting of many individual CPU-cores that are connected together working simultaneously on one or more computer programs. + +A single CPU-core acts as a serial computer, running only a single computer program at any one time. +The Oxford English Dictionary defines serial computing as ‘the performance of operations … in a set order one at a time’. +To take advantage of a parallel computer, we need to perform many operations at the same time so that we can make use of many CPU-cores. +Parallel computing is defined as ‘Involving the concurrent or simultaneous performance of certain operations’. + +It is quite clear that a supercomputer has the potential for doing calculations very quickly. +However, it may not immediately be obvious how to take advantage of this potential power for any particular problem. +We next look at techniques we can use in our programs in order to take advantage of parallel computers. +This requires a problem, calculation or serial computer program to be parallelised. + +The process of parallelising a calculation has a number of important steps: + +- splitting the calculation into a number of smaller tasks that can be done independently (and therefore performed simultaneously by different CPU-cores), which is also called decomposing the calculation; +- identifying when and where tasks need to be coordinated (meaning that the CPU-cores must talk to each other); +- implementing these two operations using standard approaches; +- executing the parallel program on a parallel computer. + +The first two steps typically depend only on the problem you are trying to solve, and not on the architecture of the particular parallel computer it is to be run on. +However, we will see that the last two steps are quite different depending on whether you are targeting a shared or distributed-memory computer. +In these cases, we use two distinct programming models: the shared-variables model and the message-passing model, each executed in fundamentally different ways. + +There are many existing software packages and tools to help you actually write the implementation but, the first two steps (the design of the program) are still done by hand. +This requires someone to sit down and think things through using pencil-and-paper, maybe experimenting with a number of ideas to find the best approach. + +Parallel computing has been around for several decades so there is a wealth of experience to draw upon, and the best parallelisation strategies have been identified for a wide range of standard problems. +Despite this, it is not currently possible to completely automate these design steps for anything but the simplest problems – perhaps a disappointing state of affairs given the fact that almost every phone, tablet or laptop is now a parallel computer, but good news if you are employed as a parallel programmer! + +--- + +![Photo of busy road intersection](images/chuttersnap-4YdbwhmTMn0-unsplash.jpg) +*Image courtesy of [CHUTTERSNAP](https://unsplash.com/@chuttersnap) from [Unsplash](https://unsplash.com)* + +## Traffic Simulation + +Now we are going to consider how to parallelise a more interesting example than just adding up a list of numbers. First, we will describe how the example works and get familiar with it. + +We are going to look at a very simple way of trying to simulate the way that traffic flows. This better illustrates how parallel computing is used in practice compared to the previous salaries example: + +- We process the same data over and over again rather than just reading it once; +- The way the model works (where the behaviour of each car depends on the road conditions immediately in front and behind) is a surprisingly good analogy for much more complicated computations, such as weather modelling, which we will encounter later on. + +### The traffic model + +In this example we only talk about a straight, one-way road without any intersections. +We think of the road as being divided into a number of sections or cells, each cell either contains a car (is occupied) or is empty (unoccupied). +The car behaviour is very simple: if the cell in front is occupied then it cannot move and remains where it is; if the cell in front is unoccupied then it moves forward. +A complete sweep of the road, where we consider every car once, is called an iteration; each iteration, a car can only move once. +Having updated the entire road, we proceed to the next iteration and check again which cars can move and which cars stand still. + +If you want, you can think of each cell representing about 5 metres of road and each iteration representing around half a second of elapsed time. + +Despite being such a simple simulation, it represents some of the basic features of real traffic flow quite well. We will talk you through some examples in the next few videos. + +The simplest way to characterise different simulations is the density and average velocity of the cars: The density is the number of cars divided by the number of cells. If all cells are occupied then the density is 1.0 (i.e. 100%), and if half of them are occupied then the density is 0.5 (50%). +If we define the top speed of a car as 1.0 (i.e. it moves one cell in an iteration) then the average speed of the cars at each iteration is the number of cars that move divided by the total number of cars. +For example, if we have 10 cars and, in a given iteration, all of them move then the average speed is 1.0. +If only four of them move then the average speed is 0.4 and, if none of them move, then the average speed is zero (total gridlock). + +:::callout{variant="discussion"} +Why do you think this problem is a good analogy for other, more complex simulations? If the traffic flow problem was to be parallelised, how do you think it could be decomposed into parallel tasks? +::: + +--- + +## How the traffic model works + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_bw098xol&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_8e9q3inw" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Traffic_Model_works_hd"} +:::: + +:::solution{title="Transcipt"} +0:11 - So now we’re going to spend some time looking at the traffic model. Here I’m illustrating it being run as pawns on a chessboard. And the way it works is that the squares on the chessboard represent the road and the pawns represent the cars. And obviously, a car here, this piece of road is occupied. There’s a pawn there. This piece of road is not occupied. And just to remind you, the way the model works. The rules are very simple. At each iteration a car moves forward if it can, if there’s a gap in front. It doesn’t move forward if there is a car in front of it. Now the right way to run this simulation is to look + +0:42 - at the cars and say: that one can move, that one can’t, and that one can and then to move them. That’s slightly awkward. It turns out that we actually get the right answer if we update them in order from left to right as you look at it. So let’s just go ahead and run the model. So first of all, we have these three cars. This car can move. This car can’t move. And this car can move. So that iteration, two cars moved. There were three cars. So we’re interested in the average speed. The average speed is 2/3 or 0.67. Here I’m saying the maximum speed of the car is one. + +1:14 - If you thought of the maximum speed of the car as being 60 miles an hour, you could say the average speed here is 40. But for simplicity, I’m just calling the maximum speed one. On the next iteration, again this car can’t move, this one can and this one can. So again the average speed of the cars there was 2/3 or 0.67. But you’ll see that actually now the cars arranged themselves with gaps in between them. So from now on, we’re in free flowing traffic. So from now on every iteration, all the cars move. And the average speed is one. And this would just carry on happily, all the cars moving. And we have completely free flowing traffic. + +1:50 - Now it turns out that in any situation where we have fewer than 50% or fewer cars then they will eventually arrange themselves into this is car gap, car gap arrangement and move off with average speed of one. However, what’s interesting is actually what happens when there are more than 50% density of cars and we get congestion. +::: + +In this short video David uses a chessboard to explain how the traffic model works. + +We will be using the chessboard to illustrate a few examples of the traffic conditions in the next steps. + +:::callout{variant="discussion"} +Do you think this simplified model is actually useful? Why? +::: + +--- + +## Example 1: Traffic lights + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_hurz7ea5&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_pbpr0tln" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Traffic_lights_hd"} +:::: + +:::solution{title="Transcript"} +0:12 - An interesting situation to look at is when we have traffic stopped at a set of traffic lights. So here I have this piece here representing some traffic lights and here I have four cars queued up at the traffic lights. Let’s imagine what happens when the traffic lights go green. So what would be really nice, would make driving much easier, and make congestion a lot less, is if all the cars moved off at once. So the traffic lights go green and all the cars move off in a single block like that. But we know that’s not what happens. We know you have to wait a long time when traffic lights change. So let’s see how this works in this model. + +0:45 - So we’ve got the cars lined up here. The traffic lights go green. On the first iteration, only one car can move. Then on the next iteration, two cars can move. But this car and this car are still static. The third iteration, three cars can move. And only on the fourth iteration, after four time steps, is this car able to move and they all move off freely. + +1:10 - And after that, we have free-flowing traffic. These cars will carry on moving at speed one and move away freely. So even with this very simple model, this very simple cellular automaton model of how traffic flows, we see that in a situation of cars lined up at traffic lights, it predicts the correct behaviour. The cars don’t move off in a block, they move off and end up spread apart. And they move off one by one. Meaning the cars further back in the line have to wait a certain amount of time to get through and perhaps even miss the traffic lights if they go red again. +::: + +In this first example David uses the traffic model on a chessboard to simulate a situation created by traffic lights. + +:::callout{variant="discussion"} +In your opinion, is our toy model capable of capturing the effect of traffic lights? Can you outline the most essential assumptions? +::: + +--- + +## Example 2: Traffic congestion + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_8tn3hmn6&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_t5dr00g1" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Traffic_congestion_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - So one of the interesting things about the simulation is to look at congestion when traffic jams arise. So if that we start with a situation where the road is completely full, we have cars everywhere, obviously none of the cars can move. It means complete gridlock. What we’re going to do is consider a situation where there is a small amount of space, so there’s a gap there. So what happens? Well each iteration, you’ll see that only one car can move. The first iteration, this car can move. The second iteration, this car can move. The third iteration, this car can move. And so on. And obviously there, the average speed is very low. + +0:43 - Only one car is moving per iteration out of all these cars, so the average speed is not very high. But another way to look at this is quite interesting. If you watch this, you’ll see that what’s actually happening is one car is moving to the right every iteration. You could also think of it as the gap moving to the left. So you could think of this situation as the gap moving to the left with speed one. That’s to be contrasted with a single car on its own on an otherwise empty road where the car moves to the right with speed one. Every iteration the car moves one space to the right. + +1:18 - And that’s just quite interesting observation, but it might help you, in fact, to try to think about how the model works. +::: + +In this second example, we again use the chessboard to look at the traffic congestion. + +Do you understand why only one car can move at each iteration? + +This is actually quite important not only in our toy traffic model but also in any kind of computer simulations. The process of transferring continuous functions, equations and models into small distinct bits to be executed in a sequence of steps is called discretisation. This process is one of the initial steps in creating computer simulations. + +--- + +## Example 3: Rubbernecking + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_8a540tg5&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_uwprlc3q" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Traffic_Rubbernecking_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - So I’m sure you’ll have been in the situation where you’re driving along the motorway, and suddenly you have to brake because the car in front has slowed down. But there’s no apparent reason for it. And then maybe a minute or so later you see there’s been an accident or an incident by the side of the road. And people are slowing down to see what’s been going on. Well, surprisingly this very, very simple simulation predicts that kind of behaviour. So I’m going to have a larger road here. So here, I’ve put two chess boards together so I can now have a larger road. And we’ve set up with free-flowing traffic. So at the moment, everything’s fine. + +0:41 - Everyone is moving forward at speed one. + +0:47 - Now we’re going to imagine that perhaps someone coming from a side street has a slight accident. And there’s a small incident here, and this car is broken down. And having seen it, this car decides to slow down just to see what’s going on. So on the next iteration this car isn’t going to move. That’s the only difference. Everyone moves except for this car. Now you might have wondered beforehand what happens at the edges of the road? These are the boundary conditions, what happens off the edges of your simulation are called boundaries. And the boundary conditions say what happens there. So what we’re going to imagine here is that this is embedded in a run of free flowing traffic. + +1:25 - So cars on the right here just move off into free space. But on the left here, the gaps are replaced with cars coming in. So there is another car coming in here. So now we just carry on running the model. + +1:42 - And again. + +1:47 - And again, we add another car here. This would keep going. + +1:56 - You’ll see that although the original incident was here, the cars are starting to brake further down the line. It becomes more obvious when you run it once more. + +2:08 - But on this iteration, you can see that it’s this car here way down line that can’t move. + +2:16 - Despite the fact that the accident is far upstream. So what we see is the braking behaviour where a car has to slow down because the car in front has slowed down moves backwards to the left in a wave well away from where the original incident happened. Now to come back to boundary conditions, what we’re going to look at in the future is we’re going to look at a situation where we’re not imagining a straight road, we’re actually imagining a roundabout where the road is closed in on itself. And the way to do that is to make sure that when cars move off one side of the road, they reappear on the other. + +2:46 - What we’ve done is we’ve taken the roundabout, we’ve kind of unwrapped it into a straight line, so we have to make sure that it’s joined in to a roundabout, cars that leave here come back here. So if we are doing a roundabout we’d make sure that if I ever a car moved off the end here, it would reappear back here on the side. + +3:05 - Cars disappearing off the end here, appear back here on the edge. And we’ll come back to that. That would be the situation, that would be the boundary conditions we’ll want to use to simulate in the parallel simulation. +::: + +In this video David talks about another example of traffic conditions - rubbernecking. + +The term rubbernecking refers to the activity of motorists slowing down in order to see something on the other side of a road or highway, which creates backwards waves in traffic. + +:::callout{variant="discussion"} +Having watched the video, is it clear to you what boundary conditions are? Do you understand why they are needed? +::: + +--- + +![Program code](images/chris-ried-ieic5Tq8YMk-unsplash.jpg) +*Image courtesy of [Chrid Ried](https://unsplash.com/@cdr6934) from [Unsplash](https://unsplash.com)* + +## Traffic model: predictions, implementation and cellular automata + +Using the measurements mentioned in the previous steps, we can see that the simulation predicts the onset of congestion (traffic jams) whenever the density of cars exceeds 50%. + +If we run many simulations many times at different densities using a very long road then we get the following graph: + +![Chart of density of cars against average speed](images/hero_a14e4034-f6a0-44d2-b238-330b8c9aaed5.png) + +Note that we only take measurements of the average speed after we have run for many iterations to let it settle down to a steady state. + +This is quite easy to understand in general terms. When the car density is less than 50%, the cars will eventually arrange themselves so that there is a gap between every car and they can all move once each iteration. This is not possible when more than half the cells are occupied: some of the cars cannot move and the average speed drops below 1.0. At 100% density, where every cell is occupied, none of the cars can move and the speed is 0. + +What perhaps isn’t obvious (at least it wasn’t obvious to me) is how rapidly congestion happens in the model - an effect well known to every driver, where only a few extra cars can turn a previously clear road into a traffic jam. This is an example of emergent behaviour where repeated application of very simple rules can lead to surprisingly complex behaviour when applied to a large enough system (the complexity increases with size). + +### Boundary Conditions + +You might be concerned about what happens at the extreme ends of the road. If I have a road containing 100 cells, what happens if the car in cell 100 wants to move forward? Which cell is to the right of cell 100? Which cell is to the left of cell 1? In computer simulations, these are called the boundary conditions: we have to decide what to do at the boundaries (i.e. the extreme ends) of the simulation. + +For simplicity, we will choose to simulate traffic driving on a very large roundabout rather than a straight piece of road. +If you imagine wrapping the road into a circle, this means that when a car moves away from cell 100 it reappears back on cell 1. +So, the cell forward from cell 100 is cell 1; the cell backward from cell 1 is cell 100. +These are called periodic boundary conditions. +There are plenty of other strategies used to handle boundaries in computational science, different problems require different solutions. + +For the traffic model, this has the nice side-effect that the total number of cars stays the same on every iteration as there is no way for cars to enter or leave the roundabout. +If the number of cars changes from iteration to iteration then we must have made a mistake! + +![Barges from Conway's Game of Life](images/Conways_game_of_life.png) + +### Cellular Automata + +A simulation such as this, where we have a number of cells and update them depending on the values of their neighbours, is generally called a Cellular Automaton. One of the most famous cellular automata is the Game of Life which models the growth and death of biological cells. The Game of Life takes place on a two-dimensional grid rather than a line which makes it a bit too complicated for our purposes, but the principles are very similar. + +For the one-dimensional case like the Traffic Model, it turns out that there are 256 different possible models: + +- each cell depends on the values of three cells (itself and its two immediate neighbours); +- this means there are 8 rules (i.e. 111, 110, 101, 100, 011, 010, 001 and 000), each with two possible outcomes (occupied = 1, unoccupied = 0). First number corresponds to the preceding, second to the current and third the next cell, giving the following outcomes for the new state of the central cells: 1, 0, 1, 1, 1, 0, 0 and 0, respectively. +- that makes a total of 256 = 28 possible models (to include every possible combination of 2 outcomes for 8 rules - 2 \* 2 \* 2 \* 2 \* 2 \* 2 \* 2 \* 2 = 128). + +Most of these models will be very boring, but with our particular choice of rules the model can be seen as a simulation of traffic moving from left to right. The traffic model is sometimes called “Rule 184” - there is a (surprisingly detailed) discussion of it on [Wikipedia](https://en.wikipedia.org/wiki/Rule_184). + +We have chosen a 1D model for simplicity. What do you think you would need to consider to simulate a 2D model? How much more work would it be? + +:::callout{variant="discussion"} +We’ve made a lot of simplifications to simulate traffic using this Rule 184 cellular automaton. Can you list what they are? What important aspects of real traffic flow are we missing? + +Do you think it is possible to perform realistic traffic simulations using a more complicated cellular automaton, or would you recommend a completely different approach. Why? +::: + +--- + +![Program code](images/luca-bravo-XJXWbfSo2f0-unsplash.jpg) +*Image courtesy of [Luca Bravo](https://unsplash.com/@lucabravo) from [Unsplash](https://unsplash.com)* + +## Programming Model + +It turns out that, despite its simplicity, the traffic model is representative of a surprisingly wide range of real calculations that are carried out on parallel supercomputers. We will use it to illustrate some key ideas throughout the rest of the course. + +There are two fundamentally different approaches to parallelising a calculation which correspond to the two parallel architectures. We call these approaches programming models. + +| Parallel Architecture | Programming Model | +| --------------------- | ----------------- | +| Shared Memory | Shared Variables | +| Distributed Memory | Message Passing | + +What we mean by a programming model is that we take a high-level view of the way we are going to use the parallel computer, only concerned with the fundamental features of the computer and not bothering about the details (here, the fundamental feature is how the memory is arranged). + +To illustrate what we mean by a model in this context, let’s take another example from traffic. Imagine that you want to travel between two cities. In the private transport model you have your own vehicle and are responsible for planning the route, driving safely and ensuring the vehicle has enough fuel; you can choose who you share the journey with. In the public transport model, you pay someone else (e.g. a bus company) to supply the vehicle and a trained driver, and the vehicle follows a fixed route to a given timetable; you will share the vehicle with fellow travellers. + +Although both models achieve the same aim of transporting you between the two cities, they do so in fundamentally different ways with their own pros and cons. At this level, we are not concerned about the details of how each model is implemented: whether you drive an electric or petrol car, or travel by bus or by train, the fundamental distinction between the private and public models remains the same despite differing in the details. + +In terms of the two programming models, we are not concerned about the details of how the computer is built, whether it has a fast communications network or a slow one, whether each processor has a few cores or dozens of them. All we care about is whether or not different CPU cores are all directly attached to the same memory: are the workers sharing the same office or in different offices? diff --git a/high_performance_computing/parallel_computing/02_programming.md b/high_performance_computing/parallel_computing/02_programming.md new file mode 100644 index 00000000..1c7d536b --- /dev/null +++ b/high_performance_computing/parallel_computing/02_programming.md @@ -0,0 +1,485 @@ +--- +name: Parallel Computing Programming +dependsOn: [ + high_performance_computing.parallel_computing.01_intro +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Blueprint](images/sigmund-_dJCBtdUu74-unsplash.jpg) +*Image courtesy of [Sigmund](https://unsplash.com/@sigmund) from [Unsplash](https://unsplash.com)* + +## Shared-Variables Model + +On a computer, a variable corresponds to some piece of information that we need to store in memory. In the traffic model, for example, we need to store all the cells in the old (containing the state from the previous step) and new roads (containing the current state of the road), and other quantities that we calculate such as the number of cars that move or the density of the cars. All of these are variables — they take different values throughout the calculation. + +Remember that the the shared memory architecture (many CPU-cores connected to the same piece of memory) is like several office mates sharing a whiteboard. In this model, we have two choices as to where we store any variables: + +- shared variables: accessible by everyone in the office +- private variables: can only be accessed by the person who owns them + +A shared variable corresponds to writing the value on the whiteboard so that everyone in the office can read or modify it. You can think of private variables being stored on a personal notepad that can only be seen by the owner. + +Although writing everything on the whiteboard for all to see might seem like a good idea, it is important to ensure that the officemates do not interfere with each other’s calculations. If you are working on the cells for a section of road, you do not want someone else changing the values without you knowing about it. It is crucial to divide up the work so that the individual tasks are independent of each other (if possible) and to make sure that workers coordinate whenever there is a chance that they might interfere with each other. + +In the shared-variables model, the workers are often referred to as threads. + +### Things to consider + +When parallelising a calculation in the shared-variables model, the most important questions are: + +- which variables are shared (stored on the whiteboard) and which are private (written in your own notepad); +- how to divide up the calculation between workers; +- how to ensure that, when workers need to coordinate with each other, they do so correctly; +- how to minimise the number of times workers must coordinate with each other. + +The most basic methods of coordination are: + +- master region: certain calculations are only carried out by one of the workers - a nominated boss worker; +- barrier: everybody waits until all workers have reached a certain point in the calculation; when everyone has reached that point, workers can then proceed; +- locking: if you are working with a variable and don’t want anyone else to touch it, you can lock it. This means that only one worker can access the variable at a time - if the variable is locked by someone else, you have to wait until they unlock it. On a shared whiteboard you could imagine circling a variable to show to everyone else that you have it locked, then erasing the circle when you are finished. + +Clearly, all of these have the potential to slow things down as they can lead to workers waiting around for others to finish, so you should try and do as little coordination as possible (while still ensuring that you get the correct result!). + +### Adding to a Variable + +One of our basic operations is to increment a variable, for example to add up the total number of cars that move each iteration. It may not be obvious but, on a computer, adding one to a variable does not comprise a single operation. Using the whiteboard analogy, it has the following stages: + +- take a copy of the value on the whiteboard and write it in your notepad (load a value from memory into register); +- add one to the value on your notepad (issue an increment instruction on the register); +- copy the new value back to the whiteboard (store the new value from register to memory). + +In the shared-variables model, the problem occurs if two or more workers try and do this at the same time: if one worker takes a copy of the variable while another worker is modifying it on their notepad, then you will not get the correct answer. Sometimes you might be lucky and no-one else modifies the variable while you are working on your notepad, but there is no guarantee. + +This situation is called a race condition and is a disaster for parallel programming: sometimes you get the right answer, but sometimes the wrong answer. To fix this you need to coordinate the actions of the workers, for example using locking as described above. + +--- + +![Someone using a calculator](images/towfiqu-barbhuiya-JhevWHCbVyw-unsplash.jpg) +*Image courtesy of [Sigmund](https://unsplash.com/@towfiqu999999) from [Unsplash](https://unsplash.com)* + +## How to parallelise the Income Calculation example? + +Consider how to parallelise the salaries example using the shared-variables model, i.e. how could 4 office mates add up the numbers correctly on a shared whiteboard? + +Remember that you are not allowed to talk directly to your office mates - all communications must take place via the whiteboard. Things to consider include: + +- how do you decompose the calculation evenly between workers? +- how do you minimise communication between workers? +- how do you ensure that you do not have any race conditions? + +Can you think of any other aspects that should be taken into account? + +--- + +## Solution to Income calculation in Shared-Variables + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_zqowj328&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_jaksyv3n" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Solution_income_shared_variables_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - Here we’re going to revisit the very simple calculation which I introduced in the first week, where we were adding up lots and lots of numbers there to compute the total income of the world by adding up everyone’s salaries. And we’re going to now revisit that, not with a serial program, but look at how we do it in parallel in the shared variables model on a shared memory computer. And what I’ll do is I’ll illustrate how we’re going to split the calculation up between the different CPU-cores. But more important, the subtleties of synchronisation which are introduced. Just to re-iterate the analogy, this white board represents the shared memory which is in an office. + +0:46 - And there is more than one person in the office. Let’s imagine that there is myself and three other people– four workers sharing this office. And the analogy with shared memory is that all of us can freely read and write anywhere we want to on this white board. So what we’re going to do is we can add up these salaries– and here I just have 12. In reality we would be interested in a much larger number, but I could only fit 12 here on the board. We saw before that we could just add these up, obviously, in order, one by one. And a single CPU-core with a simple program could add these up and get the answer. + +1:16 - But what we want to do is we want to take advantage of the fact there are four of us in this office to do this calculation faster. The nice thing about addition is that we can do the calculation in any order. And that allows us to easily split it up between the workers, between the different CPU-cores in this office. So because there are four of us, we’re each going to take three numbers. And I’m going to do something very simple. I’m going to say that the first worker is going to take the first three numbers. The second the next three, then the next three, then the next three. And let’s imagine I’m the second worker. + +1:43 - So I’m taking numbers four, five, and six from the list. Now this isn’t the only way you could split the calculation up, because any worker can read or write to any part of the shared memory, we could split it up so for example, I could do the first, the fourth, the seventh and the tenth numbers, for example. But we’ll just do something simple and split it up into regular blocks. So I need to add up the first three numbers here. And that immediately introduces the concept that I need a private memory. I need somewhere where I can do my calculation on my own without interfering with someone else before we all get back together and add our subtotals together. + +2:15 - And I’m going to use this notebook to represent private memory. So it’s fairly straightforward. I just read the numbers off the board, the fourth, the fifth, and the sixth numbers and add them together in my private memory to come up with the answer. And my answer turns out to be 108,750. And at the same time, my colleagues, my coworkers are also doing the same in their own private memory with their different parts of the list. But the subtlety comes at the point when we try and combine all the subtotals together. We want to end up with the correct result being available to everybody. So we want the result to be in shared memory. + +2:47 - So I’ve written the total here in the space for the total on the shared memory. But all of the subtotals are all in our private memory. So we need to add them back together. So I come along to add my subtotal back into the running total and I immediately see a problem. Somebody has already started working on the total. This turns out to be the result from the third worker. And my worry is, can I alter this number or is somebody else working on it? + +3:10 - So what I’m going to have to do, what I want to do, is to read this number, take it away, add it to my subtotal, and then write the new updated total back in the same place. But the worry is, somebody else is currently doing the same. Somebody else might be working on their own in their other corner of the office adding their subtotal to this running total. So how do I make sure that two people aren’t working on this piece of shared memory at the same time? The way we solve this problem is by having a lock. We only allow somebody to alter this shared memory if they have the red pen. + +3:42 - So this allows me to make sure that only one worker updates this memory at once. So I come along with my subtotal wanting to add it to the running total and I see the pen is there. That means, I’m safe. There is nobody else working on this memory. So I can grab the pen, I can acquire the lock, and then carry on and do my calculation. So I add this 65,500 to my 108,750 and I can update the number. And the new subtotal turns out to be 174,250. And now I return the pen and release the lock, which means that my coworkers can now come in and safely update this number. + +4:24 - So this now represents the sum of the second and the third portions of the list. Eventually in the future, my other co-workers will come together and they’ll finish. And then we will have the total here. Now you might ask, how do I know that everyone is finished? How do I know that this is the final total? Well, to make sure that we know that, we would all execute a barrier. We would put a barrier at the end of the calculation so everybody waits when they have finished. And then when we move on we know that everyone has completed and we know that this total is updated. + +4:51 - This also illustrates the key role that private memory plays in the shared variables model for efficiency. We could get the right answer by every time that I wanted to add a number to the total not having private memory, just adding this number to the total, then this one, then this one. But each time I did that, I’d have to acquire the lock. And so I’d block everyone else many, many times. We’d be locking this number 12 times. Or in reality, thousands and thousands of times for a list of lots of numbers. + +5:19 - However, if we all create our own subtotals in our private memory, and only update the shared memory when we’re completed, we only need to use four locks– one lock for each worker– as opposed to thousands of locks, one lock for each number. There are, of course, other solutions to this problem. Rather than having a single total, we could all reserve a space in shared memory for our subtotals. So we could have four slots, where when each worker finishes, they would put their subtotal in the correct slot. So when I finished with my calculation, rather than adding it directly into the running total, I just put it in the correct slot. + +5:58 - Everyone does that and when everyone is finished, we have all the subtotals here in shared memory. But to get the right answer, we need to add them together. And then we need a master region. We nominate one thread to be the master, who would then add these numbers up together and update the total. And because we only have one thread updating this total at once through the master region rather than through a lock, we don’t have any problems with multiple people updating shared memory at the same time. +::: + +In this video David outlines how to parallelise the income calculation on a shared whiteboard. + +Making sure that the workers cooperate correctly is the main issue - ensuring correct synchronisation can be surprisingly subtle. + +Compare your answers from the last step with this solution. How did you do? Have you learned anything surprising? We are curious to know! + +--- + +![Photo of many envelopes](images/joanna-kosinska-uGcDWKN91Fs-unsplash.jpg) +*Image courtesy of [Joanna Kosinska](https://unsplash.com/@joannakosinska) from [Unsplash](https://unsplash.com)* + +## Message-Passing Model + +The Message-Passing Model is closely associated with the distributed-memory architecture of parallel computers. + +Remember that a distributed-memory computer is effectively a collection of separate computers, each called a node, connected by some network cables. It is not possible for one node to directly read or write to the memory of another node, so there is no concept of shared memory. Using the office analogy, each worker is in a separate office with their own personal whiteboard that no-one else can see. In this sense, all variables in the Message-Passing Model are private - there are no shared variables. + +In this model, the only way to communicate information with another worker is to send data over the network. We say that workers are passing messages between each other, where the message contains the data that is to be transferred (e.g. the values of some variables). A very good analogy is making a phone call. + +The fundamental points of message passing are: + +- the sender decides what data to communicate and sends it to a specific destination (i.e. you make a phone call to another office); +- the data is only fully communicated after the destination worker decides to receive the data (i.e. the worker in the other office picks up the phone); +- there are no time-outs: if a worker decides they need to receive data, they wait by the phone for it to ring; if it never rings, they wait forever! + +The message-passing model requires participation at both ends: for it to be successful, the sender has to actively send the message and the receiver has to actively receive it. It is very easy to get this wrong and write a program where the sends and receives do not match up properly, resulting in someone waiting for a phone call that never arrives. This situation is called deadlock and typically results in your program grinding to a halt. + +In this model, each worker is called a process rather than a thread as it is in shared-variables, and each worker is given a number to uniquely identify it. + +### Things to consider + +When parallelising a calculation in the message-passing model, the most important questions are: + +- how are the variables (e.g. the old and new roads) divided up among workers? +- when do workers need to send messages to each other? +- how do we minimise the number of messages that are sent? + +Because there are no shared variables (i.e. no shared whiteboard), you do not usually have to consider how the work is divided up. Since workers can only see the data on their own whiteboards, the distribution of the work is normally determined automatically from the distribution of the data: you work on the variables you have in front of you on your whiteboard. + +To communicate a lot of data we can send one big message or lots of small ones, what do you think is more efficient? Why? + +--- + +![Overhead photo of traffic](images/chuttersnap-d271d_SOGR8-unsplash.jpg) +*Image courtesy of [CHUTTERSNAP](https://unsplash.com/@chuttersnap) from [Unsplash](https://unsplash.com)* + +## How to parallelise the traffic simulation? + +Consider how you could parallelise the traffic model among 4 workers, each with their own whiteboards in separate offices, communicating by making phone calls to each other. + +Remember that the cars are on a roundabout (we are using periodic boundary conditions) so cars leaving the end of the road reappear at the start. + +To get you started: + +- think carefully about how the traffic model works; what are its basic rules? +- think about the characteristics of the message-passing model; +- how can you combine them? +- which workers need to phone each other, when and how often? + +You do not need to provide a clear-cut answer. Instead, list the things that you think need to be considered and why. + +### Extra Exercises + +In fact, sending a message can be implemented in two different ways: + +- like making a phone call (synchronously) or +- like sending an email (asynchronously). + +The difference is whether the sender waits until the receiver is actively taking part (a phone call) or carries on with their own work regardless (sending an email). + +Do you think that solving the traffic model in parallel is simpler using synchronous or asynchronous messages? Which do you think might be faster? Do you think the boundary conditions are important here? + +Imagine that you want all workers to know the average speed of the cars at every iteration. How could you achieve this using as few phone calls as possible? + +--- + +## Solution to Traffic simulation in Message-Passing + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_cy6y9hac&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_db2dyi95" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Solution_Traffic_Message_passing_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - Now we’re going to look at the traffic simulation but imagine how we could operate it in parallel. So I’ve got an even bigger board here, a bigger road. I’ve got three Chess boards stuck together. And so I have a road of length 24. And I have cars all over the road. So if we were operating on a shared memory computer in the shared-variables model, it wouldn’t really be a problem in running this simulation in parallel. This would be our very large, shared whiteboard, and there would be lots of workers together in the same office– all able to read and write to the whiteboard. + +0:41 - So for example, we could just decide if there were two workers that one worker updated all of these pawns, and the other worker updated all of these pawns, there might have to be some interaction, some collaboration between us to make sure that we are always on the same iteration. That we didn’t run ahead of each other. But in principle, there’s no real problem here. However, what’s much more interesting for this simulation is to look at how you parallelise it in distributed memory using the message-passing model. So what happens there, is we’re going to imagine this is split up over three people, three workers, but they’re in three separate offices. + +1:13 - So these are three separate whiteboards, each in a different office, and I’m operating on this small whiteboard here. All I can see is my own whiteboard. I can’t see what’s going on in these other two offices. + +1:33 - Now in the message-passing parallelisation I only have access to my own small whiteboard. And so now we can see there’s immediately a problem. For example, I can update this pawn. I know he can’t move. That one can’t move. That one can move. And that one can. But I have a problem at the edges. I have two problems. One is, I don’t know if this pawn can move, because I don’t know what’s happening over here. This is the piece of board which is owned by the person upstream from me. + +2:00 - And also, I don’t know if I should place a new pawn in this cell here, because that depends on the state of the board owned by the person who’s downstream from me. So the only way to solve this is to introduce communication. And in the message-passing model, communication is done through passing messages. And one analogy is making phone calls. So what I need to do, I need to pick up the phone and I need to phone my neighbours both to the left and to the right. I need to phone them up and say, OK, what’s going on here? What pawns do you have in your cell here? And I’ll tell my neighbour what’s going on here. + +2:37 - And then I need to make another phone call upstream to ask the person, what’s going on in your cells there? And to tell them what’s going on in my edge cells. Having communicated with my fellow workers, I’m now in a situation where I can update the simulation on my piece of board, because I know what’s going on on the edges. So for example, I know if there’s a pawn who needs to move into this square here. Or I know if there’s a gap here and this pawn can move off. So I can then update all my pawns on my board. And then on the next iteration, I have to again communicate with my fellow workers. + +3:08 - I need to communicate with my neighbours to the left and to the right to find out what the new stage of the pawns on the edges of their boards are. So the whole simulation continues in this process of communication and then calculation. Communication to find out what’s going on with your fellow workers. And then calculation, when you locally update your own chess board. There’s one extra thing we need to do this simulation, which is work out the average speed of the cars. So let’s take a situation, for example, where this pawn can move and there is no pawn coming in here. So we’ll say, OK, this pawn can’t move. This pawn moves, that’s one move. That makes two moves. + +3:44 - And this one, that’s three moves. So I know that three pawns have moved. But to calculate the average speed of the cars, I need to know how many pawns have moved on the entire road when in fact, I can only see a small section of the road. So not only do we need to communication with our fellow workers to do a single update to perform a single iteration, to find out what’s going on the edges of our board, we also need to communicate with them to work out what the average speed is, to work out how many pawns have moved on their board. + +4:10 - So whenever I want to work out what the average speed is, I have to pick up the phone and phone all of my fellow workers– that’s the simple way of doing it. Asking them how many of their pawns have moved. And then I get the totals and I can add them all together. So you can see that not only does updating the simulation require communication, even simple calculations like how many pawns have moved requires communication. Because I only know how many pawns have moved on my piece of road, but not what’s happening on the other pieces of road which are on other people’s whiteboards in other offices. +::: + +In this video David describes the basics of how you can parallelise the traffic model using message passing, i.e. on a distributed-memory machine. + +Try to list the most important points of this parallelisation. Was there anything that you failed to consider when coming up with your answer? For example, how does each worker know whether it’s supposed to call or wait for a call? Can you think of any other rules that need to be established for this to work? + +Hopefully, you now have a better understanding of how both programming models work and how they differ from each other. In the next two steps we will talk about the actual implementations of both models. + +--- + +![Photo of pipes overlapping](images/t-k-9AxFJaNySB8-unsplash.jpg) +*Image courtesy of [T K](https://unsplash.com/@realaxer) from [Unsplash](https://unsplash.com)* + +## MPI and processes + +So far we have discussed things at a conceptual level, but it’s worth going into some of the details, particularly so you are familiar with certain terminology such as process, thread, MPI and OpenMP. + +### Message Passing + +The way that message-passing typically works is that you write one computer program, but run many copies of it at the same time. Any supercomputer will have ways to let you spread these programs across the nodes, and we normally ensure we run exactly one copy of the program for every CPU-core so that they can all run at full speed without being swapped out by the operating system. From the OS’s point of view, each program is a separate process and by default they all run completely independently from each other. + +For example, if you run a Word Processor and a Spreadsheet Application at the same time, each of these becomes a separate process that can only run on a single CPU-core at a time. In the message-passing model, we exploit the parallel nature of our distributed-memory computer by running many processes. The only real differences from the word processor and spreadsheet example is that every process is a copy of the same program, and that we want our parallel processes to work together and not independently. + +Each process only has access to its own memory, totally separate from the others (this is strictly enforced by the OS to ensure that, for example, your Word Processor cannot accidentally overwrite the memory belonging to your Spreadsheet Application!). This way of implementing message-passing is called the Single Program Multiple Data or SPMD approach. + +When they want to send messages to each other, the processes call special functions to exchange data. For example, there will be a function to send data and a function to receive data. These functions are not directly part of the program but stored in a separate library which will be pre-installed on the supercomputer (if you are familiar with the way that applications are created from source code, this means that the compiler is not directly involved in the parallelisation). + +Almost all modern programs use the Message-Passing Interface library - MPI. Essentially, MPI is a collection of communication functions, that can be called from any user process. + +So, to summarise: + +- the message-passing model is implemented by running many processes at the same time; +- each process can only run on a single CPU-core and is allocated its own private memory; +- inter-process communication is enabled by using the MPI library and so does not require a special compiler; +- this is also called the SPMD approach. + +![Diagram depicting difference between message passing and shared variables](images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png) + +Can you see any problems with the Message-Passing approach if one of the nodes has a hardware failure and crashes? As supercomputers are getting larger does this become a more or less of an issue? + +--- + +## Practical 5: Parallel Execution via MPI + +So we have an MPI version of the image sharpening code, so let's compile it and submit it to Slurm. + +```bash +cd foundation-exercises/sharpen/C-MPI +make +``` + +```output +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -c sharpen.c +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -c dosharpen.c +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -c filter.c +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -c cio.c +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -c utilities.c +mpicc -cc=cc -O3 -DC_MPI_PRACTICAL -o sharpen sharpen.o dosharpen.o filter.o cio.o utilities.o -lm +``` + +::::challenge{id=parallel_prog_pr.1 title="Submitting a Sharpen MPI job"} +Write a job submission script that runs this sharpen MPI code. + +Remember that can't just run the MPI code using `./sharpen`. How should we run it in our submission script? + +:::solution + +```bash +#!/bin/bash + +#SBATCH --job-name=Sharpen-MPI +#SBATCH --nodes=4 +#SBATCH --tasks-per-node=1 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +srun ./sharpen +``` + +::: +:::: + +--- + +![Photo of sewing threads](images/stephane-gagnon-NLgqFA9Lg_E-unsplash.jpg) +*Image courtesy of [Stephane Gagnon](https://unsplash.com/@metriics) from [Unsplash](https://unsplash.com)* + +## OpenMP and threads + +### Shared Variables + +Shared variables are implemented in quite a different way from message passing. For shared variables, our CPU-cores need to be able to share the same memory (i.e. read and write to the same whiteboard). However, we said above that different processes cannot access each other’s memory, so what can we do? + +The shared-variables approach is implemented using threads. Threads are just like normal programs except they are created by processes while they are running, not explicitly launched by the user. So, every thread belongs to a parent process; unlike processes, threads can share memory. + +The sequence is: + +1) we run a single program which starts out running as a single process on a single CPU-core with its own block of memory; +2) while it is running, the process creates many threads which act like separate programs except they can all share the memory belonging to their parent process; +3) the operating system will notice that there are lots of threads running at the same time and ensure that, if possible, they are assigned to different CPU-cores. + +So, in the shared-variables model, we exploit the parallel nature of our shared-memory computer by running many threads, all created from a single program (the parent process). We rely on the operating system to do a good job of spreading the threads across the CPU-cores. + +In supercomputing, we normally use something called OpenMP to create and manage all our threads. Unlike the MPI library, OpenMP is something that needs to be built in to the compiler. There are actually many ways of creating parallel threads, but OpenMP was designed to be suited to large-scale numerical computations which is why it is so popular in the field. + +To summarise: + +- the shared-variables model is implemented by running many threads at the same time; +- each thread can only run on a single CPU-core, but they can all share memory belonging to their parent process; +- in supercomputing, we usually create threads using a special compiler that understands OpenMP. + +![Diagram depicting difference between message passing and shared variables](images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png) + +When we create threads we rely on the OS to assign them to different CPU-cores. How do think the OS makes that decision? What does it need to take into account, when there may be many more threads than CPU-cores? + +::::callout + +## Would you like to know more? + +If you're interested in a more detailed introduction to OpenMP that covers the technical concepts and its history, +you can watch [this video](https://media.ed.ac.uk/media/1_xyz5en6s). +:::: + +--- + +## Practical 6: Parallel Execution via OpenMP + +Let's take a look at a version of the sharpening code now that uses OpenMP. + +```bash +cd foundation-exercises/sharpen/C-OMP +make +``` + +```output +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -c sharpen.c +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -c dosharpen.c +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -c filter.c +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -c cio.c +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -c utilities.c +cc -O3 -fopenmp -DC_OPENMP_PRACTICAL -o sharpen sharpen.o dosharpen.o filter.o cio.o utilities.o -lm +``` + +::::challenge{id=parallel_prog_pr.2 title="Submitting a Sharpen OpenMP job"} +Write a job submission script that runs this sharpen OpenMP code. + +:::solution + +```bash +#!/bin/bash + +#SBATCH --job-name=Sharpen-OMP +#SBATCH --nodes=1 +#SBATCH --cpus-per-task=4 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +# Set the number of threads to the CPUs per task +export OMP_NUM_THREADS=4 + +./sharpen +``` + +::: +:::: + +--- + +![Photo of apples and oranges](images/anastasiya-romanova-vGY31qO4518-unsplash.jpg) +*Image courtesy of [Anastasiya Romanova](https://unsplash.com/@nanichkar) from [Unsplash](https://unsplash.com)* + +## Comparing the Message-passing and Shared-Variables models + +In your opinion, what are the pros and cons of the two models of parallel programming? + +Things to consider include: + +- how difficult it is to parallelise a calculation in the two models +- how do they use memory? Can they share it? Why? How? +- how many CPU-cores can you use in each model? +- what happens if you do it incorrectly - will the program ever complete? will it get the right answer? +- how does the speed of the two models compare - what are the overheads of each? + +--- + +## Terminology Quiz + +::::challenge{id=pcing_performance.1 title="Parallel Computing Performance Q1"} +What does the term programming model describe? + +A) a particular kind of computer simulation + +B) a specific computer programming language + +C) a high-level view of how to solve a problem using a computer + +D) the low-level details of how a particular computer is constructed + +:::solution +C) - it is concerned with the high-level methods we use to solve a problem, not the low-level details. +::: +:::: + +::::challenge{id=pcing_performance.2 title="Parallel Computing Performance Q2"} +What is a race condition in the shared-variables model? + +A) when a CPU-core is expecting to receive a message but it never arrives + +B) when two CPU-cores have different values for some private variable + +C) when lack of synchronisation leads to one CPU-core running ahead of the others + +D) when two CPU-cores try to modify the same shared variable at the same time + +:::solution +D) - this can cause erratic results and we need some form of synchronisation to fix it. +::: +:::: + +::::challenge{id=pcing_performance.3 title="Parallel Computing Performance Q3"} +Which of these could cause deadlock in the message-passing model? + +A) a CPU-core asks to receive a message but no message is ever sent to it + +B) a CPU-core asks to receive a message from another a few seconds before it is sent + +C) a CPU-core sends a message to another a few seconds before it is ready to receive it + +D) two CPU-cores try to modify the same shared variable at the same time + +:::solution +A) - this is like waiting forever for someone to phone you. +::: +:::: + +::::challenge{id=pcing_performance.4 title="Parallel Computing Performance Q4"} +Which of the following best describes the process of decomposing a calculation? + +A) deciding which variables should be private and which should be shared + +B) deciding which parts of a calculation can be done independently by different CPU-cores + +C) choosing between the shared-variables and message-passing models + +D) running a parallel program on a parallel computer + +:::solution +B) - Decomposing a problem means deciding how to do it in parallel on multiple CPU-cores +::: +:::: + +::::challenge{id=pcing_performance.5 title="Parallel Computing Performance Q5"} +What is a cellular automaton? + +A) a type of computer specifically built to run simple computer simulations + +B) a modular system for building parallel computers from simple cellular components + +C) a computer simulation technique based on repeated application of simple rules to a grid of cells + +D) a parallel programming model + +:::solution +C) - the traffic model is a good example of a simple cellular automaton +::: +:::: diff --git a/high_performance_computing/parallel_computing/03_parallel_performance.md b/high_performance_computing/parallel_computing/03_parallel_performance.md new file mode 100644 index 00000000..029b52e2 --- /dev/null +++ b/high_performance_computing/parallel_computing/03_parallel_performance.md @@ -0,0 +1,211 @@ +--- +name: Parallel Computing Performance +dependsOn: [ + high_performance_computing.parallel_computing.02_programming +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Photo of someone holding a stopwatch](images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg) +*Image courtesy of [Veri Ivanova](https://unsplash.com/@veri_ivanova) from [Unsplash](https://unsplash.com)* + +## Parallel Performance + +We have seen that parallel supercomputers have enormous potential computing power - the largest machine in the Top500 list has peak performances in excess of 1000 Petaflops. This is achieved by having hundreds of thousands of CPU-cores and GPGPU's in the same distributed-memory computer, connected with a fast network. + +When considering how to parallelise even the simplest calculations (such as the traffic model) using the message-passing model, we have seen that this introduces overheads: making a phone call to someone in a another office takes time, and this is time when you are not doing any useful calculations. + +We therefore need some way of measuring how well our parallel computation is performing: is it making the best use of all the CPU-cores? The input to all these metrics is the time taken for the program to run on $P$ CPU-cores which we will call $T_P$ . + +The standard measure of parallel performance is called the parallel speedup. We measure the time taken to do the calculation on a single CPU-core, and the time taken on $P$ CPU-cores, and compute the parallel speedup $S_P$: + +$$ +S_p = \frac{T_1}{T_P} +$$ + +For example, if the program took 200 seconds on 1 CPU-core (i.e. running in serial) and 25 seconds on 10 CPU-cores then the parallel speed-up is + +$$ +S_{10} = \frac{T_1}{T_{10}} = \frac{100}{25} = 8 +$$ + +Ideally, we would like our parallel program to run 10 times faster on 10 CPU-cores, but this is not normally possible due to the inevitable overheads. +These overheads ensure $S_P$ is generally less than $P$. + +Another way of quantifying this is to compute the parallel efficiency: + +$$ +E_P = \frac{S_P}{P} +$$ + +This gives us an idea of how efficiently we are using the CPU-cores. For the example given above, the parallel efficiency is + +$$ +E_{10} = \frac{S_{10}}{10} = \frac{8}{10} = 0.80 = 80 \% +$$ + +As before our parallelisation will not be 100% efficient and $E_P$ will be less than 1.0 (i.e. less than 100%). + +When considering the way a parallel program behaves, the standard approach is to measure the performance for increasing values of P and to plot a graph of either parallel speedup or parallel efficiency against the number of CPU-cores. + +![Performance chart of number of cpu cores against speedup for perfect speedup and example speedup](images/hero_afc04aae-ee23-4f71-8df0-80c3bf10d38e.png) + +We call this a scaling curve - we are trying to understand how well the performance scales with increasing numbers of CPU-cores; determining whether the program scales well or scales poorly (has good or bad scalability). + +For some problems a parallel efficiency of 80% would be considered to be very good and for others not so good. Can you think of a reason why that is? + +--- + +![Photo of very tall building](images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg) +*Image courtesy of [Patrick Tomasso](https://unsplash.com/@impatrickt) from [Unsplash](https://unsplash.com)* + +## Scaling and Parallel Overheads + +It is always useful to have some simple examples to help us understand why a parallel program scales the way it does. If we double the number of CPU-cores, does the performance double? If not, why not? + +Consider the following example of flying from a hotel in central London to a holiday destination. In this example: + +- the top speed of the aeroplane represents the computing power available: doubling the speed of the plane is equivalent to doubling the number of CPU-cores in our parallel computer; +- the distance we travel is equivalent to the size of the problem we are tackling; travelling twice as far is equivalent to doubling the number of cells in our traffic model. + +We will consider two journeys: from Central London to the Empire State Building in New York (5,600 km), and Central London to Sydney Opera House (16,800 km: 3 times as far). + +We will consider two possible aeroplanes: a Jumbo Jet (top speed 700 kph) and Concorde (2100 kph: 3 times as fast). + +The important observation is that the total journey time is the flight time plus the additional overheads of travelling between the city centre and the airport, waiting at check-in, clearing security or passport control, collecting your luggage etc. For simplicity, let’s assume that travel to the airport takes an hour by bus each way, and that you spend an hour in the airport at each end. + +| Plane | Destination | Flight Time | Over-head | Total Journey | Speed-up S3 | Efficiency E3 | +| --- | --- | --- | --- | --- | --- | --- | +| Jumbo Jet | New York | 8:00 | 4:00 | 12:00 | | | +| Concorde | New York | 2:40 | 4:00 | 6:40 | 1.8 | 60% | +| Jumbo Jet | Sydney | 24:00 | 4:00 | 28:00 | | | +| Concorde | Sydney | 8:00 | 4:00 | 12:00 | 2.3 | 78% | + +Try to answer the following questions: + +- does the journey overhead depend on the distance flown? +- what is the speedup for the first journey for a plane 10 or 100 times faster than Concorde? +- what does it tell you about the limits of parallel computing? +- what is the speedup for the second journey for a plane 10 or 100 times faster than Concorde? +- why do you think it’s different from the first journey’s speedup? +- what does it tell you about the possibilities of parallel computing? + +--- + +![Photo of judge's gavel](images/wesley-tingey-TdNLjGXVH3s-unsplash.jpg) +*Image courtesy of [Wesley Tingey](https://unsplash.com/@wesleyphotography) from [Unsplash](https://unsplash.com)* + +## Parallel Performance Laws + +The two most common ways to understand parallel scaling are called Amdahl’s Law and Gustafson’s Law. However, like many laws they boil down to simple common sense and are easily understandable using everyday examples. + +Let’s have another look at the example from the previous step. + +The key point here is that the overhead does not depend on the distance flown by the aircraft. In all cases you spend two hours travelling to and from the airports, and two hours waiting around in the airports: a total of four hours. + +The journey to New York illustrates Amdahl’s Law - the speedup is less than 3.0 because of the overheads. In a parallel program, these overheads can be characterised as serial parts of the calculation, i.e. parts of the calculation that do not go any faster when you use more CPU-cores. Typical examples can include reading and writing data from and to disk (which is often done via single CPU-core), or the time spent in communications. + +In fact, Gene Amdahl used this simple example to argue, way back in 1967, that parallel computing was not very useful in practice - no matter how fast your parallel computer is, you cannot eliminate the serial overheads. +Even if we had a starship travelling at almost the speed of light, the journey time to New York would never be less than 4 hours so the speedup would never exceed 3.0. +Inherently, your program is always going to be limited by the portions of the program that cannot be parallelised, even if these portions only make up a small fraction of the runtime of the serial program. +The fact that we may only be able to make a calculation three times faster even if we use thousands of CPU-cores is a little dispiriting… + +However, the journey to Sydney illustrates Gustafson’s Law - things get better if you tackle larger problems because the serial overheads stay the same but the parallel part gets larger. For the large problem, our maximum speedup would be 7.0. + +![Chart depicting diminishing returns from increasingly faster aircraft](images/hero_38f91e44-13fa-4d56-bd8f-1ff3b86002ff.png) + +Scaling plot for New York and Sydney Journeys (image) + +Amdahl’s law can be expressed in equations as: + +$$ +S_P = \frac{P}{(\alpha P + (1-\alpha))} +$$ + +where alpha is the fraction of the calculation that is entirely serial. For example, for the New York journey then + +$$ + \alpha = \frac{4\, \rm hours}{12\, \rm hours} = 0.33 +$$ + +Amdahl’s law predicts that, although the speedup always increases with P, it never exceeds 1/$\alpha$. For the New York trip, this means the speedup is limited to 3.0 which is what we already observed. + +If you are interested in more details (and more maths!) then you can visit Wikipedia ([Amdahl’s Law](https://en.wikipedia.org/wiki/Amdahl%27s_law) and [Gustafson’s Law](https://en.wikipedia.org/wiki/Gustafson%27s_law)), but all the equations can sometimes obscure what is really a common-sense result. + +--- + +![Long exposure of many cars on highway at night](images/jake-givens-iR8m2RRo-z4-unsplash.jpg) +*Image courtesy of [Jake Givens](https://unsplash.com/@jakegivens) from [Unsplash](https://unsplash.com)* + +## Scaling Behaviour of the Traffic Simulation + +Let’s consider the traffic model parallelised using message-passing – you and three colleagues are working together on the traffic model in different offices. You each have your own chess set with pawns representing the cars, but you need to phone each other if you want to exchange information. + +For the sake of argument, let’s also assume that you can update your chess board at the rate of two cells per second, i.e. if you had 8 cells then each timestep would take you 4 seconds. + +For a road of 20 cells, work out how long it takes to do a single iteration in serial (one person) and in parallel (all four people). You will need to count how many phone calls you need to make, and make a guess as to how long each phone call will take. + +For simplicity, we will not bother to compute the average speed of the cars and you can ignore any startup costs associated with telling everyone the initial distribution of cars. + +What is the speedup using four people, S4, for this calculation? + +Is parallel computing worthwhile here? + +Compute the speedup S4 for roads of length 200, 2000 and 20000 cells - what do you observe? You could also compute S2 and S3 to see how each problem size scales with increasing P. + +To what extent do these figures agree or disagree with Amdahl’s law and Gustafson’s Law? + +:::callout{variant="discussion"} +These calculations, require you to make a number of assumptions so when comparing answers with your fellow learners you should not focus on the numerical results so much. Looking at the reasoning behind each of the assumptions and comparing your overall conclusions should be more interesting. +::: + +--- + +## Terminology Recap + +::::challenge{id=pcing_programming.1 title="Parallel Computing Programming Q1"} +The standard measure of parallel performance is called the parallel ____ . +For P CPU-cores it is calculated as the time taken to run a program on ____ +CPU-core divided by the time taken to run it on ____ CPU-cores. + +:::solution + +A) speed up + +B) one + +C) P + +::: +:::: + +::::challenge{id=pcing_programming.2 title="Parallel Computing Programming Q2"} +In parallel computing, the parallel ____ is used to measure how efficiently the CPU-cores are utilised. Although, we would like this to be as high as possible, it is typically less than ____. + +:::solution + +A) efficency + +B) 1.0 + +::: +:::: + +::::challenge{id=pcing_programming.3 title="Parallel Computing Programming Q3"} +The plot showing the performance of a parallel program with increasing number of CPU-cores is referred to as a ____ ____ . The fact that parallel programs do not scale perfectly (i.e. the speedup is not equal to the number of CPU-cores) is explained by an equation called ____ ____ . + +:::solution + +A) scaling curve + +B) Amdahl's Law + +::: +:::: diff --git a/high_performance_computing/parallel_computing/04_practical.md b/high_performance_computing/parallel_computing/04_practical.md new file mode 100644 index 00000000..d6a5eb6a --- /dev/null +++ b/high_performance_computing/parallel_computing/04_practical.md @@ -0,0 +1,361 @@ +--- +name: Parallelising Mandelbrot Set Generation +dependsOn: [ + high_performance_computing.parallel_computing.03_parallel_performance +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +## Part 1: Introduction & Theory + +### Mandelbrot Set + +The Mandelbrot set is a famous example of a fractal in mathematics. It is a set of complex numbers $c$ for which the function + +$f_c(z) = z^2 + c$ + +does not diverge to infinity when iterated from $z=0$, i.e the values of $c$ for which the sequence + +$[ c,\ c^2+c,\ (c^2+c)^2+c,\ ((c^2+c)^2+c)^2+c,\ ...]$ + +remains bounded. + +The complex numbers can be thought of as 2d coordinates, that is a complex number $z$ with real part $a$ and imaginary part $b$ ($z = a + ib$) can be written as $(a, b)$. The coordinates can be plotted as an image, where the color corresponds to the number of iterations required before the escape condition is reached. The escape condition is when we have confirmed that the sequence is not bounded, this is when the magnitude of $z$, the current value in the iteration, is greater than 2. +The complex numbers can be thought of as 2d coordinates, that is a complex number $z$ with real part $a$ and imaginary part $b$ ($z = a + ib$) can be written as $(a, b)$. +The coordinates can be plotted as an image, where the color corresponds to the number of iterations required before the escape condition is reached; when we have confirmed that the sequence is not bounded, this is when the magnitude of $z$, the current value in the iteration, is greater than 2. +The pseudo code for this is: + +```python nolint +for each x,y coordinate + x0, y0 = x, y + x = 0 + y = 0 + iteration = 0 + while (iteration < max_iterations and x^2 + y^2 <= 4 ) + x_next = x^2+y^2 + x0 + y_next = 2*x*y + y0 + + iteration = iteration + 1 + + x = x_next + y = y_next + + return color_map(iteration) +``` + +Note that for points within the Mandelbrot set +the condition will never be met, hence the need to set the upper bound ``max_iterations``. + +The Julia set is another example of a complex number set. + +From the parallel programming point of view the useful feature of the Mandelbrot and Julia sets is that the calculation for each point is independent i.e. whether one point lies within the set or not is not affected by other points. + +### Parallel Programming Concepts + +#### Task farm + +Task farming is one of the common approaches used to parallelise applications. Its main idea is to automatically create pools of calculations (called tasks), dispatch them to the processes and the to collect the results. + +The process responsible for creating this pool of jobs is known as a **source**, sometimes it is also called a *master* or *controller process*. + +The process collecting the results is called a **sink**. Quite often one process plays both roles – it creates, distributes tasks and collects results. It is also possible to have a team of source and sink process. A ‘farm’ of one or more workers claims jobs from the source, executes them and returns results to the sink. The workers continually claim jobs (usually complete one task then ask for another) until the pool is exhausted. + +Figure 1 shows the basic concept of how a task farm is structured. + +![Schematic representation of a simple task farm](images/task_farm.png) +*Schematic representation of a simple task farm* + +In summary processes can assume the following roles: + +- **Source** - creates and distributes tasks +- **Worker processes** - complete tasks received from the source process and then send results to the sink process +- **Sink** - gathers results from worker processes. + +Having learned what a task farm is, consider the following questions: + +- What types of problems could be parallelised using the task farm approach? What types of problems would not benefit from it? Why? +- What kind of computer architecture could fully utilise the task farm benefits? + +#### Using a task farm + +A task farm is commonly used in large computations composed of many independent calculations. +Only when calculations are independent is it possible to assign tasks in the most effective way, and thus speed up the overall calculation with the most efficiency. +If the tasks are independent from each other, the processors can request them as they become available, i.e. usually after they complete their current +task, without worrying about the order in which tasks are completed. + +The dynamic allocation of tasks is an effective method for getting more use out of the compute resources. +It is inevitable that some calculations will take longer to complete than others, so using methods such as a lock-step calculation (waiting on the whole set of processors to finish a current job) or pre-distributing all tasks at the beginning would lead to wasted compute cycles. +Of course, not all problems can be parallelised using a task farm approach. + +#### Not always a task farm + +While many problems can be broken down into individual parts, there are a sizeable number of problems where this approach will not work. +Problems which involve lots of inter-process communication are often not suitable for task farms as they require the master to track which worker has which element, and to tell workers which other workers they are required to communicate with. +Additionally, the sink process may need to track this as well in cases of output order dependency. +It is still possible to use task farms to parallelise problems that require a lot of communications, however, in such cases additional complexity or overheads impacting the performance can be incurred. +As mentioned before, to determine the points lying within the Mandelbrot set there is no need for the communications between the worker tasks, which makes it an embarrassingly parallel problem that is suitable for task-farming. +Even knowing a task-farm is viable for a given job, we still need to consider how to use it in the most optimal way. + +:::callout{variant="discussion"} +- How do you think the performance would be affected if you were to use +more, equal and fewer tasks than workers? +- In your opinion what would be the optimal combination of the number +of workers and task? What would it depend on the most? Task size? +Problem size? Computer architecture? +::: + +### Load Balancing + +The factor deciding the effectiveness of a task farm is a task distribution. +Determining how the tasks are distributed across the workers is called a balancing. + +Successful load balancing avoids overloading a single worker, maximising the throughput of the system and making best use of resources available. +Poor load balancing will cause some workers of the system to be idle and consequently other elements to be ‘overworked’, leading to increased +computation time and significantly reduced performance. + +#### Poor load balancing + +The figure below shows how careless task distribution can affect the completion time. +Clearly, CPU2 needs more time to complete its assigned tasks, particularly compared to CPU3. +The total runtime is equivalent to the longest runtime on any of the CPUs so the calculation time will be longer than it would be if the resources +were used optimally. +This can occur when load balancing is not considered, random scheduling is used (although this is not always bad), or poor decisions are made about the job sizes. + +![Poor load balance](images/load_imbalance.png) +*Poor load balance* + +#### Good Load Balancing + +The next figure below shows how by scheduling jobs carefully, the best use of the resources can be made. +When a distribution strategy is chosen which optimises the use of resources, the CPUs in the diagram complete their tasks at roughly the same time. +This means that no one worker has been overloaded with tasks and dominated the running time of the overall calculation. +This can be achieved by many different means. + +For example, if the task sizes and running times are known in advance, the +jobs can be scheduled to allow best resource usage. The most common +distribution is to distribute large jobs first and then distribute progressively +smaller jobs to equal out the workload. + +If the job sizes can change or the running times are unknown, then an +adaptive system could be used which tries to infer future task lengths based +upon observed runtimes. + +![Good load balance](images/load_balance.png) +*Good load balance* + +The fractal program you will be using employs a queue strategy – tasks are queued waiting for workers, which completed their previous task, to claim them from the top of the queue. This ensures that workers that happen to get shorter tasks will complete more tasks, so that they finish roughly at the same time as workers with longer tasks. + +#### Quantifying the load imbalance + +We can try to quantify how well balanced a task farm is by computing the load imbalance factor, which we define as: + +$\text{load imbalance factor} = \frac{\text{Workload of most loaded worker}}{\text{average workload of workers}}$ + +For a perfect load-balanced calculation this will be equal to 1.0, which is equivalent to all workers having exactly the same amount of work. In general, it will be greater than 1.0. +It is a useful measure because it allows you to predict what the runtime would be for a perfectly balanced load on the same number of workers, assuming that no additional overheads are introduced due to load balancing. For example, if the load imbalance factor is 2.0 then this implies that, in principle, we could halve the runtime (reduce it by a factor of 2) if the load were perfectly balanced. + +--- + +## Part 2: Compile and Run + +Let's compile and run the example fractal code which makes use of MPI. + +### Compiling the source code + +```bash +cd foundation-exercises/fractal/C-MPI +ls +``` + +Similarly to the previous examples, we can compile the serial source code by doing: + +```bash +make +``` + +Again, if running this on your own machine locally, you may need to edit the `Makefile` to change the compiler used. You can then run the code directly with `mpiexec`, e.g. `mpiexec -n 4 ./fractal` to run it with 4 processes. + +If you're running this on ARCHER2 you'll note that the created executable `fractal` cannot be run directly, since if you try you get: + +```output +ERROR: need at least two processes for the task farm! +``` + +::::challenge{id=parallel_prog_pr.1 title="Submitting a Fractal MPI job"} +**To be able to run the job submission examples in this segment, you'll need to either have access to ARCHER2, or an HPC infrastructure running the Slurm job scheduler and knowledge of how to configure job scripts for submission.** + +So on an HPC infrastructure, we'll need (and should!) submit this as a job via Slurm. +Write a script that executes the fractal MPI code that uses 16 worker processes on a single node. + +:::solution +So in order to have this generated using two worker processes, we need to set `tasks-per-node` to 17, +to accomodate the sink process. e.g. on ARCHER2: + +```bash +#!/bin/bash + +#SBATCH --job-name=Fractal-MPI +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=17 +#SBATCH --cpus-per-task=1 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +srun ./fractal +``` + +::: +:::: + +Once complete, you should find the log file contains something like the following: + +```output +--------- CONFIGURATION OF THE TASKFARM RUN --------- + +Number of processes: 17 +Image size: 768 x 768 +Task size: 192 x 192 (pixels) +Number of iterations: 5000 +Coordinates in X dimension: -2.000000 to 1.000000 +Coordinates in Y dimension: -1.500000 to 1.500000 + +-----Workload Summary (number of iterations)--------- + +Total Number of Workers: 16 +Total Number of Tasks: 16 + +Total Worker Load: 498023053 +Average Worker Load: 31126440 +Maximum Worker Load: 156694685 +Minimum Worker Load: 62822 + +Time taken by 16 workers was 0.772049 (secs) +Load Imbalance Factor: 5.034134 +``` + +The ``fractal`` executable will take a number of parameters and produce a fractal image in a file called ``output.ppm``. By default the image will be +overlaid with blocks in different shades, which correspond to the work done by different processors. This way we can see how the tasks were allocated. An example of this is presented in figure 1 – the image is divided into 16 tasks (squares) and a different shade of red corresponds to each of the workers. For example, running this on ARCHER2 with 16 workers will therefore yield 16 shades of red, and running this on your own machine with 4 workers will yield 4 shades instead. + +![Fractal output.ppm]( ./images/fractal_output.png) +*Example output image created using 16 workers and 16 tasks.* + +So in our example script, the program created a task farm with one master process and 2 workers. The master divides the image up into tasks, where each task is a square of the size of 192 by 192 pixels (the default size of each square). The default image size is thus 768 x 768 pixels, which means there is exactly 1 task per worker. + +The load of a worker is estimated as the total number of iterations of the Mandelbrot calculation summed over all the pixels considered by that worker. The assumption is that the time taken is proportional to this. The only time that is actually measured is the total time taken to complete the calculation. + +If on ARCHER2, use `scp` to copy the `output.ppm` image file back to your local machine to view it, otherwise if on your own machine open the file directly. In any event, your "pattern" of workers for each segment will likely differ than what's depicted here, depending on which workers were assigned which task and how many workers you used. + +::::challenge{id=parallel_prog_pr.2 title="Removing Diagnostic Output"} +Try adding `-n` to `fractal`'s arguments in the submission script. What happens? + +:::solution +You'll notice that the program no longer shades the output image depending on the worker that created it. +This is a good example of how to set up a parallel program that shows how it's making use of parallel resources in its output, which is useful for debugging across multiple processes. +However, it also allows you to generate a "proper" image without this information. +It can be really helpful, particularly in more complex parallel programs, to have such optional diagnostic output! +::: +:::: + +## Fractal Program Parameters + +The following options are recognised by the fractal program: + +- ``-S`` number of pixels in the x-axis of image +- ``-I`` maximum number of iterations +- ``-x`` the x-minimum coordinate +- ``-y`` the y-minimum coordinate +- ``-f `` set to J for Julia set +- ``-c`` the real part of the parameter c+iC for the Julia set +- ``-C`` the imaginary part of the parameter c+iC for the julia set +- ``-t`` task size (pixels x pixels) +- ``-n`` do not shade output image based on task allocation to workers + +--- + +## Part 3: Investigation + +**For this segment, if you're running this locally on your own machine assume 3 workers (and 1 master process) instead of 16 workers (and 1 master process), since it's quite possible your own machine will not be able to handle 17 parallel MPI processes over 17 separate cores.** + +To explore the effect of load balancing run the code with different parameters and try to answer the following questions. + +::::challenge{id=parallel_prog_pr.3 title="Predict Runtime"} +From the default run with 16 workers and 16 tasks, what is your predicted best runtime based on the load imbalance factor? + +:::solution +The load balance factor for this run is 5.034134 (although this may vary slightly between runs). +Therefore, we would expect a speedup of approximately 5x if perfect load balance is achieved. +The runtime for this run was 0.772049s, so we should get an optimal runtime of something like 0.154s using 16 workers +::: +:::: + +::::challenge{id=parallel_prog_pr.4 title="Load Distribution"} +Look at the output for 16 tasks – can you understand how the load was distributed across workers by looking at the colours of the bands and the structure of the Mandelbrot set? + +:::solution +The size of each task is set to a grid consisting of $192^2$ pixels, which means that for the default $768^2$ grid each worker gets allocated exactly one task to work on. +::: +:::: + +::::challenge{id=parallel_prog_pr.5 title="Exploring Load Imbalance"} +For 16 workers, run the program with ever smaller task sizes (i.e. more tasks) and create a table with each row containing grid size, runtime, and load imbalance factor. You should ensure you measure all the way up to the maximum number of tasks, i.e. a task size of a single pixel. +You can use `-t` as an argument to the `fractal` program to set the task/grid size. + +:::solution + +Running this on ARCHER2 with 16 workers, you may find your answers look something like: + +| Grid size | Runtime(s) | Load imbalance factor +|-----------|------------|---------------------- +| 192 | 0.772049 | 5.034134 +| 96 | 0.237510 | 1.545613 +| 48 | 0.170243 | 1.107278 +| 24 | 0.160345 | 1.041810 +| 12 | 0.155502 | 1.006157 +| 6 | 0.159350 | 1.003959 +| 3 | 0.176114 | 1.005451 +| 2 | 0.211402 | 1.004163 +| 1 | 0.417041 | 1.003397 + +Running this on a local machine with 3 workers, you may find your answers look more like: + +| Grid size | Runtime(s) | Load imbalance factor +|-----------|------------|---------------------- +| 192 | 1.298491 | 1.337907 +| 96 | 1.003863 | 1.015820 +| 48 | 0.955120 | 1.006782 +| 24 | 0.970588 | 1.007317 +| 12 | 0.983153 | 1.013827 +| 6 | 0.981687 | 0.990886 +| 3 | 1.024623 | 1.032792 +| 2 | 1.019592 | 1.003972 +| 1 | 1.132916 | 1.015426 + +Of course, your figures may differ somewhat! +::: +:::: + +::::challenge{id=parallel_prog_pr.6 title="Analysis"} +Can you explain the form of the table/graph? +Does the minimum runtime approach what you predicted from the load imbalance factor? + +:::solution +Based on the distribution of the tasks in our generated image, the workers responsible for the bottom and top rows of the grid, as well as the left column had very little work to do, whereas the middle 4 cells had the most work, based on their overlap with the Mandelbrot set (the black areas require the most cycles to compute). + +Clearly the load imbalance is quite high because once a worker has finished processing its task there is no additional work it can do. In order to remedy this we can reduce the size of the tasks that are assigned to the workers. Looking in our table, we see that the time to solution and load imbalance factors drop rapidly. + +If on ARCHER2, for example, the fastest runtime is achieved for tasks with a gridsize of 12, by which point the load imbalance factor is only 1.0062. The load imbalance factor continues to decrease somewhat, but the runtimes increase for smaller grid sizes due to the cost of having to do more communication with the controller process to be assigned a new task. Effectively the master process has become a bottleneck because it cannot assign work to the workers fast enough. This point may be reached at different task sizes depending on how many workers there are. Depending on the problem size and complexity,a large number of workers will require a relatively larger task size to prevent the controller process becoming the bottleneck. + +If running this on your own machine with a lower number of workers, you may see a similar "shape" to the numbers or graph, with an initially high imbalance factor that decreases to around 1.0 even faster. Correspondingly, the runtime also initially decreases and then increases again as the master process becomes a bottleneck, although with fewer workers this variance is less pronounced. +::: +:::: diff --git a/high_performance_computing/parallel_computing/images/Conways_game_of_life.png b/high_performance_computing/parallel_computing/images/Conways_game_of_life.png new file mode 100644 index 00000000..7c628df1 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/Conways_game_of_life.png differ diff --git a/high_performance_computing/parallel_computing/images/anastasiya-romanova-vGY31qO4518-unsplash.jpg b/high_performance_computing/parallel_computing/images/anastasiya-romanova-vGY31qO4518-unsplash.jpg new file mode 100644 index 00000000..1a91d355 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/anastasiya-romanova-vGY31qO4518-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/andrea-zanenga-yUJVHiYZCGQ-unsplash.jpg b/high_performance_computing/parallel_computing/images/andrea-zanenga-yUJVHiYZCGQ-unsplash.jpg new file mode 100644 index 00000000..0054fcc9 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/andrea-zanenga-yUJVHiYZCGQ-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/chris-ried-ieic5Tq8YMk-unsplash.jpg b/high_performance_computing/parallel_computing/images/chris-ried-ieic5Tq8YMk-unsplash.jpg new file mode 100644 index 00000000..b1b1e4a8 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/chris-ried-ieic5Tq8YMk-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/chuttersnap-4YdbwhmTMn0-unsplash.jpg b/high_performance_computing/parallel_computing/images/chuttersnap-4YdbwhmTMn0-unsplash.jpg new file mode 100644 index 00000000..10336144 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/chuttersnap-4YdbwhmTMn0-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/chuttersnap-d271d_SOGR8-unsplash.jpg b/high_performance_computing/parallel_computing/images/chuttersnap-d271d_SOGR8-unsplash.jpg new file mode 100644 index 00000000..effe4349 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/chuttersnap-d271d_SOGR8-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/fractal_output.png b/high_performance_computing/parallel_computing/images/fractal_output.png new file mode 100644 index 00000000..7df54d50 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/fractal_output.png differ diff --git a/high_performance_computing/parallel_computing/images/hero_38f91e44-13fa-4d56-bd8f-1ff3b86002ff.png b/high_performance_computing/parallel_computing/images/hero_38f91e44-13fa-4d56-bd8f-1ff3b86002ff.png new file mode 100644 index 00000000..2560b672 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/hero_38f91e44-13fa-4d56-bd8f-1ff3b86002ff.png differ diff --git a/high_performance_computing/parallel_computing/images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png b/high_performance_computing/parallel_computing/images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png new file mode 100644 index 00000000..2cb980a5 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/hero_4177e963-f697-4b49-bcce-01940d651fd3.png differ diff --git a/high_performance_computing/parallel_computing/images/hero_a14e4034-f6a0-44d2-b238-330b8c9aaed5.png b/high_performance_computing/parallel_computing/images/hero_a14e4034-f6a0-44d2-b238-330b8c9aaed5.png new file mode 100644 index 00000000..f8b26429 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/hero_a14e4034-f6a0-44d2-b238-330b8c9aaed5.png differ diff --git a/high_performance_computing/parallel_computing/images/hero_afc04aae-ee23-4f71-8df0-80c3bf10d38e.png b/high_performance_computing/parallel_computing/images/hero_afc04aae-ee23-4f71-8df0-80c3bf10d38e.png new file mode 100644 index 00000000..7ac15b84 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/hero_afc04aae-ee23-4f71-8df0-80c3bf10d38e.png differ diff --git a/high_performance_computing/parallel_computing/images/jake-givens-iR8m2RRo-z4-unsplash.jpg b/high_performance_computing/parallel_computing/images/jake-givens-iR8m2RRo-z4-unsplash.jpg new file mode 100644 index 00000000..69a80558 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/jake-givens-iR8m2RRo-z4-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/joanna-kosinska-uGcDWKN91Fs-unsplash.jpg b/high_performance_computing/parallel_computing/images/joanna-kosinska-uGcDWKN91Fs-unsplash.jpg new file mode 100644 index 00000000..fa58da99 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/joanna-kosinska-uGcDWKN91Fs-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/load_balance.png b/high_performance_computing/parallel_computing/images/load_balance.png new file mode 100644 index 00000000..4c4cb475 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/load_balance.png differ diff --git a/high_performance_computing/parallel_computing/images/load_imbalance.png b/high_performance_computing/parallel_computing/images/load_imbalance.png new file mode 100644 index 00000000..1fe6cd8a Binary files /dev/null and b/high_performance_computing/parallel_computing/images/load_imbalance.png differ diff --git a/high_performance_computing/parallel_computing/images/luca-bravo-XJXWbfSo2f0-unsplash.jpg b/high_performance_computing/parallel_computing/images/luca-bravo-XJXWbfSo2f0-unsplash.jpg new file mode 100644 index 00000000..7a97651c Binary files /dev/null and b/high_performance_computing/parallel_computing/images/luca-bravo-XJXWbfSo2f0-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/patrick-tomasso-gMes5dNykus-unsplash.jpg b/high_performance_computing/parallel_computing/images/patrick-tomasso-gMes5dNykus-unsplash.jpg new file mode 100644 index 00000000..43d07d7d Binary files /dev/null and b/high_performance_computing/parallel_computing/images/patrick-tomasso-gMes5dNykus-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/sigmund-_dJCBtdUu74-unsplash.jpg b/high_performance_computing/parallel_computing/images/sigmund-_dJCBtdUu74-unsplash.jpg new file mode 100644 index 00000000..902c54ed Binary files /dev/null and b/high_performance_computing/parallel_computing/images/sigmund-_dJCBtdUu74-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/stephane-gagnon-NLgqFA9Lg_E-unsplash.jpg b/high_performance_computing/parallel_computing/images/stephane-gagnon-NLgqFA9Lg_E-unsplash.jpg new file mode 100644 index 00000000..d07f98cc Binary files /dev/null and b/high_performance_computing/parallel_computing/images/stephane-gagnon-NLgqFA9Lg_E-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/t-k-9AxFJaNySB8-unsplash.jpg b/high_performance_computing/parallel_computing/images/t-k-9AxFJaNySB8-unsplash.jpg new file mode 100644 index 00000000..f98f432f Binary files /dev/null and b/high_performance_computing/parallel_computing/images/t-k-9AxFJaNySB8-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/task_farm.png b/high_performance_computing/parallel_computing/images/task_farm.png new file mode 100644 index 00000000..7349445f Binary files /dev/null and b/high_performance_computing/parallel_computing/images/task_farm.png differ diff --git a/high_performance_computing/parallel_computing/images/towfiqu-barbhuiya-JhevWHCbVyw-unsplash.jpg b/high_performance_computing/parallel_computing/images/towfiqu-barbhuiya-JhevWHCbVyw-unsplash.jpg new file mode 100644 index 00000000..0a4ccaca Binary files /dev/null and b/high_performance_computing/parallel_computing/images/towfiqu-barbhuiya-JhevWHCbVyw-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg b/high_performance_computing/parallel_computing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg new file mode 100644 index 00000000..650d5240 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/images/wesley-tingey-TdNLjGXVH3s-unsplash.jpg b/high_performance_computing/parallel_computing/images/wesley-tingey-TdNLjGXVH3s-unsplash.jpg new file mode 100644 index 00000000..fdc29e18 Binary files /dev/null and b/high_performance_computing/parallel_computing/images/wesley-tingey-TdNLjGXVH3s-unsplash.jpg differ diff --git a/high_performance_computing/parallel_computing/index.md b/high_performance_computing/parallel_computing/index.md new file mode 100644 index 00000000..3e68938e --- /dev/null +++ b/high_performance_computing/parallel_computing/index.md @@ -0,0 +1,35 @@ +--- +name: Parallel Computing +id: parallel_computing +dependsOn: [ + high_performance_computing.parallel_computers, +] +files: [ + 01_intro.md, + 02_programming.md, + 03_parallel_performance.md, + 04_practical.md, +] +summary: | + This module covers how supercomputers are programmed to make use of computational resources in parallel to perform + calculations more quickly. + +--- + +In this video David will give a brief description of what awaits you in this module about parallel computing. + +::::iframe{id="kaltura_player" width="700" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_j7i8ueqz&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_54nq03w8" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Welcome_to_Parallel_Computing"} +:::: + +:::solution{title="Transcript"} +0:11 - We’ve mainly talked about hardware in the first two weeks. Now we’re going to focus on software. We’ll take a couple of simple examples and consider how you could split up each calculation to take advantage of a shared or distributed memory parallel computer. We’ll cover some of the key issues of parallel computing this week at a conceptual level. Again, the analogy of different people working on whiteboards is very useful to illustrate the core concepts. One of the examples we’ll use which is a very simple way of simulating the way the traffic flows on a road is something I hope you’ll find interesting to illustrate how we can use computer simulation to make predictions about the real world. + +0:48 - Now the way the traffic simulation is parallelised will allow us to look at what overheads are introduced by parallelisation and enable us to start to quantify when running on a parallel computer is worthwhile and when it isn’t. +::: + +We have previously discussed supercomputers mostly from a hardware perspective, +in this part we will focus more on the software side i.e. how supercomputers are programmed. + +We begin by using the traffic simulation to illustrate the core concepts of parallel computing. +From there, we delve into different programming models and their relationship to machine architectures and how they can be applied to our case study. +Finally, we examine performance, learning how to evaluate whether our simulations are efficiently utilizes computing resources. diff --git a/high_performance_computing/supercomputing/01_intro.md b/high_performance_computing/supercomputing/01_intro.md new file mode 100644 index 00000000..ca98d1e9 --- /dev/null +++ b/high_performance_computing/supercomputing/01_intro.md @@ -0,0 +1,177 @@ +--- +name: Introduction to Supercomputing +dependsOn: [ +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Photo of a supercomputer](images/taylor-vick-M5tzZtFCOfs-unsplash.jpg) +*Image courtesy of [Taylor Vick](https://unsplash.com/@tvick) from [Unsplash](https://unsplash.com)* + +## What are supercomputers? + +A supercomputer is a computer with very high-level computational capacities, significantly surpassing a general-purpose computer such as a personal desktop or laptop. + +Supercomputers were first introduced in the 1960s by Seymour Roger Cray at Control Data Corporation (CDC), and have been used intensively in science and engineering ever since. +Clearly the technology has improved since then - today’s laptop would have been a supercomputer only a couple of decades ago - but no matter how fast today’s general-purpose computers are, there will always be a need for much more powerful machines. +To keep track of the state-of-the-art, the supercomputing community looks to the Top500 list, which ranks the fastest 500 supercomputers in the world every six months. + +### Number crunching + +The main application of supercomputers is in large-scale numerical computations, also called number-crunching. For simple calculations such as: + +~~~math +123 + 765 = 888 +~~~ + +or + +~~~math +1542.38 x 2643.56 = 4077374.07 +~~~ + +You don’t need a supercomputer. In fact you don’t even need a personal computer, as pencil-and-paper or a simple calculator can do the job. If you want to calculate something more complex, such as the total of the salaries of every employee in a big company, you probably just need a general-purpose computer. + +The types of large-scale computations that are done by supercomputers, such as weather-forecasting or simulating new materials at the atomic scale, are fundamentally based on simple numerical calculations that could each be done on a calculator. However, the sheer scale of these computations and the levels of accuracy these applications require mean that almost unimaginably large numbers of individual calculations are needed to do the job. To produce an accurate weather forecast, the total number of calculations required is measured in the quintillions, where a quintillion is one with 18 zeroes after it: 1 000 000 000 000 000 000 ! + +Imagine running a computation that takes several days or weeks to complete, one that you may need to repeat it many times with different input parameters. Such a task could monopolize your computer's resources, leaving you unable to use it for anything else. This is particularly problematic for time-sensitive applications, like predicting tomorrow's weather, where a delay would make the results irrelevant. + +This is where supercomputers excel. By leveraging thousands of processors working in parallel, they can finish jobs in hours or days that would take general-purpose computers many years to complete. Furthermore, they tackle problems that are too large or complex for everyday machines to store in their memory, such as modeling the Earth's climate, simulating molecular interactions, or processing massive datasets in astrophysics. + +Supercomputers are indispensable tools for solving the most computationally demanding challenges. + +### Parallelism: the key to performance + +Supercomputers achieve this using parallel computing, carrying out many calculations simultaneously. Imagine thousands of general-purpose computers all working for you on the same problem at the same time. This analogy reflects how modern supercomputers work - you will learn more about the details of their architecture and operation later in the course. + +Also keep in mind that although supercomputers provide enormous computational capacities, they are also very expensive to develop, purchase and even just to operate. For example, the typical power consumption of a supercomputer is of the order of several megawatts, where a megawatt (MW) is enough to power a small town of around 1000 people. That is why it’s important to use them as efficiently as possible. + +:::callout{variant="discussion"} +Can you think of other examples of parallelism in everyday life, where many hands make light work? +::: + +--- + +## Supercomputers - why do we need them? + +This UKRI video gives you an overview of why high performance computing is important aspect of modern scientific research. + +::::iframe{width="100%" height="400" src="https://www.youtube.com/embed/NEgbVNIo560" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen} +:::: + +In the video Prof. Mark Parsons uses the term of High Performance Computing (HPC). When reading about supercomputing, you will encounter this term quite often. HPC is a general term that includes all the activities and components associated with supercomputers, including aspects such as software and data storage as well as the bare supercomputer hardware. + + +![Computer simulation example image](images/large_hero_e0df48e4-9b4d-422c-a18f-d7898b9578d8.jpg) +*Computer simulation covering multiple physical phenomena and machine learning algorithms can predict how dinosaurs might have moved. The laws of physics apply for extinct animals exactly as they do for living ones. © 2016 ARCHER image competition* + +## Supercomputers - how are they used? + +To do science in the real world we have to build complicated instruments for each experiment: vast telescopes for astronomers to look deep into space, powerful particle accelerators so physicists can smash atoms together at almost the speed of light, or enormous wind tunnels where engineers can study how an aeroplane wing will operate in flight. + +### Computer simulation + +However, some problems are actually too large, too distant or too dangerous to study directly: we cannot experiment on the earth’s weather to study climate change, we cannot travel thousands of light years into space to watch two galaxies collide, and we cannot dive into the centre of the sun to measure the nuclear reactions that generate its enormous heat. However, one supercomputer can run different computer programs that reproduce all of these experiments inside its own memory. + +:::callout{variant="discussion"} +We'll be covering more examples of how supercomputers are used in science later in this course. +Are there any applications of supercomputers you know of, that made you interested in this course? +::: + + +This gives the modern scientist a powerful new tool to study the real world in a virtual environment. The process of running a virtual experiment is called computer simulation, and compared to disciplines such as chemistry, biology and physics it is a relatively new area of research which has been around for a matter of decades rather than centuries. This new area, which many view as a third pillar of science which extends the two traditional approaches of theory and experiment, is called computational science and its practitioners are computational scientists. + +### Computational Science + +It’s important to be clear about the difference between computational science and computer science. Computer science is the scientific study of computers: it is through the work of computer scientists that we have the hardware and software required to build and operate today’s supercomputers. In computational science, however, we use these supercomputers to run computer simulations and make predictions about the real world. To make things even more confusing, if a computer scientist talks about a computer simulation then they would probably mean simulating a computer, i.e. running a simulation of a computer (not of the real world), which is something you might do to check that your design for a new microprocessor works correctly before going into production. +It’s important to distinguish between computational science and computer science. Computer science is the scientific study of computers, and it is through the work of computer scientists that we have the hardware and software required to build and operate today’s supercomputers. Computational science, on the other hand, leverages these supercomputers to run simulations and make predictions about the real world. Adding to the potential confusion, when computer scientists refer to a "computer simulation," they may also mean simulating a computer itself—such as testing the design of a new microprocessor through simulation before production—rather than simulating real-world phenomena. + +Large-scale computer simulation has applications in industry, engineering, commerce, and academia. For instance, modern cars and airplanes are designed and tested virtually long before physical prototypes are built. A new car must pass crash safety tests before going to market, and virtual crash simulations enable engineers to identify and resolve potential issues early in the design process. This significantly reduces the costs of physical destructive testing and minimizes the risk of expensive redesigns. Such simulations ensure that new products are far more likely to work correctly on the first attempt, saving time and resources while driving innovation. + +### Breaking the world land speed record + +This point was well made by Andy Green, the driver of the Bloodhound LSR supersonic car which was aiming to break the world land speed record. In an interview on BBC’s 5 live Drive, broadcast on 4th January 2017, Adrian Goldberg asked Andy about the risks involved compared to the records set by the famous British driver Malcolm Campbell over 80 years ago: + +Adrian: “But if you’re travelling at supersonic speeds and you’re breaking records, so by definition doing something that hasn’t been done before, there must be a risk? + +Andy: “You’re missing the point between something that hasn’t been done before and something that is not fully understood. Back in the 1930’s, if you were doing something that hadn’t been done before, there was no other way of doing it apart from to go out and find out, to see what happens. + +Nowadays you can actually produce a computer model in a supercomputer and spend literally years researching a programme down to an extraordinarily fine level of detail so that when you actually go out to push back the boundary of human endeavour, to achieve something absolutely remarkable that will make everybody look round and go ‘wow, that was impressive!’, you can actually do it in a safe, step-by-step controlled way. You can actually understand the problem in advance and that’s all the difference.” + +--- + +## Introducing Wee Archie + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_vrq8zch9&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_jh4xeojf" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Introducing_Wee_Archie_hd"} +:::: + +:::solution{title="Transcript"} +0:31 - So ARCHER is the UK National Supercomputing Service that we house here in Edinburgh as part of the University. And it’s funded by the UK Research Councils. And it can do many, many calculations per second. Actually, if you took all the people on the planet, then it would be the equivalent of all these people doing many, many, many calculations per second. + +1:17 - It’s absolutely crucially important for simulation, things like simulation of weather, simulation of the cosmology, things like cancer analysis, cancer research, all sorts of different applications that maybe you wouldn’t have foreseen. + +1:47 - There’s a real keen push to encourage the next generation of scientists to get into science, and to get into computing in general. +::: + +In your mind, you probably already have an image of a supercomputer as a massive black box. Well, they usually are just that - dull looking cabinets connected by a multitude of cables. To make things more interesting, we introduce Wee ARCHIE! + +Wee ARCHIE is a suitcase-sized supercomputer designed and built to explain what a supercomputer is. + +![Photo of Wee ARCHIE](images/181107_ARCHER_30.jpg) + +We will return to Wee ARCHIE, and its big brother ARCHER, later in the course to explain the hardware details of supercomputers. + +You can find instructions on how to configure your very own Raspberry Pi cluster [here](https://epcced.github.io/wee_archlet/). + +--- + +## Terminology Recap + +::::challenge{id=sc_intro.1 title="Supercomputing intro Q1"} +Performing computations in _____ +means carrying out many calculations simultaneously. + +:::solution +Parallel +::: +:::: + +::::challenge{id=sc_intro.2 title="Supercomputing intro Q2"} +The term HPC stands for ? + +:::solution +High Performance Computing +::: +:::: + +::::challenge{id=sc_intro.3 title="Supercomputing intro Q3"} +The process of running a virtual experiment is called? + +:::solution +Computer simulation +::: +:::: + +::::challenge{id=sc_intro.4 title="Supercomputing intro Q4"} +The term number-crunching refers to large-scale ____ ____. + +:::solution +Numerical simulations +::: +:::: + +::::challenge{id=sc_intro.5 title="Supercomputing intro Q5"} +The typical power consumption of a supercomputer is in the order of several +____. + +:::solution +Megawatts +::: +:::: diff --git a/high_performance_computing/supercomputing/02_understanding_supercomputing.md b/high_performance_computing/supercomputing/02_understanding_supercomputing.md new file mode 100644 index 00000000..d09eedc3 --- /dev/null +++ b/high_performance_computing/supercomputing/02_understanding_supercomputing.md @@ -0,0 +1,545 @@ +--- +name: Understanding Supercomputing +dependsOn: [ + high_performance_computing.supercomputing.01_intro +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Graphical image of a computer processor](images/processor-2217771_640.jpg) +*Image courtesy of [ColiN00B](https://pixabay.com/users/colin00b-346653/) from [Pixabay](https://pixabay.com)* + +## Understanding Supercomputing - Processors + +So what are supercomputers made of? Are the building components really so different from personal computers? And what determines how fast a supercomputer is? + +In this step, we start to outline the answers to these questions. We will go into a lot more detail in a future module but for now we will cover enough of the basics for you to be able to understand the characteristics of the supercomputers in the [Top500](https://www.top500.org/lists/top500/2023/11/) list (the linked list is from November 2023). + +When we talk about a processor, we mean the central processing unit (CPU) in a computer which is sometimes considered to be the computer’s brain. The CPU carries out the instructions of computer programs, the terms CPU and processor are generally used interchangeably. +A modern CPU is composed of a collection of several separate processing units, we call each independent processing unit a CPU-core - some people just use the term core. + +A modern domestic device (e.g. a laptop, mobile phone or iPad) will usually have a few CPU-cores (perhaps two or four), while a supercomputer has tens or hundreds of thousands of CPU-cores. As mentioned before, a supercomputer gets its power from all these CPU-cores working together at the same time - working in parallel. Conversely, the mode of operation you are familiar with from everyday computing, in which a single CPU-core is doing a single computation, is called serial computing. + +It may surprise you to learn that supercomputers are built using the same basic elements that you normally find in your desktop, such as processors, memory and disk. The difference is largely a matter of scale. The reason is quite simple: the cost of developing new hardware is measured in billions of euros, and the market for consumer products is vastly larger than that for supercomputing, so the most advanced technology you can find is actually what you find in general-purpose computers. + +Interestingly, the same approach is used for computer graphics - the graphics processor (or GPU) in a home games console will have hundreds of cores. Special-purpose processors like GPUs are now being used to increase the power of supercomputers - in this context they are called accelerators. + +![Image denoting the more powerful, fewer cores of a CPU versus the smaller, more numerous cores of a GPU](images/large_hero_8408f33c-87f5-4061-aec7-42ef976e83fd.webp) +*A typical CPU has a small number of powerful, general-purpose cores; a GPU has many more specialised cores. © NVIDIA* + +To use all of these CPU-cores together means they must be able to talk to each other. In a supercomputer, connecting very large numbers of CPU-cores together requires a communications network, which is called the interconnect in the jargon of the field. A large parallel supercomputer may also be called a Massively Parallel Processor or MPP. + +Does it surprise you to learn that games console components and other general-purpose hardware are also used in supercomputers? + +--- + +![Photo of tape measures with varying units of measurement](images/william-warby-WahfNoqbYnM-unsplash.jpg) +*Image courtesy of [Willian Warby](https://unsplash.com/@wwarby) from [Unsplash](https://unsplash.com)* + +## Understanding Supercomputing - Performance + +In supercomputing, we are normally interested in numerical computations: what is the answer to 0.234 + 3.456, or 1.4567 x 2.6734? Computers store numbers like these in floating-point format, so they are called floating-point numbers. A single instruction like addition or multiplication is called an operation, so we measure the speed of supercomputers in terms of floating-point operations per second or Flop/s, which is sometimes written and said more simply as Flops. + +So how many Flops can a modern CPU-core perform? Let’s take a high-end processor like the AMD EPYC Zen2 (Rome) 7F32 CPU (which happens to be the processor used in the Snellius system at [SURFsara](https://www.surf.nl/en/services/snellius-the-national-supercomputer), the Netherlands National Supercomputer). The way a processor is normally marketed is to quote its clock frequency, which here is 3.7 GHz. This is the rate at which each CPU-core operates. Clock speed is expressed in cycles per second (Hertz), and the prefix Giga means a billion (a thousand million), so this CPU-core is working at the almost mind-blowing rate of 3.7 billion cycles per second. Under favourable circumstances, an AMD EPYC CPU-core can perform 16 floating-point operations per cycle, which means each CPU-core can perform 16 x 3.7 billion = 59.2 billion floating-point operations per second. + +So, the peak performance of one of our CPU-cores is 59.2 GFlops. + +We say peak performance because this is the absolute maximum, never-to-be-exceeded figure which it is very hard to achieve in practice. However, it’s a very useful figure to know for reference. + +Clearly, with many thousands of CPU-cores we’re going to encounter some big numbers so here is a table summarising the standard abbreviations you’ll come across: + +|Ops per second | Scientific Notation | Prefix | Unit | +|---------------------------|---------------------|---------|--------| +| 1 000 | 10^3 | Kilo | Kflops | +| 1 000 000 | 10^6 | Mega | Mflops | +| 1 000 000 000 | 10^9 | Giga | Gflops | +| 1 000 000 000 000 | 10^12 | Tera | Tflops | +| 1 000 000 000 000 000 | 10^15 | Peta | Pflops | +| 1 000 000 000 000 000 000 | 10^18 | Exa | Eflops | + +:::callout +:::callout{variant="info"} +A quick word of warning here: when talking about performance measures such as Gflops, we are talking about powers of ten. For other aspects such as memory, it is more natural to work in powers of 2 - computers are binary machines after all. +::: + +It is something of a coincidence that 2^10 = 1024 is very close to 10^3 = 1000, so we are often sloppy in the terminology. However, we should really be clear if a kiloByte (Kbyte) is 1000 Bytes or 1024 Bytes. By KByte, people usually mean 1024 Bytes but, strictly speaking, a Kbyte is actually 1000 Bytes. The technically correct terminology for 1024 Bytes is KibiByte written as KiByte. + +This might seem like an academic point since, for a KByte, the difference is only about 2%. However, the difference between a PByte and a PiByte is more than 12%. If your supercomputer salesman quotes you a price for a PetaByte of disk, make sure you know exactly how much storage you’re getting! + +--- + +![Photo of someone holding a stopwatch](images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg) +*Image courtesy of [Veri Ivanova](https://unsplash.com/@veri_ivanova) from [Unsplash](https://unsplash.com)* + +## Understanding Supercomputing - Benchmarking + +If we are going to compare performance we need some standard measure. When buying a car, we don’t just take the manufacturer’s word for how good the car is - we take it for a test drive car and see how well it performs in practice. For cars we might be interested in top speed or fuel economy, and it turns out that we are interested in the equivalent quantities for supercomputers: maximum floating-point performance and power consumption. + +In the supercomputer world, two parameters are frequently used to measure the performance of a system, Rpeak and Rmax: Rpeak is the theoretical peak performance, which is just the peak performance of a CPU-core multiplied by the number of CPU-cores; Rmax is the measured maximum performance. To measure supercomputer performance the equivalent of a test drive is how fast it can run a standard program, the process of measuring performance is called benchmarking. Both are expressed in units of FLOPS. + +The standard benchmark for supercomputing is the High Performance LINPACK (HPL) benchmark. +LINPACK involves running a standard mathematical computation called an LU factorisation on a very large square matrix of size Nmax by Nmax. +The matrix is just a table of floating-point numbers, and Nmax is chosen in order to fill the machines available memory and to maximise performance. + +[The top 500](https://top500.org/lists/top500/2024/11/) uses these values to compare the performance of the worlds fastest machines. + +:::callout{variant="info"} +The Nmax used to achieve the measured performance is not typically disclosed but in June 2018, the world’s fastest supercomputer (the Summit system at the Oak Ridge National Laboratory in the USA, since decommisioned) used Nmax of over 16 million - imagine working with a spreadsheet with over 16 million rows and 16 millions columns! +::: + +Supercomputers are not only expensive to purchase, but they are also extremely expensive to run due to their extreme power consumption. A typical supercomputer consumes multiple megawatts, and this power is turned into heat which we have to get rid of via external cooling. + +For example, the fourth on the top500 list, Tianhe-2 system has a peak power consumption of 18.5 megawatts and, including external cooling, the system drew an aggregate of 24 megawatts when running the LINPACK benchmark. If a kilowatt of power costs 10 cents per hour, Tianhe-2’s peak power consumption will be 2400 euros per hour, which is in excess of 21 million euros per year. +For example, Tianhe-2 system has a peak power consumption of 18.5 megawatts and, including external cooling, the system drew an aggregate of 24 megawatts when running the LINPACK benchmark. If a kilowatt of power costs 10 cents per hour, Tianhe-2’s peak power consumption will be 2400 euros per hour, which is in excess of 21 million euros per year. +Rpeak and Rmax are what Top500 uses to rank supercomputers. Also quoted is the electrical power consumption, which leads to the creation of another list - the [Green 500](https://www.top500.org/lists/green500/2022/06/) (June 2022)- which ranks supercomputers on their fuel economy in terms of Flops perWatt. Despite its massive power bill, Frontier is quite power-efficient. The top ranked system (Frontier) holds 2nd position on the Green 500. +Rpeak and Rmax are what Top500 uses to rank supercomputers performance but, also quoted is the electrical power consumption, which leads to the creation of another list - the [Green 500](https://www.top500.org/lists/green500/2024/11/) (November 2024)- which ranks supercomputers on their fuel economy in terms of Flops perWatt. +Take a look at the [Top500](https://www.top500.org/lists/top500/2022/06/) list - does the fact the top supercomputer for performance is also the to for power efficeny surprise you? What could be the reason for this? +:::callout{variant="discussion"} +Take a look at the [Top500](https://www.top500.org/lists/top500/2022/06/) list for June 2022 - does the fact the top supercomputer for performance is also the top for power efficiency surprise you? What could be the reason for this? +::: + +--- + +![Photo of magnifying glass used on laptop keyboard](images/agence-olloweb-d9ILr-dbEdg-unsplash.jpg) +*Image courtesy of [Agence Olloweb](https://unsplash.com/@olloweb) from [Unsplash](https://unsplash.com/)* + +## HPC System Design + +Now you understand the basic hardware of supercomputers, you might be wondering what a complete system looks like. Let’s have a look at the high-level architecture of supercomputer, with emphasis on how it differs from a desktop machine. + +The figure below shows the building blocks of a complete supercomputer system and how they are connected together. Most systems in the world will look like this at an abstract level, so understanding this will give you a good model for how all supercomputers are put together. + +![Diagram of general supercomputer architecture and how its components relate to a user's own computer](images/large_hero_a3db6ae7-8a0e-4fe4-b2da-302380de963a.png) + +Let’s go through the figure step by step. + +### Interactive Nodes + +As a user of a supercomputer, you will get some login credentials, for example a username and password. Using these you can access one of the interactive nodes (sometimes called login nodes). You don’t have to travel to the supercomputer centre where these interactive nodes are located - you just connect from your desktop machine over the internet. + +Since a supercomputer system typically has many hundreds of users, there are normally several interactive nodes which share the workloads, i.e. to make sure that all the users are not trying to access one single machine at the same time. This is where you do all your everyday tasks such as developing computer programs or visualising results. + +### Batch System + +Once logged into an interactive node, you can now run large computations on the supercomputer. It is very important to understand that you do not directly access the CPU-cores that do the hard work. Supercomputers operate in batch mode - you submit a job (including everything needed to run your simulation) to a queue and it is run some time in the future. This is done to ensure that the whole system is utilised as fully as possible. + +The user creates a small file, referred to as a job script, which specifies all the parameters of the computation such as which program is to be run, the number of CPU-cores required, the expected duration of the job etc. This is then submitted to the batch system. Resources will be allocated when available and a user will be guaranteed exclusive access to all the CPU-cores they are assigned. This prevents other processes from interfering with a job and allows it to achieve the best performance. + +Individual users will not have access to the full resources of the supercomputer. Instead, they will be allocated resources according to the specifications in their job script and the limits defined by their project's funding or allocation agreements. + +It is the job of the batch scheduler to look at all the jobs in the queue and decide which jobs to run based on, for example, their expected execution time and how many CPU-cores they require. At any one time, a single supercomputer could be running several parallel jobs with hundreds waiting in the queue. Each job will be allocated a separate portion of the whole supercomputer. A good batch system will keep the supercomputer full with jobs all the time, but not leave individual jobs in the queue for too long. + +### Compute nodes + +The compute nodes are at the core of the system and the part that we’ve concentrated on for most of this module. They contain the resources to execute user jobs - the thousands of CPU-cores operating in parallel that give a supercomputer its power. They are connected by fast interconnect, so that the communication time between CPU-cores impacts program run times as little as possible. + +### Storage + +Although the compute nodes may have disks attached to them, they are only used for temporary storage while a job is running. There will be some large external storage, comprising thousands of disks, to store the input and output files for each computation. This is connected to the compute nodes using fast interconnect so that computations which have large amounts of data as input or output don’t spend too much time accessing their files. The main storage area will also be accessible from the interactive nodes, e.g. so you can visualise your results. + +--- + +## Practical 1: Setting up Prerequisites + +To undertake the practical sessions in this course you'll need one of the following: + +- A machine with OpenMP and MPI installed (see links to instructions below), although you won't be able to run the Slurm job scheduler examples unless you have access to ARCHER2 which these examples assume. +- The Slurm job submission examples presented assume access to [ARCHER2](https://www.archer2.ac.uk/) which has OpenMP and MPI preinstalled. These examples can be made to work on other HPC infrastructures, such as [DiRAC](https://dirac.ac.uk/), but due to differences in how these systems are configured, prior knowledge of job scripts and the correct parameters to use for those systems will be required. + +### Local machine installation + +#### Installing OpenMP on your machine + +In order to make use of OpenMP, it's usually a case of ensuring you have the [right compiler installed on your system](https://www.openmp.org/resources/openmp-compilers-tools/), such as gcc. + +#### Installing MPI on your machine + +To install a popular version of MPI called [OpenMPI](https://www.open-mpi.org/) on a desktop or laptop: + +- **Linux:** Most distributions have OpenMPI available in their package manager, e.g. + + ```bash + sudo apt install openmpi-bin openmpi-dev + ``` + +- **Mac:** The MacPorts and Homebrew package managers both have OpenMPI available: + + ```bash + brew install openmpi + # or + port install openmpi + ``` + +- **Windows:** Whilst you *can* build OpenMPI yourself on Windows, it's generally easier to use the [**Windows Subsystem for Linux**](https://learn.microsoft.com/en-us/windows/wsl/install). + +This can be useful for when you're writing code or testing it on a smaller scale, but you will need to check that you're installing a version of OpenMPI that's also available on whichever HPC cluster you're likely to scale up to. + +### Using ARCHER2 + +The other option, if you already have an account on it, is to use ARCHER2 which has all the software pre-installed. + +### Installing an SSH client + +To connect to ARCHER2 from our local laptop or PC you'll need an SSH client, which allows us to connect to and use a command line interface on a remote computer as if we were our own. +Please follow the directions below to install an SSH client for your system if you do not already have one. + +#### Windows + +Modern versions of Windows have SSH available in Powershell. First run Powershell, and you can test if it is available by typing ssh --help in Powershell. If it is installed, you should see some useful output. If it is not installed, you will get an error. If SSH is not available in Powershell, then you should install MobaXterm from [http://mobaxterm.mobatek.net](http://mobaxterm.mobatek.net). You will want to get the Home edition (Installer edition). However, if Powershell works, you do not need this. + +#### MacOS + +macOS comes with SSH pre-installed, so you should not need to install anything. Use your “Terminal” app. + +#### Linux + +Linux users do not need to install anything, you should be set! Use your terminal application. + +### Using SSH to connect to ARCHER2 + +You should now be able to log into ARCHER2 by following the login instructions in the [ARCHER2 documentation](https://docs.archer2.ac.uk/user-guide/connecting/#ssh-clients), e.g. + +```bash +ssh username@login.archer2.ac.uk +``` + +You will also need to use a means of secondary authentication in order to gain access, e.g. using the authenticator app you used during ARCHER2 registration. +Then you should see a welcome message followed by a Bash prompt, e.g.: + +```bash +username@ln01:~> +``` + +::::callout +When using ARCHER2, be sure to `cd` to the `/work` filesystem, i.e.: + +```bash +cd /work/[project code]/[group code]/[username] +``` + +You should have been given `[project code]` and `[group code]` at the start of this course. + +The `/work` filesystem is a high performance parallel file system that can be accessed by both the frontend login nodes and the compute nodes. All jobs on ARCHER2 should be run from the `/work` file system, since ARCHER2 compute nodes cannot access the `/home` file system at all and will fail with an error. + +For more information the ARCHER2 documentation: [https://docs.archer2.ac.uk/user-guide/io/#using-the-archer2-file-systems](https://docs.archer2.ac.uk/user-guide/io/#using-the-archer2-file-systems). +:::: + +## Practical 2: Compiling and running our first program + +This example aims to get you used to the command line environment of a high performance computer, by compiling example code and submitting jobs to the batch system while learning about the hardware of a HPC system. + +### Compiling an Example Code + +First, we'll need to create an example code to compile. + +::::callout + +## Recap: Using an Editor from within the Shell + +When working on an HPC system we will frequently need to create or edit text files. + +Some of the more common ones are: + +- `vi`: a very basic text editor developed during the 1970's/80's. It differs from most editors - and is commonly found to be confusing because of it - in that it has two modes of operation: command and insert. In command mode, you are able to pass instructions to the editor, such as dealing with files (save, load, or insert a file), and editing (cut, copy, and paste text). However, you can't insert new characters. For that the editor needs to be in insert mode, which allows you to type into a text document. You can enter insert mode by typing `i`. To return to the command mode, you can use `Escape`. +- `vim`: built on `vi`, `vim` goes much further, adding features like undo/redo, autocompletion, search and replace, and syntax highlighting (which uses different coloured text to distinguish different programming language text). It mainly uses the same command/insert modes as `vi` which can take some getting used to, but is developed as a power-users editing tool that is highly configurable. +- `emacs`: also highly configurable and extensible, `emacs` has a less steep learning curve than `vim` but offers features common to many modern code editors. It readily integrates with debuggers, which is great if you need to find problems in your code as it runs. +- `nano`: a lightweight editor that also uses the more common way of allowing the editing of text by default, but allows you to access extra editor functionality such as search/replace or saving files by using `Ctrl` with other keys. + +These are all text-based editors, in that they do not use a graphical user interface like Windows. They simply appear in the terminal, which has a key advantage, particularly for HPC systems like ARCHER2 or DiRAC: they can be used everywhere there is a terminal, such as via an SSH connection. + +One of the common pitfalls of using Linux is that the `vi` editor is commonly set as the default editor. If you find yourself in `vi`, you can exit using `Escape` to get into command mode, and then `:` to enter a new command followed by `q` + `!`, which means quit `vi` without saving the file. + +We'll use `nano`, a lightweight editor that's accessible from practically any installation of Linux. + +If following this on your own machine (e.g. not via ARCHER2), feel free to use any editor you like. + +:::: + +Whilst in your account directory within the `/work` filesystem, create a new file called `helloWorldSerial.c` using an editor, e.g. + +```bash +nano helloWorldSerial.c +``` + +And enter the following contents: + +```c +#include +#include +#include +#include +#include + +int main(int argc, char* argv[]) +{ + + // Check input argument + if(argc != 2) + { + printf("Required one argument `name`.\n"); + return 1; + } + + // Receive argument + char* iname = (char *)malloc(strlen(argv[1])); + strcpy(iname, argv[1]); + + // Get the name of the node we are running on + char hostname[HOST_NAME_MAX]; + gethostname(hostname, HOST_NAME_MAX); + + // Hello World message + printf("Hello World!\n"); + + // Message from the node to the user + printf("Hello %s, this is %s.\n", iname, hostname); + + // Release memory holding command line argument + free(iname); +} +``` + +This C code will accept a single argument (for example, your name), and report which node it is running from. +To try this example yourself you will first need to compile the example code. + +If the file that contains the above code is called `helloWorldSerial.c`, then to compile and run this directly on the ARCHER2 login node use: + +```bash +cc helloWorldSerial.c -o hello-SER +./hello-SER yourname +``` + +If you're running this on your own machine you may need to replace `cc` with `gcc` to get it to use the right compiler on your machine. + +And you should see: + +```output +Hello World! +Hello yourname, this is ln01. +``` + +::::callout{variant="tip"} + +## Be Kind to the login nodes + +It’s worth remembering that if you're using an HPC infrastructure the login node is often very busy managing lots of users logged in, creating and editing files and compiling software, and submitting jobs. As such, although running quick jobs directly on a login node is ok, for example to compile and quickly test some code, it’s not intended for running computationally intensive jobs and these should always be submitted for execution on a compute node, which we'll look at shortly. + +The login node is shared with all other users and your actions could cause issues for other people, so think carefully about the potential implications of issuing commands that may use large amounts of resource. +:::: + +### Submitting our First Job + +**To be able to run the job submission examples in this segment, you'll need to either have access to ARCHER2, or an HPC infrastructure running the Slurm job scheduler and knowledge of how to configure job scripts for submission.** + +To take advantage of the compute nodes, we need the batch scheduler to queue our code to run on a compute node. The scheduler used in this lesson is Slurm. Although Slurm is not used everywhere, it's very popular and the process of specifying and running jobs is quite similar regardless of what scheduling software is being used. + +Schedulers such as Slurm tend to make use of submission scripts, typically written in Bash, which define what to run but also, critically, define *what* the job is and *how* to run it. + +Place this bash code into a file called `Hello_Serial_Slurm.sh` and replace `YOUR_NAME_HERE` with your own input and `[project code]` with your supplied project code. + +```bash +#!/bin/bash + +#SBATCH --job-name=Hello-SER +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=1 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +./hello-SER YOUR_NAME_HERE +``` + +If you run this script (e.g. using `bash Hello_Serial_Slurm.sh` you should see the output as before. +But we have also defined some scheduler directives as comments (prefixed by `#SBATCH`) in our script which are interpreted by the job scheduler, which indicate: + +- `--job-name` - a name for the job, which acn be anything +- `--nodes`, `--tasks-per-node`, `--cpus-per-node` - the number of compute nodes we wish to request for the job, the number of tasks (or processes) we wish to run, and the number of cpus we wish to use (in this case, a single process on 1 CPU on 1 node) +- `--time` - the expected overall run time (or *wall time*) for the job, in `minutes:hours:seconds`. If our job goes over this, the scheduler may terminate the job! +- `--account`, `--partition` - the account we wish to charge for this job, and the partition, or queue, we wish to submit the job to. These vary from Slurm system-to-system, depending on how they are configured +- `--qos` - the requested Quality of Service, or priority, for this job. Again, this may vary between different Slurm HPC systems + +To submit this job run, + +```bash +sbatch Hello_Serial_Slurm.sh +``` + +A unique job identifier is returned: + +```output +Submitted batch job 5843243 +``` + +Using this identifier, we can check the status of the job, e.g.: + +```bash +squeue --job 5843243 +``` + +```output + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 5843243 standard Hello-SE username PD 0:00 1 (Priority) +``` + +Eventually, we should see the job's state (`ST`) change to `R` to indicate it's running, along with the node it's running on indicated under `NODELIST`, and the time it's been running so far: + +```output + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 5843243 standard Hello-SE username R 0:01 1 nid003218 +``` + +And we may even see it enter the completing (`CG`) state as the job finishes. Once complete, the job will disappear from this list. + +We should now see one file returned as output, named `slurm-[job id].out`, containing the name of the node it ran on. + +::::challenge{id=understanding_sc_pr.1 title="Time's Up"} +Resource requests are typically binding, and if you exceed them, your job will be killed. +Let’s see this in action and use wall time as an example. + +Add a `sleep 240` at the end of the submission script which will cause the script (and hence the job) to wait for 4 minutes, exceeding the requested 1 minute. Resubmit the job, and continue to monitor the job using `squeue`. What happens? + +:::solution + +You should see the following in the job's Slurm output log file, indicating it was terminated: + +```output +Hello world! +slurmstepd: error: *** JOB 5851929 ON nid001099 CANCELLED AT 2024-03-07T09:15:27 DUE TO TIME LIMIT *** +``` + +You may notice that the job is cancelled perhaps around 30 seconds *after* its requested time of 1 minute, so there is some leeway, but not much! +::: + +:::: + +--- + +## What Supercomputing is not + +One of the main aims of this course is to de-mystify the whole area of supercomputing. + +Although supercomputers have computational power far outstripping your desktop PC, they are built from the same basic hardware. Although we use special software techniques to enable the many CPU-cores in a supercomputer to work together in parallel, each individual processor is basically operating in exactly the same way as the processor in your laptop. + +However, you may have heard of ongoing developments that take more unconventional approaches: + +- Quantum Computers are built from hardware that is radically different from the mainstream. +- Artificial Intelligence tackles problems in a completely different way from the computer software we run in traditional computational science. + +We will touch on these alternative approaches in some of the final foundational module. In the meantime, feel free to raise any questions you have about how they relate to supercomputing by commenting in any of the discussion steps. + +--- + +## Supercomputing Terminology + +::::challenge{id=understanding_sc.1 title="Understanding Supercomputing Q1"} +If someone quotes the performance of a computer in terms of “Flops”, what do they mean? + +A) the total number of floating-point operations needed for a computation + +B) the number of float-point operations performed per second + +C) the clock frequency of the CPU-cores + +D) the total memory + +:::solution +B) - it’s our basic measure of supercomputer speed. +::: +:::: + +::::challenge{id=understanding_sc.2 title="Understanding Supercomputing Q2"} +What is a benchmark in computing? + +A) a hardware component of a computer system + +B) a computer program that produces scientific results + +C) a computer program specifically designed to assess performance + +D) the peak performance of a computer + +:::solution +C) - it’s the equivalent of a standard consumer test for a product, like “how many litres per second can my electric shower deliver water at 40 degrees centigrade?” +::: +:::: + +::::challenge{id=understanding_sc.3 title="Understanding Supercomputing Q3"} +Which one below represents the right order of number of Flops (from small to large)? + +A) Kflops Gflops Mflops Tflops Eflops Pflops + +B) Mflops Gflops Kflops Pflops Tflops + +C) Kflops Mflops Gflops Tflops Pflops Eflops + +D) Gflops Kflops Tflops Mflops Pflops Eflops + +:::solution +C) - computers started in the Kiloflops range and we are now approaching Exaflops. +::: +:::: + +::::challenge{id=understanding_sc.4 title="Understanding Supercomputing Q4"} +What does clock speed typically refer to? + +A) the speed at which a CPU-core executes instructions + +B) memory access speed + +C) the I/O speed of hard disks + +D) the performance achieved using the LINPACK benchmark + +:::solution +A) - it’s basically the heartbeat of the processor. +::: +:::: + +::::challenge{id=understanding_sc.5 title="Understanding Supercomputing Q5"} +What processor technologies are used to build supercomputers (compared to, for example, a desktop PC)? + +A) a special CPU operating at a superfast clock speed + +B) a large number of special CPUs operating at very fast speeds + +C) a large number of standard CPUs with specially boosted clock speeds + +D) a very large number of standard CPUs at standard clock speeds +::: + +:::solution +D) - we take standard desktop technology but use vast numbers of CPUs to increase the computational power. +::: +:::: + +::::challenge{id=understanding_sc.6 title="Understanding Supercomputing Q6"} +A parallel computer has more than one CPU-core. Which of the following are examples of parallel computers? + +A) a modern laptop + +B) a modern mobile phone + +C) a supercomputer + +D) a home games console + +E) all of the above + +:::solution +E) - That’s right - parallelism is ubiquitous across almost all modern computer devices and is not restricted to just a few high-end supercomputers. +::: +:::: + +--- diff --git a/high_performance_computing/supercomputing/03_supercomputing_world.md b/high_performance_computing/supercomputing/03_supercomputing_world.md new file mode 100644 index 00000000..67e3d96e --- /dev/null +++ b/high_performance_computing/supercomputing/03_supercomputing_world.md @@ -0,0 +1,187 @@ +--- +name: Supercomputing World +dependsOn: [ + high_performance_computing.supercomputing.02_understanding_supercomputing +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +![Computer circuit board looking like a city](images/bert-b-rhNff6hB41s-unsplash.jpg) +*Image courtesy of [bert b](https://unsplash.com/@bertsz) from [Unsplash](https://unsplash.com)* + +## Current Trends and Moore's Law + +Over recent decades, computers have become more and more powerful. New computing devices appear on the market at a tremendous rate, and if you like to always have the fastest available personal computer, you need to visit your local computer store very frequently! But how did computers become so powerful, and are there any fundamental limits to how fast a computer can be? To answer these questions, we need to understand what CPUs are made up of. + +As mentioned previously, our measure of the performance of a CPU-core is based on the number of floating-point operations it can carry out per second, which in turn depends on the clock speed. CPUs are built from Integrated Circuits that consist of very large numbers of transistors. These transistors are connected by extremely small conducting wires which can carry an electric current. By controlling whether or not an electric current goes through certain conducting lines, we are able to encode information and perform calculations. + +Most transistors nowadays are created with silicon, a type of semiconductor. A semiconductor is a material that can act as both a conductor (a material that permits the flow of electrons) and an insulator (that inhibits electron flow), which is exactly the characteristics we want a transistor to have. The maximum physical size of a processor chip is limited to a few square centimetres, so to get a more complicated and powerful processor we need to make the transistors smaller. + +In 1965, the co-founder of Fairchild Semiconductor and Intel, Gordon E. Moore, made an observation and forecast. He noticed that manufacturing processes were continually improving to the extent that: + +> "The number of transistors that could be placed on an integrated circuit was doubling approximately every two years." + +He predicted that this would continue into the future. This observation is named after him, and is called Moore’s law. Although it is really a forecast and not a fundamental law of nature, the prediction has been remarkably accurate for over 50 years. + +The first CPU from Intel (the i4004) introduced in 1971 had 2000 transistors, and Intel’s Core i7 CPU introduced in 2012 had 3 billion transistors. This is in excess of a million times more transistors, but is actually in line with what you would expect from the exponential growth of Moore’s law over around 40 years. + +It turns out that, as we pack our transistors closer and closer together, every time we double the density of transistors we can double the frequency. So, although Moore’s law is actually a prediction about the density of transistors, for the first four decades it also meant that: + +> "Every two years the CPU clock frequency doubled." + +We saw clock speeds steadily increasing, finally breaking the GHz barrier (a billion cycles per second) in the early twenty-first century. But then this growth stopped, and clock speeds have remained at a few GHz for more than a decade. So did Moore’s law stop? + +The problem is that increasing clock frequency comes at a cost: it takes more power. Above a few GHz, our processors become too power hungry and too hot to use in everyday devices. But Moore’s law continues, so rather than increasing the frequency we put more processors on the same physical chip. We call these CPU-cores, and we now have multicore processors. The below image shows a schematic of a modern processor (Intel’s Core i7) with four CPU-cores (four pinkish rectangles). + +![Rendering of Intel Core i7 CPU](images/large_hero_cafacb0d-898b-44b4-9290-5c25c211fc03.jpg) +*Intel's Core i7 A modern quad-core CPU - Intel’s Core i7 © Intel* + +So for the past decade, Moore’s law has meant: + +> "Every two years, the number of CPU-cores in a processor now doubles." + +In the last few years the process of doubling transistors in integrated circuits is showing signs of slowing down. It’s no longer every two years but perhaps every three years, but the overall trend still continues. + +The current trend in the supercomputing world is that supercomputers are getting bigger not faster. Since the speed of a single CPU-core cannot be increased any more, having more and more cores working together is the only way to meet our computational requirements. + +![Graph of transistor count over time, ](images/Transistor-Count-over-time.png) +*Image courtesy of Max Roser, Hannah Ritchie [OurWorldinData](https://ourworldindata.org/uploads/2020/11/Transistor-Count-over-time.png) ([CC-BY](https://creativecommons.org/licenses/by/4.0/deed.en))* + +:::callout{variant="discussion"} +Your next mobile phone will probably have more CPU-cores than your current one. Do you think this is more useful than a faster CPU? Can you see any problems in making use of all these CPU-cores? +::: + +--- + +## How to calculate the world's yearly income? + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_tpqo25kw&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_1ms2y7b4" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="worlds_yearly_income_hd"} +:::: + +:::solution{title="Transcript"} +0:11 - Here, we’re going to introduce a very simple example, of calculating the world’s yearly income. It’s a bit of a toy example. But, we’re going to imagine that we have a list of the incomes, the salaries, of everybody in the entire world, and we’re going to add them all up to work out what the total income of the world is. Now this is obviously a very simple example, and slightly artificial. But, we’ll actually use it, and come back to it, in a number of contexts. First of all, it is very useful as a specific example of a real calculation, where we can illustrate how much faster calculations have got, through the developments in processor technology over the years. + +0:46 - We talk about megahertz, gigahertz, and all kinds of things like this, but if we focus on a specific calculation, it will maybe become more obvious what that actually translates into. Also we’ll come back to this example later on in the course, to see how you might implement it in parallel. How would you use multiple CPU cores to do the calculation faster? So, let’s imagine we have a list. It’s going to be a long list of the salaries of everybody in the world, ordered alphabetically by country and person. So, right at the top of list we have a couple of people, Aadel and Aamir Abdali, who live in Afghanistan. Unfortunately, people in Afghanistan, the average wage is quite low. + +1:23 - But they earn just under 1,000 pounds a year each. We carry on down the list, we get a couple of representative people from the UK, Mark and Mary Hensen, a couple who live in the north of England, earning 20,000 or 30,000 pounds each. And there’s me, that’s my salary– oh there seems to be a small error there, you can’t quite see what my salary is, but anyway, I’m on the list as you’d expect. And way down at the bottom– we’re assuming there are seven billion people in the world, which is a reasonable estimate– a couple of brothers from Zimbabwe, Zojj and Zuka Zinyama, who run a successful garage and motor repair business, earning 3,500 and 1,000 pounds each. + +1:58 - So what are we going to do to add up all these numbers? What we’re going to do is, we’re going to write a computer programme. Now, as I’ve said before, this isn’t a programming course, we don’t expect you to be a computer programmer. However, this is very simple, and it will serve to illustrate the way that computers work, and allow us to translate these megahertz and gigahertz frequencies we’ve been talking about into actual elapsed time in seconds. So how do we add it up? Well, first of all, we have a running total which we set to zero. And, we start at the top of the list, and we go through the numbers in order. + +2:29 - So, we add the income to the total. We go to the next item in the list– the second, the third, the fourth, the fifth. And then we repeat, if we’re not at the end of the list. So if we’re not at the end of the list, but not at the seven billionth entry, we have to go back, add the next income to the total, go to the next item, and then keep repeating, repeating. So we repeat these three steps, three individual steps, we repeat them seven billion times. And once we’ve finished at the end, we can print the total out. So, it’s a very simple prescription, in some kind of pseudo-language of the computer program to add up these incomes. + +3:04 - But, the most important point is the core loop, the one that’s executed seven billion times, has three distinct steps in it. And, we’re going to assume that each one of these corresponds to a single instruction issued by the CPU, by the processor. Now, it’s quite a naive assumption, but it’s perfectly OK for our purposes here. +::: + +How would you go about calculating the world’s yearly income? Well, it’s simple adding up of numbers but there are many of them… so the real question is: how long would it take? + +In this video David describes how to tackle the calculation in serial on his laptop, and in the next step we will discuss how long it might take. + +We will use this example in other steps on this course to better illustrate some of the key concepts, so make sure you understand how it works. + +--- + +## Moore's Law in practice + +::::iframe{id="kaltura_player" width="100%" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_4zab4d0l&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_1mhy9m0z" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Moores_Law_hd"} +:::: + +:::solution{title="Transcript"} +0:12 - So the question is, how long does this calculation take? So, what I’m going to do is I’m going to do a bit of history. I’m going to look at the history of processors over about the past five decades to see how long it would have taken to do this calculation on a particular processor from that time. I’m going to focus on Intel as a manufacturer. Intel is a very successful manufacturer today of processors– they are very prevalent in desktops and laptops– but there are other designers or manufacturers of processors you may have heard of. ARM, for example, who design a lot of the processors which go into mobile devices like mobile phones, and Nvidia who produce graphics processors. + +0:48 - There’s also IBM, International Business Machines, who actually design their own processors, and AMD, who also make their own processors. But I’m going to look at Intel, the main reason is because they have a long history and we can look way back many decades to see how these things have gone on. Now, I’m starting in 1966, which might seem like a strange starting point, but that when I was born. So there I am in 1966. How long does it take me to do this calculation. Now I have 100 billion neurons– that’s a lot of neurons in my brain– but unfortunately these neurons aren’t particularly good at doing mechanical floating-point operations. + +1:21 - So I reckon I could issue operations at the rate of one Hertz. So that’s the frequency one Hertz, which is one operation per second, or one second per operation. Now remember, the core loop had three steps in it. So at one operation per second, or one second per operation, it’s going to take me three seconds to do that loop. And it turns out it would take me 650 years to add up the salaries of all 7 billion people in the world. And that’s really, for one person, it’s just not going to even complete in a lifetime. So it’s a completely untenable calculation. + +1:54 - Now back in 1971, Intel introduced what is now considered to be the first modern integrated microprocessor, the Intel 4004, and it had 2000 transistors. Not many transistors compared to my 100 billion neurons, but these transistors are very good at doing floating-point calculations. The frequency of this machine with about 100 kilohertz– 100 kilohertz is 100,000 operations per second, or one operation every 10 microseconds– a microsecond is a millionth of a second. So every millionth of a second, this chip could do one operation. Now, remember, there are three steps, so it’s going to take 30 microseconds to do each loop. So, the total time to do that seven billion times is two and a half days. + +2:35 - So even over four decades ago, microprocessors were able to translate what would have been a completely unfeasible calculation for one person to do into something which can be done in a few days. Which is a major step forward. But if we fast forward another two decades, we see the impact of Moore’s law. Moore’s law is this exponential increase, this regular doubling of the number of transistors you can get on a microprocessor, and in 1993 Intel released the Pentium, and in the intervening decades the manufacturing technology had increased to such an extent that they could now put 3 million transistors on the Pentium. This extra density of transistors translates into the ability to run these at a faster frequency. + +3:15 - The Pentium had a frequency of 60 megahertz. That’s 60 million operations per second, or the time per operation is 17 nanoseconds, where a nanosecond is a billionth of a second. Now it’s worth thinking about that for a while. A nanosecond is an extremely short period of time. Let’s imagine a ray of light. Light is the fastest thing there can be in the universe and every nanosecond, light only travels about 30 centimetres. So, each time that the Pentium issues an operation, it issues an operation every 17 nanoseconds. In those 17 nanoseconds, light has only travelled about five metres, about the width of a room. Now it takes three steps per loop. + +3:56 - So the time loop is 50 nanoseconds, which means, in 1993, it would have taken six minutes to add up the salaries of everybody in the whole world. So, you can see, in a couple of decades a calculation which would have taken several days, has gone to something which you could just set the computer going, and go off and have a cup of coffee and come back, and it would be finished. So, let’s fast forward to 2012, another two decades. And Intel then released the core i7 processor, which had three billion transistors– not three million but three billion. It could operate at a frequency of three gigahertz, which is three billion operations per second. + +4:29 - The time per operation is a third of a nanosecond. So each time that the core i7 could issue an operation, every third of a nanosecond, light could only travel 10 centimetres. So, you can see we’re approaching some fairly fundamental physical limits here, in how fast processors can go. The time per loop was one nanosecond. That meant that to add up all seven billion salaries would have taken seven seconds. So, again, in the intervening two decades from 1993 to 2012, we’ve gone from a calculation where you would have to go away and wait to have a cup of coffee for it to finish, to something you could just sit and it would be ready almost instantaneously. + +5:05 - So, hopefully that illustrates the impact of Moore’s law. How, from 1971 to 2012, over the period of four decades, this relentless increase in the speed of CPUs has gone from a calculation taking two and a half days, to taking seven seconds. +::: + +So how long does it take to add up 7 billion numbers? Well, it depends on what you are using to add them… + +We’ve talked about Moore’s Law and what it has meant in terms of processor hardware: + +Every two years the number of transistors that can fit onto an integrated circuit doubles (until 2005); + +Every two years the number of CPU-cores in a processor now doubles. + +In this video David uses the income calculation example to illustrate what is the impact of the first point in practice. + +--- + +![Podium with top three winners](images/winner-1019835_640.jpg) +*Image courtesy of [Peggy_Marco](https://pixabay.com/users/peggy_marco-1553824/) from [Pixabay](https://pixabay.com)* + +## Top500 list: Supercomputing hit parade + +On the top500.org site, you can find the ranks and details of the 500 most powerful supercomputers in the world. + +The first Top500 list was published in June 1993, and since then the Top500 project publishes an updated list of the supercomputers twice a year: + +in June at the ISC High Performance conference in Germany, +and in November at the Supercomputing conference in the US. +The site provides news with respect to supercomputers and HPC, it also has a handy statistics tool which people can use to gain more insights of the Top500 systems. + +:::callout{variant="discussion"} +Have a look at the most recent list and briefly comment on the following question: + +- What manufacturers produce the world’s largest supercomputers? +- What types of processors do they use? +- What fraction of peak performance is typically achieved for the LINPACK benchmark? +- Play with the statistics tool on top500.org and think about the trends in current HPC systems. For example, how many supercomputers in the Top500 are classed as being for use by industry? +::: + +--- + +## Terminology recap + +::::challenge{id=sc_world.1 title="Supercomputing World Q1"} +Historically, a ____ contained a single "brain" but nowadays it contains multiple ____. + +:::solution + +1) Processor + +2) Cores + +::: +:::: + +::::challenge{id=sc_world.2 title="Supercomputing World Q2"} +The mode of computing in which a single CPU-core is doing a single computation is called ____ computing, as opposed to ____ computing, where all CPU-cores work together at the same time. + +:::solution + +1) Serial + +2) Parallel + +::: +:::: + +::::challenge{id=sc_world.3 title="Supercomputing World Q3"} +The process of evaluating the performance of a supercomputer by running a standard program is called ____. The standard calculation used to compile the top500 list is called ____. + +:::solution + +1) Benchmarking + +2) LINPACK + +::: +:::: diff --git a/high_performance_computing/supercomputing/04_practical.md b/high_performance_computing/supercomputing/04_practical.md new file mode 100644 index 00000000..32f3f862 --- /dev/null +++ b/high_performance_computing/supercomputing/04_practical.md @@ -0,0 +1,280 @@ +--- +name: Image Sharpening using HPC +dependsOn: [ + high_performance_computing.supercomputing.03_supercomputing_world +] +tags: [foundation] +attribution: + - citation: > + "Introduction to HPC" course by EPCC. + This material was originally developed by David Henty, Manos Farsarakis, Weronika Filinger, James Richings, and Stephen Farr at EPCC under funding from EuroCC. + url: https://epcced.github.io/Intro-to-HPC/ + image: https://epcced.github.io/Intro-to-HPC/_static/epcc_logo.svg + license: CC-BY-4.0 +--- + +## Part 1: Introduction & Theory + +Images can be fuzzy from random noise and blurrings. + +An image can be sharpened by: + + 1. detecting the edges + 2. combining the edges with the original image + +These steps are shown in the figure below. + +![Image sharpening steps](images/sharpening_diagram.png) +*Image sharpening steps* + +--- + +### Edge detection + +Edges can be detected using a Laplacian filter. The Laplacian $L(x,y)$ is the second spatial derivative of the image intensity $I(x,y)$. This means it highlights regions of rapid intensity change, i.e the edges. + +$$ +L(x,y) = \frac{\partial^2 I }{\partial x^2} + \frac{\partial^2 I }{\partial y^2} +$$ + +In practice, the Laplacian also highlights the noise in the image and, therefore, it is sensible to apply smoothing to the image as a first step. Here we apply a Gaussian filter, this filter $G(x,y)$ approximates each pixel as the weighted average of its neighbors. + +$$ +G(x,y) = \frac{1}{2 \pi \sigma^2} e^{- (x^2+y^2)/(2 \sigma^2)} +$$ + +The two operations can be combined to give the Laplacian of Gaussian filter $L \circ G(x,y)$. + +$$ +L \circ G(x,y) = -\frac{1}{\pi \sigma^4} \left( 1 - \frac{x^2+y^2}{2 \sigma^2}\right) e^{- (x^2+y^2)/(2 \sigma^2)} +$$ + +These two functions $G(x,y)$ and $L \circ G(x,y)$ are graphed below. + +!["Gaussian" and "Laplacian of Gaussian" filters](images/Laplacian_of_Gaussian.png) +*"Gaussian" and "Laplacian of Gaussian" filters* + +--- + +### Implementation + +To apply the $L \circ G$ filter to an image the $L \circ G$ filter must be turned into a discrete mask, that is a matrix of size 2d+1 x 2d+1 where d is an integer. We use d=8, therefore the $L \circ G$ filter is a 17x17 square, it looks like this: + +![Laplacian of Gaussian filter as a discrete mask](images/mask.png) +*$L \circ G$ filter as a discrete mask* + +To perform the convolution of this filter with the original image, the following operation is performed on each pixel, + +$$ +\text{edges}(i,j) = \sum_{k=-d}^d \sum_{l=-d}^d \text{image}(i + k, j + l) \times \text{filter}(k,l), +$$ + +the sharpened image is then created by adding the edges to the original image, with a scaling factor (See the source code for the full details). + +--- + +## Part 2: Download, Compile, Run + +### Downloading the source code + +In this exercise we will be using an image sharpening program which is available in a Github repository. +To download the code you will need to clone the repository, e.g.: + +```bash +git clone https://github.com/UNIVERSE-HPC/foundation-exercises +``` + +Alternatively you might have already cloned the repository as part of another exercise in which case skip this step. + +When you clone the repository the output will look similar to this: + +```output +Cloning into 'foundation-exercises'... +remote: Enumerating objects: 131, done. +remote: Counting objects: 100% (131/131), done. +remote: Compressing objects: 100% (69/69), done. +remote: Total 131 (delta 56), reused 127 (delta 55), pack-reused 0 +Receiving objects: 100% (131/131), 366.62 KiB | 3.05 MiB/s, done. +Resolving deltas: 100% (56/56), done. +``` + +You will now have a folder called `foundation-exercises. Change directory into where the sharpening code is located and list the contents: +You will now have a folder called `foundation-exercises`. Change directory into where the sharpening code is located and list the contents: +```bash +cd foundation-exercises/sharpen +ls +``` + +Output: + +```output +C-MPI C-OMP C-SER F-MPI F-OMP F-SER +``` + +There are several version of the code, a serial version and a number of parallel versions both for the C language and for Fortran. Initially we will be looking at the serial version located in the `C-SER` folder. + +### Compiling the source code + +We will compile the serial version of the source code using a Makefile. + +Move into the ``C-SER`` directory and list the contents. + +```bash +cd C-SER +ls +``` + +Output: + +```output +cio.c dosharpen.c filter.c fuzzy.pgm Makefile sharpen.c sharpen.h sharpen.slurm utilities.c utilities.h +``` + +You will see various code files. The Makefile includes the commands which compile them into an executable program. To use the Makefile, type the ``make`` command. + +```bash +make +``` + +::::callout + +## Which Compiler? + +If you're running this on your own machine or another HPC infrastructure, +you may find you get an error message about not being able to find the compiler, e.g.: + +```output +cc -O3 -DC_SERIAL_PRACTICAL -c sharpen.c +make: cc: No such file or directory +make: *** [Makefile:32: sharpen.o] Error 127 +``` + +In which case, you can edit the file `Makefile`, which contains the instructions to compile the code, and ensure it uses `gcc` instead. +Edit `Makefile` in an editor of your choice and replace the following line: + +```bash +CC= cc +``` + +with this one: + +```bash +CC= gcc +``` + +and then save the file, and re-run `make`. +In future practical sections you can perform the same change to makefiles if you encounter the same error. + +:::: + +Output: + +```output +cc -O3 -DC_SERIAL_PRACTICAL -c sharpen.c +cc -O3 -DC_SERIAL_PRACTICAL -c dosharpen.c +cc -O3 -DC_SERIAL_PRACTICAL -c filter.c +cc -O3 -DC_SERIAL_PRACTICAL -c cio.c +cc -O3 -DC_SERIAL_PRACTICAL -c utilities.c +cc -O3 -DC_SERIAL_PRACTICAL -o sharpen sharpen.o dosharpen.o filter.o cio.o utilities.o -lm +``` + +This should produce an executable file called ``sharpen``. + +### Running the serial program + +We can run the serial program directly on the login node: + +```bash +./sharpen +``` + +You should see output similar to: + +```output +Image sharpening code running in serial + +Input file is: fuzzy.pgm +Image size is 564 x 770 + +Using a filter of size 17 x 17 + +Reading image file: fuzzy.pgm +... done + +Starting calculation ... +... finished + +Writing output file: sharpened.pgm + +... done + +Calculation time was 1.378783 seconds +Overall run time was 1.498794 seconds +``` + +You should find an output file `sharpened.pgm` which contains the sharpened image. + +::::callout + +## Would you like to know more? + +If you're interested in the implementation itself, take a look at the code - particularly `dosharpen.c`. +:::: + +### Viewing the images + +If you're running this via your own machine, take a look at the produced image file named `sharpened.pgm`. + +If however you're following this on an HPC infrastructure like ARCHER2, to view the sharpened image you'll need to copy the file to your local machine. +Fortunately, we can use the `scp` (secure copy) command to do this over SSH. +From a terminal on your local machine or laptop: +Use the `pwd` command to output your current directory location, and use it with `scp` to copy the original and sharpened files over, e.g. on ARCHER2: + +```bash +scp username@login.archer2.ac.uk:/work/[project code]/[group code]/[username]/foundation-exercises/sharpen/C-SER/fuzzy.pgm . +scp username@login.archer2.ac.uk:/work/[project code]/[group code]/[username]/foundation-exercises/sharpen/C-SER/sharpened.pgm . +``` + +Then you should be able to open and view the image file on your local machine. + +::::callout + +## What about viewing the file *without* copying? + +Another way to view this file directly on an HPC resource, without copying it, is by installing an X Window client on your local machine and then log into the remote machine with X forwarding enabled. +Covering this in detail is beyond the scope of this course, although the ARCHER2 [documentation on connecting](https://docs.archer2.ac.uk/user-guide/connecting/#logging-in) has some information. +:::: + +::::challenge{id=sc_practical_pr.1 title="Submit sharpen to a compute node"} + +## Submitting to a compute node + +**To be able to run the job submission examples in this segment, you'll need to either have access to ARCHER2, or an HPC infrastructure running the Slurm job scheduler and knowledge of how to configure job scripts for submission.** + +Write a Slurm script to run sharpen on a compute node, and submit it. + +:::solution + +```bash +#!/bin/bash + +#SBATCH --job-name=Sharpen +#SBATCH --nodes=1 +#SBATCH --tasks-per-node=1 +#SBATCH --cpus-per-task=1 +#SBATCH --time=00:01:00 + +# Replace [project code] below with your project code (e.g. t01) +#SBATCH --account=[project code] +#SBATCH --partition=standard +#SBATCH --qos=standard + +./sharpen +``` + +When you submit this using `sbatch`, the output of the completed job can be seen in the `slurm-` output log file. +This will look very similar to the terminal output when it is run directly. + +::: + +:::: diff --git a/high_performance_computing/supercomputing/images/181107_ARCHER_30.jpg b/high_performance_computing/supercomputing/images/181107_ARCHER_30.jpg new file mode 100644 index 00000000..0a86e37c Binary files /dev/null and b/high_performance_computing/supercomputing/images/181107_ARCHER_30.jpg differ diff --git a/high_performance_computing/supercomputing/images/Laplacian_of_Gaussian.png b/high_performance_computing/supercomputing/images/Laplacian_of_Gaussian.png new file mode 100644 index 00000000..6645a93e Binary files /dev/null and b/high_performance_computing/supercomputing/images/Laplacian_of_Gaussian.png differ diff --git a/high_performance_computing/supercomputing/images/Transistor-Count-over-time.png b/high_performance_computing/supercomputing/images/Transistor-Count-over-time.png new file mode 100644 index 00000000..8b54f846 Binary files /dev/null and b/high_performance_computing/supercomputing/images/Transistor-Count-over-time.png differ diff --git a/high_performance_computing/supercomputing/images/agence-olloweb-d9ILr-dbEdg-unsplash.jpg b/high_performance_computing/supercomputing/images/agence-olloweb-d9ILr-dbEdg-unsplash.jpg new file mode 100644 index 00000000..133aa773 Binary files /dev/null and b/high_performance_computing/supercomputing/images/agence-olloweb-d9ILr-dbEdg-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/bert-b-rhNff6hB41s-unsplash.jpg b/high_performance_computing/supercomputing/images/bert-b-rhNff6hB41s-unsplash.jpg new file mode 100644 index 00000000..4b1da200 Binary files /dev/null and b/high_performance_computing/supercomputing/images/bert-b-rhNff6hB41s-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/both_images.png b/high_performance_computing/supercomputing/images/both_images.png new file mode 100644 index 00000000..9de76814 Binary files /dev/null and b/high_performance_computing/supercomputing/images/both_images.png differ diff --git a/high_performance_computing/supercomputing/images/large_hero_8408f33c-87f5-4061-aec7-42ef976e83fd.webp b/high_performance_computing/supercomputing/images/large_hero_8408f33c-87f5-4061-aec7-42ef976e83fd.webp new file mode 100644 index 00000000..5eb907d0 Binary files /dev/null and b/high_performance_computing/supercomputing/images/large_hero_8408f33c-87f5-4061-aec7-42ef976e83fd.webp differ diff --git a/high_performance_computing/supercomputing/images/large_hero_9748869f-e962-4c23-a6b6-8216e757920c.png b/high_performance_computing/supercomputing/images/large_hero_9748869f-e962-4c23-a6b6-8216e757920c.png new file mode 100644 index 00000000..451407a5 Binary files /dev/null and b/high_performance_computing/supercomputing/images/large_hero_9748869f-e962-4c23-a6b6-8216e757920c.png differ diff --git a/high_performance_computing/supercomputing/images/large_hero_a3db6ae7-8a0e-4fe4-b2da-302380de963a.png b/high_performance_computing/supercomputing/images/large_hero_a3db6ae7-8a0e-4fe4-b2da-302380de963a.png new file mode 100644 index 00000000..821b04f0 Binary files /dev/null and b/high_performance_computing/supercomputing/images/large_hero_a3db6ae7-8a0e-4fe4-b2da-302380de963a.png differ diff --git a/high_performance_computing/supercomputing/images/large_hero_cafacb0d-898b-44b4-9290-5c25c211fc03.jpg b/high_performance_computing/supercomputing/images/large_hero_cafacb0d-898b-44b4-9290-5c25c211fc03.jpg new file mode 100644 index 00000000..2def6753 Binary files /dev/null and b/high_performance_computing/supercomputing/images/large_hero_cafacb0d-898b-44b4-9290-5c25c211fc03.jpg differ diff --git a/high_performance_computing/supercomputing/images/large_hero_e0df48e4-9b4d-422c-a18f-d7898b9578d8.jpg b/high_performance_computing/supercomputing/images/large_hero_e0df48e4-9b4d-422c-a18f-d7898b9578d8.jpg new file mode 100644 index 00000000..a7cece6e Binary files /dev/null and b/high_performance_computing/supercomputing/images/large_hero_e0df48e4-9b4d-422c-a18f-d7898b9578d8.jpg differ diff --git a/high_performance_computing/supercomputing/images/mask.png b/high_performance_computing/supercomputing/images/mask.png new file mode 100644 index 00000000..959e50a7 Binary files /dev/null and b/high_performance_computing/supercomputing/images/mask.png differ diff --git a/high_performance_computing/supercomputing/images/pierre-bamin-5B0IXL2wAQ0-unsplash.jpg b/high_performance_computing/supercomputing/images/pierre-bamin-5B0IXL2wAQ0-unsplash.jpg new file mode 100644 index 00000000..9d04e16f Binary files /dev/null and b/high_performance_computing/supercomputing/images/pierre-bamin-5B0IXL2wAQ0-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/processor-2217771_640.jpg b/high_performance_computing/supercomputing/images/processor-2217771_640.jpg new file mode 100644 index 00000000..4c7263ba Binary files /dev/null and b/high_performance_computing/supercomputing/images/processor-2217771_640.jpg differ diff --git a/high_performance_computing/supercomputing/images/sharpen_speedup.svg b/high_performance_computing/supercomputing/images/sharpen_speedup.svg new file mode 100644 index 00000000..decbb8b2 --- /dev/null +++ b/high_performance_computing/supercomputing/images/sharpen_speedup.svg @@ -0,0 +1 @@ +020406080100120140020406080100120140SpeedupNumber of coresCalculation speedupOverall speedupPerfect speedup \ No newline at end of file diff --git a/high_performance_computing/supercomputing/images/sharpening_diagram.png b/high_performance_computing/supercomputing/images/sharpening_diagram.png new file mode 100644 index 00000000..3bcb76a3 Binary files /dev/null and b/high_performance_computing/supercomputing/images/sharpening_diagram.png differ diff --git a/high_performance_computing/supercomputing/images/taylor-vick-M5tzZtFCOfs-unsplash.jpg b/high_performance_computing/supercomputing/images/taylor-vick-M5tzZtFCOfs-unsplash.jpg new file mode 100644 index 00000000..c6ae0005 Binary files /dev/null and b/high_performance_computing/supercomputing/images/taylor-vick-M5tzZtFCOfs-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg b/high_performance_computing/supercomputing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg new file mode 100644 index 00000000..650d5240 Binary files /dev/null and b/high_performance_computing/supercomputing/images/veri-ivanova-p3Pj7jOYvnM-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/william-warby-WahfNoqbYnM-unsplash.jpg b/high_performance_computing/supercomputing/images/william-warby-WahfNoqbYnM-unsplash.jpg new file mode 100644 index 00000000..9a040fb3 Binary files /dev/null and b/high_performance_computing/supercomputing/images/william-warby-WahfNoqbYnM-unsplash.jpg differ diff --git a/high_performance_computing/supercomputing/images/winner-1019835_640.jpg b/high_performance_computing/supercomputing/images/winner-1019835_640.jpg new file mode 100644 index 00000000..86eca24b Binary files /dev/null and b/high_performance_computing/supercomputing/images/winner-1019835_640.jpg differ diff --git a/high_performance_computing/supercomputing/index.md b/high_performance_computing/supercomputing/index.md new file mode 100644 index 00000000..69349623 --- /dev/null +++ b/high_performance_computing/supercomputing/index.md @@ -0,0 +1,28 @@ +--- +name: Introduction to Supercomputing +id: supercomputing +dependsOn: [ + technology_and_tooling.bash_shell, +] +files: [ + 01_intro.md, + 02_understanding_supercomputing.md, + 03_supercomputing_world.md, + 04_practical.md, +] +summary: | + An introduction to supercomputing, including why we need them and how they are used. + +--- + +In this very short video Dr. David Henty introduces this module on supercomputing. + +::::iframe{id="kaltura_player" width="700" height="400" src="https://cdnapisec.kaltura.com/p/2010292/sp/201029200/embedIframeJs/uiconf_id/32599141/partner_id/2010292?iframeembed=true&playerId=kaltura_player&entry_id=1_lwezg5oi&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=1_1mgenjg0" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Welcome_to_Supercomputing"} +:::: + +:::solution{title="Transcript"} +0:12 - In this first week after a brief introduction to the kinds of applications supercomputing has, we’ll largely concentrate on supercomputer hardware. Starting from how a modern computer processor works, we’ll explain where supercomputers get their enormous computing power from, why they’re also called parallel computers, and how we quantify and measure their speed. We’ll also cover some history to illustrate how far we’ve come since the birth of modern supercomputing in the early 90s, and show you some examples of current state-of-the-art machines. We’ll also introduce some key terminology that you’ll need to understand the rest of the course. So enjoy. +::: + +The primary aim of this module is to provide a general understanding of supercomputers and their importance. +It will also introduce key terminology to help you understand the fundamentals of supercomputing.