Low-Level Interface Graphs #13

benbenolson · 2018-09-27T17:13:26Z

Currently SICM does not take into account complex hierarchies of memory; specifically, it cannot take into account bandwidth and latency features of the machine that it runs on, nor does it have any way of expressing connectivity between NUMA nodes or other devices.

The solution that we've come up with is that we should represent the system architecture as a graph problem, with nodes representing NUMA nodes and other devices, and with edges representing connectivity between them. That way, a user could jump through a graph of their machine architecture and select a place to allocate based on a more complex set of properties: bandwidth, latency, and number of hops from any memory node in the graph; closeness to, for example, a NIC; and closeness to a CPU or GPU.

Because this is intended to especially work on HPC systems with very complicated hardware and users that are knowledgeable about it, SICM will first attempt to read a graph from a file with the system topology. It will do some post-processing on this graph (for example, calculating derived edges between each compute and memory node to make for easier traversal), and use that to inform applications about the architecture instead of a device list.

If this graph file does not exist, SICM will then try to use hwloc to construct the graph automatically. Because hwloc cannot detect connections between memory and compute nodes, nor their speeds and other properties, these properties will either have to be guessed or left blank. When guessed, the API will make it known to the user that the values may not be accurate, since the only way to do this is to determine the user's architecture by looking at their CPU model ("Intel Sandy Bridge with two NUMA nodes" is something that we could reasonably determine automatically, for example), then use that information to assume certain speeds (We know the QPI speed of a Sandy Bridge system, generally, so can use a generic two-node Sandy Bridge graph).

@calccrypto @gvallee Please let me know what you guys think. We've been designing this for a little while, trying to come up with some way to express a greater level of detail (i.e. including compute nodes, as well as connections between all devices) without sacrificing usability.

This issue is for completing this overhaul of SICM's low level interface in sicm_low.c. Other parts of SICM need to be very slightly changed once the API changes.

These changes are already begun in the branch hwloc_test. They begin with the file sicm_graph.h, but currently sicm_low.c and sicm_low.h have not been modified yet.

The text was updated successfully, but these errors were encountered:

calccrypto · 2018-09-27T17:22:55Z

Do you mean something like this?

benbenolson · 2018-09-27T17:47:15Z

Yeah, that's very close to what I was thinking, but not exactly. What we're really after is the edges-- the bandwidth and latency between individual nodes-- which hwloc does not provide. If we wanted something like that, we could simply call hwloc and get that, but it's not enough information to fully inform allocations-- we know where we can allocated, but there's no way to know how quick that memory is or any other properties about the memory. hwloc's goal, after all, isn't high-performance memory allocation; it's detecting your whole system topology, so it provides more detail than we need in some areas, and too little information in other areas.

Here are some graphs that I made to display what I mean. You can look at arch/ecuador.hdf (I know, I'm going to change the file extension) to see the kind of syntax that I'm getting at when I say that a user can supply their own graph file. So, Ecuador looks something like this:

Per the above design (it's already implemented in sicm_graph.h), SICM reads in arch/ecuador.hdf (we could have generic versions of common architectures to assist the user), which looks like the above figure. Next, it creates derived edges from every compute node to every memory node. This is so that users can easily iterate over edges out of a particular compute node and determine exactly how much bandwidth and latency they can get from each memory node. In the above example, that would mean that there would be additional edges between the CPU on node 0 and the memory on node 1, and the CPU on node 1 and the memory on node 0.

Summit is where things start to get interesting. Since it's a much more complicated architecture, the graph starts to become a little larger, but necessarily so. Using this graph, the user could easily allocate memory close to the NIC, or determine that they want the greater bandwidth of the DRAM, or place memory only on the GPUs (possibly because they have some CUDA code that requires it):

benbenolson added the enhancement label Sep 27, 2018

benbenolson self-assigned this Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-Level Interface Graphs #13

Low-Level Interface Graphs #13

benbenolson commented Sep 27, 2018

calccrypto commented Sep 27, 2018

benbenolson commented Sep 27, 2018

Low-Level Interface Graphs #13

Low-Level Interface Graphs #13

Comments

benbenolson commented Sep 27, 2018

calccrypto commented Sep 27, 2018

benbenolson commented Sep 27, 2018