Parallel Architecture for the upcoming alpaka integration #95

erikzenker · 2015-09-04T17:39:32Z

Alpaka provides the possibility to describe algorithms (kernels) in a abstract form, such that these algorithms are executeable on several hardware architectures e.g.: cpu, multi cpu, nvidia accelerators or xeon phis.

The clear goal is to run HASEonGPU on other hardware than NVIDIA accelerators and I think also to run HASEonGPU on varying accelerators/devices on the same time. To achieve that, we need to think about how to distribute workload locally to varying devices and globally to compute nodes.

Every device corresponds to a peer

This design would be more or less equal to the current design where each peer manages one NVIDIA accelerator (accept the master):

Each peer...

Grabs a free device
Requests a sample point
Runs the kernel on this device
Sends the result back
Requests a new sample point

Cons:

The number of total devices has to be known in advance because this is also the number of
peers that need to be spawned.

Pros:

Equal to the current design --> no big changes necessary
Usually you know the system you are running your simulation on

One peer per compute node with multiple devices

In this design a single peer could request sample points for all available
devices on its node and use the alpaka async streams to start multiple
kernels in parallel.

Each peer:

Grabs all devices it can get
Request as many sample points as devices
Starts a kernel on each device
Sends a result back when a device has finished
Requests sample points for finished devices

Pros:

Only the number of compute nodes need to be spawned as peers
Could really use "all" devices on a peer

Discuss !

bussmann · 2015-09-05T16:43:56Z

2, since the Con isn't one.

erikzenker · 2015-09-06T16:22:25Z

Okay, its a Con for my unconscious mind, that does not want to break up the current design. Update!

slizzered · 2015-09-07T07:57:01Z

In strategy 1, does the peer release the device after it returns the sample point? (My question is, why does it first look for the sample point, and only then grab a device).

I like the first strategy, since hierarchies are kept flat and simple, but I can see the benefits of auto-adjusting the number of devices per node by using only a single peer.

My idea about strategy 2:Use the one peer per node approach, but spawn an additional thread for each Accelerator and CPU that takes part in the computation. They can use the original thread for communication and create some form of hierarchy. That way, we can keep a clear separation of parallel computation and communication.

erikzenker · 2015-09-07T08:48:41Z

Okay, approach 1 works also when the device is grabbed first. It is also more efficient to not grab a device again and again. Update!

Your idea about strategy 2 looks looks like this ?

Thus, there are two hierarchies of communication ?

slizzered · 2015-09-07T08:58:55Z

Yes, that is about what I thought of. The communication thread would be only very lightweight to act as a an abstraction layer, so the compute threads don't have to change too much (basically, only replace main.cc and adapt calc_phi_ase_graybat.cc and keep most of the underlying compute-things).

I'm not sure about the mesh, but if we can put mesh-creation in a deeper layer (inside the compute-thread), the whole communication will also be separated from alpaka.

ax3l · 2015-09-07T11:39:21Z

I think strategy 2 is way more complicated to implement and strategy 1 "Every device corresponds to a peer" is does not require building yet-an-other scheduler that takes care about the devices in the rank.

slizzered · 2015-09-07T12:21:14Z

Yes, strategy 1 is very easy in comparison and so far we had a lot of success with the KISS principle behind it.

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

erikzenker · 2015-09-07T13:33:11Z

I would prefer strategy 1, because its simple. And I think its not a big thing to go from strategy 1 strategy 2 later.

ax3l · 2015-09-07T14:29:29Z

totally agree, also connecting various backend over the "same" abstract communication layer is already a nice task.

@slizzered

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

I actually think that might still be possible in 1, one just needs a communication layer that can asynchronously create communicators (MPI) / add new global "ranks" (ZeroMQ sockets). strategy 2 will naturally grow from that (in case new ranks are not globally announced).

bussmann · 2015-09-08T05:52:20Z

Then let's do 1 and see how it works out. Concentrate on alpaka, not haseongpu redesigns.

slizzered · 2015-09-08T08:31:01Z

👍

erikzenker added question discussion and removed question labels Sep 4, 2015

erikzenker added this to the 1.6 HASEonANY milestone Sep 6, 2015

slizzered added alpaka and removed alpaka labels Oct 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Architecture for the upcoming alpaka integration #95

Parallel Architecture for the upcoming alpaka integration #95

erikzenker commented Sep 4, 2015

bussmann commented Sep 5, 2015

erikzenker commented Sep 6, 2015

slizzered commented Sep 7, 2015

erikzenker commented Sep 7, 2015

slizzered commented Sep 7, 2015

ax3l commented Sep 7, 2015

slizzered commented Sep 7, 2015

erikzenker commented Sep 7, 2015

ax3l commented Sep 7, 2015

bussmann commented Sep 8, 2015

slizzered commented Sep 8, 2015

Parallel Architecture for the upcoming alpaka integration #95

Parallel Architecture for the upcoming alpaka integration #95

Comments

erikzenker commented Sep 4, 2015

Every device corresponds to a peer

One peer per compute node with multiple devices

bussmann commented Sep 5, 2015

erikzenker commented Sep 6, 2015

slizzered commented Sep 7, 2015

erikzenker commented Sep 7, 2015

slizzered commented Sep 7, 2015

ax3l commented Sep 7, 2015

slizzered commented Sep 7, 2015

erikzenker commented Sep 7, 2015

ax3l commented Sep 7, 2015

bussmann commented Sep 8, 2015

slizzered commented Sep 8, 2015