Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Architecture for the upcoming alpaka integration #95

Open
erikzenker opened this issue Sep 4, 2015 · 11 comments
Open

Parallel Architecture for the upcoming alpaka integration #95

erikzenker opened this issue Sep 4, 2015 · 11 comments

Comments

@erikzenker
Copy link
Member

Alpaka provides the possibility to describe algorithms (kernels) in a abstract form, such that these algorithms are executeable on several hardware architectures e.g.: cpu, multi cpu, nvidia accelerators or xeon phis.

The clear goal is to run HASEonGPU on other hardware than NVIDIA accelerators and I think also to run HASEonGPU on varying accelerators/devices on the same time. To achieve that, we need to think about how to distribute workload locally to varying devices and globally to compute nodes.

Every device corresponds to a peer

This design would be more or less equal to the current design where each peer manages one NVIDIA accelerator (accept the master):

Each peer...

  1. Grabs a free device
  2. Requests a sample point
  3. Runs the kernel on this device
  4. Sends the result back
  5. Requests a new sample point

Cons:

  • The number of total devices has to be known in advance because this is also the number of
    peers that need to be spawned.

Pros:

  • Equal to the current design --> no big changes necessary
  • Usually you know the system you are running your simulation on

One peer per compute node with multiple devices

In this design a single peer could request sample points for all available
devices on its node and use the alpaka async streams to start multiple
kernels in parallel.

Each peer:

  1. Grabs all devices it can get
  2. Request as many sample points as devices
  3. Starts a kernel on each device
  4. Sends a result back when a device has finished
  5. Requests sample points for finished devices

Pros:

  • Only the number of compute nodes need to be spawned as peers
  • Could really use "all" devices on a peer

Discuss !

@bussmann
Copy link
Member

bussmann commented Sep 5, 2015

2, since the Con isn't one.

@erikzenker erikzenker added this to the 1.6 HASEonANY milestone Sep 6, 2015
@erikzenker
Copy link
Member Author

Okay, its a Con for my unconscious mind, that does not want to break up the current design. Update!

@slizzered
Copy link
Contributor

In strategy 1, does the peer release the device after it returns the sample point? (My question is, why does it first look for the sample point, and only then grab a device).

I like the first strategy, since hierarchies are kept flat and simple, but I can see the benefits of auto-adjusting the number of devices per node by using only a single peer.

My idea about strategy 2:Use the one peer per node approach, but spawn an additional thread for each Accelerator and CPU that takes part in the computation. They can use the original thread for communication and create some form of hierarchy. That way, we can keep a clear separation of parallel computation and communication.

@erikzenker
Copy link
Member Author

Okay, approach 1 works also when the device is grabbed first. It is also more efficient to not grab a device again and again. Update!

Your idea about strategy 2 looks looks like this ?

hierarchy

Thus, there are two hierarchies of communication ?

@slizzered
Copy link
Contributor

Yes, that is about what I thought of. The communication thread would be only very lightweight to act as a an abstraction layer, so the compute threads don't have to change too much (basically, only replace main.cc and adapt calc_phi_ase_graybat.cc and keep most of the underlying compute-things).

I'm not sure about the mesh, but if we can put mesh-creation in a deeper layer (inside the compute-thread), the whole communication will also be separated from alpaka.

@ax3l
Copy link
Member

ax3l commented Sep 7, 2015

I think strategy 2 is way more complicated to implement and strategy 1 "Every device corresponds to a peer" is does not require building yet-an-other scheduler that takes care about the devices in the rank.

@slizzered
Copy link
Contributor

Yes, strategy 1 is very easy in comparison and so far we had a lot of success with the KISS principle behind it.

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

@erikzenker
Copy link
Member Author

I would prefer strategy 1, because its simple. And I think its not a big thing to go from strategy 1 strategy 2 later.

@ax3l
Copy link
Member

ax3l commented Sep 7, 2015

totally agree, also connecting various backend over the "same" abstract communication layer is already a nice task.

@slizzered

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

I actually think that might still be possible in 1, one just needs a communication layer that can asynchronously create communicators (MPI) / add new global "ranks" (ZeroMQ sockets). strategy 2 will naturally grow from that (in case new ranks are not globally announced).

@bussmann
Copy link
Member

bussmann commented Sep 8, 2015

Then let's do 1 and see how it works out. Concentrate on alpaka, not haseongpu redesigns.

@slizzered
Copy link
Contributor

👍

@slizzered slizzered added alpaka and removed alpaka labels Oct 16, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants