-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Architecture for the upcoming alpaka integration #95
Comments
2, since the Con isn't one. |
Okay, its a Con for my unconscious mind, that does not want to break up the current design. Update! |
In strategy 1, does the peer release the device after it returns the sample point? (My question is, why does it first look for the sample point, and only then grab a device). I like the first strategy, since hierarchies are kept flat and simple, but I can see the benefits of auto-adjusting the number of devices per node by using only a single peer. My idea about strategy 2:Use the one peer per node approach, but spawn an additional thread for each Accelerator and CPU that takes part in the computation. They can use the original thread for communication and create some form of hierarchy. That way, we can keep a clear separation of parallel computation and communication. |
Yes, that is about what I thought of. The communication thread would be only very lightweight to act as a an abstraction layer, so the compute threads don't have to change too much (basically, only replace I'm not sure about the mesh, but if we can put mesh-creation in a deeper layer (inside the compute-thread), the whole communication will also be separated from alpaka. |
I think strategy 2 is way more complicated to implement and strategy 1 "Every device corresponds to a peer" is does not require building yet-an-other scheduler that takes care about the devices in the rank. |
Yes, strategy 1 is very easy in comparison and so far we had a lot of success with the KISS principle behind it. I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node. |
I would prefer strategy 1, because its simple. And I think its not a big thing to go from strategy 1 strategy 2 later. |
totally agree, also connecting various backend over the "same" abstract communication layer is already a nice task.
I actually think that might still be possible in 1, one just needs a communication layer that can asynchronously create communicators (MPI) / add new global "ranks" (ZeroMQ sockets). strategy 2 will naturally grow from that (in case new ranks are not globally announced). |
Then let's do 1 and see how it works out. Concentrate on alpaka, not haseongpu redesigns. |
👍 |
Alpaka provides the possibility to describe algorithms (kernels) in a abstract form, such that these algorithms are executeable on several hardware architectures e.g.: cpu, multi cpu, nvidia accelerators or xeon phis.
The clear goal is to run HASEonGPU on other hardware than NVIDIA accelerators and I think also to run HASEonGPU on varying accelerators/devices on the same time. To achieve that, we need to think about how to distribute workload locally to varying devices and globally to compute nodes.
Every device corresponds to a peer
This design would be more or less equal to the current design where each peer manages one NVIDIA accelerator (accept the master):
Each peer...
Cons:
peers that need to be spawned.
Pros:
One peer per compute node with multiple devices
In this design a single peer could request sample points for all available
devices on its node and use the alpaka async streams to start multiple
kernels in parallel.
Each peer:
Pros:
Discuss !
The text was updated successfully, but these errors were encountered: