-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reductions benchmark #3
Comments
Or it's the termination detection. profling shows that the workers are spending their time idling:
|
Further investigation (and fixes to the benchmark) shows that it's actually faster than serial except when the number of threads = number of logical cores. Repeated multiple times
|
I probably forgot to mention that this is nothing more than a first attempt at splitting tasks with return values. I realized that splittable futures could be used to support reductions, so I tried that, without worrying about algorithmic inefficiencies. To be honest, I'm not sure if this approach is salvageable – unless you manage to transform it into a tree-like reduction. |
I see, then I will scrap it for now, however I may reuse that notify of data dependencies: mratsim/weave#31 For matrix multiplication and nested for loops, I need a barrier that works also for worker threads or even better a way to specify data dependencies like I need the task with range So I was thinking that I should implement reduction first and then see how they manage data dependencies and then use a similar techniques. Back to the drawing board, and I guess I'll start first with a worker barrier. |
So it turned out that I didn't split tasks in "loadBalance"/RT_Poll which meant that in parallel for loop that didn't generate any task I was basically doing measuring my sequential performance with runtime overhead sprinkled on top. The reduce technique is actually not bad on eager futures though I'm worried with lazy futures as they need to allocate 3 times instead of 2 times for the eager ones, see mratsim/weave#83. |
I wonder if you benched the reduction implementation or if you had benchmark to recommend?
I have implemented 2 benchmarks but it seems like reductions are as slow as serial. Actually lazy futures are even slower than the eager futures.
I suppose this is a either:
The 2-3 allocations:
tasking-2.0/src/runtime.c
Lines 1525 to 1545 in 303ce3e
Benchmark:
The text was updated successfully, but these errors were encountered: