Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
So the fact that loadBalance didn't split tasks was the main reason behind #35 whoops. Fixes #35 and addresses the main reason behind aprell/tasking-2.0#3
Histogram pathological bench
After the fix we are consistently faster than OpenMP on the histogram bench:
So we can put it out of the pathological cases and complete @HadrienG2 challenge in HadrienG2/parallel-histograms-bench#2 (comment)
Log-Sum-Exp pathological bench
LogSumExp is now also faster than sequential
weave/benchmarks/xxx_pathological/logsumexp/weave_logsumexp.nim
Lines 183 to 202 in a393c62
weave/benchmarks/xxx_pathological/logsumexp/weave_logsumexp.nim
Lines 223 to 240 in a393c62
weave/benchmarks/xxx_pathological/logsumexp/weave_logsumexp.nim
Lines 266 to 295 in a393c62
GEneralized Matrix Multiplication (GEMM BLAS)
It doesn't help reaching Laser or MKL speed on GEMM (perf unchanged) though it makes targeting a random victim instead of a last victim more competitive (reaching over 2.3 TFlops).
Combining that with WV_MaxConcurrentStealPerWorker=2 in LastVictim mode will trigger an abysmal performance of 0.5 TFlops so it might makes sense to switch back the default to Random from lastVictim (introduced in #74). This would also regain 10% lost perf on tasks with a duration inferior to 20 µs or so.
Matrix tranposition
Matrix transposition used nested for loop so the nested loop provided opportunities to do the full load balancing, we are on par with OpenMP collapse:
Meaning
I think this means that Weave has no more parallelism domain where it is weak, it is actually as fast/efficient of the top industry contenders of each categories: