-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS matmul vs Laser, Intel MKL, MKL-DNN, OpenBLAS and co #68
Comments
See #69 Laser can now reach 2.7 TFlops with GCC: static, guided and dynamic schedule. Clang is at 2.1~2.2GHz. A nimsuggest process was probably hogging my CPU
|
Some consolation: Intel MKL build with TBB misses the 1TFlops mark as well:
Change mkl_gemm.nim.cfg to:
but it reaches 3.2 TFlops with Intel OpenMP
|
Main issue fixed in 36507ec by exposing more parallelism by parallelizing 3 nested loops instead of just 2.
Important: Just adding Stats Before
After
|
Very interesting and thorough analysis of TBB overhead vs static scheduling: http://www.cs.yale.edu/homes/abhishek/abhishek-taco11.pdf |
After #74 and changing work-stealing victim selection, we can now reach 2.3 TFlops. Interesting readings:
|
The idea was that workers would get a huge chunk of work upfront (similar to eager binary splitting or OpenMP scheduling) and wouldn't have to find more later. After implementation the following configuration is stably at 2.28~2.30TFlops
|
Some ideas: Rejected
This would give up the main benefit of Weave which is being purely channel based and easier to extend with distributed computing, easier to formally verify and future hardware proof, assuming we will have in the future more cluster-on-chips and that hardware-message passing primitives will be available but memory coherency will be too expensive to maintain (i.e. very costly atomics). (for example Mellanox/Tilera 100+ARM cores, Intel SCC, Adapteva 1024 ARM cores)
Interrupts might be tricky for embedding in other programs, say Python. It will also flush the cache at an inopportune time (when the core is doing carefully tuned AVX512 instructions to use 100% of the CPU cycles) IdeasSimilar to Latency hiding work-stealing: http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-112R.pdf it might want to have another private queue for statically scheduled tasks with a static parallelFor. Tasks sent in that queue cannot be stolen and there would be no load balancing. A "schedule: static" field could be added to the parallelFor construct. This would allow:
This can be extended to:
Both would help for IO tasks #22 with long-running event-loops on the static queue, they may spawn multithreaded computation on the work-stealing queue. Properly designing this for composition would allow not having to hardcode an event-loop in Weave, meaning it could be extended to anything that requires a long-running polling thread. Interesting logsWhen scheduling the following nested loops and trying to see if the Worker 0 distributes the i loop first (outer loop, more work sent per i iteration) or the j loop first (inner loop, less work sent per j iteration): import
../weave,
../weave/instrumentation/loggers,
../weave/contexts
init(Weave)
parallelForStrided i in 0 ..< 2000, stride = 256:
parallelForStrided j in 0 ..< 2000, stride = 128:
captures: {i}
log("Matrix[%d, %d] (thread %d)\n", i, j, myID())
exit(Weave) Logs are:
The first thing we see in the logs is that when scheduling, the whole task is sent even if it's splittable Lines 266 to 289 in 2cebb3d
The second thing that we see is that somehow there was no task splitting at all i.e. the following log entries are missing: Lines 226 to 245 in 2cebb3d
|
Further analysis of the logs shows a promising way to close the gap with OpenMP on GEMM (and hopefully coarse-grain non-nested parallel loops as well) with minimal change (which does not undermine the potential value of pinned tasks).
Worker 0 starts working on the outer loop 0x121e1159 and generates multiple inner loop tasks 0x121e0a6c. When it receives a steal request, it distributes those first. However for non nested-parallelism where work comes as a single loop like for the pathological histogram and logsumexp (#35) this changes nothing unless we also change the initial task enqueue for parallel loops which for for-loop is currently winner-takes-all (i.e. there is no splitting). |
Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf) This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.
|
Very in-depth analysis of MKL, MKL-DNN, Eigen and parallelism and threadpool in the context of deep-learning workload: https://arxiv.org/pdf/1908.04705.pdf |
* loadBalance didn't split tasks: hopefully addresses #68 and #35 * Introduce a new debugSplit and extra assertions to ensure a worker keeps at-least an iteration * Ensure a splitter always keeps an iteration * Reenable all logsumexp bench * 2.5 TFlops!! Go back to random victim policy, only parallelize 2 loops in Weave * Move 2D histogram and logsumexp out of pathological case * Add a README on why GEMM is a key workload * add some reference numbers
After #83, Weave achieves 2.53TFlops. Some figures to keep a realistic performance target:
Perfect multithreaded scaling on 18 cores would be 2.88 TFlops, reaching over 3TFlops would need significantly improving single-threaded GEMM kernel which is out-of-scope for Weave and should be tackled in Laser. Theoretical maximumis computed by multiplying
|
It seems like without Backoff, we actually do reach the 2.7 TFlops
|
Closed by #94 We can reach over 2.8TFlops from a 160~170GFlops single threaded baseline (17.5x speedup) while MKL reaches 3.4TFlops from a 210 GFlops baseline (16.2x) and OpenBLAS 2.8 TFlops from a 193GFlops baseline (14.5) and MKL-DNN 3.0TFlops from a 200 GFlops baseline (15x speedup). So the issue on multithreading front and perf vs Laser/OpenMP is solved and even with the slower kernel, the performance is very good on high-core-count machines. Now there is still an issue with nesting. The new code still uses syncRoot which prevents nesting in other parallel region (like batched matmul). A simple nestable kernel would await/sync the outer and inner loop but it seems that it leads to load imbalance as the performance reached is at most 1.8TFlops. Maybe the kernel could be redesigned to take full advantage of the dataflow approach but it's out of scope of this issue. weave/benchmarks/matmul_gemm_blas/gemm_pure_nim/gemm_weave.nim Lines 154 to 188 in 85b90b3
|
Somewhat related to #31 and very probably related to #35
By giving Laser (OpenMP-based) the time to get CPU at full-speed, it can reach 2.23 TFlops on my CPU:see below, nimsuggest was running at 100%Intel MKL is at 2.85 TFlops
OpenBLAS is at 1.52 TFlops
Unfortunately Weave is stuck at 0.6 TFlops.
First look
On the right more than half of the instructions for Weave are not floating points instructions.
Also it seems like the CPU frequency is at 2.7GHz in the case of Weave while Laser reaches 3.6GHz. My BIOS settings are 4.1GHz all core turbo for normal code, 4.0GHz for AVX2 and 3.5GHz for AVX512 code.
Call stack
Looking into the call stack, it seems like we are paying the price of dynamic scheduling :/ as the time spend into the GEMM kernel is similar.
Conclusion
We have very high overhead for coarse-grained parallelism where OpenMP shines. I suspect this is also the reason behind #35.
Suspicions
Additionally I suspect the runtime is polluting the L1 cache with runtime data while GEMM has been carefully tuned to take advantage of L1 and L2 caches. Using smaller tiles might help (similarly avoiding hyperthreading which does the same would probably help as well)
The text was updated successfully, but these errors were encountered: