We investigate the influence of computational hardware topology on the throughput performance in terms of double precision floating point operations per second (FLOPS). Two commercially available GPU-accelerated compute nodes are compared using the multiplication of dense matrices as a compute-bound mathematical problem. We find that, the higher host-device memory bandwidths provided by NVLink-enabled CPUs (compared to PCIe connections) significantly improve the overall performance. We also compare our own implementation with an NVIDIA benchmark and report substantial speedups, especially for very large matrices.
-
Notifications
You must be signed in to change notification settings - Fork 0
simonaertssen/large_scale_gpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
This repository contains the code and framework for the special course in 'Large Scale Computation of ... GPU Architecture'.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published