large_scale_gpus

We investigate the influence of computational hardware topology on the throughput performance in terms of double precision floating point operations per second (FLOPS). Two commercially available GPU-accelerated compute nodes are compared using the multiplication of dense matrices as a compute-bound mathematical problem. We find that, the higher host-device memory bandwidths provided by NVLink-enabled CPUs (compared to PCIe connections) significantly improve the overall performance. We also compare our own implementation with an NVIDIA benchmark and report substantial speedups, especially for very large matrices.

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
.vscode		.vscode
Benchmarks		Benchmarks
ClassicKernels		ClassicKernels
Cuda_DGEMM_tiled_JochenKreuz		Cuda_DGEMM_tiled_JochenKreuz
DIEKUHDA		DIEKUHDA
First experiments		First experiments
MultiplyExperiments		MultiplyExperiments
TimeBandwith		TimeBandwith
matmult_experiments		matmult_experiments
.gitignore		.gitignore
Commands cheatsheet.txt		Commands cheatsheet.txt
Multi-GPU-Topologies_special_course_report.pdf		Multi-GPU-Topologies_special_course_report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

large_scale_gpus

About

Releases

Packages

Contributors 2

Languages

simonaertssen/large_scale_gpus

Folders and files

Latest commit

History

Repository files navigation

large_scale_gpus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages