TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified CPU and GPU devices.
Documentation for TransferBench is available at https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html.
- You must have a ROCm stack installed on your system (HIP runtime)
- You must have
libnuma
installed on your system - AMD IOMMU must be enabled and set to passthrough for AMD Instinct cards
To build documentation locally, use the following code:
cd docs
pip3 install -r .sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
You can build TransferBench using Makefile or CMake.
-
Makefile:
make
-
CMake:
mkdir build cd build CXX=/opt/rocm/bin/hipcc cmake .. make
If ROCm is not installed in
/opt/rocm/
, you must setROCM_PATH
to the correct location.
You can build TransferBench to run on NVIDIA platforms via HIP or native NVCC.
Use the following code to build with HIP for NVIDIA (note that you must have a HIP-compatible CUDA version installed, e.g., CUDA 11.5):
CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
Use the following code to build with native NVCC (builds TransferBenchCuda
):
make
-
Running TransferBench with no arguments displays usage instructions and detected topology information
-
You can use several preset configurations instead of a configuration file:
a2a
: All-to-all benchmark testcmdline
: Take in Transfers to run from command-line instead of via filep2p
: Peer-to-peer benchmark testpcopy
: Benchmark parallel copies from a single GPU to other GPUsrsweep
: Random sweep across possible sets of transfersrwrite
: Benchmarks parallel remote writes from a single GPU to other GPUsscaling
: GPU subexecutor scaling testsschmoo
: Local/Remote read/write/copy between two GPUssweep
: Sweep across possible sets of transfers
-
When using the same GPU executor in multiple simultaneous transfers on separate streams (USE_SINGLE_STREAM=0), performance may be serialized due to the maximum number of hardware queues available
- The number of maximum hardware queues can be adjusted via
GPU_MAX_HW_QUEUES
- The number of maximum hardware queues can be adjusted via