Releases: ROCm/TransferBench
Releases · ROCm/TransferBench
TransferBench v1.53
v1.53
Added
- Added ability to specify NULL for sweep preset as source or destination memory type
TransferBench v1.52
Added
- Added USE_HSA_DMA env var to switch to using hsa_amd_memory_async_copy instead of hipMemcpyAsync for DMA execution
- Added ability to set USE_GPU_DMA env var for a2a benchmark
- Adding check for large BAR enablement for GPU devices during topology check
Fixed
- Potential memory leak if HSA reports 0 hops between GPUs and CPUs
TransferBench v1.51
v1.51
Modified
- CSV output has been modified slightly to match normal terminal output
- Output for non single stream mode has been changed to match single stream mode (results per Executor)
Added
- Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration
If set to 0, this should infinitely loop (which may be useful for some debug purposes) - Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to
0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned
for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the
search space. The number of CUs used will be identical for all variable subExecutor transfers - Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs
through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values
Pass criteria limits can be modified (due to platform differences) via the environment variables
LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)
Fixed
- Fixed out-of-bounds memory access during topology detection that can happen if the number of
CPUs is less than the number of NUMA domains - Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)
TransferBench v1.50
Added
- Adding new parallel copy preset benchmark (pcopy)
- Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>
Fixed
- Removed non-copies DMA Transfers (this had previously been using hipMemset)
- Fixed CPU executor when operating on null destination
TransferBench v1.49
Fixes
- Enumerating previously missed DMA engines used only for CPU traffic in topology display
TransferBench v1.48
v1.48
Fixes
- Various fixes for TransferBenchCuda
Additions
- Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
- Printing warnings when exeuctors are overcommited
Modifications
- USE_REMOTE_READ supported for rwrite preset benchmark
TransferBench v1.47
Fixes
- Fixing CUDA compilation
TransferBench v1.46
Fixes
- Fixing GFX_UNROLL set to 13 (past 8) on gfx906 cards
Modifications
- GFX_SINGLE_TEAM=1 by default
- Adding field showing summation of individual Transfer bandwidths for Executors
TransferBench v1.45
Additions
- Adding A2A_MODE to a2a preset (0 = copy, 1 = read-only, 2 = write-only)
- Adding GFX_UNROLL to modify GFX kernel's unroll factor
- Adding GFX_WAVE_ORDER to modify order in which wavefronts process data
Modifications
- Rewrote the GFX reduction kernel to support new wave ordering
TransferBench v1.44
Additions
- Adding rwrite preset to benchmark remote parallel writes
- Usage: ./TransferBench rwrite <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=3>