Documentation for TransferBench is available at https://rocm.docs.amd.com/projects/TransferBench.
- Refactored TransferBench into a header-only library combined with a thin client to facilitate the use of TransferBench as the backend for other applications
- Optimized how data validation is handled - this should speed up Tests with many parallel transfers as data is only generated once
- Preset benchmarks now no longer take in any extra command line arguments. Preset settings are only accessed via environment variables. Details for each preset are printed
- The a2a preset benchmark now defaults to using fine-grained memory and GFX unroll of 2
- Refactored how Transfers are launched in parallel which has reduced some CPU-side overheads
- CPU and DMA executor timing now use CPU wall clock timing instead of slowest Transfer time
- New one2all preset which sweeps over all subests of parallel transfers from one GPU to others
- Adding new warnings for DMA execution relating to how HIP will default to using agents from the source memory
- CU scaling preset has been removed. Similar functionality already exists in the schmoo preset benchmark
- Preparation of source data via GFX kernel has been removed (USE_PREP_KERNEL)
- Removed GFX block-reordering (BLOCK_ORDER)
- Removed NUM_CPU_DEVICES and NUM_GPU_DEVICES from common env vars and only into the presets they apply to.
- Removed SHARED_MEM_BYTES option for GFX executor
- Removed USE_PCIE_INDEX, and SHARED_MEM_BYTES
- Fixed a potential timing reporting issue when DMA executed Transfers end up getting serialized.
- Added ability to specify NULL for sweep preset as source or destination memory type
- Added USE_HSA_DMA env var to switch to using hsa_amd_memory_async_copy instead of hipMemcpyAsync for DMA execution
- Added ability to set USE_GPU_DMA env var for a2a benchmark
- Adding check for large BAR enablement for GPU devices during topology check
- Potential memory leak if HSA reports 0 hops between GPUs and CPUs
- CSV output has been modified slightly to match normal terminal output
- Output for non single stream mode has been changed to match single stream mode (results per Executor)
- Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration If set to 0, this should infinitely loop (which may be useful for some debug purposes)
- Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to 0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the search space. The number of CUs used will be identical for all variable subExecutor transfers
- Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values Pass criteria limits can be modified (due to platform differences) via the environment variables LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)
- Fixed out-of-bounds memory access during topology detection that can happen if the number of CPUs is less than the number of NUMA domains
- Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)
- Adding new parallel copy preset benchmark (pcopy)
- Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>
- Removed non-copies DMA Transfers (this had previously been using hipMemset)
- Fixed CPU executor when operating on null destination
- Enumerating previously missed DMA engines used only for CPU traffic in topology display
- Various fixes for TransferBenchCuda
- Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
- Printing warnings when exeuctors are overcommited
- USE_REMOTE_READ supported for rwrite preset benchmark
- Fixing CUDA support
- Fixing GFX_UNROLL set to 13 (past 8) on gfx906 cards
- GFX_SINGLE_TEAM=1 by default
- Adding field showing summation of individual Transfer bandwidths for Executors
- Adding A2A_MODE to a2a preset (0 = copy, 1 = read-only, 2 = write-only)
- Adding GFX_UNROLL to modify GFX kernel's unroll factor
- Adding GFX_WAVE_ORDER to modify order in which wavefronts process data
- Rewrote the GFX reduction kernel to support new wave ordering
- Adding rwrite preset to benchmark remote parallel writes
- Usage: ./TransferBench rwrite <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=3>
- Modifying a2a to show executor timing, as well as executor min/max bandwidth
- Fixing schmoo maxNumCus optional arg parsing
- Schmoo output modified to be easier to copy
- Adding schmoo preset config benchmarks local/remote reads/writes/copies
- Usage: ./TransferBench schmoo <numBytes=64M> <localIdx=0> <remoteIdx=1> <maxNumCUs=32>
- Fixing some misreported timings when running with non-fixed number of iterations
- Fixing XCC defaulting to 0 instead of random for preset configs, ignoring XCC_PREF_TABLE
- (Experimental) Adding support for Executor sub-index
- Remove deprecated gcnArch code. ROCm version must include support for hipDeviceMallocUncached
- Adding missing threadfence which could cause non-fine-grained Transfers to report higher speeds
- USE_SINGLE_STREAM is enabled by default now. (Disable via USE_SINGLE_STREAM=0)
- Fix unrecognized token error when XCC_PREF_TABLE is unspecified
- (Experimental) Adding XCC filtering - combined with XCC_PREF_TABLE, this tries to select specific XCCs to use for specific (SRC->DST) Transfers
- USE_FINE_GRAIN also applies to a2a preset
- Set
GPU_KERNEL=3
as default for gfx942
- Added the
ALWAYS_VALIDATE
environment variable to allow for validation after every iteration, instead of only once at the end of all iterations
- Increased the line limit from 2048 to 32768
SHOW_ITERATIONS
now shows XCC:CU instead of just CU IDSHOW_ITERATIONS
is printed whenUSE_SINGLE_STREAM
=1
BLOCK_SIZE
has been added to control the threadblock size (must be a multiple of 64, up to 512)BLOCK_ORDER
has been added to control how work is ordered for GFX-executors runningUSE_SINGLE_STREAM
=1- 0 - Threadblocks for transfers are ordered sequentially (default)
- 1 - Threadblocks for transfers are interleaved
- 2 - Threadblocks for transfers are ordered randomly
- A2A preset config now responds to
USE_REMOTE_READ
- Race-condition during wall-clock initialization caused "inf" during single-stream runs
- CU numbering output after CU masking
- The default number of warmups has been reverted to 3
- The default unroll factor for gfx940/941 has been set to 6
- Added
A2A_DIRECT
, which only runs all-to-all on directly connected GPUs (now on by default) - Added average statistics for P2P and A2A benchmarks
- Added
USE_FINE_GRAIN
for P2P benchmark- With older devices, P2P performance with default coarse-grain device memory stops timing as soon as a request is sent to data fabric, and not actually when it arrives remotely. This can artificially inflate bandwidth numbers, especially when sending small amounts of data.
- Modified P2P output to help distinguish between CPU and GPU devices
- Fixed Makefile target to prevent unnecessary re-compilation
- Added cmdline preset to allow specification of simple tests on command line (e.g.,
./TransferBench cmdline 64M "1 4 G0->G0->G1"
) - Adding the
HIDE_ENV
environment variable, which stops environment variable values from printing - Adding the
CU_MASK
environment variable, which allows you to select the CUs to run on CU_MASK
is specified in CU indices (0-#CUs-1), where ' - ' can be used to denote ranges of values (e.g.,CU_MASK
=3-8,16 requests that transfer be run only on CUs 3,4,5,6,7,8,16)- Note that this is somewhat experimental and may not work on all hardware
SHOW_ITERATIONS
now shows CU usage for that iteration (experimental)
- Added extra comments on commonly missing includes with details on how to install them
- CUDA compilation works again (the
wall_clock64
CUDA alias was not defined)
- Setting SHOW_ITERATIONS=1 provides additional information about per-iteration timing for file and
P2P configs
- For file configs, iterations are sorted from min to max bandwidth and displayed with standard deviation
- For P2P, min/max/standard deviation is shown for each direction
- P2P benchmark formatting now reports bidirectional bandwidth in each direction (as well as sum) for clarity
- Fixed a bug in the P2P bidirectional benchmark that used the incorrect number of
subExecutors
for CPU<->GPU tests
- New All-To-All GPU benchmark accessed by preset "A2A"
- Added gfx941 wall clock frequency
- New GPU subexec scaling benchmark accessed by preset "scaling"
- Tests GPU-GFX copy performance based on # of CUs used
- Switched the kernel timing function to
wall_clock64
- Fixed a bug with
SAMPLING_FACTOR
VALIDATE_DIRECT
can now be used withUSE_PREP_KERNEL
- Switched to local GPU for validating GPU memory
VALIDATE_DIRECT
now also applies to source memory array checking- Added null memory pointer check prior to deallocation
- Adding the ability to validate GPU destination memory directly without going through the CPU
staging buffer (
VALIDATE_DIRECT
)- Note that this only works on AMD devices with large-bar access enabled, and may slow things down considerably
- Refactored how environment variables are displayed
- Mismatch stops after the first detected error within an array instead of listing all mismatched elements
- Allowed switch to GFX kernel for source array initialization (
USE_PREP_KERNEL
)- Note that
USE_PREP_KERNEL
can't be used withFILL_PATTERN
- Note that
- Added the ability to compile with nvcc only (
TransferBenchCuda
)
- The default pattern was set to [Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)]
- Added the
example.cfg
file
- Additional src array validation during preparation
- Added a new environment variable (
CONTINUE_ON_ERROR
) to resume tests after a mis-match detection - Initialized GPU memory to 0 during allocation
- Fixed a bug that prevented single transfers greater than 8 GB
- Removed "check for latest ROCm" warning when allocating too much memory
- Off-source memory value is now printed when a mis-match is detected
- Added documentation
- Added pthread linking in src/Makefile and CMakeLists.txt
- Added printing off the hex value of the floats for output and reference
- Added support for cmake
- Converted to the Pitchfork layout standard
- Added support for TransferBench on NVIDIA platforms (via
HIP_PLATFORM
=nvidia)- Note that CPU executors on NVIDIA platform cannot access GPU memory (no large-bar access)
- Added multi-input/multi-output (MIMO) support: transfers now can reduce (element-wise summation) multiple input memory arrays and write sums to multiple outputs
- Added GPU-DMA executor 'D', which uses
hipMemcpy
for SDMA copies- Previously, this was done using
USE_HIP_CALL
, but now GPU-GFX kernel can run in parallel with GPU-DMA, instead of applying to all GPU executors globally - GPU-DMA executor can only be used for single-input/single-output transfers
- GPU-DMA executor can only be associated with one SubExecutor
- Previously, this was done using
- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only transfers
- Added new
GPU_KERNEL
environment variable that allows switching between various GPU-GFX reduction kernels
- Improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs
- Updated the
example.cfg
file to cover new features - Updated output to support MIMO
- Changed CU and CPU thread naming to SubExecutors for consistency
- Sweep Preset: default sweep preset executors now includes DMA
- P2P benchmarks:
- Removed
p2p_rr
,g2g
andg2g_rr
(now only works via P2P)- Setting
NUM_CPU_DEVICES
=0 can only be used to benchmark GPU devices (likeg2g
) - The new
USE_REMOTE_READ
environment variable replaces_rr
presets
- Setting
- New environment variable
USE_GPU_DMA
=1 replacesUSE_HIP_CALL
=1 for benchmarking with GPU-DMA Executor - Number of GPU SubExecutors for benchmark can be specified via
NUM_GPU_SE
- Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
- Number of CPU SubExecutors for benchmark can be specified via
NUM_CPU_SE
- Removed
- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same transfer
USE_HIP_CALL
: useGPU-DMA
executor 'D' or setUSE_GPU_DMA
=1 for P2P benchmark presets- Currently, a warning will be issued if
USE_HIP_CALL
is set to 1 and the program will stop
- Currently, a warning will be issued if
NUM_CPU_PER_TRANSFER
: the number of CPU SubExecutors will be whatever is specified for the transferUSE_MEMSET
: this function can now be done via a transfer using the null memory type
- Fixed incorrect bandwidth calculation when using single stream mode and per-transfer data sizes
- Printing off src/dst memory addresses during interactive mode
- Switching to
numa_set_preferred
instead ofset_mempolicy
- Fixed handling of non-configured NUMA nodes
- Topology detection now shows actual NUMA node indices
- Fixed 'for' issue with
NUM_GPU_DEVICES
- Fixed bug with allocations involving non-default CPU memory types
- Unpinned CPU memory type ('U'), which may require
HSA_XNACK
=1 in order to access via GPU executors - Added sweep configuration logging to
lastSweep.cfg
- Ability to specify the number of CUs to use for sweep-based presets
- Modified advanced configuration file format to accept bytes-per-transfer
- Fixed random sweep repeatability
- Fixed bug with CPU NUMA node memory allocation
- Topology output now includes NUMA node information
- Support for NUMA nodes with no CPU cores (e.g., CXL memory)
- The
SWEEP_SRC_IS_EXE
environment variable was removed
- There are new environment variables for sweep based presets:
SWEEP_XGMI_MIN
: The minumum number of XGMI hops for transfersSWEEP_XGMI_MAX
: The maximum number of XGMI hops for transfersSWEEP_SEED
: Uses a random seedSWEEP_RAND_BYTES
: Uses a random amount of bytes (up to pre-specified N) for each transfer
- CSV output for sweep now includes an environment variables section followed by output
- CSV output no longer lists environment variable parameters in columns
- We changed the default number of warmup iterations from 3 to 1
- Split CSV output of link type to
ExeToSrcLinkType
andExeToDstLinkType
- There are new preset modes stress-test benchmarks:
sweep
andrandomsweep
sweep
iterates over all possible sets of transfers to testrandomsweep
iterates over random sets of transfers- New sweep-only environment variables can modify
sweep
SWEEP_SRC
: String containing only "B","C","F", or "G" that defines possible source memory typesSWEEP_EXE
: String containing only "C" or "G" that defines possible executorsSWEEP_DST
: String containing only "B","C","F", or "G" that defines possible destination memory typesSWEEP_SRC_IS_EXE
: Restrict the executor to be the same as the source, if non-zeroSWEEP_MIN
: Minimum number of parallel transfers to testSWEEP_MAX
: Maximum number of parallel transfers to testSWEEP_COUNT
: Maximum number of tests to runSWEEP_TIME_LIMIT
: Maximum number of seconds to run tests
- New environment variables to restrict number of available devices to test on (primarily for sweep
runs)
NUM_CPU_DEVICES
: Number of CPU devicesNUM_GPU_DEVICES
: Number of GPU devices
- Fixed timing display for CPU executors when using single-stream mode
- Setting
NUM_ITERATIONS
to a negative number indicates a run of -NUM_ITERATIONS
seconds per test
- Copies are now referred to as 'transfers' instead of 'links'
- Reordered how environment variables are displayed (alphabetically now)
- Combined timing is now always on for kernel-based GPU copies; the
COMBINED_TIMING
environment variable has been removed - Single sync is no longer supported for facility variable iterations; the
USE_SINGLE_SYNC
environmental variable has been removed
- Added the
USE_SINGLE_STREAM
feature- All Links that run on the same GPU device are run with a single kernel launch on a single stream
- This doesn't work with
USE_HIP_CALL
, and it forcesUSE_SINGLE_SYNC
to collect timings - Added the ability to request coherent or fine-grained host memory ('B')
- Separated the TransferBench repository from the RCCL repository
- Peer-to-peer benchmark mode now works with
OUTPUT_TO_CSV
- Toplogy display now works with
OUTPUT_TO_CSV
- Moved the documentation about the config file into
example.cfg
- Removed config file generation
- Removed the 'show pointer address' (
SHOW_ADDR
) environment variable