Skip to content

Releases: ROCm/TransferBench

TransferBench v1.53

11 Nov 06:59
b56d481
Compare
Choose a tag to compare

v1.53

Added

  • Added ability to specify NULL for sweep preset as source or destination memory type

TransferBench v1.52

09 Oct 16:49
600cf13
Compare
Choose a tag to compare

Added

  • Added USE_HSA_DMA env var to switch to using hsa_amd_memory_async_copy instead of hipMemcpyAsync for DMA execution
  • Added ability to set USE_GPU_DMA env var for a2a benchmark
  • Adding check for large BAR enablement for GPU devices during topology check

Fixed

  • Potential memory leak if HSA reports 0 hops between GPUs and CPUs

TransferBench v1.51

15 Aug 17:46
b30aefb
Compare
Choose a tag to compare

v1.51

Modified

  • CSV output has been modified slightly to match normal terminal output
  • Output for non single stream mode has been changed to match single stream mode (results per Executor)

Added

  • Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration
    If set to 0, this should infinitely loop (which may be useful for some debug purposes)
  • Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to
    0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned
    for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the
    search space. The number of CUs used will be identical for all variable subExecutor transfers
  • Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs
    through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values
    Pass criteria limits can be modified (due to platform differences) via the environment variables
    LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)

Fixed

  • Fixed out-of-bounds memory access during topology detection that can happen if the number of
    CPUs is less than the number of NUMA domains
  • Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)

TransferBench v1.50

03 Apr 16:27
eaf32b4
Compare
Choose a tag to compare

Added

  • Adding new parallel copy preset benchmark (pcopy)
    • Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>

Fixed

  • Removed non-copies DMA Transfers (this had previously been using hipMemset)
  • Fixed CPU executor when operating on null destination

TransferBench v1.49

02 Apr 22:38
97fbbbb
Compare
Choose a tag to compare

Fixes

  • Enumerating previously missed DMA engines used only for CPU traffic in topology display

TransferBench v1.48

02 Feb 22:46
aa801b9
Compare
Choose a tag to compare

v1.48

Fixes

  • Various fixes for TransferBenchCuda

Additions

  • Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
  • Printing warnings when exeuctors are overcommited

Modifications

  • USE_REMOTE_READ supported for rwrite preset benchmark

TransferBench v1.47

09 Jan 20:52
ceeab46
Compare
Choose a tag to compare

Fixes

  • Fixing CUDA compilation

TransferBench v1.46

14 Dec 03:54
d5445b9
Compare
Choose a tag to compare

Fixes

  • Fixing GFX_UNROLL set to 13 (past 8) on gfx906 cards

Modifications

  • GFX_SINGLE_TEAM=1 by default
  • Adding field showing summation of individual Transfer bandwidths for Executors

TransferBench v1.45

05 Dec 06:41
f33c7fd
Compare
Choose a tag to compare

Additions

  • Adding A2A_MODE to a2a preset (0 = copy, 1 = read-only, 2 = write-only)
  • Adding GFX_UNROLL to modify GFX kernel's unroll factor
  • Adding GFX_WAVE_ORDER to modify order in which wavefronts process data

Modifications

  • Rewrote the GFX reduction kernel to support new wave ordering

TransferBench v1.44

01 Dec 21:00
33a5435
Compare
Choose a tag to compare

Additions

  • Adding rwrite preset to benchmark remote parallel writes
  • Usage: ./TransferBench rwrite <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=3>