Skip to content

Commit

Permalink
Fixing topology detection memory access and CU masking for multi XCD …
Browse files Browse the repository at this point in the history
…GPUs (#116)

* Fixing potential out-of-bounds write during topology detection
* Fixing CU_MASK for multi-XCD GPUs
* Adding sub-iterations via NUM_SUBITERATIONS
* Adding support for variable subexecutor Transfers
* Adding healthcheck preset
  • Loading branch information
gilbertlee-amd authored Aug 15, 2024
1 parent ae843a6 commit b30aefb
Show file tree
Hide file tree
Showing 6 changed files with 639 additions and 379 deletions.
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,29 @@
Documentation for TransferBench is available at
[https://rocm.docs.amd.com/projects/TransferBench](https://rocm.docs.amd.com/projects/TransferBench).

## v1.51

## Modified
- CSV output has been modified slightly to match normal terminal output
- Output for non single stream mode has been changed to match single stream mode (results per Executor)

### Added
- Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration
If set to 0, this should infinitely loop (which may be useful for some debug purposes)
- Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to
0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned
for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the
search space. The number of CUs used will be identical for all variable subExecutor transfers
- Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs
through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values
Pass criteria limits can be modified (due to platform differences) via the environment variables
LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)

### Fixed
- Fixed out-of-bounds memory access during topology detection that can happen if the number of
CPUs is less than the number of NUMA domains
- Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)

## v1.50

### Added
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,9 @@ make
* Running TransferBench with no arguments displays usage instructions and detected topology
information
* You can use several preset configurations instead of a configuration file:
* `a2a` : All-to-all benchmark test
* `cmdline`: Take in Transfers to run from command-line instead of via file
* `a2a` : All-to-all benchmark test
* `cmdline` : Take in Transfers to run from command-line instead of via file
* `healthcheck` : Simple health check (supported on MI300 series only)
* `p2p` : Peer-to-peer benchmark test
* `pcopy` : Benchmark parallel copies from a single GPU to other GPUs
* `rsweep` : Random sweep across possible sets of transfers
Expand Down
Loading

0 comments on commit b30aefb

Please sign in to comment.