From bf8540f482f453f4fa04ac750381135e99815bb1 Mon Sep 17 00:00:00 2001 From: Kyle Wang Date: Mon, 25 Nov 2024 23:14:27 +0800 Subject: [PATCH] Add a DPP intro in amdgpu_kernel_optimization_guide.md (#598) --- docs/amdgpu_kernel_optimization_guide.md | 123 ++++++++++++++++++++++- 1 file changed, 122 insertions(+), 1 deletion(-) diff --git a/docs/amdgpu_kernel_optimization_guide.md b/docs/amdgpu_kernel_optimization_guide.md index 09c5b59f9..91b7f1385 100644 --- a/docs/amdgpu_kernel_optimization_guide.md +++ b/docs/amdgpu_kernel_optimization_guide.md @@ -4,7 +4,7 @@ Author: Jakub Kuderski @kuhar Date: 2024-06-24 -Last Update: 2024-08-22 +Last Update: 2024-11-22 ## Introduction @@ -293,3 +293,124 @@ forms a *clause* that translates to a single data fabric transaction. > [!TIP] > For allocations of 4 GB or less, you can implement predicated loads using the > `buffer` instructions. + +## Data-Parallel Primitives and Warp-level Reduction + +For cross-lane data sharing, the most straightforward way is LDS. Some lanes +write data to some locations on LDS and other lanes read data from LDS. Besides, +there are several instructions can be used to share data cross lanes within a +wavefront/warp. + +Here's a brief introduction of these instructions. Please check out [this +blog](https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/) for +details. + +### ds_permute/ds_bpermute + +`ds_permute`/`ds_bpermute` instructions use LDS hardware for data sharing but +don't actually write to an LDS location. But it still needs `s_waitcnt` +instruction to determine when data is returned to `dest` VGPR. + +Example: +```nasm +ds_bpermute_b32 dest, addr, src [offset:addr_offset] +``` + +### ds_swizzle + +Compared to `ds_bpermute`, the `ds_swizzle` instruction doesn't require an +additional VGPR for offset since it's encoded in the instruction. + +`ds_swizzle` is likely to have less address generation instructions required +than `ds_bpermute`. + +The cons are: +1. It only supports limited patterns. +2. Similar to `ds_bpermute`, `s_waitcnt` is required to wait for the `dest` VGPR. + +Example: +```nasm +ds_swizzle_b32 dest, src offset:ds_pattern +``` + +### Data-Parallel Primitives, DPP + +DPP is a 32-bit instruction modifier appended to the normal VALU instructions. +It allows VALU instructions to access data in neighboring lanes directly, which +means it doesn't need LDS hardware anymore, hence `s_waitcnt` instructions are +**not required**. + +Unfortunately, it also supported limited patterns like `ds_swizzle`. And there +are some instructions that can't be modified by DPP. + +Example: +```nasm +; Normal VALU instruction. +v_add_f32 + +; Instruction modified by DPP. +v_add_f32_dpp +``` + +It's worth mentioning that DPP has different names and syntaxes on different +architectures: +* CDNA: DPP +* RDNA: DPP8/DPP16 + +For details, please check the [MI300 ISA Reference +Guide](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) +and the [RDNA3 ISA Reference +Guide](https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf). + +### How to use them in MLIR + +Each instruction has a corresponding Op in MLIR (except for `ds_permute`, this +one is not implemented at the time of writing): +* `ds_bpermute`: `rocdl.ds_bpermute` +* `ds_swizzle`: `rocdl.ds_swizzle` +* DPP: `rocdl.update.dpp`, `amdgpu.dpp` (a thin wrapper around + `rocdl.update.dpp` with more comprehensive user interface, e.g., replace magic + numbers with enums) + +The first 2 are straightforward, while DPP follows a different fashion. + +Since DPP is an instruction modifier instead of an instruction itself, there are +tremendous number of combinations of VALU instructions and DPP. To solve that, +`rocdl.update.dpp` and `amdgpu.dpp` are designed to be a wrapper of +`v_mov_b32_dpp` instruction. And it depends on LLVM compiler to fuse it with the +subsequent VALU instruction **with best efforts**. + +For example, `v_mov_b32_dpp` + `v_add_f32_e32` might be fused into `v_add_f32_dpp`. + +There are plenty of constraints stopping an instruction from being merged. For +example, if either the `bank_mask` or the `row_mask` is not `0xf`, it can't be +fused. You can check the +[GCNDPPCombine::combineDPPMov](https://github.com/llvm/llvm-project/blob/ab51eccf88f5321e7c60591c5546b254b6afab99/llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp#L522) +function to see how it works. + +### Comparison + +To summarize, there's no free lunch: instruction's expressivity comes at the +expense of performance. + +The relative performance of cross-lane instructions is as follows: + +DPP > `ds_swizzle` >= `ds_permute` > `ds_bpermute` + +while the generality ranking is the reverse: + +DPP < `ds_swizzle` < `ds_permute` < `ds_bpermute` + +This table presents the approximate instruction latency, collected +experimentally on Fused Softmax kernel with +[rocprofv2](https://github.com/ROCm/rocprofiler?tab=readme-ov-file#plugin-support) +on the MI300 GPU: + +| Instructions | MLIR Op | Hardware | latency/#cycles | +| ---------------------- | ---------------------------- | ------------ | --------------- | +| ds_permute/ds_bpermute | rocdl.ds_bpermute | LDS hardware | ~50* | +| ds_swizzle | rocdl.ds_swizzle | LDS hardware | ~50* | +| DPP | rocdl.update.dpp, amdgpu.dpp | VALU | 4~12 | + +*: For `ds_permute`/`ds_bpermute` and `ds_swizzle`, the latency includes the +instruction itself and its corresponding `s_waitcnt` instruction.