Add device bridge support for HIP and CUDA. #3

stellaraccident · 2024-04-23T04:28:50Z

Adds a custom IREE "_test_add" op which exercises actual code generation.
Reworks the device management layer to:
- Initialize our Device wrapper to maintain the correspondence between IREE HalDevice and PyTorch device.
- Annotated Device with dlpack_device_type_code.
- Detects whether the Torch device is for real-CUDA or AMDGPU/HIP CUDA and interfaces to the outside world correctly based on this.
- Auto-detects the AMDGPU chip and the CUDA SM version and arranges for the JIT compiler to use that.
- Dynamically enables custom kernel registration for CUDA if it is available.

Limitations (for now):

IREE's dlpack interop only supports contiguous tensors. Lifting this requires further interfacing with the compiler to specialize on different strided layouts.
Device synchronization is using dlpack's implicit default-stream synchronization. Since we are currently still JIT compiling in synchronous mode, we are skating by on this. Some additional hooks and APIs are needed to properly place stream events to do it for real.
ROCM builds of torch seem to not have an easy way to map a device back to its UUID, and I didn't have a CUDA device handy to test the ways this is done on CUDA, so both are just relying on enumeration order on multi device systems. I kept the correspondence to a single place in the code and think I know how to fix this, but it will require some poking.

Depends on an IREE bump that includes: iree-org/iree#17131 (but can go in without it since that only unblocks HIP/CUDA, which were not supported yet anyway)

* Threads explicit device through models. * Implements functional InferenceTensor, Theta and Dataset transformations and uses it to implement `to(device=)`. * Adds `--device foo` to example runner. * With iree-org/iree-turbine#3 and supporting patches, this allows custom ops and kernels to be transparently be used on CUDA/ROCM devices (instead of just CPU).

* Threads explicit device through models. * Implements functional InferenceTensor, Theta and Dataset transformations and uses it to implement `to(device=)`. * Adds `--device foo` to example runner. * With iree-org/iree-turbine#3 and supporting patches, this allows custom ops and kernels to transparently be used on CUDA/ROCM devices (instead of just CPU).

Signed-off-by: Harsh Menon <[email protected]>

stellaraccident added 3 commits April 22, 2024 21:28

Add device bridge support for HIP and CUDA.

6a2f212

Use real op

0feccb8

Implement import/export.

75a9955

stellaraccident mentioned this pull request Apr 25, 2024

Support moving theta and models to a specific device. nod-ai/shark-ai#9

Merged

stellaraccident added 3 commits April 24, 2024 19:11

Reach working state.

c59b870

Finishes tests.

ae71554

Black

2b8e301

stellaraccident requested a review from rsuderman April 25, 2024 02:26

Mypy

1d4afcc

rsuderman approved these changes Apr 25, 2024

View reviewed changes

stellaraccident merged commit cafc812 into main Apr 25, 2024
3 checks passed

stellaraccident deleted the eager_hip_cuda branch April 25, 2024 02:36

harsh-nod added a commit to harsh-nod/iree-turbine that referenced this pull request Oct 30, 2024

Address comments iree-org#3

e09dff2

Signed-off-by: Harsh Menon <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add device bridge support for HIP and CUDA. #3

Add device bridge support for HIP and CUDA. #3

stellaraccident commented Apr 23, 2024 •

edited

Loading

Add device bridge support for HIP and CUDA. #3

Add device bridge support for HIP and CUDA. #3

Conversation

stellaraccident commented Apr 23, 2024 • edited Loading

stellaraccident commented Apr 23, 2024 •

edited

Loading