[webgpu] Use subgroup for matmulnbits #23224

qjia7 · 2024-12-30T07:52:26Z

Description

This PR applies subgroup to implement matmulnbits when tile_m > 1 for intel devices.
With this PR, prefill for 500 tokens prompt for phi3 becomes 3.5s from 8.5s on intel Meteor Lake.

In this version, the local_id.x = 8, local_id.y = 4. To load A data, each thread needs to access memory twice so that the tile size A is 64 x 4. In order to get the correct shuffle A when acculate the inter_results, we have to unconditionally get a_data low and a data high. Then use select to decide which data to use for current thread. So this method doubles the shuffle commands.

Use local_id.x = 4, local_id.y = 8 so that we only need to suffle once compared with previous method.

When the index is a variable

This reverts commit 128cd7d.

This reverts commit c0dd1db.

This reverts commit 3b32acc.

qjia7 · 2025-01-02T05:07:30Z

@guschmue @fs-eire Please take a look. Currently, this PR is only applied on Intel devices since I don't see perf improvement on NV device on my hand. Need some time to investigate the reason.

Please also let me know whether it works on your xe devices.
If you want to check other GPUs like, mac, you can just change

onnxruntime/onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

Line 572 in 8328e14

    
           const bool use_subgroup = context.Device().HasFeature(wgpu::FeatureName::Subgroups) && context.AdapterInfo().vendor == std::string_view{"intel"} && components_a == 4 && block_size == 32;

to

  const bool use_subgroup = context.Device().HasFeature(wgpu::FeatureName::Subgroups) && components_a == 4 && block_size == 32;

cc @sushraja-msft

guschmue · 2025-01-02T17:54:59Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2025-01-02T17:55:09Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2025-01-02T17:55:13Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2025-01-02T17:55:17Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2025-01-02T17:55:27Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2025-01-02T17:55:38Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2025-01-02T17:55:40Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2025-01-02T17:55:46Z

Azure Pipelines successfully started running 9 pipeline(s).

qjia7 added 17 commits December 27, 2024 11:42

[webgpu] Add kernel type to profile info

3b32acc

relax some limitations

8e0b27f

[webgpu] Use override shape in shader key

c0dd1db

support subgroup with size = 32

d3a196b

Use local_id.x = 4, local_id.y = 8 so that we only need to suffle once compared with previous method.

subgroup size = 32, workgroup size = 64

99fe697

put a/scale data into workgroup memory

0e85048

support subgroup size = 8/16/32/64

6feb70f

workaround the issue that subgroupShuffle is very slow

55fc6bb

When the index is a variable

add sg_size check

128cd7d

Revert "add sg_size check"

0775e4e

This reverts commit 128cd7d.

apply subgroup for tile_m > 1

454fd0d

check subgroup feature

16fc075

only apply subgroup for intel GPUs

622bdd0

Revert "[webgpu] Use override shape in shader key"

729df58

This reverts commit c0dd1db.

Revert "[webgpu] Add kernel type to profile info"

dff1961

This reverts commit 3b32acc.

add more limitations to use subgroup

8328e14

qjia7 changed the title ~~[Not for Review] [webgpu] Test subgroup for matmulnbits~~ [webgpu] Use subgroup for matmulnbits Jan 2, 2025

qjia7 marked this pull request as ready for review January 2, 2025 04:47

guschmue added the ep:WebGPU ort-web webgpu provider label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[webgpu] Use subgroup for matmulnbits #23224

[webgpu] Use subgroup for matmulnbits #23224

qjia7 commented Dec 30, 2024 •

edited

Loading

qjia7 commented Jan 2, 2025

guschmue commented Jan 2, 2025

guschmue commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

guschmue commented Jan 2, 2025

guschmue commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

[webgpu] Use subgroup for matmulnbits #23224

Are you sure you want to change the base?

[webgpu] Use subgroup for matmulnbits #23224

Conversation

qjia7 commented Dec 30, 2024 • edited Loading

Description

qjia7 commented Jan 2, 2025

guschmue commented Jan 2, 2025

guschmue commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

guschmue commented Jan 2, 2025

guschmue commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

azure-pipelines bot commented Jan 2, 2025

qjia7 commented Dec 30, 2024 •

edited

Loading