RotaryEmbedding Contrib OP #3695

TedThemistokleous · 2024-12-09T15:06:09Z

Add the Contrib OP for RotaryEmbedding which is a Microsoft Contrib OP

Able to reuse the GPU kernel we have in GroupQuerryAttention and then use a new parser to handle this correctly

codecov · 2024-12-09T15:20:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.23%. Comparing base (4b15b6c) to head (fa58c0d).
Report is 4 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #3695   +/-   ##
========================================
  Coverage    92.23%   92.23%           
========================================
  Files          514      514           
  Lines        21746    21746           
========================================
  Hits         20057    20057           
  Misses        1689     1689

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pfultz2 · 2024-12-09T18:49:17Z

We really should remove this GPU kernel. It looks like this can already be implemented with the operators we have already.

TedThemistokleous · 2024-12-10T15:49:53Z

So don't reuse what we've done here?

pfultz2 · 2024-12-12T18:46:48Z

So don't reuse what we've done here?

How difficult is it to just implement this for RotaryEmbedding onnx parser(and leave GQA alone for now)?

TedThemistokleous · 2024-12-12T19:13:35Z

I mean the parser is the easy part, its how we want to compute it. Hold on let me push

… op.

TedThemistokleous · 2024-12-12T19:17:31Z

Parser isn't hard I've just been going over GQA and cutting it down to reuse some of the initial work for an op.

pfultz2 · 2024-12-12T19:28:07Z

Parser isn't hard I've just been going over GQA and cutting it down to reuse some of the initial work for an op.

I meant how hard it is to implement using the operators we already have. This way we dont need to create a ref operator(which requires a lot more work to verify and test and lower).

TedThemistokleous · 2024-12-12T19:37:07Z

Parser isn't hard I've just been going over GQA and cutting it down to reuse some of the initial work for an op.

I meant how hard it is to implement using the operators we already have. This way we dont need to create a ref operator(which requires a lot more work to verify and test and lower).

Not too difficult I believe. There' a bunch of indexing and mods done. I can take this approach instead since I've just been reading/cutting out things from GQA

Still need to genreate rotation matrix for operator based on input args

migraphx-bot · 2024-12-20T23:37:56Z

Test	Batch	Rate new 35c78f	Rate old f56b1b	Diff	Compare
torchvision-resnet50	64	3,253.12	3,255.67	-0.08%	✅
torchvision-resnet50_fp16	64	6,989.78	6,983.58	0.09%	✅
torchvision-densenet121	32	2,433.52	2,431.65	0.08%	✅
torchvision-densenet121_fp16	32	4,048.27	4,074.03	-0.63%	✅
torchvision-inceptionv3	32	1,626.28	1,628.91	-0.16%	✅
torchvision-inceptionv3_fp16	32	2,744.28	2,746.14	-0.07%	✅
cadene-inceptionv4	16	764.67	764.54	0.02%	✅
cadene-resnext64x4	16	812.21	813.45	-0.15%	✅
slim-mobilenet	64	7,460.21	7,469.86	-0.13%	✅
slim-nasnetalarge	64	208.96	209.05	-0.04%	✅
slim-resnet50v2	64	3,438.88	3,440.80	-0.06%	✅
bert-mrpc-onnx	8	1,146.01	1,145.22	0.07%	✅
bert-mrpc-tf	1	469.92	476.55	-1.39%	✅
pytorch-examples-wlang-gru	1	423.72	422.09	0.39%	✅
pytorch-examples-wlang-lstm	1	399.50	394.94	1.15%	✅
torchvision-resnet50_1	1	776.22	769.40	0.89%	✅
cadene-dpn92_1	1	405.06	398.97	1.53%	✅
cadene-resnext101_1	1	382.37	383.77	-0.37%	✅
onnx-taau-downsample	1	346.26	345.22	0.30%	✅
dlrm-criteoterabyte	1	33.32	33.32	-0.02%	✅
dlrm-criteoterabyte_fp16	1	52.73	52.72	0.03%	✅
agentmodel	1	8,357.78	8,109.71	3.06%	🔆
unet_fp16	2	58.70	58.87	-0.28%	✅
resnet50v1_fp16	1	941.11	930.54	1.13%	✅
resnet50v1_int8	1	1,034.47	1,002.61	3.18%	🔆
bert_base_cased_fp16	64	1,170.71	1,168.63	0.18%	✅
bert_large_uncased_fp16	32	362.86	363.25	-0.11%	✅
bert_large_fp16	1	200.07	198.22	0.93%	✅
distilgpt2_fp16	16	2,200.66	2,197.80	0.13%	✅
yolov5s	1	525.69	532.80	-1.33%	✅
tinyllama	1	43.64	43.43	0.50%	✅
vicuna-fastchat	1	172.53	174.20	-0.96%	✅
whisper-tiny-encoder	1	417.30	418.04	-0.18%	✅
whisper-tiny-decoder	1	427.44	433.15	-1.32%	✅

Check results before merge 🔆

migraphx-bot · 2024-12-20T23:37:58Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

✅ bert-mrpc-tf: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ torchvision-resnet50_1: PASSED: MIGraphX meets tolerance

✅ cadene-dpn92_1: PASSED: MIGraphX meets tolerance

✅ cadene-resnext101_1: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

initial changes to lowering to reuse rotatary embedding kernel for op

fa58c0d

TedThemistokleous added roadmap Tasks to finish for a release Onnx Operators Adding or modifying an Onnx Operator in the MIGraphX codebase labels Dec 9, 2024

TedThemistokleous self-assigned this Dec 9, 2024

TedThemistokleous changed the title ~~initial changes to lowering to reuse rotatary embedding kernel for op~~ RotaryEmbedding Contrib OP Dec 9, 2024

Add parser for rotary embedding and cutdown GQA op as base for rotary…

58e0cb3

… op.

TedThemistokleous added 4 commits December 13, 2024 00:14

Add parsing of input args and params before handling compute site

d2ca1c8

Remove changes for lowering custom kernel for now

598086a

Cleanup before adding compute in migraphx ops

d2e6799

Backup for now on rotary embedding.

35c78f2

Still need to genreate rotation matrix for operator based on input args

TedThemistokleous requested a review from pfultz2 December 20, 2024 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RotaryEmbedding Contrib OP #3695

RotaryEmbedding Contrib OP #3695

TedThemistokleous commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 9, 2024

pfultz2 commented Dec 9, 2024

TedThemistokleous commented Dec 10, 2024

pfultz2 commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

pfultz2 commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

migraphx-bot commented Dec 20, 2024

migraphx-bot commented Dec 20, 2024

RotaryEmbedding Contrib OP #3695

Are you sure you want to change the base?

RotaryEmbedding Contrib OP #3695

Conversation

TedThemistokleous commented Dec 9, 2024 • edited Loading

codecov bot commented Dec 9, 2024

Codecov Report

pfultz2 commented Dec 9, 2024

TedThemistokleous commented Dec 10, 2024

pfultz2 commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

pfultz2 commented Dec 12, 2024

TedThemistokleous commented Dec 12, 2024

migraphx-bot commented Dec 20, 2024

migraphx-bot commented Dec 20, 2024

TedThemistokleous commented Dec 9, 2024 •

edited

Loading