-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RotaryEmbedding Contrib OP #3695
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3695 +/- ##
========================================
Coverage 92.23% 92.23%
========================================
Files 514 514
Lines 21746 21746
========================================
Hits 20057 20057
Misses 1689 1689 ☔ View full report in Codecov by Sentry. |
We really should remove this GPU kernel. It looks like this can already be implemented with the operators we have already. |
So don't reuse what we've done here? |
How difficult is it to just implement this for RotaryEmbedding onnx parser(and leave GQA alone for now)? |
I mean the parser is the easy part, its how we want to compute it. Hold on let me push |
Parser isn't hard I've just been going over GQA and cutting it down to reuse some of the initial work for an op. |
I meant how hard it is to implement using the operators we already have. This way we dont need to create a ref operator(which requires a lot more work to verify and test and lower). |
Not too difficult I believe. There' a bunch of indexing and mods done. I can take this approach instead since I've just been reading/cutting out things from GQA |
Still need to genreate rotation matrix for operator based on input args
Check results before merge 🔆 |
🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output |
Add the Contrib OP for RotaryEmbedding which is a Microsoft Contrib OP
Able to reuse the GPU kernel we have in GroupQuerryAttention and then use a new parser to handle this correctly