-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832
base: main
Are you sure you want to change the base?
Conversation
- Implement NEON SIMD kernel for matrix transpose for fp32 datatype - Connect such kernel with current function template - This expands SIMD coverage for matrix transpose datatype : fp16 only -> fp16, fp32 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- With NEON support (arm), apply matrix transpose with SIMD add BLAS level. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
…ber function - For fp32, channel-first, 0:2:1 transpose case, use transpose from blas interface instead of current loop implementation. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
- Verify transpose function with A.T.T = A **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
#else | ||
transpose_fallback<float>(M, N, src, ld_src, dst, ld_dst); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someday, all of such #ifdef (NEON, FP16, ...) need to be migrated to header files (preferably centralized into a single header) and code lines depending on such need to be separated as a class or a file (e.g., defined in blas_neon.cpp
and whether to use functions/sub-classes in that file is determined at a header and build-script/option.
But for today, lets move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this issue is related to #2549
I am aware of it. Think we urgently have to discuss when to apply this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, one of the biggest problems of applying #2549 is that all android make file / tizen spec file in the current nntr is including blas_interface.h
since this pr substitutes blas_interface
into single cpu_backend.h
, this affects quite a lot of files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Combining transpose with dot product, e.g., (
Nice approach! Gotta try that one too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR proposes neon simd kernel in fp32 matirix transpose.
Table below is conducted with Galaxy S24U, TC = 100.
Note that this kernel is more effective for sufficiently big matrices, but still better than before anyway.
I added additional unittest TC for validating this function with simple idea : (A.T.T = A) but please suggest any idea if there's more efficient way.
This will instantly impact fp32 BCQ Tensor usage
Self evaluation: