[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

skykongkong8 · 2024-12-18T00:22:45Z

This PR proposes neon simd kernel in fp32 matirix transpose.
Table below is conducted with Galaxy S24U, TC = 100.
Note that this kernel is more effective for sufficiently big matrices, but still better than before anyway.
I added additional unittest TC for validating this function with simple idea : (A.T.T = A) but please suggest any idea if there's more efficient way.
This will instantly impact fp32 BCQ Tensor usage

dim	prev	neon
768x768	1.9 ms	1.6 ~ 1.0 ms
1440x1440	2.9 ms	2.29 ms
1920x1560	4.2 ~ 3.6 ms	3.36 ~ 2.67 ms
1560x2048	7.13 ~ 6.97 ms	3.57 ~ 3.2 ms
512x2048	2.75 ms	1.80 ~ 1.7 ms

Self evaluation:

Build test: [x]Passed [ ]Failed [ ]Skipped
Run test: [x]Passed [ ]Failed [ ]Skipped

- Implement NEON SIMD kernel for matrix transpose for fp32 datatype - Connect such kernel with current function template - This expands SIMD coverage for matrix transpose datatype : fp16 only -> fp16, fp32 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- With NEON support (arm), apply matrix transpose with SIMD add BLAS level. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…ber function - For fp32, channel-first, 0:2:1 transpose case, use transpose from blas interface instead of current loop implementation. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Verify transpose function with A.T.T = A **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

myungjoo · 2024-12-20T08:56:11Z

nntrainer/tensor/blas_interface.cpp

+#else
+  transpose_fallback<float>(M, N, src, ld_src, dst, ld_dst);
+#endif
+}


Someday, all of such #ifdef (NEON, FP16, ...) need to be migrated to header files (preferably centralized into a single header) and code lines depending on such need to be separated as a class or a file (e.g., defined in blas_neon.cpp and whether to use functions/sub-classes in that file is determined at a header and build-script/option.

But for today, lets move on.

Seems this issue is related to #2549
I am aware of it. Think we urgently have to discuss when to apply this

In fact, one of the biggest problems of applying #2549 is that all android make file / tizen spec file in the current nntr is including blas_interface.h since this pr substitutes blas_interface into single cpu_backend.h, this affects quite a lot of files...

EunjuYang

LGTM.
Combining transpose with dot product, e.g., ($$(W \cdot I)^T = I^T \cdot W^T$$) can be another option for transpose unittest. Anyway transpose of transpose seems good :)

skykongkong8 · 2024-12-24T07:16:27Z

LGTM. Combining transpose with dot product, e.g., ( ( W ⋅ I ) T = I T ⋅ W T ) can be another option for transpose unittest. Anyway transpose of transpose seems good :)

Nice approach! Gotta try that one too

DonghakPark

LGTM!

skykongkong8 added 4 commits December 18, 2024 09:05

[ unittest ] Add bigger test case for tensor matrix transpose case

062906a

- Verify transpose function with A.T.T = A **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20 and EunjuYang as code owners December 18, 2024 00:22

skykongkong8 changed the title ~~Pr/transpose/simd/fp32~~ [ Tensor ] Apply SIMD in matrix transpose fp32 Dec 18, 2024

github-actions bot added the Need Review label Dec 18, 2024

skykongkong8 changed the title ~~[ Tensor ] Apply SIMD in matrix transpose fp32~~ [ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 Dec 18, 2024

myungjoo reviewed Dec 20, 2024

View reviewed changes

myungjoo approved these changes Dec 20, 2024

View reviewed changes

EunjuYang approved these changes Dec 24, 2024

View reviewed changes

DonghakPark approved these changes Jan 7, 2025

View reviewed changes

github-actions bot added PR/READY2MERGE and removed Need Review labels Jan 7, 2025

skykongkong8 self-assigned this Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

skykongkong8 commented Dec 18, 2024 •

edited

Loading

myungjoo Dec 20, 2024

skykongkong8 Dec 24, 2024

skykongkong8 Dec 24, 2024

EunjuYang left a comment •

edited

Loading

skykongkong8 commented Dec 24, 2024

DonghakPark left a comment

[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

Are you sure you want to change the base?

[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

Conversation

skykongkong8 commented Dec 18, 2024 • edited Loading

myungjoo Dec 20, 2024

Choose a reason for hiding this comment

skykongkong8 Dec 24, 2024

Choose a reason for hiding this comment

skykongkong8 Dec 24, 2024

Choose a reason for hiding this comment

EunjuYang left a comment • edited Loading

Choose a reason for hiding this comment

skykongkong8 commented Dec 24, 2024

DonghakPark left a comment

Choose a reason for hiding this comment

skykongkong8 commented Dec 18, 2024 •

edited

Loading

EunjuYang left a comment •

edited

Loading