Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 #2832

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

skykongkong8
Copy link
Member

@skykongkong8 skykongkong8 commented Dec 18, 2024

This PR proposes neon simd kernel in fp32 matirix transpose.
Table below is conducted with Galaxy S24U, TC = 100.
Note that this kernel is more effective for sufficiently big matrices, but still better than before anyway.
I added additional unittest TC for validating this function with simple idea : (A.T.T = A) but please suggest any idea if there's more efficient way.
This will instantly impact fp32 BCQ Tensor usage

dim prev neon
768x768 1.9 ms 1.6 ~ 1.0 ms
1440x1440 2.9 ms 2.29 ms
1920x1560 4.2 ~ 3.6 ms 3.36 ~ 2.67 ms
1560x2048 7.13 ~ 6.97 ms 3.57 ~ 3.2 ms
512x2048 2.75 ms 1.80 ~ 1.7 ms

Self evaluation:

  1. Build test: [x]Passed [ ]Failed [ ]Skipped
  2. Run test: [x]Passed [ ]Failed [ ]Skipped

- Implement NEON SIMD kernel for matrix transpose for fp32 datatype
- Connect such kernel with current function template
- This expands SIMD coverage for matrix transpose datatype : fp16 only -> fp16, fp32

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- With NEON support (arm), apply matrix transpose with SIMD add BLAS level.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…ber function

- For fp32, channel-first, 0:2:1 transpose case, use transpose from blas interface instead of current loop implementation.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Verify transpose function with A.T.T = A

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
@skykongkong8 skykongkong8 changed the title Pr/transpose/simd/fp32 [ Tensor ] Apply SIMD in matrix transpose fp32 Dec 18, 2024
@skykongkong8 skykongkong8 changed the title [ Tensor ] Apply SIMD in matrix transpose fp32 [ Tensor ] Apply SIMD in matrix transpose fp32 @open sesame 12/18 10:16 Dec 18, 2024
#else
transpose_fallback<float>(M, N, src, ld_src, dst, ld_dst);
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday, all of such #ifdef (NEON, FP16, ...) need to be migrated to header files (preferably centralized into a single header) and code lines depending on such need to be separated as a class or a file (e.g., defined in blas_neon.cpp and whether to use functions/sub-classes in that file is determined at a header and build-script/option.

But for today, lets move on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this issue is related to #2549
I am aware of it. Think we urgently have to discuss when to apply this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, one of the biggest problems of applying #2549 is that all android make file / tizen spec file in the current nntr is including blas_interface.h since this pr substitutes blas_interface into single cpu_backend.h, this affects quite a lot of files...

Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Combining transpose with dot product, e.g., ($$(W \cdot I)^T = I^T \cdot W^T$$) can be another option for transpose unittest. Anyway transpose of transpose seems good :)

@skykongkong8
Copy link
Member Author

LGTM. Combining transpose with dot product, e.g., ( ( W ⋅ I ) T = I T ⋅ W T ) can be another option for transpose unittest. Anyway transpose of transpose seems good :)

Nice approach! Gotta try that one too

Copy link
Member

@DonghakPark DonghakPark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@skykongkong8 skykongkong8 self-assigned this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants