[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

skykongkong8 · 2025-01-07T02:24:30Z

This PR proposes a accelerated function proposed from #2850
This will not affect the function immediately, but will introduce PR for that in the near future.

Note that current int8 Tensor does not consider zero point as a qParam.
Multiply with 16 to 32 bit widening, 32 bit scaling, and min-max saturation.
Use nearest rounding for the result.

Self evaluation:

Build test: [X]Passed [ ]Failed [ ]Skipped
Run test: [X]Passed [ ]Failed [ ]Skipped

EunjuYang

Happy to read the PR. Here're some comments from my side.

nntrainer/tensor/blas_neon.cpp

EunjuYang · 2025-01-09T00:08:29Z

nntrainer/tensor/blas_neon.h

+void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,
+              float *lhs_scale, float *rhs_scale, float *res_scale,
+              unsigned int scale_len);


What about declaring the lhs_scale, rhs_scale and res_scale as const pointer?

Suggested change

void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,

float *lhs_scale, float *rhs_scale, float *res_scale,

unsigned int scale_len);

void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,

const float *lhs_scale, const float *rhs_scale, const float *res_scale,

unsigned int scale_len);

Sounds better!
AFAIK, current char tensor scale factor is stored like std::vector<float> const &scales, so should be fixed that way. Thanks!

EunjuYang · 2025-01-09T00:10:17Z

nntrainer/tensor/blas_neon.cpp

+  if (scale_len == 1) {
+    return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale[0],
+                             rhs_scale[0], res_scale[0]);
+  } else {
+    return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale, rhs_scale,
+                             res_scale, scale_len);
+  }


simple question. do you think there's more room if we separately define the kernel for the scalar case? if not, we might consider removing this condition after implementing the vector scale case.

Do you mean is there a room for a unified kernel that can handle both the scalar and vector cases?
It is not impossible, but it will deteriorate the kernel functionality. For example, same scale factor vector will be loaded redundantly while scale factor can be pre-loaded in the scalar-only kernel like : float32x4_t multiplier = vdupq_n_f32(lhs_scale * rhs_scale / res_scale);

I see, so you're saying that if we know it's a scalar, using a scalar-specific kernel would be better? In that case, it might be wiser to proceed as PRed. Thanks for your response!

- Note that current int8 Tensor does not consider zero point as a qParam. - Multiply with 16 to 32 bit widening, 32 bit scaling, and min-max saturation. - Use nearest rounding for the result. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

EunjuYang

LGTM!

djeong20

Awesome 👍

djeong20 · 2025-01-09T08:05:52Z

nntrainer/tensor/blas_neon.cpp

+    res_f32_2 = vmulq_f32(res_f32_2, multiplier);
+    res_f32_3 = vmulq_f32(res_f32_3, multiplier);
+
+    /// @note: currently we use vcvtnq_s32_f32 instead of vcvtq_s32_f32


could you explain this part?

Both vcvtq_s32_f32 and vcvtnq_s32_f32 are converting intrinsics from fp32 to signed int8 but,

vcvtq_s32_f32 : round to zero
vcvtnq_s32_f32 : round to nearest

As you might already know, there are many rounding rules when dealing with quantized value (stochasitc rounding, ... ), but AFAIK rounding to the nearest is the most typical one.

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20 and EunjuYang as code owners January 7, 2025 02:24

github-actions bot added the Need Review label Jan 7, 2025

skykongkong8 self-assigned this Jan 8, 2025

EunjuYang reviewed Jan 9, 2025

View reviewed changes

skykongkong8 force-pushed the pr/int8/mul_simd branch from f11fa81 to 7b15a9d Compare January 9, 2025 00:34

EunjuYang approved these changes Jan 9, 2025

View reviewed changes

skykongkong8 changed the title ~~[ neon ] Implement int8 mul neon simd kernel~~ [ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 Jan 9, 2025

djeong20 approved these changes Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

skykongkong8 commented Jan 7, 2025

EunjuYang left a comment

EunjuYang Jan 9, 2025

skykongkong8 Jan 9, 2025

EunjuYang Jan 9, 2025

skykongkong8 Jan 9, 2025

EunjuYang Jan 9, 2025

EunjuYang left a comment

djeong20 left a comment

djeong20 Jan 9, 2025

skykongkong8 Jan 9, 2025 •

edited

Loading

[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

Are you sure you want to change the base?

[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

Conversation

skykongkong8 commented Jan 7, 2025

EunjuYang left a comment

Choose a reason for hiding this comment

EunjuYang Jan 9, 2025

Choose a reason for hiding this comment

skykongkong8 Jan 9, 2025

Choose a reason for hiding this comment

EunjuYang Jan 9, 2025

Choose a reason for hiding this comment

skykongkong8 Jan 9, 2025

Choose a reason for hiding this comment

EunjuYang Jan 9, 2025

Choose a reason for hiding this comment

EunjuYang left a comment

Choose a reason for hiding this comment

djeong20 left a comment

Choose a reason for hiding this comment

djeong20 Jan 9, 2025

Choose a reason for hiding this comment

skykongkong8 Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

skykongkong8 Jan 9, 2025 •

edited

Loading