Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skykongkong8
Copy link
Member

This PR proposes a accelerated function proposed from #2850
This will not affect the function immediately, but will introduce PR for that in the near future.

  • Note that current int8 Tensor does not consider zero point as a qParam.
  • Multiply with 16 to 32 bit widening, 32 bit scaling, and min-max saturation.
  • Use nearest rounding for the result.

Self evaluation:

  1. Build test: [X]Passed [ ]Failed [ ]Skipped
  2. Run test: [X]Passed [ ]Failed [ ]Skipped

Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to read the PR. Here're some comments from my side.

nntrainer/tensor/blas_neon.cpp Show resolved Hide resolved
Comment on lines 199 to 201
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,
float *lhs_scale, float *rhs_scale, float *res_scale,
unsigned int scale_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about declaring the lhs_scale, rhs_scale and res_scale as const pointer?

Suggested change
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,
float *lhs_scale, float *rhs_scale, float *res_scale,
unsigned int scale_len);
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len,
const float *lhs_scale, const float *rhs_scale, const float *res_scale,
unsigned int scale_len);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds better!
AFAIK, current char tensor scale factor is stored like std::vector<float> const &scales, so should be fixed that way. Thanks!

Comment on lines +632 to +638
if (scale_len == 1) {
return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale[0],
rhs_scale[0], res_scale[0]);
} else {
return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale, rhs_scale,
res_scale, scale_len);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple question. do you think there's more room if we separately define the kernel for the scalar case? if not, we might consider removing this condition after implementing the vector scale case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean is there a room for a unified kernel that can handle both the scalar and vector cases?
It is not impossible, but it will deteriorate the kernel functionality. For example, same scale factor vector will be loaded redundantly while scale factor can be pre-loaded in the scalar-only kernel like : float32x4_t multiplier = vdupq_n_f32(lhs_scale * rhs_scale / res_scale);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you're saying that if we know it's a scalar, using a scalar-specific kernel would be better? In that case, it might be wiser to proceed as PRed. Thanks for your response!

- Note that current int8 Tensor does not consider zero point as a qParam.
- Multiply with 16 to 32 bit widening, 32 bit scaling, and min-max saturation.
- Use nearest rounding for the result.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Copy link
Contributor

@EunjuYang EunjuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@skykongkong8 skykongkong8 changed the title [ neon ] Implement int8 mul neon simd kernel [ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 Jan 9, 2025
Copy link
Contributor

@djeong20 djeong20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 👍

res_f32_2 = vmulq_f32(res_f32_2, multiplier);
res_f32_3 = vmulq_f32(res_f32_3, multiplier);

/// @note: currently we use vcvtnq_s32_f32 instead of vcvtq_s32_f32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain this part?

Copy link
Member Author

@skykongkong8 skykongkong8 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both vcvtq_s32_f32 and vcvtnq_s32_f32 are converting intrinsics from fp32 to signed int8 but,

vcvtq_s32_f32 : round to zero
vcvtnq_s32_f32 : round to nearest

As you might already know, there are many rounding rules when dealing with quantized value (stochasitc rounding, ... ), but AFAIK rounding to the nearest is the most typical one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants