-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ neon ] Implement int8 mul neon simd kernel @open sesame 01/09 12:38 #2857
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to read the PR. Here're some comments from my side.
nntrainer/tensor/blas_neon.h
Outdated
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len, | ||
float *lhs_scale, float *rhs_scale, float *res_scale, | ||
unsigned int scale_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about declaring the lhs_scale
, rhs_scale
and res_scale
as const pointer?
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len, | |
float *lhs_scale, float *rhs_scale, float *res_scale, | |
unsigned int scale_len); | |
void ele_qmul(int8_t *lhs, int8_t *rhs, int8_t *res, unsigned int data_len, | |
const float *lhs_scale, const float *rhs_scale, const float *res_scale, | |
unsigned int scale_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds better!
AFAIK, current char tensor scale factor is stored like std::vector<float> const &scales
, so should be fixed that way. Thanks!
if (scale_len == 1) { | ||
return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale[0], | ||
rhs_scale[0], res_scale[0]); | ||
} else { | ||
return __ele_qmul_kernel(lhs, rhs, res, data_len, lhs_scale, rhs_scale, | ||
res_scale, scale_len); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple question. do you think there's more room if we separately define the kernel for the scalar case? if not, we might consider removing this condition after implementing the vector scale case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean is there a room for a unified kernel that can handle both the scalar and vector cases?
It is not impossible, but it will deteriorate the kernel functionality. For example, same scale factor vector will be loaded redundantly while scale factor can be pre-loaded in the scalar-only kernel like : float32x4_t multiplier = vdupq_n_f32(lhs_scale * rhs_scale / res_scale);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so you're saying that if we know it's a scalar, using a scalar-specific kernel would be better? In that case, it might be wiser to proceed as PRed. Thanks for your response!
- Note that current int8 Tensor does not consider zero point as a qParam. - Multiply with 16 to 32 bit widening, 32 bit scaling, and min-max saturation. - Use nearest rounding for the result. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>
f11fa81
to
7b15a9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 👍
res_f32_2 = vmulq_f32(res_f32_2, multiplier); | ||
res_f32_3 = vmulq_f32(res_f32_3, multiplier); | ||
|
||
/// @note: currently we use vcvtnq_s32_f32 instead of vcvtq_s32_f32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both vcvtq_s32_f32 and vcvtnq_s32_f32 are converting intrinsics from fp32 to signed int8 but,
vcvtq_s32_f32 : round to zero
vcvtnq_s32_f32 : round to nearest
As you might already know, there are many rounding rules when dealing with quantized value (stochasitc rounding, ... ), but AFAIK rounding to the nearest is the most typical one.
This PR proposes a accelerated function proposed from #2850
This will not affect the function immediately, but will introduce PR for that in the near future.
Self evaluation: