-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CharTensor] Enable QINT8 multiplication feature #2850
base: main
Are you sure you want to change the base?
Conversation
81b92c9
to
5df2b81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for the comment below.
float lhs_scale = *(float *)getScale(); | ||
float rhs_scale = *input.getScale<float>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they always scalars? Based on #2844, it seems scale can be an array.
Is it related to the condition in the note, i.e., 4. only per-tensor quantization qscheme is supported ?
if so, we need to add a condition to verify if the case can be supported or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed review! Yes, you're correct. Currently, there isn't such a way to check the quantization scheme of the tensor. I'll create a new PR to do so and apply it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, great work!
float multiplier = lhs_scale * rhs_scale / scale; | ||
|
||
int8_t *lhs = (int8_t *)getData(); | ||
int8_t *rhs = input.getData<int8_t>(); | ||
int8_t *result = output.getData<int8_t>(); | ||
|
||
for (unsigned int i = 0; i < size(); ++i) { | ||
int32_t accum_val = | ||
static_cast<int32_t>(lhs[i]) * static_cast<int32_t>(rhs[i]); | ||
|
||
result[i] = | ||
std::max(-128, std::min((int)std::lround(multiplier * accum_val), 127)); | ||
} | ||
|
||
*output.getScale<float>() = scale; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you might already know, in order to add simd accelerated code and maintain it with vanilla code altogether, this should be reside at blas_interface.cpp
level. I can do that later on, but would like to discuss one thing:
// scalar scale
void ele_mul(int8_t* lhs, int8_t* rhs, int8_t* res, float lhs_scale, float rhs_scale, float scale, unsigned int N);
// vector scale
void ele_mul(int8_t* lhs, int8_t* rhs, int8_t* res, float* lhs_scale, float* rhs_scale, float* scale, unsigned int N);
would function design like above valid for your intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing your opinion! To answer yes, the function design you suggested would be valid (although for vector scale, the length of output channels should be provided).
Maybe we could use a single kernel to support both scalar and vector scales.
void qmul_kernel(int8_t* lhs, int8_t* rhs, int8_t* res, unsigned int data_len,
float* lhs_scale, float* rhs_scale, float* res_scale, unsigned int scale_len)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds even better. I will refer to that.
float lhs_scale = *(float *)getScale(); | ||
float rhs_scale = *input.getScale<float>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing to check...
Could I confirm whether there are no plans to incorporate zero point as a qParam for QInt8 Tensors at this moment?
I could find some computational kernels that uses zero points for both int8, and uint8 so I just want to make sure where we heading to.
This pull request aims to enable the QINT8 element-wise multiplication feature in CharTensor. This takes two tensors of the same dimensions and returns a matrix of the multiplied corresponding elements. Please note that automatically determining the new scale factor will be added in a future update. **Self-evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: Donghyeon Jeong <[email protected]>
5df2b81
to
1e63cc4
Compare
This pull request aims to enable the QINT8 element-wise multiplication feature in CharTensor.
This takes two tensors of the same dimensions and returns a matrix of the multiplied corresponding elements.
Please note that automatically determining the new scale factor will be added in a future update.
Self-evaluation: