-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859
base: main
Are you sure you want to change the base?
Conversation
… Pipeline changes Upated the kernels as per the latest buffer generalized changes. Added unittest for Addition FP16 in unittest_blas_kernels_cl.cpp Signed-off-by: Yash Singh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM
// unsigned int offsetFeqsSin, | ||
// unsigned int offsetSin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// unsigned int offsetFeqsSin, | |
// unsigned int offsetSin |
let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update in the latest commit.
@@ -50,7 +52,7 @@ __kernel void rotary_emb_cl(__global float *input, | |||
unsigned idx = (from + h)*dim; | |||
for(unsigned int i = idx; i < idx + dim; i++){ | |||
cos_ptr[i - idx] = freqs_cos[i]; | |||
sin_ptr[i - idx] = freqs_sin[i]; | |||
sin_ptr[i - idx + offsetSin] = freqs_sin[i + offsetFreqsSin]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain this part?
- why offsetSin and offsetFreqsSin (
cos_.size()
andfreqs_cos.size() * dim
) are added - what is the intended behavior in this change?
Also, wouldn't this result in accessing invalid memory space for freqs_sin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, so as per the latest GPU pipeline changes we are using a genralized set if buffers instead of creating buffers everytime whenever we a kernel is called. As of now there are only 5 generalized buffers, 3 for input buffers and 2 for output buffers.
As i need 5 input buffers so I am using an offset for both freqs_sin
and freqs_sin_flat
. Thats why in the code as well I've used the offset for both.
Example, lets say there is bufferA of size 500, so from my 1-100, I am storing freqs_cos and from 100-200 I am storing freqs_sin, so when using the data of freqs_sin, I'll have to use an offset of 100 and that is what I am doing here.
Please refer this PR for more understanding: #2816
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification!
// nntrainer::TensorDim::TensorType t_type_nchw_fp32 = { | ||
// nntrainer::Tformat::NCHW, nntrainer::Tdatatype::FP32}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// nntrainer::TensorDim::TensorType t_type_nchw_fp32 = { | |
// nntrainer::Tformat::NCHW, nntrainer::Tdatatype::FP32}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update in the next commit. Thanks for pointing it out.
FP32
andFP16
.unittest_blas_kernels_cl.cpp
.add_i
androtary_emb
ops are updated.Signed-off-by: Yash Singh [email protected]