[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859

yashSingh0723 · 2025-01-07T06:17:15Z

Upated the kernels as per the latest buffer generalized changes for both FP32 and FP16.
Added unittest for Addition Kernel FP16 in unittest_blas_kernels_cl.cpp.
add_i and rotary_emb ops are updated.

Signed-off-by: Yash Singh [email protected]

… Pipeline changes Upated the kernels as per the latest buffer generalized changes. Added unittest for Addition FP16 in unittest_blas_kernels_cl.cpp Signed-off-by: Yash Singh <[email protected]>

djeong20

Overall, LGTM

djeong20 · 2025-01-13T03:30:42Z

nntrainer/tensor/cl_operations/attention_kernel_strings.h

+// unsigned int offsetFeqsSin,
+//                                       unsigned int offsetSin


Suggested change

// unsigned int offsetFeqsSin,

// unsigned int offsetSin

let's remove it

I'll update in the latest commit.

djeong20 · 2025-01-13T03:41:00Z

nntrainer/tensor/cl_operations/attention_kernel_strings.h

@@ -50,7 +52,7 @@ __kernel void rotary_emb_cl(__global float *input,
          unsigned idx = (from + h)*dim;
          for(unsigned int i = idx; i < idx + dim; i++){
            cos_ptr[i - idx] = freqs_cos[i];
-            sin_ptr[i - idx] = freqs_sin[i];
+            sin_ptr[i - idx + offsetSin] = freqs_sin[i + offsetFreqsSin];


could you explain this part?

why offsetSin and offsetFreqsSin (cos_.size() and freqs_cos.size() * dim) are added

what is the intended behavior in this change?

Also, wouldn't this result in accessing invalid memory space for freqs_sin?

Hello, so as per the latest GPU pipeline changes we are using a genralized set if buffers instead of creating buffers everytime whenever we a kernel is called. As of now there are only 5 generalized buffers, 3 for input buffers and 2 for output buffers.
As i need 5 input buffers so I am using an offset for both freqs_sin and freqs_sin_flat. Thats why in the code as well I've used the offset for both.
Example, lets say there is bufferA of size 500, so from my 1-100, I am storing freqs_cos and from 100-200 I am storing freqs_sin, so when using the data of freqs_sin, I'll have to use an offset of 100 and that is what I am doing here.

Please refer this PR for more understanding: #2816

Thank you for the clarification!

djeong20 · 2025-01-13T03:44:54Z

test/unittest/unittest_blas_kernels_cl.cpp

+  // nntrainer::TensorDim::TensorType t_type_nchw_fp32 = {
+  //   nntrainer::Tformat::NCHW, nntrainer::Tdatatype::FP32};


Suggested change

// nntrainer::TensorDim::TensorType t_type_nchw_fp32 = {

// nntrainer::Tformat::NCHW, nntrainer::Tdatatype::FP32};

I'll update in the next commit. Thanks for pointing it out.

[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU…

493b26c

… Pipeline changes Upated the kernels as per the latest buffer generalized changes. Added unittest for Addition FP16 in unittest_blas_kernels_cl.cpp Signed-off-by: Yash Singh <[email protected]>

yashSingh0723 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, gichan-jang, anyj0527, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, skykongkong8, djeong20 and EunjuYang as code owners January 7, 2025 06:17

github-actions bot added the Need Review label Jan 7, 2025

djeong20 approved these changes Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859

[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859

yashSingh0723 commented Jan 7, 2025

djeong20 left a comment

djeong20 Jan 13, 2025

yashSingh0723 Jan 13, 2025

djeong20 Jan 13, 2025

yashSingh0723 Jan 13, 2025

djeong20 Jan 13, 2025

djeong20 Jan 13, 2025

yashSingh0723 Jan 13, 2025

		// nntrainer::TensorDim::TensorType t_type_nchw_fp32 = {
		// nntrainer::Tformat::NCHW, nntrainer::Tdatatype::FP32};

[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859

Are you sure you want to change the base?

[OpenCL/GPU] Optimized Blas and Attention kernels with the latest GPU Pipeline. #2859

Conversation

yashSingh0723 commented Jan 7, 2025

djeong20 left a comment

Choose a reason for hiding this comment

djeong20 Jan 13, 2025

Choose a reason for hiding this comment

yashSingh0723 Jan 13, 2025

Choose a reason for hiding this comment

djeong20 Jan 13, 2025

Choose a reason for hiding this comment

yashSingh0723 Jan 13, 2025

Choose a reason for hiding this comment

djeong20 Jan 13, 2025

Choose a reason for hiding this comment

djeong20 Jan 13, 2025

Choose a reason for hiding this comment

yashSingh0723 Jan 13, 2025

Choose a reason for hiding this comment