About a cache-friendly data layout #8432
luoyu-intel
started this conversation in
General
Replies: 1 comment 4 replies
-
It would be possible to write a ggml-backend buffer type that uses a different layout for some types. To do so, you would need to modify the |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi @ggerganov @slaren !
I'm trying to improve the Intel GPUs' performance for next-token latency. I think the current data layout is not cache-friendly which makes the SYCL kernels very difficult to fully use the memory bandwidth. For example, Q4_0:
The delta value(2 bytes) has broken the quantized data's cache alignment.
If using the continuous layout, each GPU core can always read a cache line of quantized data from memory:
I did some tests on these two layouts and profiled the
llama-cli
with Intel Vtune. I wrote a new gemv kernel for the continuous weight and added a new kernel to convert ablock_q4_0
weight to a continuous weight layout before the gemv kernel. codeThe native
block_q4_0
took ~56us for a 4096x4096 weight. The continuous layout took ~33us. It's ~70% speedup. Intel A770m GPU can run llama2-q4_0 at 18.5ms/token(discard the latency of the conversion kernels).My question is: is it possible to convert
block_q4_0
weight to a continuous weight in llama.cpp? Or can it be done for the Intel-SYCL backend only?Beta Was this translation helpful? Give feedback.
All reactions