About a cache-friendly data layout #8432

luoyu-intel · 2024-07-11T08:31:24Z

luoyu-intel
Jul 11, 2024

Hi @ggerganov @slaren !
I'm trying to improve the Intel GPUs' performance for next-token latency. I think the current data layout is not cache-friendly which makes the SYCL kernels very difficult to fully use the memory bandwidth. For example, Q4_0:

#define QK4_0 32
typedef struct {
    ggml_half d;           // delta
    uint8_t qs[QK4_0 / 2]; // nibbles / quants
} block_q4_0;

The delta value(2 bytes) has broken the quantized data's cache alignment.
If using the continuous layout, each GPU core can always read a cache line of quantized data from memory:

uint8_t qs[N*K/2];
ggml_half d[N*K/QK4_0];

I did some tests on these two layouts and profiled the llama-cli with Intel Vtune. I wrote a new gemv kernel for the continuous weight and added a new kernel to convert a block_q4_0 weight to a continuous weight layout before the gemv kernel. code
The native block_q4_0 took ~56us for a 4096x4096 weight. The continuous layout took ~33us. It's ~70% speedup. Intel A770m GPU can run llama2-q4_0 at 18.5ms/token(discard the latency of the conversion kernels).

My question is: is it possible to convert block_q4_0 weight to a continuous weight in llama.cpp? Or can it be done for the Intel-SYCL backend only?

slaren · 2024-07-11T11:08:55Z

slaren
Jul 11, 2024
Collaborator

It would be possible to write a ggml-backend buffer type that uses a different layout for some types. To do so, you would need to modify the set_tensor function of the buffer to apply the transformation, and undo it in the get_tensor function. Views may be tricky with a structure of arrays layout, but it should be doable by using the tensor view_src and view_offset attributes of the tensor to calculate the location of the data.

4 replies

luoyu-intel Jul 12, 2024
Author

a ggml-backend buffer type Is this type going to be a shared type for all backends? Or be a patch of SYCL-backend.

Views may be tricky with a structure of arrays layout, but it should be doable by using the tensor view_src and view_offset attributes of the tensor to calculate the location of the data.

I'm looking into the detail code for this. I'm concerned that there will be some changes for the common tensor code.

slaren Jul 12, 2024
Collaborator

Buffer types are (usually) specific to each backend. For SYCL you would probably want to modify ggml_backend_sycl_buffer_set_tensor and ggml_backend_sycl_buffer_get_tensor.

luoyu-intel Jul 15, 2024
Author

Is it possible that I do not modify any code of the ggml-backend buffer? I suppose the new layout and the native layout have the same buffer size. I only transform the layout of Q4_0 in ggml_backend_sycl_buffer_set_tensor and change the SYCL kernels of Q4_0 to new layout.

slaren Jul 15, 2024
Collaborator

You don't have to modify any of the common ggml-backend code, only the SYCL implementation. ggml_backend_sycl_buffer_set_tensor alone may not be enough, for full compatibility with features such state save/restore, you would also need to implement ggml_backend_sycl_buffer_get_tensor to restore the layout to the standard ggml layout. If you need more buffer size for the layout, you would only need to modify ggml_backend_sycl_buffer_type_get_alloc_size to report the correct size of the tensor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About a cache-friendly data layout #8432

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

About a cache-friendly data layout #8432

luoyu-intel Jul 11, 2024

Replies: 1 comment · 4 replies

slaren Jul 11, 2024 Collaborator

luoyu-intel Jul 12, 2024 Author

slaren Jul 12, 2024 Collaborator

luoyu-intel Jul 15, 2024 Author

slaren Jul 15, 2024 Collaborator

luoyu-intel
Jul 11, 2024

Replies: 1 comment 4 replies

slaren
Jul 11, 2024
Collaborator

luoyu-intel Jul 12, 2024
Author

slaren Jul 12, 2024
Collaborator

luoyu-intel Jul 15, 2024
Author

slaren Jul 15, 2024
Collaborator