The original definition of MultiHeadAttention
refers to here.
The original definition of KeyValueCache
refers to here.
For MultiHeadCacheAttention
, it is just fuse MultiHeadAttention
and KeyValueCache
together.
Allows the model to jointly attend to information from different representation subspaces as described in the paper: Attention Is All You Need.
Multi-Head Attention(
query
, key
and value
.
Shape of
In MultiHeadCacheAttention, key = cat(past_key, current_key)
and value = cat(past_value, current_value)
But in this operator, shape of
Number of heads
Dimension of each head, where
Whether apply casual mask when sequence length > 1.
Whether apply alibi mask within the operator. Do not need to set alibi mask in attn_mask
when it is True
For Grouped-Query Attention. If num_kv_heads
and num_heads
are not equal, we should repeat key and value num_heads/num_kv_heads
times before applying num_heads
must be divisible by num_kv_heads
. Default is 0, and at this point, num_heads
is used as num_kv_heads
.
Number of attention layers.
Attention layer index for cache and scale.
Quantize bit for cache compression. For example, 8 means int8 compression. 0
means disabled.
Quantize scale shared group size.
Define data layout of cache
and scale
. Default is zero.
Meaning of numbers:
-
0
:$cache(MaxB,L,2,MaxS,H,Dh)$ and$scale(MaxB,L,2,MaxS,H,Dh/quant\_group)$ -
1
:$cache(L,MaxB,2,H,MaxS,Dh)$ and$scale(L,MaxB,2,H,MaxS,Dh/quant\_group)$
Input Query tensor
Shape:
Input Key tensor
Shape:
Input Value tensor
Shape:
Sequence position where current_key
and current_value
begining to store.
Shape: Determinated by cache_layout
.
Contains key and value caches of attention layer. When cache_layout
is 0
, subspace
Shape: Determinated by cache_layout
.
Contains key and value cache quantize scales of attention layer. When cache_layout
is 0
, subspace quant_bit
is not zero. Data in this tensor will be modified.
Optional custom mask. If shape is attn_mask
will be broadcasted.
Note: The last dim of mask could be bigger than
Shape:
Output feature of attention result
Shape: