llama: Add support for RWKV v7 architecture #11452

MollySophia · 2025-01-27T13:33:36Z

@BlinkDL 's explanation of RWKV v7:
RWKV-7 as a meta-in-context learner
Also there are plenty of tests on trained models (currently 0.1B and 0.4B) posted on his x account. Larger models are coming too in several days.

Current available RWKV v7 model repos in HF format:
https://huggingface.co/SmerkyG/RWKV7-Goose-0.1B-World2.8-HF (not an official published one, tensor names are expected to change in the future)
https://huggingface.co/mollysama/rwkv-7-world-0b4-hf
https://huggingface.co/mollysama/rwkv-7-world-1b5-hf
https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 (~~hybrid~~distilled model with rwkv v7 "attn" and qwen2.5 7B's mlp, distilled from qwen2.5) (it's not really appropriate to call them "hybrid" models because they actually doesn't have transformer attns)

Distilled DS-R1 models:
https://huggingface.co/RWKV-Red-Team/ARWKV-R1-7B
https://huggingface.co/RWKV-Red-Team/ARWKV-R1-1B5

This PR contains:

GGML_OP_L2_NORM that applies pytorch-style l2 normalization, along the rows. Tested with CPU, CUDA, SYCL, Vulkan, Metal backends.
GGML_OP_RWKV_WKV7 which is the core of the RWKV v7 architecture. Implemented the naive recurrent wkv7 kernel in CPU, CUDA, SYCL, Vulkan, Metal.
Support inference of RWKV7 and ARWKV7 models.
Simple Metal kernel for the old WKV6.
Skip unused tokens in last layer ffn computation for rwkv models. (8000tps -> 8100tps prefilling for 7B v7 model)

TODO:
~~- [ ] (within this PR or in the future) Implement chunkwise wkv7 (and possibly wkv6 as well) as per flash-linear-attention's impl.~~

Note: Current benchmark of ARWKV7-7B f16

# molly @ molly-workstation in ~/llama.cpp on git:rwkv-v7 x [9:49:42] 
$ ./build-test/bin/llama-bench -m ../ARWKV-7B-Preview-0_1-NoG/ARWKV-7B-Preview-0_1-NoG-F16.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         pp512 |      8105.20 ± 15.34 |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         tg128 |         50.62 ± 0.01 |

build: 76219859 (4579)

which is way faster than RWKV v6 7B when prefilling (still a bit slower than Qwen2.5 7B).

MollySophia · 2025-01-29T05:45:06Z

Update: added support for fla-hub's rwkv7 hf model format. (https://huggingface.co/fla-hub/rwkv7-1.5B-world)

ggerganov · 2025-01-29T08:50:39Z

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

MollySophia · 2025-01-29T08:52:51Z

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

That’s great! I can help with that too

ggerganov · 2025-01-29T08:55:02Z

Great, keep a look at the #11213 PR. It's still very messy, but I hope it will soon start to make sense.

MollySophia · 2025-02-09T04:00:28Z

Great, keep a look at the #11213 PR. It's still very messy, but I hope it will soon start to make sense.

I think maybe we can have this PR done first? I'll help with #11213 too for the future changes.

Signed-off-by: Molly Sophia <[email protected]>

There isn't much peformance gain though. Just for more op coverage Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

They passes on my m2 and m4 devices :| Signed-off-by: Molly Sophia <[email protected]>

Signed-off-by: Molly Sophia <[email protected]>

MollySophia marked this pull request as ready for review January 27, 2025 13:33

MollySophia force-pushed the rwkv-v7 branch from 09d9056 to e9c6311 Compare January 27, 2025 13:38

MollySophia marked this pull request as draft January 27, 2025 14:09

MollySophia marked this pull request as ready for review January 28, 2025 09:10

MollySophia force-pushed the rwkv-v7 branch from 16a8acd to 6588ccd Compare January 29, 2025 05:46

MollySophia force-pushed the rwkv-v7 branch from 7621985 to f48c27d Compare February 1, 2025 01:53

MollySophia force-pushed the rwkv-v7 branch 2 times, most recently from 97c31bb to e6ee7e9 Compare February 10, 2025 05:01

MollySophia and others added 10 commits February 10, 2025 13:02

ggml: Add op l2_norm

5445300

Signed-off-by: Molly Sophia <[email protected]>

WIP: Add support for rwkv v7

6dcc21e

Signed-off-by: Molly Sophia <[email protected]>

wkv7 CUDA impl

9cd24dd

Signed-off-by: Molly Sophia <[email protected]>

WKV7 Vulkan & sycl

e7794cb

Signed-off-by: Molly Sophia <[email protected]>

initial support for apple

84b4f81

update tests for 1b6 3b 7b

65307d2

Fix metal wkv6 inference

d564c4b

Signed-off-by: Molly Sophia <[email protected]>

ggml: metal unary exp & neg

3a2a97a

There isn't much peformance gain though. Just for more op coverage Signed-off-by: Molly Sophia <[email protected]>

WKV7 Metal

2187607

Signed-off-by: Molly Sophia <[email protected]>

WKV7 Vulkan bugfix

e9ba411

Signed-off-by: Molly Sophia <[email protected]>

MollySophia added 10 commits February 10, 2025 13:02

Add support for ARWKV7 Hybrid models

f6be4dc

Signed-off-by: Molly Sophia <[email protected]>

Apply code-format changes

2175aeb

Signed-off-by: Molly Sophia <[email protected]>

rwkv7: converter script simplification

922ebbe

Signed-off-by: Molly Sophia <[email protected]>

Add _set_vocab_rwkv_world as a common function

1fdc00b

Signed-off-by: Molly Sophia <[email protected]>

rwkv7: Add some model type variants

cffd099

Signed-off-by: Molly Sophia <[email protected]>

rwkv: skip computing output for unused tokens for hybrid models

9cad1ca

Signed-off-by: Molly Sophia <[email protected]>

rwkv: better handling for models without gate

39eb446

Signed-off-by: Molly Sophia <[email protected]>

remove duplicate break;

b5be8ff

Signed-off-by: Molly Sophia <[email protected]>

RWKV_WKV6 testing: avoid some weird fails

41a80df

They passes on my m2 and m4 devices :| Signed-off-by: Molly Sophia <[email protected]>

rwkv7: do not quantize small yet 2D lora weights

1a9c263

Signed-off-by: Molly Sophia <[email protected]>

MollySophia force-pushed the rwkv-v7 branch from e6ee7e9 to 1a9c263 Compare February 10, 2025 05:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: Add support for RWKV v7 architecture #11452

llama: Add support for RWKV v7 architecture #11452

MollySophia commented Jan 27, 2025 •

edited

Loading

MollySophia commented Jan 29, 2025

ggerganov commented Jan 29, 2025

MollySophia commented Jan 29, 2025

ggerganov commented Jan 29, 2025

MollySophia commented Feb 9, 2025

llama: Add support for RWKV v7 architecture #11452

Are you sure you want to change the base?

llama: Add support for RWKV v7 architecture #11452

Conversation

MollySophia commented Jan 27, 2025 • edited Loading

MollySophia commented Jan 29, 2025

ggerganov commented Jan 29, 2025

MollySophia commented Jan 29, 2025

ggerganov commented Jan 29, 2025

MollySophia commented Feb 9, 2025

MollySophia commented Jan 27, 2025 •

edited

Loading