Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: Add support for RWKV v7 architecture #11452

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

MollySophia
Copy link
Collaborator

@MollySophia MollySophia commented Jan 27, 2025

@BlinkDL 's explanation of RWKV v7:
RWKV-7 as a meta-in-context learner
Also there are plenty of tests on trained models (currently 0.1B and 0.4B) posted on his x account. Larger models are coming too in several days.

Current available RWKV v7 model repos in HF format:
https://huggingface.co/SmerkyG/RWKV7-Goose-0.1B-World2.8-HF (not an official published one, tensor names are expected to change in the future)
https://huggingface.co/mollysama/rwkv-7-world-0b4-hf
https://huggingface.co/mollysama/rwkv-7-world-1b5-hf
https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 (hybriddistilled model with rwkv v7 "attn" and qwen2.5 7B's mlp, distilled from qwen2.5) (it's not really appropriate to call them "hybrid" models because they actually doesn't have transformer attns)

Distilled DS-R1 models:
https://huggingface.co/RWKV-Red-Team/ARWKV-R1-7B
https://huggingface.co/RWKV-Red-Team/ARWKV-R1-1B5

This PR contains:

  • GGML_OP_L2_NORM that applies pytorch-style l2 normalization, along the rows. Tested with CPU, CUDA, SYCL, Vulkan, Metal backends.
  • GGML_OP_RWKV_WKV7 which is the core of the RWKV v7 architecture. Implemented the naive recurrent wkv7 kernel in CPU, CUDA, SYCL, Vulkan, Metal.
  • Support inference of RWKV7 and ARWKV7 models.
  • Simple Metal kernel for the old WKV6.
  • Skip unused tokens in last layer ffn computation for rwkv models. (8000tps -> 8100tps prefilling for 7B v7 model)

TODO:
- [ ] (within this PR or in the future) Implement chunkwise wkv7 (and possibly wkv6 as well) as per flash-linear-attention's impl.

Note: Current benchmark of ARWKV7-7B f16

# molly @ molly-workstation in ~/llama.cpp on git:rwkv-v7 x [9:49:42] 
$ ./build-test/bin/llama-bench -m ../ARWKV-7B-Preview-0_1-NoG/ARWKV-7B-Preview-0_1-NoG-F16.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         pp512 |      8105.20 ± 15.34 |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         tg128 |         50.62 ± 0.01 |

build: 76219859 (4579)

which is way faster than RWKV v6 7B when prefilling (still a bit slower than Qwen2.5 7B).

@MollySophia MollySophia marked this pull request as ready for review January 27, 2025 13:33
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend python python script changes ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jan 27, 2025
@MollySophia MollySophia marked this pull request as draft January 27, 2025 14:09
@MollySophia MollySophia marked this pull request as ready for review January 28, 2025 09:10
@MollySophia
Copy link
Collaborator Author

Update: added support for fla-hub's rwkv7 hf model format. (https://huggingface.co/fla-hub/rwkv7-1.5B-world)

@ggerganov
Copy link
Owner

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

@MollySophia
Copy link
Collaborator Author

Just a heads up, this will likely take some time to merge - I want to finish #11213 first and then figure out how to fit RWKV in the new code, likely with it's own implementation of llama_context.

That’s great! I can help with that too

@ggerganov
Copy link
Owner

Great, keep a look at the #11213 PR. It's still very messy, but I hope it will soon start to make sense.

@MollySophia
Copy link
Collaborator Author

Great, keep a look at the #11213 PR. It's still very messy, but I hope it will soon start to make sense.

I think maybe we can have this PR done first? I'll help with #11213 too for the future changes.

@MollySophia MollySophia force-pushed the rwkv-v7 branch 2 times, most recently from 97c31bb to e6ee7e9 Compare February 10, 2025 05:01
MollySophia and others added 10 commits February 10, 2025 13:02
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
There isn't much peformance gain though. Just for more op coverage

Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Signed-off-by: Molly Sophia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants