Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K230-Llama3 #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,74 @@
# rvspoc-S2422-Llama3
##Llama 3 在 K230 上的优化实现
Llama3模型模型规模较大,使用SD卡,使用swap的方式,成功在canmv_K230上运行Llama3(Q2,Q4和Q8量化)。
###############################环境准备######################################
##烧写镜像
使用嘉楠开发者社区提供镜像,镜像下载链接:https://developer.canaan-creative.com/resource, 镜像版本:canmv_debian_sdcard_sdk_1.3.img.gz

##安装软件
为了便于文件传输和重新分区,使用apt安装ssh和parted软件。注意在apt update前需要先同步一下系统时间。
参考命令:
1)更新时间:
date --set="2024-08-03 03:15:20" ##修改为UTC当前时间
2)apt update
3)安装ssh:
apt install openssh-server -y
4)安装parted:
apt install parted -y

##重新分区

参考命令:
1)查看可用空间大小:
parted -l /dev/mmcblk1
2)根据上条命令运行结果的SD卡大小提示设置分区;
parted /dev/mmcblk1 resizepart 3 62.5G
resize2fs /dev/mmcblk1p3
##设置swap
swap设置为2G
1)dd if=/dev/zero of=/mnt/swap bs=256M count=8
2)mkswap /mnt/swap
3)swapon /mnt/swap
###################程序运行#########################

试了两个llama3的实现,llama3.c 做了Q8量化,llama.cpp尝试Q2和Q4量化,具体如下:

1)基于https://github.com/jameswdelancey/llama3.c 实现,使用方法一致。需要生成q8量化文件。

##提交内容
运行需要三个文件:
llama3_8b_instruct_q80.bin
下载地址:https://huggingface.co/Sophia957/llama3_8b_instruct_q80/resolve/main/llama3_8b_instruct_q80.bin
runq3-k230 (PR上传)
tokenizer.bin(PR上传)

其他:
a. 移植过程中,参考上届的llama2题目,在K230上对比了Llama2增加RVV前后的效果,提速明显,但尝试在llama3.c上增加rvv,没有明显提速且运行结果有点问题,于是尝试其他方法提速。
b. 由于Llama3中有较多的matmul计算,尝试使用Matrix进行提速,下载工具链:Xuantie-900-gcc-linux-6.6.0-glibc-x86_64-V2.10.1,编译Matrix的demo成功,但在K230中运行没有成功,提示指令不支持,后续查询C908资料没有看到包含Matrix相关介绍。


2)基于 https://github.com/ggerganov/llama.cpp 尝试了Q2_k和Q4_0

##提交内容
运行需要文件:
llama-cli-tune (PR上传)
model文件:Meta-Llama-3-8B.Q2_K.gguf 或 Meta-Llama-3-8B.Q4_0.gguf
下载地址:
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/resolve/main/Meta-Llama-3-8B.Q2_K.gguf
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/resolve/main/Meta-Llama-3-8B.Q4_0.gguf

参考运行命令:
./llama-cli-tune -m ./Meta-Llama-3-8B.Q2_K.gguf -p "Once upon a time, " -n 10 -s 123
./llama-cli-tune -m ./Meta-Llama-3-8B.Q4_0.gguf -p "Once upon a time, " -n 10 -s 123

其他:
a. llama.cpp中使用了rvv的intrinsic,但是目前使用 Xuantie-900-gcc-linux-6.6.0-glibc工具链,rvv intrinsic部分没有成功编译进去。


重启设备遇到网络启动失败,可运行下面指令:
networkctl
ip link set enu1 name eth0
systemctl restart networking.service


Submission repo for S2422. ref: rvspoc.org
124 changes: 124 additions & 0 deletions cpp-Q2-log.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
root@v:/home/s# ./llama-cli-tune -m ./Meta-Llama-3-8B.Q2_K.gguf -p "Once upon a time" -n 10 -s 123
Log start
main: build = 3507 (4b77ea95)
main: built with riscv64-unknown-linux-gnu-gcc (Xuantie-900 linux-6.6.0 glibc gcc Toolchain V2.10.1 B-20240712) 10.4.0 for riscv64-unknown-linux-gnu
main: seed = 123
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from ./Meta-Llama-3-8B.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = .
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.block_count u32 = 32
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 10
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q2_K: 129 tensors
llama_model_loader: - type q3_K: 64 tensors
llama_model_loader: - type q4_K: 32 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q2_K - Medium
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 2.95 GiB (3.16 BPW)
llm_load_print_meta: general.name = .
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 3024.38 MiB
...................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 560.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 1 / 1 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = 10, n_keep = 0


Once upon a time there was a big rock on the island of K
llama_print_timings: load time = 497741.76 ms
llama_print_timings: sample time = 11.44 ms / 10 runs ( 1.14 ms per token, 874.36 tokens per second)
llama_print_timings: prompt eval time = 237352.11 ms / 4 tokens (59338.03 ms per token, 0.02 tokens per second)
llama_print_timings: eval time = 1170584.69 ms / 9 runs (130064.97 ms per token, 0.01 tokens per second)
llama_print_timings: total time = 1408539.24 ms / 13 tokens
Log end
Binary file added llama-cli-tune
Binary file not shown.
Binary file added llama2-rvv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added llama3-q8.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added matmul-Matrix-failed.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added runq3-k230
Binary file not shown.
Binary file added tokenizer.bin
Binary file not shown.