Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix defrag logic #11707

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

llama : fix defrag logic #11707

wants to merge 2 commits into from

Conversation

ggerganov
Copy link
Owner

While working on #11213 I realized that we are currently doing many unnecessary graph defrags because of incorrect cache fragmentation logic. The cache padding triggers the fragmentation threshold for small contexts even if there is no fragmentation at all.

./scripts/compare-commits.sh master gg/llama-fix-defrag -m models/llama-3.1-8b-instruct/ggml-model-q4_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-q8_0.gguf -m models/llama-3.1-8b-instruct/ggml-model-f16.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-f16.gguf -fa 1
Model Test t/s master t/s gg/llama-fix-defrag Speedup
llama 8B F16 pp512 1458.51 1458.18 1.00
llama 8B F16 tg128 38.82 39.19 1.01
llama 8B Q4_0 pp512 1324.28 1323.85 1.00
llama 8B Q4_0 tg128 99.55 101.37 1.02
llama 8B Q8_0 pp512 1298.42 1298.34 1.00
llama 8B Q8_0 tg128 66.23 66.99 1.01
qwen2 3B F16 pp512 3226.49 3226.91 1.00
qwen2 3B F16 tg128 71.26 72.44 1.02
qwen2 3B Q4_0 pp512 2927.50 2925.14 1.00
qwen2 3B Q4_0 tg128 138.02 142.55 1.03
qwen2 3B Q8_0 pp512 2880.21 2878.93 1.00
qwen2 3B Q8_0 tg128 108.89 112.35 1.03

master has the following path applied:

diff --git a/examples/llama-bench/llama-bench.cpp b/examples/llama-bench/llama-bench.cpp
index 4ac19ca86..8e9f90f27 100644
--- a/examples/llama-bench/llama-bench.cpp
+++ b/examples/llama-bench/llama-bench.cpp
@@ -753,6 +753,7 @@ struct cmd_params_instance {
         cparams.offload_kqv = !no_kv_offload;
         cparams.flash_attn  = flash_attn;
         cparams.embeddings  = embeddings;
+        cparams.defrag_thold = 0.1f;
 
         return cparams;
     }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant