llama-server can not host Deepseek-r1-distill-Qwen-1.5B on CUDA #11673

chwenjun225 · 2025-02-05T09:33:28Z

chwenjun225
Feb 5, 2025

I carefully follwing this documents to built project

After build, I run llama-server with my command
./third_3rdparty/llama.cpp-b4641/build/bin/llama-server -m /home/chwenjun225/Projects/Foxer/notebooks/DeepSeek-R1-Distill-Qwen-1.5B_finetune_CoT_ReAct/1_finetuned_DeepSeek-R1-Distill-Qwen-1.5B_finetune_CoT_ReAct/gguf/1_finetuned_DeepSeek-R1-Distill-Qwen-1.5B_finetune_CoT_ReAct-1.8B-1_finetuned_DeepSeek-R1-Distill-Qwen-1.5B_finetune_CoT_ReAct-F32.gguf \ --port 2026

I track with htop look like it still inference on CPU:

And the VRAM of GPU is like this:

I don't know if there is misunderstood but when I ran vllm server it took 16/24GB

And after that, I ran the command ./llama-server --list-device
Output:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
error: invalid argument: --list-device

So my question is, is there any way I can host and inference my model on CUDA

Answered by slaren

Feb 5, 2025

You need to add the -ngl parameter to the command line. Try -ngl 99.

View full answer

slaren · 2025-02-05T09:43:29Z

slaren
Feb 5, 2025
Collaborator

You need to add the -ngl parameter to the command line. Try -ngl 99.

1 reply

chwenjun225 Feb 5, 2025
Author

Thank you so much, I was testing from -ngl 99 to -ngl 200, and the whole model took 07/24GB. The inference speed is faster than light ❤️👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-server can not host Deepseek-r1-distill-Qwen-1.5B on CUDA #11673

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

llama-server can not host Deepseek-r1-distill-Qwen-1.5B on CUDA #11673

chwenjun225 Feb 5, 2025

Replies: 1 comment · 1 reply

slaren Feb 5, 2025 Collaborator

chwenjun225 Feb 5, 2025 Author

chwenjun225
Feb 5, 2025

Replies: 1 comment 1 reply

slaren
Feb 5, 2025
Collaborator

chwenjun225 Feb 5, 2025
Author