Feature Request: Free up VRAM when llama-server not in use #11703

99991 · 2025-02-06T08:25:44Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Unload the model from VRAM when it has not been used for --unload-timeout 300 seconds and reload it automatically into VRAM when new requests arrive.

Motivation

Freeing up VRAM allows running other things.

Possible Implementation

It is implemented in ollama: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately
Old issue closed by bot: Power save mode for server --unload-timeout 120 #4598
PR with shutdown after timeout, but not restarting: server: Add timeout to stop the server automatically when idling for too long. #10742
Workaround with proxy: https://github.com/mostlygeek/llama-swap

The text was updated successfully, but these errors were encountered:

99991 added the enhancement New feature or request label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Free up VRAM when llama-server not in use #11703

Feature Request: Free up VRAM when llama-server not in use #11703

99991 commented Feb 6, 2025

Feature Request: Free up VRAM when llama-server not in use #11703

Feature Request: Free up VRAM when llama-server not in use #11703

Comments

99991 commented Feb 6, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation