ollama sometimes moves models off GPU, making them so slow they time out #887

jeffbl · 2024-09-26T21:59:44Z

When ollama is launched fresh, and loads (for example) llava:7b so that content-categoriser can run, it is nice and quick. However, after some time, a request goes in and it is running on CPU. This takes so long that the preprocessor times out, causing lots of problems, and a very slow response time (since the preprocessor has to wait to time out)

@shahdyousefak ping due to implications for health checks #802.

Need to find some way to manage ollama and specify what is running on CPU vs GPU, to get consistent results when running models.

jeffbl · 2024-09-27T16:37:49Z

seems it fell off the GPU overnight since working on it yesterday, since request through open-webui is maxing out all physical cores on unicorn for a llava:7b request

docker compose stop ; docker compose up -d resets it, and uses much less CPU (and results come back much much faster):

jeffbl · 2024-10-01T18:23:14Z

Trying to get models to load on GPU on unicorn is very difficult, even initially, since the whole GPU only has 12GB of RAM, and about 5 of it is taken up with our "normal" preprocessor stack. Nonetheless, the content-categoriser preprocessor ollama requests now include a "keep_alive": -1 that should force the model to remain in memory indefinitely.

Note, you can force a model to unload with a request like:

curl -X POST -H 'Authorization: Bearer SECRETKEY' -H 'Content-Type: application/json' -L https://ollama.pegasus.cim.mcgill.ca/ollama/api/generate -d '{"model": "llava:latest", "prompt":"", "stream":false, "keep_alive": 0}'

TODO: run an empty query when the preprocessor is started, to force the model into memory, so that the first query doesn't time out trying to load models. Right now, the very first query will be slow.

Note that config/ollama.env on unicorn currently points requests to pegasus, since there is plenty of GPU memory there to keep models loaded. If we try and do it on unicorn, and it unloads, requests time out since it takes so long, causing end-user delays, errors in the logs, and no category set.

jeffbl · 2024-10-04T16:01:09Z

Monitoring for several days, the keep_alive flag implemented in PR #874 has kept the model in memory on pegasus the entire time, and response time has been excellent. I don't see any more clever implementation, but this does leave the problem that first requests are very slow since model is only loaded on first request. Logged #890 as separate issue, and am closing this as resolved.

jeffbl self-assigned this Sep 26, 2024

jeffbl mentioned this issue Sep 26, 2024

New content categoriser #874

Merged

9 tasks

jeffbl mentioned this issue Oct 4, 2024

LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

Open

jeffbl closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ollama sometimes moves models off GPU, making them so slow they time out #887

ollama sometimes moves models off GPU, making them so slow they time out #887

jeffbl commented Sep 26, 2024

jeffbl commented Sep 27, 2024

jeffbl commented Oct 1, 2024

jeffbl commented Oct 4, 2024

ollama sometimes moves models off GPU, making them so slow they time out #887

ollama sometimes moves models off GPU, making them so slow they time out #887

Comments

jeffbl commented Sep 26, 2024

jeffbl commented Sep 27, 2024

jeffbl commented Oct 1, 2024

jeffbl commented Oct 4, 2024