Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ollama sometimes moves models off GPU, making them so slow they time out #887

Closed
jeffbl opened this issue Sep 26, 2024 · 3 comments
Closed
Assignees

Comments

@jeffbl
Copy link
Member

jeffbl commented Sep 26, 2024

When ollama is launched fresh, and loads (for example) llava:7b so that content-categoriser can run, it is nice and quick. However, after some time, a request goes in and it is running on CPU. This takes so long that the preprocessor times out, causing lots of problems, and a very slow response time (since the preprocessor has to wait to time out)

@shahdyousefak ping due to implications for health checks #802.

Need to find some way to manage ollama and specify what is running on CPU vs GPU, to get consistent results when running models.

@jeffbl jeffbl self-assigned this Sep 26, 2024
@jeffbl jeffbl mentioned this issue Sep 26, 2024
9 tasks
@jeffbl
Copy link
Member Author

jeffbl commented Sep 27, 2024

seems it fell off the GPU overnight since working on it yesterday, since request through open-webui is maxing out all physical cores on unicorn for a llava:7b request

Image

docker compose stop ; docker compose up -d resets it, and uses much less CPU (and results come back much much faster):

Image

@jeffbl
Copy link
Member Author

jeffbl commented Oct 1, 2024

Trying to get models to load on GPU on unicorn is very difficult, even initially, since the whole GPU only has 12GB of RAM, and about 5 of it is taken up with our "normal" preprocessor stack. Nonetheless, the content-categoriser preprocessor ollama requests now include a "keep_alive": -1 that should force the model to remain in memory indefinitely.

Note, you can force a model to unload with a request like:

curl -X POST -H 'Authorization: Bearer SECRETKEY' -H 'Content-Type: application/json' -L https://ollama.pegasus.cim.mcgill.ca/ollama/api/generate -d '{"model": "llava:latest", "prompt":"", "stream":false, "keep_alive": 0}'

TODO: run an empty query when the preprocessor is started, to force the model into memory, so that the first query doesn't time out trying to load models. Right now, the very first query will be slow.

Note that config/ollama.env on unicorn currently points requests to pegasus, since there is plenty of GPU memory there to keep models loaded. If we try and do it on unicorn, and it unloads, requests time out since it takes so long, causing end-user delays, errors in the logs, and no category set.

@jeffbl
Copy link
Member Author

jeffbl commented Oct 4, 2024

Monitoring for several days, the keep_alive flag implemented in PR #874 has kept the model in memory on pegasus the entire time, and response time has been excellent. I don't see any more clever implementation, but this does leave the problem that first requests are very slow since model is only loaded on first request. Logged #890 as separate issue, and am closing this as resolved.

@jeffbl jeffbl closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant