-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ollama sometimes moves models off GPU, making them so slow they time out #887
Comments
Trying to get models to load on GPU on unicorn is very difficult, even initially, since the whole GPU only has 12GB of RAM, and about 5 of it is taken up with our "normal" preprocessor stack. Nonetheless, the content-categoriser preprocessor ollama requests now include a Note, you can force a model to unload with a request like:
TODO: run an empty query when the preprocessor is started, to force the model into memory, so that the first query doesn't time out trying to load models. Right now, the very first query will be slow. Note that |
Monitoring for several days, the keep_alive flag implemented in PR #874 has kept the model in memory on pegasus the entire time, and response time has been excellent. I don't see any more clever implementation, but this does leave the problem that first requests are very slow since model is only loaded on first request. Logged #890 as separate issue, and am closing this as resolved. |
When ollama is launched fresh, and loads (for example) llava:7b so that content-categoriser can run, it is nice and quick. However, after some time, a request goes in and it is running on CPU. This takes so long that the preprocessor times out, causing lots of problems, and a very slow response time (since the preprocessor has to wait to time out)
@shahdyousefak ping due to implications for health checks #802.
Need to find some way to manage ollama and specify what is running on CPU vs GPU, to get consistent results when running models.
The text was updated successfully, but these errors were encountered: