LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

jeffbl · 2024-10-04T16:00:17Z

Implementation of #887 now loads the llava LLM in memory on pegasus and pins it there. However, the first request is still slow since the LLM is not loaded until first request. I think this is true of multiple models we're currently using. Work item is to figure out how we can load everything into memory so even first requests are fast.

Note that on a slower machine like unicorn, the initial load of LLMs can potentially take so long that the orchestrator moves on due to the request timing out. This effectively means that the first request fails completely.

jeffbl assigned shahdyousefak Oct 4, 2024

jeffbl mentioned this issue Oct 4, 2024

ollama sometimes moves models off GPU, making them so slow they time out #887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

jeffbl commented Oct 4, 2024

LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

LLM model not loaded in memory until first request for content-categoriser (and others?) causes timeouts and slow initial queries #890

Comments

jeffbl commented Oct 4, 2024