You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementation of #887 now loads the llava LLM in memory on pegasus and pins it there. However, the first request is still slow since the LLM is not loaded until first request. I think this is true of multiple models we're currently using. Work item is to figure out how we can load everything into memory so even first requests are fast.
Note that on a slower machine like unicorn, the initial load of LLMs can potentially take so long that the orchestrator moves on due to the request timing out. This effectively means that the first request fails completely.
The text was updated successfully, but these errors were encountered:
Implementation of #887 now loads the llava LLM in memory on pegasus and pins it there. However, the first request is still slow since the LLM is not loaded until first request. I think this is true of multiple models we're currently using. Work item is to figure out how we can load everything into memory so even first requests are fast.
Note that on a slower machine like unicorn, the initial load of LLMs can potentially take so long that the orchestrator moves on due to the request timing out. This effectively means that the first request fails completely.
The text was updated successfully, but these errors were encountered: