-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single model mode #223
Single model mode #223
Conversation
- Use a threadlock around the model in single model mode
@peldszus Thanks for putting this together. |
It probably depends on your intended audience, but I agree. For private users with no, or just one GPU, single model mode is probably the best option. In production environments with higher throughput, maybe with multiple GPUs on one node, there are different scaling / optimizations routes to go, but they could involve single model mode as well. |
@makaveli10 If this is fine for you, I can adjust the argument parser and the readme. |
2933189
to
ab17c4d
Compare
b7e68ab
to
1407731
Compare
@makaveli10 Have a look now, I updated the option defaultness and the readme correspondingly. |
Looks good to me. I would add an option to use single model if not using a custom model as well but I guess for a future realease because that is a bit more complicated in terms of maintaining a dict of models with model_sizes that have been instantiated and clearing it up if no client is using that model size. |
I added a mode in which all client connections use the same single model, instead of instantiating a new model for each connection. This only applies if a custom model has been specified at server start (i.e. a trt model or a custom fw model).
For this a new option has been added, defaulting to false, so that the current behaviour is not changed.
This partially resolves #109, but only for custom models. It does not apply for a fw-backend which dynamically loads standard-models based on the client request.
A thread lock is used, to make model prediction thread safe. But this also means that connections have to wait, if another connection currently predicts.
Motivation
I use a large-v3 tensorrt model. It would take 5secs to load for every new client connection. With the single model option, this is reduced to <1sec. Also, I only want to have the model in VRAM once.