-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Vllm integration #1628
base: main
Are you sure you want to change the base?
[DRAFT] Vllm integration #1628
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I'm really looking forward to this integration! Just out of curiosity, do you think using optimum or torch.compile as a generation backend is possible? @vwxyzjn |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Yes I think torch.compile would be an option, but with the caveat that currently only a few model architectures are supported. |
Hi, is there any updates? Thanks! |
-- UPDATE 7/7/2024: after chatting with @lewtun, we'd like to see if vLLM is willing to support vllm-project/vllm#6189 officially before merging this PR as it may cause confusion for the users.
This PR adds a vLLM backend for generation purposes. Preliminary testing shows it's ~8x faster. Given 80 mins of training, the one with HF generation proceeded for 2650 episodes, whereas the one with vLLM generation proceeded for 16k episodes.
Note that your milage might vary with different hardware / generation length. For example, in TL;DR vllm 1B models vLLM does not seem to provide much speed benefits, likely due to short generation length.
Note that we have to use our custom vLLM build to achieve precise device placement (so that we can place the vLLM instance on the 8th GPU). See vwxyzjn/vllm#1