Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Vllm integration #1628

Draft
wants to merge 55 commits into
base: main
Choose a base branch
from
Draft

Conversation

vwxyzjn
Copy link
Contributor

@vwxyzjn vwxyzjn commented May 7, 2024

-- UPDATE 7/7/2024: after chatting with @lewtun, we'd like to see if vLLM is willing to support vllm-project/vllm#6189 officially before merging this PR as it may cause confusion for the users.

This PR adds a vLLM backend for generation purposes. Preliminary testing shows it's ~8x faster. Given 80 mins of training, the one with HF generation proceeded for 2650 episodes, whereas the one with vLLM generation proceeded for 16k episodes.

image

Note that your milage might vary with different hardware / generation length. For example, in TL;DR vllm 1B models vLLM does not seem to provide much speed benefits, likely due to short generation length.

Note that we have to use our custom vLLM build to achieve precise device placement (so that we can place the vLLM instance on the 8th GPU). See vwxyzjn/vllm#1

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@vwxyzjn vwxyzjn marked this pull request as ready for review July 3, 2024 18:27
@scottsuk0306
Copy link
Contributor

I'm really looking forward to this integration! Just out of curiosity, do you think using optimum or torch.compile as a generation backend is possible? @vwxyzjn

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@lewtun
Copy link
Member

lewtun commented Aug 13, 2024

I'm really looking forward to this integration! Just out of curiosity, do you think using optimum or torch.compile as a generation backend is possible? @vwxyzjn

Yes I think torch.compile would be an option, but with the caveat that currently only a few model architectures are supported.

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Dec 18, 2024

Hi, is there any updates? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants