-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Optimize handling of sampling metadata and req_ids list #13244
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
- Move SamplingMetadata to a field in the persistent batch, updated only when the batch changes rather than constructed every step - Keep input_batch.req_ids sized to the number of requests in the batch, so that anywhere that iterates over it doesn't need to slice (copy) the list or keep track of the separate request count. It is still updated in-place Signed-off-by: Nick Hill <[email protected]>
2bcf20f
to
7d6ee8f
Compare
@WoosukKwon this is the first step, I am working on follow-on simplification for the penalty parameters, etc. |
@WoosukKwon apologies, I am looking into the test failure. |
Signed-off-by: Nick Hill <[email protected]>
@WoosukKwon the test failure should be fixed now... the shared apply penalties code was doing in-place unsqueezes on the sampling penalty tensors - which I think is a bad thing to do but didn't cause a problem before because we were passing new slices every step. |
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/v1/worker/gpu_input_batch.py
@WoosukKwon that's fine with me. |
Signed-off-by: Nick Hill <[email protected]>
…streamline Signed-off-by: Nick Hill <[email protected]> # Conflicts: # tests/v1/worker/test_gpu_input_batch.py # vllm/v1/sample/sampler.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
@njhill Sorry for the delay. I will review this PR once it's rebased. |
Signed-off-by: Nick Hill <[email protected]> # Conflicts: # tests/v1/sample/test_sampler.py # tests/v1/worker/test_gpu_input_batch.py # vllm/v1/worker/gpu_input_batch.py # vllm/v1/worker/gpu_model_runner.py
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
@WoosukKwon I have now rebased. #13360 partially overlaps with this (e,g. I simplified some of the min_tokens handling in this one but have refactored completely in the other one based on the new abstraction). But I think it would be fine to get this in first and I can rebase the other one if you're ok with that. |
Signed-off-by: Nick Hill <[email protected]>
@njhill I'm not sure it's worthwhile to change from N = 1024
x = []
# List
start = time.perf_counter()
for i in range(N):
x.append([])
end = time.perf_counter()
print(f"list: {(end - start) * 1000:.3f} ms")
y = []
# Tuple
start = time.perf_counter()
for i in range(N):
y.append(())
end = time.perf_counter()
print(f"tuple: {(end - start) * 1000:.3f} ms") I find that adding 1024 (maximum number of requests in the batch) empty lists only takes 80-90 us. While using tuple reduces this time to 30-40 us, I think the 50 us gap (in the worst case) cannot justify the extra complexity here. When the batch size is 32, the gap becomes even smaller (7 us vs 2 us). WDYT? |
Signed-off-by: Nick Hill <[email protected]>
@WoosukKwon I agree it's not worth any extra complexity. Just might as well use |
@njhill I think changing |
Signed-off-by: Nick Hill <[email protected]>
@WoosukKwon sure, let me revert those too. I think mostly we don't need to consider the tuple/list difference because these are args or fields that would be considered read-only. |
Signed-off-by: Nick Hill <[email protected]>
@WoosukKwon I need to fix up some of the gpu_model_runner tests, but I'll wait for your first review to make sure you are good with the changes overall before spending time on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing. Looks much cleaner! 😄
del request.spec_token_ids[num_scheduled_spec_tokens:] | ||
scheduled_spec_decode_tokens[request.request_id] = ( | ||
request.spec_token_ids[:num_scheduled_spec_tokens]) | ||
request.spec_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It avoids creating a new list, just trims the existing one down to num_scheduled_spec_tokens
, since any later spec token ids are essentially discarded anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Maybe worth a comment.
Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Very nice simplification!
Signed-off-by: Nick Hill <[email protected]>
SamplingMetadata
object to a field in the persistent batch, updated only when the batch changes rather than constructed every stepinput_batch.req_ids
sized to the number of requests in the batch, so that anywhere that iterates over it doesn't need to slice (copy) the list or keep track of the separate request count. It is still updated in-place