Restore the ability to draw multiple samples with Open Source models #416

rlouf · 2023-12-08T09:46:59Z

This was removed in #366 to simplify the PR, and should be added again. This will require to be careful with the shape and an added dimension will need to be added to take sample shape into account. The mechanism implemented there can be re-used when implementing Beam Search #258

lapp0 · 2024-01-03T01:35:44Z

I'm looking to implement beam search.

I'm wondering whether we could simply use the sequence tokens as the ID of the sequence?

8b1ff9a#diff-f65ffb5f52b2e358c713ccb8f32a700769426c6c8b655f689e3cdccae07d22ac

For 1000 token sequences, I can generate 25,000 keys on my machine per second, so it shouldn't be a substantial bottleneck.

rlouf · 2024-01-03T07:13:31Z

My understanding is that vLLM needs sequence ids because they're doing continuous batching, and we wouldn't need to assign an id to sequences here.

I'm still hesitating between using one big tensor of shape n_samples x n_batch x n_token_ids like we did before the refactor, or breaking the sequences down like this:

from typing import List


class Sequence:
    prompt_token_ids: List[int]
    generated_token_ids: List[int]
    logprob: float

    @property
    def token_ids(self):
        return prompt_token_ids + generated_token_ids
	
    def add_token_id(self, token_id, token_logprob):
        logprob += token_logprob
        self.generated_token_ids.append(token_id)

class Generation:
    sequences: List[Sequence]

For beam search, a Sequence would be a beam and Generation correspond to an input prompt. We have plans for better KV cache management in the future and this would definitely simplify things.

This is the way vLLM does it, and they create new tensors at each step. We would need to determine the overhead of creating new tensors at each step before moving forward.

What do you think?

rlouf · 2024-01-03T20:47:43Z

Would you mind copy/pasting your comment into a discussion and I'll answer there? Just so we stay on topic here and your comment is easier to find for future readers.

lapp0 · 2024-01-03T20:50:14Z

#501

rlouf · 2024-01-11T13:34:39Z

Here are the changes that need to be implemented in order to restore the ability to generate several samples for each sequence:

Update sequence_generator to it expands the prompt ids from (n_batches, n_tokens) to (n_samples, n_batches, n_tokens) by duplicating the prompts. A single prompt with 10 tokens would lead to token_ids with shape (1, 1, 10), a batch of 3 prompts to a shape (1, 3, 10) (with padding) and 7 samples for a batch of 3 prompts with shape (7, 3, 10). We keep singleton dimensions to simplify the code. Same with fsm_states.
Reshape the token_ids array to (n_batches * n_samples, n_tokens) before calling the token generator
Update update_token_ids, expand_attention_masks and update_fsm_states
Decode the result.
Reshape the decoded sequences to (n_samples, n_batches, n_tokens). Remove singleton dimensions. Note that when we return a Sequence instance instead of just text in the future we will only remove singleton dimensions when printing or extracting the text, but will keep them for the token_ids and logprobs.

We will need to add tests for init_generator_state, sequence_generator and the sampling algorithms.

rlouf added enhancement help wanted labels Dec 8, 2023

rlouf mentioned this issue Jan 13, 2024

Allow to generate several samples for each prompt #533

Merged

2 tasks

This was referenced Jan 15, 2024

VLLM tensor-parallel and RegexLogitsProcessor #524

Closed

Use token_ids to track the FSM state for each sequence in the vLLM integration #539

Closed

rlouf closed this as completed in #533 Feb 6, 2024

rlouf mentioned this issue Feb 6, 2024

Add beam search #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore the ability to draw multiple samples with Open Source models #416

Restore the ability to draw multiple samples with Open Source models #416

rlouf commented Dec 8, 2023

lapp0 commented Jan 3, 2024

rlouf commented Jan 3, 2024 •

edited

Loading

rlouf commented Jan 3, 2024

lapp0 commented Jan 3, 2024

rlouf commented Jan 11, 2024 •

edited

Loading

Restore the ability to draw multiple samples with Open Source models #416

Restore the ability to draw multiple samples with Open Source models #416

Comments

rlouf commented Dec 8, 2023

lapp0 commented Jan 3, 2024

rlouf commented Jan 3, 2024 • edited Loading

rlouf commented Jan 3, 2024

lapp0 commented Jan 3, 2024

rlouf commented Jan 11, 2024 • edited Loading

rlouf commented Jan 3, 2024 •

edited

Loading

rlouf commented Jan 11, 2024 •

edited

Loading