Handle `max_token` more gracefully #433

rlouf · 2023-12-14T08:05:58Z

A bug was found in the way max_token is managed. Since the number of tokens generated is updated every time FSMState is called, when we generate tokens in batch the sequence generator returns max_tokens / batch_size tokens. The solution implemented in #391 consists in adding an idx argument to the next_state method of the FSMs, but this does not respect the a priori interface between the generator and FSMs.

There are two solutions to this problem:

Create a FSM for each sequence in the batch.
Handle the constraint in SequenceGenerator.

Either of these allows us to remove the idx argument. #417 suggests that the stop_at constraint should be implemented at the SequenceGenerator level (since we need to decode the token ids), and the fact that we should let users control the maximum number of tokens generated manually makes me lean towards (2).

The text was updated successfully, but these errors were encountered:

benlipkin · 2023-12-14T13:28:13Z

A note here concerning the new CFGFSM class outlined in #391. That class also uses the idx argument to track the incremental parser progress individually for each batch sequence. Other upcoming features like SMC steering will also likely require state to be stored uniquely for each batch sequence (to avoid constant recalculation). This idx workaround is required for the current interface where the FSM is shared across all sequences in the batch, but I agree that it adds undue code complexity, and could be avoided. As such, I would lend further support for "Solution 1. Create a FSM for each sequence in the batch." This will add some startup overhead in initializing a large batch, but will make each generation step easier. And actually thinking a bit deeper, since each batch sequence is initialized to the same state, these overhead costs can be mostly amortized by constructing an FSM once and then tiling it, e.g., [FSM(*args)]*batch_size so we would be able to avoid multiplying overhead in producing the state transition dict, etc. I'm happy to sketch this out either as part of a new PR or in #391.

rlouf · 2023-12-14T14:18:07Z

Makes sense. Let's merge #391 first.

rlouf · 2023-12-20T09:49:21Z

An alternate solution that I implemented in the vLLM integration example was to initialize every sequence-dependent variable as a default dict and pass the sequence id. The solution of copying the FSM is much more elegant, and there's some work planned with the vLLM team to generate FSMs dynamically for each new sequence indeed.

rlouf added enhancement structured generation Linked to structured generation labels Dec 14, 2023

benlipkin mentioned this issue Dec 19, 2023

Refactor FSM idx tracking during batch generation #449

Closed

RobinPicard mentioned this issue Dec 20, 2023

Restore the ability to stop when a specified string has been generated #451

Merged

rlouf linked a pull request Dec 20, 2023 that will close this issue

Restore the ability to stop when a specified string has been generated #451

Merged

rlouf closed this as completed in #451 Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `max_token` more gracefully #433

Handle `max_token` more gracefully #433

rlouf commented Dec 14, 2023

benlipkin commented Dec 14, 2023 •

edited

Loading

rlouf commented Dec 14, 2023

rlouf commented Dec 20, 2023

Handle max_token more gracefully #433

Handle max_token more gracefully #433

Comments

rlouf commented Dec 14, 2023

benlipkin commented Dec 14, 2023 • edited Loading

rlouf commented Dec 14, 2023

rlouf commented Dec 20, 2023

Handle `max_token` more gracefully #433

Handle `max_token` more gracefully #433

benlipkin commented Dec 14, 2023 •

edited

Loading