Parallel sampling support #106

masahi · 2023-12-11T22:34:27Z

The main idea is to separately manage prompt and decode tokens in the cache manager, so that the former can be shared among n samples in a parallel-sampling request. We need to be careful not to overwrite a shared block, which can arise when

The prompt token counts is not divisible by the block size
A prompt is longer than the window size of SWA.

My implementation is very different from the one in vllm, but the section 4.4 of their paper https://arxiv.org/pdf/2309.06180.pdf provides a good background for parallel sampling. For example, the first case above, "The prompt token counts is not divisible by the block size", is solved by copying partially-shared prompt blocks to each decode sequence, as described in the paper.

The SWA + parallel sampling case is difficult and I'm not happy with my current solution. It is not clean and not optimal in terms of free block usage (I don't use "reference count" of blocks as the paper describes and vllm implements). This case is also difficult to test, so there could be more bugs. But I think this is in a reasonable state for the first cut.

Also importantly, evicting a parallel-sampling request is still not supported since restoring its KV cache entries by recompute is very challenging (vllm also doesn't support that either). I'll work on that later.

An example output from parallel sampling with SWA on an 8k prompt

Generated 0-th sample = ' House Of The Seven Hawks The director's name is Abhsihek Saxena Who's the one who produced and directed film by'
                                                                                             
Generated 1-th sample = ' House of The Seven Hawks                                                                                                                                         
Question: Who made films on hitting the perfect 10.                                          
                                                                                                                                                                                           
 Rumbi Katedza                                                                                                                                                                             
Question'                                                                                                                                                                                  
                                                                                             
Generated 2-th sample = ' House Of The Seven Hawks                                                                                                                                         
Answers:                                                                                     
Passage 7;                                                                                   
Abhishek Saxena                                                                                                                                                                            
Abhishek Sax'

This reverts commit 5382004.

masahi · 2023-12-12T07:26:05Z

serve/mlc_serve/engine/base.py

+
+def get_prompt_sequence_id(request_id: RequestId) -> SequenceId:
+    return SequenceId(request_id, PROMPT_SEQEUNCE_INDEX)
+


@sunggg @elvin-n Please be aware of this convention

psrivas2 · 2023-12-14T06:02:00Z

Should we also update the assert at https://github.com/masahi/mlc-llm/blob/parallel-sampling-dev/serve/mlc_serve/engine/base.py#L148 to be if self.best_of > self.num_sequences:

masahi · 2023-12-14T06:45:29Z

I don't think we support self.best_of > self.num_sequences case. We can generate best_of samples, but we don't have logic to choose num_sequnces samples from them (because right now we don't preserve logprobs after sampling).

Besides, is there a point in allowing greedy sampling when best_of == num_sequences (which your condition allows)?

elvin-n · 2023-12-14T07:52:06Z

serve/benchmarks/benchmark_throughput.py

    )

    if args.use_staging_engine:
        engine.stop()

    total_num_tokens = sum(
-        prompt_len + output_len for _, prompt_len, output_len in requests
+        prompt_len + output_len * args.num_sequences_to_sample for _, prompt_len, output_len in requests


not a subject to change, but rather some reflections - the metric that we calculate, tokens peer second, might not reflect the real number of processed tokens because we have cache eviction and we can process same request several times. I.e. we can process more requests than calculate in this total_num_tokens. Not sure if it affects any conclusion or picture.

imo, we should not include tokens for re-computation in this kind of benchmarking to avoid the situation where throughput looks better than it should be because of re-computed tokens. If there are many re-computed tokens to generate valid output token, that is the problem of engine and the benchmark script should reflect this.

I agree with @sunggg here

elvin-n · 2023-12-14T09:25:53Z

serve/mlc_serve/engine/engine_common.py

                            sampling_params=state.sampling_params,
                        )
                    )
                    cache_manager.extend(
                        gen_seq.seq_id,
-                        len(token_ids) - gen_seq.next_start_position,
+                        prompt_counts


understanding the changes between previous calculations of the new tokens and new implementation, I do not understand why we have here quite complex formula. It depends on the previous tokens, why do we calculate number of new tokens from the previous ones? I would understand if we have here just 1, but i do not understand why we go to the number of tokens and next position.

Can you point cases for decode where we do not have here 1?

Previously len(token_ids) is the combined prompt and decode token counts. This diff doesn't change anything.

I think we are having future support for speculative decoding in mind here. There we may generate multiple tokens in one decoding step.

This diff doesn't change anything

I absolutely agree with this.

I think we are having future support for speculative decoding in mind here

I can hardly imagine decoding of several tokens without significant modification of the logic.

Again, do not propose to change right now, but originally it has to be just extend(gen_seq.seq_id,1)

yes its going to be the next big project. It will be very cool.

elvin-n · 2023-12-14T09:38:13Z

serve/mlc_serve/engine/engine_common.py

-                self.current_batch[request_to_remove.request_id].num_sequences == 1
-            ), "Evicting a multi-sequence request is not supported."
+
+            # TODO(masahi): Properly support evicting a multi-sequence request


Disappointing solution. Let's change at least the algo of looking for the request for evicting and will select from the list of requests having n == 1 and only if we cannot find such, only in this case evict requests having n > 1

I think this is temporary solution until we land the eviction. I'm okay with this as a temporary solution if we can follow-up quickly.

I'll follow-up with this today. This is indeed a temp solution, but even if after we have a proper one, it's still good to make the best effort to avoid evicting a parallel sampling request.

elvin-n · 2023-12-14T12:23:25Z

serve/mlc_serve/model/paged_cache_model.py

+                    ):
+                        # This sequence is trying to overwrite a prompt block shared with other sequences.
+
+                        # TODO(masahi): The engine should take into account this additional


Comment is not clear. The verification and handling seems is correct. What additionally engine should take into account and in which case?

Thinking about it more, we probably don't need to worry about this since the engine conservatively assumes that each decode step can allocate one block for all sequences (correct me if I'm wrong about this). Whenever we hit this code path, we don't do allocation at L226. So we only allocate up to one block for all sequences per decode step in all code paths.

psrivas2 · 2023-12-14T16:16:59Z

Besides, is there a point in allowing greedy sampling when best_of == num_sequences (which your condition allows)?

Won't we run into this assert if the request has n >= 2 and temperature = 0.0? I understand that this is not a realistic use case though but the user might be surprised if this request fails.

sunggg

Thank you @masahi for another great work! Overall, LGTM. I'm also okay with the temporary solutions you mentioned for the fast iteration, so I'd like us to merge this quickly and follow-up quickly.

sunggg · 2023-12-14T16:41:26Z

serve/mlc_serve/engine/engine_common.py

-                self.current_batch[request_to_remove.request_id].num_sequences == 1
-            ), "Evicting a multi-sequence request is not supported."
+
+            # TODO(masahi): Properly support evicting a multi-sequence request


I think this is temporary solution until we land the eviction. I'm okay with this as a temporary solution if we can follow-up quickly.

sunggg · 2023-12-14T17:31:24Z

serve/mlc_serve/model/paged_cache_model.py

+            num_tokens % self.block_size != 0
+        )
+
+        if self.sliding_window:


Orthogonal to this PR, but what happens if we have more prompt tokens than sliding window size?

That's a very common case with Mistral with long prompts. The code paths in

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L65-L73
https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/model/paged_cache_model.py#L110-L114

are specifically for such case. Prompt blocks already wrap around before any decoding happens due to circular buffering, so at each decoding step we need to carefully determine which portion of prompt blocks are still shared among n samples.

sunggg · 2023-12-14T17:34:06Z

serve/benchmarks/benchmark_throughput.py

    )

    if args.use_staging_engine:
        engine.stop()

    total_num_tokens = sum(
-        prompt_len + output_len for _, prompt_len, output_len in requests
+        prompt_len + output_len * args.num_sequences_to_sample for _, prompt_len, output_len in requests


imo, we should not include tokens for re-computation in this kind of benchmarking to avoid the situation where throughput looks better than it should be because of re-computed tokens. If there are many re-computed tokens to generate valid output token, that is the problem of engine and the benchmark script should reflect this.

This PR brings the MLC-LLM support for Android. We are now able to run LLM-based chatbot on Android phones. Co-authored-by: spectrometerHBH <[email protected]> Co-authored-by: Yaxing Cai <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Tianqi Chen <[email protected]>

masahi added 30 commits December 8, 2023 00:07

wip

712ccec

wip

2d05640

wip

f3742b9

fix

a38c955

fix

b960d4d

fix

a5e6e37

refactor

3b2df21

more refactor

77e0e5f

wip

5808958

wip

e4f21b4

more refactor

4080146

more refactor

9d42deb

fixed

9eb92f8

fixed mypy

18b8e41

minor

bdb0be3

msg clean

27da1c2

fix missing finish_reason

f9747ac

remove unnecessary type annot on defaultdict

522edd7

Return requests state from get_requests_to_process

5585197

simplify typing

421d2ea

reduced list concat

a83c494

remove dict add and lookup

5382004

wrong comment

55045d0

Revert "remove dict add and lookup"

e7b6a3c

This reverts commit 5382004.

fix sampler test

d962435

make it possible to disable prometheus metrics

78ab330

collect metrics only in staging engine

70369fc

return False in stop_by_length if request is already finished

b16f787

move check_stopping_sequences to engine_common.py

fd39416

separately manage prompt and decode tokens in cache manager

19cd84b

masahi added 9 commits December 11, 2023 07:31

copying cache block works

43c3091

fix for disco

e92f4c1

Merge branch 'batch-serving' into parallel-sampling-dev

8df5ea1

Introduce DecodeBlockTable to support SWA

8977d48

Merge branch 'batch-serving' into parallel-sampling-dev

8bbe0b3

fix mypy

051f57f

SWA working?

5b29a05

Merge branch 'batch-serving' into parallel-sampling-dev

c13b2f4

clean

ea984db

masahi marked this pull request as ready for review December 12, 2023 06:52

move id stuff to engine/base

7382a79

masahi commented Dec 12, 2023

View reviewed changes

masahi added 4 commits December 12, 2023 20:10

cancel works for sync engine

d855b71

drop num_sequence from cancel api

83f9de3

call cancel callback before removing request state from a batch

c4279b9

cancel works for staging engien too

2ed26e0

masahi marked this pull request as draft December 12, 2023 21:48

masahi marked this pull request as ready for review December 12, 2023 21:53

masahi added 3 commits December 12, 2023 22:28

update benchmark script for n sampling

6da1412

correctly update prompt sharing condition

65d47fd

add note on the difference to the vllm implementation

0244e06

sunggg changed the title ~~Parallel sampling suppoort~~ Parallel sampling support Dec 13, 2023

elvin-n approved these changes Dec 14, 2023

View reviewed changes

sunggg approved these changes Dec 14, 2023

View reviewed changes

sunggg merged commit c767814 into octoml:batch-serving Dec 14, 2023
1 check passed

masahi mentioned this pull request Dec 14, 2023

Parallel sampling follow up #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling support #106

Parallel sampling support #106

masahi commented Dec 11, 2023 •

edited

Loading

masahi Dec 12, 2023

psrivas2 commented Dec 14, 2023

masahi commented Dec 14, 2023 •

edited

Loading

elvin-n Dec 14, 2023

sunggg Dec 14, 2023

masahi Dec 14, 2023

elvin-n Dec 14, 2023

masahi Dec 14, 2023 •

edited

Loading

elvin-n Dec 14, 2023

masahi Dec 14, 2023 •

edited

Loading

elvin-n Dec 14, 2023

sunggg Dec 14, 2023

masahi Dec 14, 2023 •

edited

Loading

elvin-n Dec 14, 2023

masahi Dec 14, 2023 •

edited

Loading

psrivas2 commented Dec 14, 2023 •

edited

Loading

sunggg left a comment

sunggg Dec 14, 2023

sunggg Dec 14, 2023

masahi Dec 14, 2023 •

edited

Loading

sunggg Dec 14, 2023


		def get_prompt_sequence_id(request_id: RequestId) -> SequenceId:
		return SequenceId(request_id, PROMPT_SEQEUNCE_INDEX)

Parallel sampling support #106

Parallel sampling support #106

Conversation

masahi commented Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

psrivas2 commented Dec 14, 2023

masahi commented Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

psrivas2 commented Dec 14, 2023 • edited Loading

sunggg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi commented Dec 11, 2023 •

edited

Loading

masahi commented Dec 14, 2023 •

edited

Loading

masahi Dec 14, 2023 •

edited

Loading

masahi Dec 14, 2023 •

edited

Loading

masahi Dec 14, 2023 •

edited

Loading

masahi Dec 14, 2023 •

edited

Loading

psrivas2 commented Dec 14, 2023 •

edited

Loading

masahi Dec 14, 2023 •

edited

Loading