Add vLLM and MII Deepspeed for Throughput Benchmarking #117

sunggg · 2023-12-14T19:29:08Z

This PR integrates vLLM and Deepspeed MII for the convenience of benchmarking.
The usage:

python3 serve/benchmarks/benchmark_throughput.py --backend mlc-serve --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 

python3 serve/benchmarks/benchmark_throughput.py --backend vllm --model /opt/models/mistral/mixtral-8x7b-instruct-v0.1 --num-shards 2 --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 

python3 serve/benchmarks/benchmark_throughput.py --backend mii --model /opt/models/mistral/mixtral-8x7b-instruct-v0.1 --num-shards 2 --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Some performance result with new benchmark script.
Currently, I have a problem running Deepspeed MII with multi-gpu and vllm for Llama families.

Mixtral 8x7B fp16 on 2xH100
* MLC-Serve, Engine Throughput: 23.52 requests/s, 8997.83 tokens/s
* vLLM, Engine Throughput: 13.35 requests/s, 5109.44 tokens/s
* MII, AssertionError: Attempting to run with TP > 1 and not using a distributed launcher like deepspeed or torch.distributed

Llama 13B fp16 on 1xH100
* MLC-Serve, Engine Throughput: 18.92 requests/s, 6830.19 tokens/s
* vLLM, cublas error
* MII, Engine Throughput: 14.79 requests/s, 5339.59 tokens/s

With this PR, you wouldn't be able to compare with the previous numbers because

This PR introduces --greedy-sampling-ratio to consider the performance of random sampling, which is more expensive path than greedy. Prior to this PR, the --greedy-sampling-ratio was 0.0.
This PR introduces --max-output-tokens to force generation length globally since MII does not seem to support per-request max-output-tokens.

The follow-up PR will bring latency benchmark.

masahi · 2024-01-08T20:36:42Z

Prior to this PR, the --greedy-sampling-ratio was 1.0.

I think it was 0 (no greedy sampling)

sunggg · 2024-01-08T20:39:39Z

oops, you are right. let me fix my description. thank you for the merge!

sunggg force-pushed the feature/2023-Dec/add-backends-for-benchmarking branch from c0c390f to fb29bef Compare January 8, 2024 16:33

sunggg changed the title ~~[Draft - Do Not Merge] Add VLLM and MII Deepspeed for Throughput Benchmarking~~ Add VLLM and MII Deepspeed for Throughput Benchmarking Jan 8, 2024

sunggg marked this pull request as ready for review January 8, 2024 16:34

sunggg changed the title ~~Add VLLM and MII Deepspeed for Throughput Benchmarking~~ Add vLLM and MII Deepspeed for Throughput Benchmarking Jan 8, 2024

sunggg mentioned this pull request Jan 8, 2024

Latency Benchmark Script for User-side Metric #147

Merged

sunggg added 5 commits January 8, 2024 18:47

wip

ea43710

rebased

22cf0fc

wip

e113eff

done

69838fa

--greedy-sampling-ratio

4801127

sunggg force-pushed the feature/2023-Dec/add-backends-for-benchmarking branch from b45fda6 to 4801127 Compare January 8, 2024 20:07

masahi approved these changes Jan 8, 2024

View reviewed changes

masahi merged commit 800e76d into octoml:batch-serving Jan 8, 2024
1 check passed

sunggg mentioned this pull request Jan 8, 2024

Minor follow-up for PR#117 #149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM and MII Deepspeed for Throughput Benchmarking #117

Add vLLM and MII Deepspeed for Throughput Benchmarking #117

sunggg commented Dec 14, 2023 •

edited

Loading

masahi commented Jan 8, 2024

sunggg commented Jan 8, 2024

Add vLLM and MII Deepspeed for Throughput Benchmarking #117

Add vLLM and MII Deepspeed for Throughput Benchmarking #117

Conversation

sunggg commented Dec 14, 2023 • edited Loading

masahi commented Jan 8, 2024

sunggg commented Jan 8, 2024

sunggg commented Dec 14, 2023 •

edited

Loading