Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM and MII Deepspeed for Throughput Benchmarking #117

Conversation

sunggg
Copy link
Member

@sunggg sunggg commented Dec 14, 2023

This PR integrates vLLM and Deepspeed MII for the convenience of benchmarking.
The usage:

python3 serve/benchmarks/benchmark_throughput.py --backend mlc-serve --local-id mixtral-8x7b-instruct-v0.1-q0f16-presharded-2gpu --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 

python3 serve/benchmarks/benchmark_throughput.py --backend vllm --model /opt/models/mistral/mixtral-8x7b-instruct-v0.1 --num-shards 2 --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 

python3 serve/benchmarks/benchmark_throughput.py --backend mii --model /opt/models/mistral/mixtral-8x7b-instruct-v0.1 --num-shards 2 --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000

Some performance result with new benchmark script.
Currently, I have a problem running Deepspeed MII with multi-gpu and vllm for Llama families.

Mixtral 8x7B fp16 on 2xH100
* MLC-Serve, Engine Throughput: 23.52 requests/s, 8997.83 tokens/s
* vLLM, Engine Throughput: 13.35 requests/s, 5109.44 tokens/s
* MII, AssertionError: Attempting to run with TP > 1 and not using a distributed launcher like deepspeed or torch.distributed

Llama 13B fp16 on 1xH100
* MLC-Serve, Engine Throughput: 18.92 requests/s, 6830.19 tokens/s
* vLLM, cublas error
* MII, Engine Throughput: 14.79 requests/s, 5339.59 tokens/s

With this PR, you wouldn't be able to compare with the previous numbers because

  • This PR introduces --greedy-sampling-ratio to consider the performance of random sampling, which is more expensive path than greedy. Prior to this PR, the --greedy-sampling-ratio was 0.0.
  • This PR introduces --max-output-tokens to force generation length globally since MII does not seem to support per-request max-output-tokens.

The follow-up PR will bring latency benchmark.

@sunggg sunggg force-pushed the feature/2023-Dec/add-backends-for-benchmarking branch from c0c390f to fb29bef Compare January 8, 2024 16:33
@sunggg sunggg changed the title [Draft - Do Not Merge] Add VLLM and MII Deepspeed for Throughput Benchmarking Add VLLM and MII Deepspeed for Throughput Benchmarking Jan 8, 2024
@sunggg sunggg marked this pull request as ready for review January 8, 2024 16:34
@sunggg sunggg changed the title Add VLLM and MII Deepspeed for Throughput Benchmarking Add vLLM and MII Deepspeed for Throughput Benchmarking Jan 8, 2024
@sunggg sunggg force-pushed the feature/2023-Dec/add-backends-for-benchmarking branch from b45fda6 to 4801127 Compare January 8, 2024 20:07
@masahi
Copy link
Member

masahi commented Jan 8, 2024

Prior to this PR, the --greedy-sampling-ratio was 1.0.

I think it was 0 (no greedy sampling)

@masahi masahi merged commit 800e76d into octoml:batch-serving Jan 8, 2024
1 check passed
@sunggg
Copy link
Member Author

sunggg commented Jan 8, 2024

oops, you are right. let me fix my description. thank you for the merge!

@sunggg sunggg mentioned this pull request Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants