-
Notifications
You must be signed in to change notification settings - Fork 870
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Asynchronous worker communication and vllm integration (#3146)
* Added dummy async comm worker thread * First version of async worker in frontend running * [WIP]Running async worker but requests get corrupted if parallel * First version running with thread feeding + async predict * shorten vllm test time * Added AsyncVLLMEngine * Extend vllm test with multiple possible prompts * Batch size =1 and remove stream in test * Switched vllm examples to async comm and added llama3 example * Fix typo * Corrected java file formatting * Cleanup and silent chatty debug message * Added multi-gpu support to vllm examples * fix java format * Remove debugging messages * Fix async comm worker test * Added cl_socket to fixture * Added multi worker note to vllm example readme * Disable tests * Enable async worker comm test * Debug CI * Fix python version <= 3.9 issue in async worker * Renamed async worker test * Update frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatchAggregator.java Remove job from jobs_in_backend on error Co-authored-by: Naman Nandan <[email protected]> * Unskip vllm example test * Clean up async worker code * Safely remove jobs from jobs_in_backend * Let worker die if one of the threads in async service dies * Add description of parallelLevel and parallelType=custom to docs/large_model_inference.md * Added description of parallelLevel to model-archiver readme.md * fix typo + added words * Fix skip condition for vllm example test --------- Co-authored-by: Naman Nandan <[email protected]>
- Loading branch information
1 parent
4c96e6f
commit 5f3df71
Showing
28 changed files
with
1,267 additions
and
186 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Example showing inference with vLLM on LoRA model | ||
|
||
This is an example showing how to integrate [vLLM](https://github.com/vllm-project/vllm) with TorchServe and run inference on model `meta-llama/Meta-Llama-3-8B-Instruct` with continuous batching. | ||
This examples supports distributed inference by following [this instruction](../Readme.md#distributed-inference) | ||
|
||
### Step 1: Download Model from HuggingFace | ||
|
||
Login with a HuggingFace account | ||
``` | ||
huggingface-cli login | ||
# or using an environment variable | ||
huggingface-cli login --token $HUGGINGFACE_TOKEN | ||
``` | ||
|
||
```bash | ||
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-8B-Instruct --use_auth_token True | ||
``` | ||
|
||
### Step 2: Generate model artifacts | ||
|
||
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following. | ||
|
||
```bash | ||
torch-model-archiver --model-name llama3-8b --version 1.0 --handler ../base_vllm_handler.py --config-file model-config.yaml -r ../requirements.txt --archive-format no-archive | ||
mv model llama3-8b | ||
``` | ||
|
||
### Step 3: Add the model artifacts to model store | ||
|
||
```bash | ||
mkdir model_store | ||
mv llama3-8b model_store | ||
``` | ||
|
||
### Step 4: Start torchserve | ||
|
||
```bash | ||
torchserve --start --ncs --ts-config ../config.properties --model-store model_store --models llama3-8b | ||
``` | ||
|
||
### Step 5: Run inference | ||
|
||
```bash | ||
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 --prompt-text "@prompt.json" --prompt-json | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# TorchServe frontend parameters | ||
minWorkers: 1 | ||
maxWorkers: 1 | ||
maxBatchDelay: 100 | ||
responseTimeout: 1200 | ||
deviceType: "gpu" | ||
asyncCommunication: true | ||
|
||
handler: | ||
model_path: "model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/" | ||
vllm_engine_config: | ||
max_num_seqs: 16 | ||
max_model_len: 250 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
{ | ||
"prompt": "A robot may not injure a human being", | ||
"max_new_tokens": 50, | ||
"temperature": 0.8, | ||
"logprobs": 1, | ||
"prompt_logprobs": 1, | ||
"max_tokens": 128, | ||
"adapter": "adapter_1" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.