forked from intel/llm-on-ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add vllm_predictor * add tests skeleton * add tests skeleton * add pytest.ini * wip * complete, debug wip * nit * nit * nit * complete generate supporting str and List[str] * add model * add streaming * remove tests * Add install-vllm-cpu script * nit Signed-off-by: Wu, Xiaochang <[email protected]> * nit Signed-off-by: Wu, Xiaochang <[email protected]> * nit * fix package inference * update install script and add doc * nit * nit * nit * add dtype support * nit * nit * nit * add ci * nit * nit * add libpthread-stubs0-dev * fix install-vllm-cpu * fix * revert inference.inference_config * debug ci * debug ci * debug ci * debug ci * debug ci * debug ci * debug ci * debug ci * update --------- Signed-off-by: Wu, Xiaochang <[email protected]>
- Loading branch information
Showing
15 changed files
with
309 additions
and
28 deletions.
There are no files selected for viewing
27 changes: 27 additions & 0 deletions
27
.github/workflows/config/llama-2-7b-chat-hf-vllm-fp32.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
port: 8000 | ||
name: llama-2-7b-chat-hf-vllm | ||
route_prefix: /llama-2-7b-chat-hf-vllm | ||
cpus_per_worker: 24 | ||
gpus_per_worker: 0 | ||
deepspeed: false | ||
vllm: | ||
enabled: true | ||
precision: fp32 | ||
workers_per_group: 2 | ||
device: "cpu" | ||
ipex: | ||
enabled: false | ||
precision: bf16 | ||
model_description: | ||
model_id_or_path: meta-llama/Llama-2-7b-chat-hf | ||
tokenizer_name_or_path: meta-llama/Llama-2-7b-chat-hf | ||
chat_processor: ChatModelLLama | ||
prompt: | ||
intro: '' | ||
human_id: '[INST] {msg} [/INST] | ||
' | ||
bot_id: '' | ||
stop_words: [] | ||
config: | ||
use_auth_token: '' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# syntax=docker/dockerfile:1 | ||
FROM ubuntu:22.04 | ||
|
||
ENV LANG C.UTF-8 | ||
|
||
WORKDIR /root/llm-on-ray | ||
|
||
RUN --mount=type=cache,target=/var/cache/apt apt-get update -y \ | ||
&& apt-get install -y build-essential cmake wget curl git vim htop ssh net-tools \ | ||
&& apt-get clean \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
ENV CONDA_DIR /opt/conda | ||
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \ | ||
/bin/bash ~/miniconda.sh -b -p /opt/conda | ||
ENV PATH $CONDA_DIR/bin:$PATH | ||
|
||
# setup env | ||
SHELL ["/bin/bash", "--login", "-c"] | ||
|
||
RUN --mount=type=cache,target=/opt/conda/pkgs conda init bash && \ | ||
unset -f conda && \ | ||
export PATH=$CONDA_DIR/bin/:${PATH} && \ | ||
conda config --add channels intel && \ | ||
conda install -y -c conda-forge python==3.9 gxx=12.3 gxx_linux-64=12.3 | ||
|
||
COPY ./pyproject.toml . | ||
COPY ./dev/scripts/install-vllm-cpu.sh . | ||
|
||
RUN mkdir ./finetune && mkdir ./inference | ||
|
||
RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[cpu] -f https://developer.intel.com/ipex-whl-stable-cpu \ | ||
-f https://download.pytorch.org/whl/torch_stable.html | ||
|
||
# Install vllm-cpu | ||
# Activate base first for loading g++ envs ($CONDA_PREFIX/etc/conda/activate.d/*) | ||
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
source /opt/conda/bin/activate base && ./install-vllm-cpu.sh | ||
|
||
# TODO: workaround, remove this when fixed in vllm-cpu upstream | ||
RUN --mount=type=cache,target=/root/.cache/pip \ | ||
pip install xformers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Check tools | ||
[[ -n $(which g++) ]] || { echo "GNU C++ Compiler (g++) is not found!"; exit 1; } | ||
[[ -n $(which pip) ]] || { echo "pip command is not found!"; exit 1; } | ||
|
||
# g++ version should be >=12.3 | ||
version_greater_equal() | ||
{ | ||
printf '%s\n%s\n' "$2" "$1" | sort --check=quiet --version-sort | ||
} | ||
gcc_version=$(g++ -dumpversion) | ||
echo | ||
echo Current GNU C++ Compiler version: $gcc_version | ||
echo | ||
version_greater_equal "${gcc_version}" 12.3.0 || { echo "GNU C++ Compiler 12.3.0 or above is required!"; exit 1; } | ||
|
||
# Install from source | ||
MAX_JOBS=8 pip install -v git+https://github.com/bigPYJ1151/vllm@PR_Branch \ | ||
-f https://download.pytorch.org/whl/torch_stable.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Setting up vLLM For Intel CPU | ||
|
||
__NOTICE: The support for vLLM is experimental and subject to change.__ | ||
|
||
## Install vLLM for Intel CPU | ||
|
||
vLLM for CPU currently supports Intel® 4th Gen Xeon® Scalable Performance processor (formerly codenamed Sapphire Rapids) for best performance and is runnable with FP32 precision for other Xeon processors. | ||
|
||
Please run the following script to install vLLM for CPU into your current environment. Currently a GNU C++ compiler with >=12.3 version is required to build and install. | ||
|
||
```bash | ||
$ dev/scripts/install-vllm-cpu.sh | ||
``` | ||
|
||
## Setup | ||
|
||
Please follow [Deploying and Serving LLMs on Intel CPU/GPU/Gaudi](serve.md) document to setup other environments. | ||
|
||
## Run | ||
|
||
#### Serving | ||
|
||
To serve model with vLLM, run the following: | ||
|
||
```bash | ||
$ python serve.py --config_file inference/models/vllm/llama-2-7b-chat-hf-vllm.yaml --simple --keep_serve_terminal | ||
``` | ||
|
||
In the above example, `vllm` property is set to `true` in the config file for enabling vLLM. | ||
|
||
#### Querying | ||
|
||
To start a non-streaming query, run the following: | ||
|
||
```bash | ||
$ python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/llama-2-7b-chat-hf | ||
``` | ||
|
||
To start a streaming query, run the following: | ||
|
||
```bash | ||
$ python examples/inference/api_server_simple/query_single.py --model_endpoint http://127.0.0.1:8000/llama-2-7b-chat-hf --streaming_response | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
port: 8000 | ||
name: llama-2-7b-chat-hf | ||
route_prefix: /llama-2-7b-chat-hf | ||
cpus_per_worker: 24 | ||
gpus_per_worker: 0 | ||
deepspeed: false | ||
vllm: | ||
enabled: true | ||
precision: bf16 | ||
workers_per_group: 2 | ||
device: "cpu" | ||
ipex: | ||
enabled: false | ||
precision: bf16 | ||
model_description: | ||
model_id_or_path: meta-llama/Llama-2-7b-chat-hf | ||
tokenizer_name_or_path: meta-llama/Llama-2-7b-chat-hf | ||
chat_processor: ChatModelLLama | ||
prompt: | ||
intro: '' | ||
human_id: '[INST] {msg} [/INST] | ||
' | ||
bot_id: '' | ||
stop_words: [] | ||
config: | ||
use_auth_token: '' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.