Skip to content

Latest commit

 

History

History
66 lines (56 loc) · 2.67 KB

serve.md

File metadata and controls

66 lines (56 loc) · 2.67 KB

Deploying and Serving LLMs on Intel CPU/GPU/Gaudi

This guide provides detailed steps for deploying and serving LLMs on Intel CPU/GPU/Gaudi.

Setup

Please follow setup.md to setup the environment first.

Configure Serving Parameters

We provide preconfigured yaml files in inference/models for popular open source models. You can customize a few configurations such as the resource used for serving.

To deploy on CPU, please make sure device is set to CPU and cpus_per_worker is set to a correct number.

cpus_per_worker: 24
device: CPU

To deploy on GPU, please make sure device is set to GPU and gpus_per_worker is set to 1.

gpus_per_worker: 1
device: GPU

To deploy on Gaudi, please make sure device is set to hpu and hpus_per_worker is set to 1.

hpus_per_worker: 1
device: HPU

LLM-on-Ray also supports serving with Deepspeed for AutoTP and BigDL-LLM for INT4/FP4/INT8/FP8 to reduce latency. You can follow the corresponding documents to enable them.

Serving

OpenAI-compatible API

To deploy your model, execute the following command with the model's configuration file. This will create an OpenAI-compatible API (OpenAI API Reference) for serving.

python inference/serve.py --config_file <path to the conf file>

To deploy and serve multiple models concurrently, place all models' configuration files under inference/models and directly run python inference/serve.py without passing any conf file.

After deploying the model, you can access and test it in many ways:

# using curl
export ENDPOINT_URL=http://localhost:8000/v1
curl $ENDPOINT_URL/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": $MODEL_NAME,
    "messages": [{"role": "assistant", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
    "temperature": 0.7
    }'

# using requests library
python examples/inference/api_server_openai/query_http_requests.py

# using OpenAI SDK
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=$your_openai_api_key
python examples/inference/api_server_openai/query_openai_sdk.py

Serving Model to a Simple Endpoint

This will create a simple endpoint for serving according to the port and route_prefix parameters in conf file, for example: http://127.0.0.1:8000/gpt2.

python inference/serve.py --config_file <path to the conf file> --serve_simple

After deploying the model endpoint, you can access and test it by using the script below:

python inference/query_single.py --model_endpoint <the model endpoint URL>