This guide provides detailed steps for deploying and serving LLMs on Intel CPU/GPU/Gaudi.
Please follow setup.md to setup the environment first.
We provide preconfigured yaml files in inference/models for popular open source models. You can customize a few configurations such as the resource used for serving.
To deploy on CPU, please make sure device
is set to CPU and cpus_per_worker
is set to a correct number.
cpus_per_worker: 24
device: CPU
To deploy on GPU, please make sure device
is set to GPU and gpus_per_worker
is set to 1.
gpus_per_worker: 1
device: GPU
To deploy on Gaudi, please make sure device
is set to hpu and hpus_per_worker
is set to 1.
hpus_per_worker: 1
device: HPU
LLM-on-Ray also supports serving with Deepspeed for AutoTP and BigDL-LLM for INT4/FP4/INT8/FP8 to reduce latency. You can follow the corresponding documents to enable them.
To deploy your model, execute the following command with the model's configuration file. This will create an OpenAI-compatible API (OpenAI API Reference) for serving.
python inference/serve.py --config_file <path to the conf file>
To deploy and serve multiple models concurrently, place all models' configuration files under inference/models
and directly run python inference/serve.py
without passing any conf file.
After deploying the model, you can access and test it in many ways:
# using curl
export ENDPOINT_URL=http://localhost:8000/v1
curl $ENDPOINT_URL/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": $MODEL_NAME,
"messages": [{"role": "assistant", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
# using requests library
python examples/inference/api_server_openai/query_http_requests.py
# using OpenAI SDK
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=$your_openai_api_key
python examples/inference/api_server_openai/query_openai_sdk.py
This will create a simple endpoint for serving according to the port
and route_prefix
parameters in conf file, for example: http://127.0.0.1:8000/gpt2.
python inference/serve.py --config_file <path to the conf file> --serve_simple
After deploying the model endpoint, you can access and test it by using the script below:
python inference/query_single.py --model_endpoint <the model endpoint URL>