This tutorial guides you through a minimal setup of the vLLM Production Stack using one vLLM instance with the facebook/opt-125m
model. By the end of this tutorial, you will have a working deployment of vLLM on a Kubernetes environment with GPU.
- A Kubernetes environment with GPU support. If not set up, follow the 00-install-kubernetes-env guide.
- Helm installed. Refer to the install-helm.sh script for instructions.
- kubectl installed. Refer to the install-kubectl.sh script for instructions.
- the project repository cloned: vLLM Production Stack repository.
- Basic familiarity with Kubernetes and Helm.
The vLLM Production Stack repository provides a predefined configuration file, values-01-minimal-example.yaml
, located at tutorials/assets/values-01-minimal-example.yaml
. This file contains the following content:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "opt125m"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "facebook/opt-125m"
replicaCount: 1
requestCPU: 6
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "10Gi"
Explanation of the key fields:
modelSpec
: Defines the model configuration, including:name
: A name for the model deployment.repository
: Docker repository hosting the model image.tag
: Docker image tag.modelURL
: Specifies the LLM model to use.
replicaCount
: Sets the number of replicas to deploy.requestCPU
andrequestMemory
: Specifies the CPU and memory resource requests for the pod.requestGPU
: Specifies the number of GPUs required.pvcStorage
: Allocates persistent storage for the model.
Note: If you intend to set up TWO vllm pods, please refer to tutorials/assets/values-01-2pods-minimal-example.yaml
.
Deploy the Helm chart using the predefined configuration file:
sudo helm repo add vllm https://vllm-project.github.io/production-stack
sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
Explanation of the command:
vllm
in the first command: The Helm repository.vllm
in the second command: The name of the Helm release.-f tutorials/assets/values-01-minimal-example.yaml
: Specifies the predefined configuration file.
Monitor the deployment status using:
sudo kubectl get pods
Expected output:
- Pods for the
vllm
deployment should transition toReady
and theRunning
state.
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s
vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s
Note: It may take some time for the containers to download the Docker images and LLM weights.
Expose the vllm-router-service
port to the host machine:
sudo kubectl port-forward svc/vllm-router-service 30080:80
Test the stack's OpenAI-compatible API by querying the available models:
curl -o- http://localhost:30080/models
Expected output:
{
"object": "list",
"data": [
{
"id": "facebook/opt-125m",
"object": "model",
"created": 1737428424,
"owned_by": "vllm",
"root": null
}
]
}
Send a query to the OpenAI /completion
endpoint to generate a completion for a prompt:
curl -X POST http://localhost:30080/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "Once upon a time,",
"max_tokens": 10
}'
Expected output:
{
"id": "completion-id",
"object": "text_completion",
"created": 1737428424,
"model": "facebook/opt-125m",
"choices": [
{
"text": " there was a brave knight who...",
"index": 0,
"finish_reason": "length"
}
]
}
This demonstrates the model generating a continuation for the provided prompt.
To remove the deployment, run:
sudo helm uninstall vllm