Skip to content

Commit

Permalink
add llama-3.2B
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelfeil committed Feb 14, 2025
1 parent 1d0215c commit 0f861e8
Show file tree
Hide file tree
Showing 6 changed files with 230 additions and 7 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct

This is a Deployment for TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral)

With Briton you get the following benefits by default:
- *Lowest-latency* latency, beating frameworks such as vllm
- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching.
- *distributed inference* run large models (such as LLama-405B) tensor-parallel
- *json-schema based structured output for any model*
- *chunked prefilling* for long generation tasks

Optionally, you can also enable:
- *speculative decoding* using an external draft model or self-speculative decoding
- *fp8 quantization* deployments on H100, H200 and L4 GPUs


# Examples:
This deployment is specifically designed for the Hugging Face model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models.

meta-llama/Llama-3.2-3B-Instruct is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more.


## Deployment with Truss

Before deployment:

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`


First, clone this repository:
```sh
git clone https://github.com/basetenlabs/truss-examples.git
cd 11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct
```

With `11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted.

```sh
truss push --publish
# prints:
# ✨ Model Briton-meta-llama-llama-3.2-3b-instruct-truss-example was successfully pushed ✨
# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx
```

## Call your model

### OpenAI compatible inference
Briton is OpenAI compatible, which means you can use the OpenAI client library to interact with the model.

```python
from openai import OpenAI
import os

client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Default completion
response_completion = client.completions.create(
model="not_required",
prompt="Q: Tell me everything about Baseten.co! A:",
temperature=0.3,
max_tokens=100,
)

# Chat completion
response_chat = client.chat.completions.create(
model="",
messages=[
{"role": "user", "content": "Tell me everything about Baseten.co!"}
],
temperature=0.3,
max_tokens=100,
)

# Structured output
from pydantic import BaseModel

class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]

completion = client.beta.chat.completions.parse(
model="not_required",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
],
response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed

# If you model supports tool-calling, you can use the following example:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": False
},
"strict": True
}
}]

completion = client.chat.completions.create(
model="not_required",
messages=[{"role": "user", "content": "What is the weather like in Paris today?"}],
tools=tools
)

print(completion.choices[0].message.tool_calls)
```


## Config.yaml
By default, the following configuration is used for this deployment. This config uses `quantization_type=fp8_kv`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16.

```yaml
build_commands: []
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example
python_version: py39
requirements: []
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
secrets: {}
system_packages: []
trt_llm:
build:
base_model: llama
checkpoint_repository:
repo: meta-llama/Llama-3.2-3B-Instruct
revision: main
source: HF
max_seq_len: 131072
num_builder_gpus: 4
plugin_configuration:
use_fp8_context_fmha: true
quantization_type: fp8_kv
tensor_parallel_count: 1
runtime:
enable_chunked_context: true

```

## Support
If you have any questions or need assistance, please open an issue in this repository or contact our support team.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
build_commands: []
environment_variables: {}
external_package_dirs: []
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example
python_version: py39
requirements: []
resources:
accelerator: L4
cpu: '1'
memory: 10Gi
use_gpu: true
secrets: {}
system_packages: []
trt_llm:
build:
base_model: llama
checkpoint_repository:
repo: meta-llama/Llama-3.2-3B-Instruct
revision: main
source: HF
max_seq_len: 131072
num_builder_gpus: 4
plugin_configuration:
use_fp8_context_fmha: true
quantization_type: fp8_kv
tensor_parallel_count: 1
runtime:
enable_chunked_context: true
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ model_name: Briton-tiiuae-falcon3-10b-instruct-truss-example
python_version: py39
requirements: []
resources:
accelerator: L4
accelerator: L4:4
cpu: '1'
memory: 10Gi
use_gpu: true
Expand All @@ -163,11 +163,10 @@ trt_llm:
revision: main
source: HF
max_seq_len: 32768
num_builder_gpus: 4
plugin_configuration:
use_fp8_context_fmha: true
quantization_type: fp8_kv
tensor_parallel_count: 1
tensor_parallel_count: 4
runtime:
enable_chunked_context: true

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ model_name: Briton-tiiuae-falcon3-10b-instruct-truss-example
python_version: py39
requirements: []
resources:
accelerator: L4
accelerator: L4:4
cpu: '1'
memory: 10Gi
use_gpu: true
Expand All @@ -29,10 +29,9 @@ trt_llm:
revision: main
source: HF
max_seq_len: 32768
num_builder_gpus: 4
plugin_configuration:
use_fp8_context_fmha: true
quantization_type: fp8_kv
tensor_parallel_count: 1
tensor_parallel_count: 4
runtime:
enable_chunked_context: true
1 change: 1 addition & 0 deletions 11-embeddings-reranker-classification-tensorrt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ Examples:
- [deepseek-ai/DeepSeek-R1-Distill-Llama-70B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-llama-70b)
- [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-qwen-32b)
- [meta-llama/Llama-3.1-405B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.1-405b)
- [meta-llama/Llama-3.2-3B-Instruct-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct)
- [meta-llama/Llama-3.3-70B-Instruct-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.3-70b-instruct)
- [meta-llama/Llama-3.3-70B-Instruct-tp2-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.3-70b-instruct-tp2)
- [microsoft/phi-4-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-microsoft-phi-4)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -906,6 +906,15 @@ def llamalike_config(
),
is_gated=True,
), # meta-llama/Llama-3.1-405B tp8
Deployment(
"meta-llama/Llama-3.2-3B-Instruct",
"meta-llama/Llama-3.2-3B-Instruct",
Accelerator.L4,
TextGen(),
solution=Briton(
trt_config=llamalike_config(repoid="meta-llama/Llama-3.2-3B-Instruct")
),
),
Deployment(
"meta-llama/Llama-3.1-405B",
"meta-llama/Llama-3.1-405B",
Expand Down Expand Up @@ -977,7 +986,7 @@ def llamalike_config(
Accelerator.L4,
TextGen(),
solution=Briton(
trt_config=llamalike_config(repoid="tiiuae/Falcon3-10B-Instruct")
trt_config=llamalike_config(repoid="tiiuae/Falcon3-10B-Instruct", tp=4)
),
is_gated=True,
),
Expand Down

0 comments on commit 0f861e8

Please sign in to comment.