-
Notifications
You must be signed in to change notification settings - Fork 40
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
1d0215c
commit 0f861e8
Showing
6 changed files
with
230 additions
and
7 deletions.
There are no files selected for viewing
177 changes: 177 additions & 0 deletions
177
...anker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
# TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct | ||
|
||
This is a Deployment for TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral) | ||
|
||
With Briton you get the following benefits by default: | ||
- *Lowest-latency* latency, beating frameworks such as vllm | ||
- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching. | ||
- *distributed inference* run large models (such as LLama-405B) tensor-parallel | ||
- *json-schema based structured output for any model* | ||
- *chunked prefilling* for long generation tasks | ||
|
||
Optionally, you can also enable: | ||
- *speculative decoding* using an external draft model or self-speculative decoding | ||
- *fp8 quantization* deployments on H100, H200 and L4 GPUs | ||
|
||
|
||
# Examples: | ||
This deployment is specifically designed for the Hugging Face model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). | ||
Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models. | ||
|
||
meta-llama/Llama-3.2-3B-Instruct is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more. | ||
|
||
|
||
## Deployment with Truss | ||
|
||
Before deployment: | ||
|
||
1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys). | ||
2. Install the latest version of Truss: `pip install --upgrade truss` | ||
|
||
|
||
First, clone this repository: | ||
```sh | ||
git clone https://github.com/basetenlabs/truss-examples.git | ||
cd 11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct | ||
``` | ||
|
||
With `11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted. | ||
|
||
```sh | ||
truss push --publish | ||
# prints: | ||
# ✨ Model Briton-meta-llama-llama-3.2-3b-instruct-truss-example was successfully pushed ✨ | ||
# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx | ||
``` | ||
|
||
## Call your model | ||
|
||
### OpenAI compatible inference | ||
Briton is OpenAI compatible, which means you can use the OpenAI client library to interact with the model. | ||
|
||
```python | ||
from openai import OpenAI | ||
import os | ||
|
||
client = OpenAI( | ||
api_key=os.environ['BASETEN_API_KEY'], | ||
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1" | ||
) | ||
|
||
# Default completion | ||
response_completion = client.completions.create( | ||
model="not_required", | ||
prompt="Q: Tell me everything about Baseten.co! A:", | ||
temperature=0.3, | ||
max_tokens=100, | ||
) | ||
|
||
# Chat completion | ||
response_chat = client.chat.completions.create( | ||
model="", | ||
messages=[ | ||
{"role": "user", "content": "Tell me everything about Baseten.co!"} | ||
], | ||
temperature=0.3, | ||
max_tokens=100, | ||
) | ||
|
||
# Structured output | ||
from pydantic import BaseModel | ||
|
||
class CalendarEvent(BaseModel): | ||
name: str | ||
date: str | ||
participants: list[str] | ||
|
||
completion = client.beta.chat.completions.parse( | ||
model="not_required", | ||
messages=[ | ||
{"role": "system", "content": "Extract the event information."}, | ||
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."}, | ||
], | ||
response_format=CalendarEvent, | ||
) | ||
|
||
event = completion.choices[0].message.parsed | ||
|
||
# If you model supports tool-calling, you can use the following example: | ||
tools = [{ | ||
"type": "function", | ||
"function": { | ||
"name": "get_weather", | ||
"description": "Get current temperature for a given location.", | ||
"parameters": { | ||
"type": "object", | ||
"properties": { | ||
"location": { | ||
"type": "string", | ||
"description": "City and country e.g. Bogotá, Colombia" | ||
} | ||
}, | ||
"required": [ | ||
"location" | ||
], | ||
"additionalProperties": False | ||
}, | ||
"strict": True | ||
} | ||
}] | ||
|
||
completion = client.chat.completions.create( | ||
model="not_required", | ||
messages=[{"role": "user", "content": "What is the weather like in Paris today?"}], | ||
tools=tools | ||
) | ||
|
||
print(completion.choices[0].message.tool_calls) | ||
``` | ||
|
||
|
||
## Config.yaml | ||
By default, the following configuration is used for this deployment. This config uses `quantization_type=fp8_kv`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16. | ||
|
||
```yaml | ||
build_commands: [] | ||
environment_variables: {} | ||
external_package_dirs: [] | ||
model_metadata: | ||
example_model_input: | ||
max_tokens: 512 | ||
messages: | ||
- content: Tell me everything you know about optimized inference. | ||
role: user | ||
stream: true | ||
temperature: 0.5 | ||
tags: | ||
- openai-compatible | ||
model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example | ||
python_version: py39 | ||
requirements: [] | ||
resources: | ||
accelerator: L4 | ||
cpu: '1' | ||
memory: 10Gi | ||
use_gpu: true | ||
secrets: {} | ||
system_packages: [] | ||
trt_llm: | ||
build: | ||
base_model: llama | ||
checkpoint_repository: | ||
repo: meta-llama/Llama-3.2-3B-Instruct | ||
revision: main | ||
source: HF | ||
max_seq_len: 131072 | ||
num_builder_gpus: 4 | ||
plugin_configuration: | ||
use_fp8_context_fmha: true | ||
quantization_type: fp8_kv | ||
tensor_parallel_count: 1 | ||
runtime: | ||
enable_chunked_context: true | ||
|
||
``` | ||
|
||
## Support | ||
If you have any questions or need assistance, please open an issue in this repository or contact our support team. |
38 changes: 38 additions & 0 deletions
38
...ings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/config.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
build_commands: [] | ||
environment_variables: {} | ||
external_package_dirs: [] | ||
model_metadata: | ||
example_model_input: | ||
max_tokens: 512 | ||
messages: | ||
- content: Tell me everything you know about optimized inference. | ||
role: user | ||
stream: true | ||
temperature: 0.5 | ||
tags: | ||
- openai-compatible | ||
model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example | ||
python_version: py39 | ||
requirements: [] | ||
resources: | ||
accelerator: L4 | ||
cpu: '1' | ||
memory: 10Gi | ||
use_gpu: true | ||
secrets: {} | ||
system_packages: [] | ||
trt_llm: | ||
build: | ||
base_model: llama | ||
checkpoint_repository: | ||
repo: meta-llama/Llama-3.2-3B-Instruct | ||
revision: main | ||
source: HF | ||
max_seq_len: 131072 | ||
num_builder_gpus: 4 | ||
plugin_configuration: | ||
use_fp8_context_fmha: true | ||
quantization_type: fp8_kv | ||
tensor_parallel_count: 1 | ||
runtime: | ||
enable_chunked_context: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters