add llama-3.2B

basetenlabs · Feb 14, 2025 · 0f861e8 · 0f861e8
1 parent 1d0215c
commit 0f861e8
Show file tree

Hide file tree

Showing 6 changed files with 230 additions and 7 deletions.
diff --git a/...anker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/README.md b/...anker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/README.md
@@ -0,0 +1,177 @@
+# TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct
+
+This is a Deployment for TensorRT-LLM Briton with meta-llama/Llama-3.2-3B-Instruct. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral)
+
+With Briton you get the following benefits by default:
+- *Lowest-latency* latency, beating frameworks such as vllm
+- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching.
+- *distributed inference* run large models (such as LLama-405B) tensor-parallel
+- *json-schema based structured output for any model*
+- *chunked prefilling* for long generation tasks
+
+Optionally, you can also enable:
+- *speculative decoding* using an external draft model or self-speculative decoding
+- *fp8 quantization* deployments on H100, H200 and L4 GPUs
+
+
+# Examples:
+This deployment is specifically designed for the Hugging Face model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
+Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models.
+
+meta-llama/Llama-3.2-3B-Instruct  is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more.
+
+
+## Deployment with Truss
+
+Before deployment:
+
+1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
+2. Install the latest version of Truss: `pip install --upgrade truss`
+
+
+First, clone this repository:
+```sh
+git clone https://github.com/basetenlabs/truss-examples.git
+cd 11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct
+```
+
+With `11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted.
+
+```sh
+truss push --publish
+# prints:
+# ✨ Model Briton-meta-llama-llama-3.2-3b-instruct-truss-example was successfully pushed ✨
+# 🪵  View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx
+```
+
+## Call your model
+
+### OpenAI compatible inference
+Briton is OpenAI compatible, which means you can use the OpenAI client library to interact with the model.
+
+```python
+from openai import OpenAI
+import os
+
+client = OpenAI(
+    api_key=os.environ['BASETEN_API_KEY'],
+    base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
+)
+
+# Default completion
+response_completion = client.completions.create(
+    model="not_required",
+    prompt="Q: Tell me everything about Baseten.co! A:",
+    temperature=0.3,
+    max_tokens=100,
+)
+
+# Chat completion
+response_chat = client.chat.completions.create(
+    model="",
+    messages=[
+        {"role": "user", "content": "Tell me everything about Baseten.co!"}
+    ],
+    temperature=0.3,
+    max_tokens=100,
+)
+
+# Structured output
+from pydantic import BaseModel
+
+class CalendarEvent(BaseModel):
+    name: str
+    date: str
+    participants: list[str]
+
+completion = client.beta.chat.completions.parse(
+    model="not_required",
+    messages=[
+        {"role": "system", "content": "Extract the event information."},
+        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
+    ],
+    response_format=CalendarEvent,
+)
+
+event = completion.choices[0].message.parsed
+
+# If you model supports tool-calling, you can use the following example:
+tools = [{
+    "type": "function",
+    "function": {
+        "name": "get_weather",
+        "description": "Get current temperature for a given location.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "City and country e.g. Bogotá, Colombia"
+                }
+            },
+            "required": [
+                "location"
+            ],
+            "additionalProperties": False
+        },
+        "strict": True
+    }
+}]
+
+completion = client.chat.completions.create(
+    model="not_required",
+    messages=[{"role": "user", "content": "What is the weather like in Paris today?"}],
+    tools=tools
+)
+
+print(completion.choices[0].message.tool_calls)
+```
+
+
+## Config.yaml
+By default, the following configuration is used for this deployment. This config uses `quantization_type=fp8_kv`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16.
+
+```yaml
+build_commands: []
+environment_variables: {}
+external_package_dirs: []
+model_metadata:
+  example_model_input:
+    max_tokens: 512
+    messages:
+    - content: Tell me everything you know about optimized inference.
+      role: user
+    stream: true
+    temperature: 0.5
+  tags:
+  - openai-compatible
+model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example
+python_version: py39
+requirements: []
+resources:
+  accelerator: L4
+  cpu: '1'
+  memory: 10Gi
+  use_gpu: true
+secrets: {}
+system_packages: []
+trt_llm:
+  build:
+    base_model: llama
+    checkpoint_repository:
+      repo: meta-llama/Llama-3.2-3B-Instruct
+      revision: main
+      source: HF
+    max_seq_len: 131072
+    num_builder_gpus: 4
+    plugin_configuration:
+      use_fp8_context_fmha: true
+    quantization_type: fp8_kv
+    tensor_parallel_count: 1
+  runtime:
+    enable_chunked_context: true
+
+```
+
+## Support
+If you have any questions or need assistance, please open an issue in this repository or contact our support team.
diff --git a/...ings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/config.yaml b/...ings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct/config.yaml
@@ -0,0 +1,38 @@
+build_commands: []
+environment_variables: {}
+external_package_dirs: []
+model_metadata:
+  example_model_input:
+    max_tokens: 512
+    messages:
+    - content: Tell me everything you know about optimized inference.
+      role: user
+    stream: true
+    temperature: 0.5
+  tags:
+  - openai-compatible
+model_name: Briton-meta-llama-llama-3.2-3b-instruct-truss-example
+python_version: py39
+requirements: []
+resources:
+  accelerator: L4
+  cpu: '1'
+  memory: 10Gi
+  use_gpu: true
+secrets: {}
+system_packages: []
+trt_llm:
+  build:
+    base_model: llama
+    checkpoint_repository:
+      repo: meta-llama/Llama-3.2-3B-Instruct
+      revision: main
+      source: HF
+    max_seq_len: 131072
+    num_builder_gpus: 4
+    plugin_configuration:
+      use_fp8_context_fmha: true
+    quantization_type: fp8_kv
+    tensor_parallel_count: 1
+  runtime:
+    enable_chunked_context: true
diff --git a/...s-reranker-classification-tensorrt/Briton-tiiuae-falcon3-10b-instruct/README.md b/...s-reranker-classification-tensorrt/Briton-tiiuae-falcon3-10b-instruct/README.md
@@ -149,7 +149,7 @@ model_name: Briton-tiiuae-falcon3-10b-instruct-truss-example
 python_version: py39
 requirements: []
 resources:
-  accelerator: L4
+  accelerator: L4:4
   cpu: '1'
   memory: 10Gi
   use_gpu: true
@@ -163,11 +163,10 @@ trt_llm:
       revision: main
       source: HF
     max_seq_len: 32768
-    num_builder_gpus: 4
     plugin_configuration:
       use_fp8_context_fmha: true
     quantization_type: fp8_kv
-    tensor_parallel_count: 1
+    tensor_parallel_count: 4
   runtime:
     enable_chunked_context: true
 

diff --git a/...mbeddings-reranker-classification-tensorrt/Briton-tiiuae-falcon3-10b-instruct/config.yaml b/...mbeddings-reranker-classification-tensorrt/Briton-tiiuae-falcon3-10b-instruct/config.yaml
@@ -15,7 +15,7 @@ model_name: Briton-tiiuae-falcon3-10b-instruct-truss-example
 python_version: py39
 requirements: []
 resources:
-  accelerator: L4
+  accelerator: L4:4
   cpu: '1'
   memory: 10Gi
   use_gpu: true
@@ -29,10 +29,9 @@ trt_llm:
       revision: main
       source: HF
     max_seq_len: 32768
-    num_builder_gpus: 4
     plugin_configuration:
       use_fp8_context_fmha: true
     quantization_type: fp8_kv
-    tensor_parallel_count: 1
+    tensor_parallel_count: 4
   runtime:
     enable_chunked_context: true
diff --git a/11-embeddings-reranker-classification-tensorrt/README.md b/11-embeddings-reranker-classification-tensorrt/README.md
@@ -70,6 +70,7 @@ Examples:
  - [deepseek-ai/DeepSeek-R1-Distill-Llama-70B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-llama-70b)
  - [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-qwen-32b)
  - [meta-llama/Llama-3.1-405B-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.1-405b)
+ - [meta-llama/Llama-3.2-3B-Instruct-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct)
  - [meta-llama/Llama-3.3-70B-Instruct-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.3-70b-instruct)
  - [meta-llama/Llama-3.3-70B-Instruct-tp2-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.3-70b-instruct-tp2)
  - [microsoft/phi-4-Briton](https://github.com/basetenlabs/truss-examples/tree/main/11-embeddings-reranker-classification-tensorrt/Briton-microsoft-phi-4)

diff --git a/11-embeddings-reranker-classification-tensorrt/templating/generate_templates.py b/11-embeddings-reranker-classification-tensorrt/templating/generate_templates.py
@@ -906,6 +906,15 @@ def llamalike_config(
         ),
         is_gated=True,
     ),  # meta-llama/Llama-3.1-405B tp8
+    Deployment(
+        "meta-llama/Llama-3.2-3B-Instruct",
+        "meta-llama/Llama-3.2-3B-Instruct",
+        Accelerator.L4,
+        TextGen(),
+        solution=Briton(
+            trt_config=llamalike_config(repoid="meta-llama/Llama-3.2-3B-Instruct")
+        ),
+    ),
     Deployment(
         "meta-llama/Llama-3.1-405B",
         "meta-llama/Llama-3.1-405B",
@@ -977,7 +986,7 @@ def llamalike_config(
         Accelerator.L4,
         TextGen(),
         solution=Briton(
-            trt_config=llamalike_config(repoid="tiiuae/Falcon3-10B-Instruct")
+            trt_config=llamalike_config(repoid="tiiuae/Falcon3-10B-Instruct", tp=4)
         ),
         is_gated=True,
     ),