Release Triton Distributed v0.1.0 · triton-inference-server/triton_distributed

Release Notes

We are pleased to introduce Triton Distributed our next-generation inference serving framework solution designed to deliver exceptional performance for deploying AI workloads at data center scale. Triton Distributed features a new, modular, multinode architecture with LLM extensions, including Disaggregating Serving, which intelligently separates LLM inference phases across distinct GPU devices and types. In early testing, Triton Distributed increased inference throughput by up to 2x when serving the popular open source Llama 3.1 70B model on 2 nodes of 8xH100s using configurations (3K ISL / 150 OSL, FP8, and > 40 tokens/s/user).

Triton Distributed will modularize the current Triton architecture to multiple composable components for improved scaling:

Modular and interchangeable API server to fit user’s needs
Smart Router capable of routing requests intelligently based on capacity , KV cache, and user sessions.
Independently scalable worker to resolve bottlenecks in the production pipeline.

These modularized components are built on top of Triton Distributed communication planes that facilitate efficient communication of data, requests and events. More details can be found in our design docs: 1) overview 2) data 3) request and 4) events. The new components of Triton Distributed will be available incrementally as we develop the framework.

There will be two more patch 0.1 releases until 0.2. The timelines of the upcoming releases and their associated features are shown below:

Upcoming 0.1 Features

(v0.1.0) Public GitHub repository with README
(v0.1.0) Getting started tutorial
(v0.1.0) Initial Docker container build
(v0.1.0) Functional request and data planes
Functional disaggregated serving with vLLM (up to 2 nodes)
Functional KV Cache Aware Routing
Functional Serving with TRT-LLM
Performant KV Cache Aware Routing
Performant disaggregation with vLLM

We would love to get your feedback on our release and contributions. This project is under Apache 2.0 license, and if you are interested in contributing, please don’t hesitate to create a PR or file an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton Distributed v0.1.0

Release Notes