Skip to content

Triton Distributed v0.1.0

Latest
Compare
Choose a tag to compare
@nnshah1 nnshah1 released this 18 Jan 07:22
2b3ec74

Release Notes

We are pleased to introduce Triton Distributed our next-generation inference serving framework solution designed to deliver exceptional performance for deploying AI workloads at data center scale. Triton Distributed features a new, modular, multinode architecture with LLM extensions, including Disaggregating Serving, which intelligently separates LLM inference phases across distinct GPU devices and types. In early testing, Triton Distributed increased inference throughput by up to 2x when serving the popular open source Llama 3.1 70B model on 2 nodes of 8xH100s using configurations (3K ISL / 150 OSL, FP8, and > 40 tokens/s/user).

Triton Distributed will modularize the current Triton architecture to multiple composable components for improved scaling:

  • Modular and interchangeable API server to fit user’s needs
  • Smart Router capable of routing requests intelligently based on capacity , KV cache, and user sessions.
  • Independently scalable worker to resolve bottlenecks in the production pipeline.

These modularized components are built on top of Triton Distributed communication planes that facilitate efficient communication of data, requests and events. More details can be found in our design docs: 1) overview 2) data 3) request and 4) events. The new components of Triton Distributed will be available incrementally as we develop the framework.

There will be two more patch 0.1 releases until 0.2. The timelines of the upcoming releases and their associated features are shown below:

Upcoming 0.1 Features

  • (v0.1.0) Public GitHub repository with README
  • (v0.1.0) Getting started tutorial
  • (v0.1.0) Initial Docker container build
  • (v0.1.0) Functional request and data planes
  • Functional disaggregated serving with vLLM (up to 2 nodes)
  • Functional KV Cache Aware Routing
  • Functional Serving with TRT-LLM
  • Performant KV Cache Aware Routing
  • Performant disaggregation with vLLM

We would love to get your feedback on our release and contributions. This project is under Apache 2.0 license, and if you are interested in contributing, please don’t hesitate to create a PR or file an issue.