This release brings support for Large Language Model (LLM) serving, starting with the Llama 3.1 family of models from Meta on AMD Instinct™ MI300X Accelerators.
The full vertically-integrated SHARK AI stack is now available for deploying machine learning models:
- The
sharktank
package builds bridges from popular machine learning models coming from existing model repositories like Hugging Face and frameworks like llama.cpp to the IREE compiler. This model export and compilation pipeline features whole program optimization and efficient cross-target code generation without depending on operator libraries. - The
shortfin
package provides serving applications built on top of the IREE runtime, with integration points to other ecosystem projects like the SGLang frontend. These applications are lightweight, portable, and packed with optimizations to improve serving efficiency.
Together, these packages simplify model deployment by eliminating the need for complex Docker containers or vendor-specific libraries while continuing to provide competitive performance and flexibility. Here are some metrics:
- The native
shortfin
serving library, including a GPU runtime, fits in less than 2MB. - The self-contained compiler fits within 70MB. Once a model is compiled, it can be deployed using
shortfin
with no additional dependencies.
Highlights in this release
Llama 3.1 serving
Guides for serving Llama 3.1 models are available here:
This release focuses support on the 8B and 70B model sizes on a single GPU. Support for 405B models and multi-GPU serving is currently experimental.
Stable Diffusion XL (SDXL) enhancements
- The previous release added initial support for serving SDXL through shortfin. This release contains several performance improvements for the SDXL model and for shortfin serving.
sharktank
The sharktank
sub-project is SHARK's model development toolkit, which is now available as part of the shark-ai
Python package.
- Models in the Llama model family can be exported for compilation with
sharktank.examples.export_paged_llm_v1
, using the model implementation insharktank/models/llama/
. The model export and compilation process will be streamlined in future releases. - A preliminatry SHARK Tank Programming Guide is available for developers interested in understanding system architecture and implementation details.
shortfin
- The
shortfin_apps.sd.server
application used for serving the SDXL diffusion model is now joined byshortfin_apps.llm.server
to serve Large Language Models like Llama. - The LLM server can be used as a backend for the SGLang frontend by following the Using
shortfin
withsglang
guide.
Changelog
Full list of changes: v3.0.0...v3.1.0
What's up next?
As always, SHARK AI is fully open source - including import pipelines, compiler tools, runtime libraries, and serving layers. Future releases will continue to build on these foundational components: expanding model architecture support, improving performance, connecting to a broader set of ecosystem services, and streamlining deployment workflows.