Lists (2)
Sort Name ascending (A-Z)
Stars
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
FlashInfer: Kernel Library for LLM Serving
SGLang is a fast serving framework for large language models and vision language models.
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
Awesome-LLM: a curated list of Large Language Model