To address memory constraints of LLM training, various memory-efficient techniques have been proposed. These include activation recomputation strategies, which trade increased computation for reduced memory usage; redundancy reduction methods that minimize data duplication across training processes; defragmentation techniques that optimize memory allocation and deallocation to reduce fragmentation and improve memory utilization; and swap and offload approaches that leverage CPU memory and NVMe SSDs to supplement GPU memory.
- Dynamic tensor rematerialization [Paper] [Code]
- M. Kirisame et al.
- ICLR 2021
- Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
- ICS 2022
- Coop: Memory is not a commodity [Paper]
- J. Zhang et al.
- NeurIPS 2023
- Checkmate: Breaking the memory wall with optimal tensor rematerialization [Paper] [Code]
- P. Jain et al.
- MLSys 2020
- Loongtrain: Efficient training of long-sequence llms with head-context parallelism [Paper]
- D. Gu et al.
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
- T. Yuan et al.
- USENIX 2024
- Reducing activation recomputation in large transformer models [Paper] [Code]
- V. A. Korthikanti et al.
- MLSys 2023
- DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [Paper] [Code]
- D. Li et al.
ZeRO [145], FSDP [146]
-
ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]
- S. Rajbhandari et al.
- SC 2020
-
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [Paper] [Code]
- Y. Zhao et al.
- VLDB 2023
- ZeRO++: Extremely Efficient Collective Communication for Giant Model Training [Paper]
- G. Wang et al.
- ICLR 24 Poster
- MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud [Paper]
- Z. Zhang et al.
- VLDB 2022
- Rethinking Memory and Communication Cost for Efficient Large Language Model Training [Paper]
- C. Wu et al.
- RTP: Rethinking Tensor Parallelism with Memory Deduplication [Paper]
- C. Luo et al.
- AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training [Paper]
- Q. Chen et al.
- ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout [Paper]
- H. Shu et al.
- ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]
- S. Rajbhandari et al.
- SC 2020
- A Heuristic for Periodic Memory Allocation with Little Fragmentation to Train Neural Networks [Paper]
- A. Imanishi et al.
- ISMM 2024
- Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
- ICS 2022
- Coop: Memory is not a commodity [Paper]
- J. Zhang et al.
- NeurIPS 2023
- GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching [Paper] [Code]
- C. Guo et al.
- ASPLOS 2024
- Expandable Segments [Code]
- Static Offloading
- Training Large Neural Networks with Constant Memory using a New Execution Algorithm [Paper]
- B. Pudipeddi et al.
- ZeRO-Offload: Democratizing Billion-Scale Model Training [Paper]
- J. Ren et al.
- USENIX ATC 21
- Elixir: Train a Large Language Model on a Small GPU Cluster [Paper] [Code]
- H. Huang et al.
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
- T. Yuan et al.
- USENIX 2024
- Training Large Neural Networks with Constant Memory using a New Execution Algorithm [Paper]
- Dynamic Offloading
- TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting [Paper]
- X. Nie et al.
- ICDE 2022
- PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management [Paper] [Code]
- J. Fang
- TPDS 2023
- Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers [Paper]
- Y. Feng
- ASPLOS 2023
- Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers [Paper]
- Y. Li et al.
- VLDB 2022
- Tensor Movement Orchestration in Multi-GPU Training Systems [Paper]
- S. Lin et al.
- HPCA 2023
- STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training [Paper]
- X. Sun et al.
- SC 2022
- TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting [Paper]
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [Paper] [Code]
- S. Rajbhandari et al.
- SC 2021
- Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent [Paper]
- X Nie et al.
- VLDB 2023
- Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System [Paper]
- H. Jang et al.
- HPCA 2024
- Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU [Paper]
- C. Liao et al.
- MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [Paper]
- D. Yu et al.
- ICS 2024