Memory Optimizations

To address memory constraints of LLM training, various memory-efficient techniques have been proposed. These include activation recomputation strategies, which trade increased computation for reduced memory usage; redundancy reduction methods that minimize data duplication across training processes; defragmentation techniques that optimize memory allocation and deallocation to reduce fragmentation and improve memory utilization; and swap and offload approaches that leverage CPU memory and NVMe SSDs to supplement GPU memory.

Activation Recomputation

Dynamic Evicting

Dynamic tensor rematerialization [Paper] [Code]
- M. Kirisame et al.
- ICLR 2021
Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
- ICS 2022
Coop: Memory is not a commodity [Paper]
- J. Zhang et al.
- NeurIPS 2023

Static Evicting

Checkmate: Breaking the memory wall with optimal tensor rematerialization [Paper] [Code]
- P. Jain et al.
- MLSys 2020
Loongtrain: Efficient training of long-sequence llms with head-context parallelism [Paper]
- D. Gu et al.
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
- T. Yuan et al.
- USENIX 2024
Reducing activation recomputation in large transformer models [Paper] [Code]
- V. A. Korthikanti et al.
- MLSys 2023
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [Paper] [Code]
- D. Li et al.

Redundancy Reduction

Fully Sharding

ZeRO [145], FSDP [146]

ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]
- S. Rajbhandari et al.
- SC 2020
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [Paper] [Code]
- Y. Zhao et al.
- VLDB 2023

Partially Sharding

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training [Paper]
- G. Wang et al.
- ICLR 24 Poster
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud [Paper]
- Z. Zhang et al.
- VLDB 2022
Rethinking Memory and Communication Cost for Efficient Large Language Model Training [Paper]
- C. Wu et al.
RTP: Rethinking Tensor Parallelism with Memory Deduplication [Paper]
- C. Luo et al.
AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training [Paper]
- Q. Chen et al.

Defragmentation

Tensor-based Defragmentation

ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout [Paper]
- H. Shu et al.
ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]
- S. Rajbhandari et al.
- SC 2020
A Heuristic for Periodic Memory Allocation with Little Fragmentation to Train Neural Networks [Paper]
- A. Imanishi et al.
- ISMM 2024
Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
- ICS 2022
Coop: Memory is not a commodity [Paper]
- J. Zhang et al.
- NeurIPS 2023

VMM-based Defragmentation

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching [Paper] [Code]
- C. Guo et al.
- ASPLOS 2024
Expandable Segments [Code]

Offloading

CPU Offloading

Static Offloading
- Training Large Neural Networks with Constant Memory using a New Execution Algorithm [Paper]
  - B. Pudipeddi et al.
- ZeRO-Offload: Democratizing Billion-Scale Model Training [Paper]
  - J. Ren et al.
  - USENIX ATC 21
- Elixir: Train a Large Language Model on a Small GPU Cluster [Paper] [Code]
  - H. Huang et al.
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
  - T. Yuan et al.
  - USENIX 2024
Dynamic Offloading
- TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting [Paper]
  - X. Nie et al.
  - ICDE 2022
- PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management [Paper] [Code]
  - J. Fang
  - TPDS 2023
- Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers [Paper]
  - Y. Feng
  - ASPLOS 2023
- Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers [Paper]
  - Y. Li et al.
  - VLDB 2022
- Tensor Movement Orchestration in Multi-GPU Training Systems [Paper]
  - S. Lin et al.
  - HPCA 2023
- STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training [Paper]
  - X. Sun et al.
  - SC 2022

SSD Offloading

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [Paper] [Code]
- S. Rajbhandari et al.
- SC 2021
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent [Paper]
- X Nie et al.
- VLDB 2023
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System [Paper]
- H. Jang et al.
- HPCA 2024
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU [Paper]
- C. Liao et al.
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [Paper]
- D. Yu et al.
- ICS 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory_Optimizations.md

Memory_Optimizations.md

Memory Optimizations

Activation Recomputation

Dynamic Evicting

Static Evicting

Redundancy Reduction

Fully Sharding

Partially Sharding

Defragmentation

Tensor-based Defragmentation

VMM-based Defragmentation

Offloading

CPU Offloading

SSD Offloading

Files

Memory_Optimizations.md

Latest commit

History

Memory_Optimizations.md

File metadata and controls

Memory Optimizations

Activation Recomputation

Dynamic Evicting

Static Evicting

Redundancy Reduction

Fully Sharding

Partially Sharding

Defragmentation

Tensor-based Defragmentation

VMM-based Defragmentation

Offloading

CPU Offloading

SSD Offloading