Skip to content

Latest commit

 

History

History
138 lines (115 loc) · 7.4 KB

Memory_Optimizations.md

File metadata and controls

138 lines (115 loc) · 7.4 KB

Memory Optimizations

To address memory constraints of LLM training, various memory-efficient techniques have been proposed. These include activation recomputation strategies, which trade increased computation for reduced memory usage; redundancy reduction methods that minimize data duplication across training processes; defragmentation techniques that optimize memory allocation and deallocation to reduce fragmentation and improve memory utilization; and swap and offload approaches that leverage CPU memory and NVMe SSDs to supplement GPU memory.

Activation Recomputation

Dynamic Evicting

  • Dynamic tensor rematerialization [Paper] [Code]
    • M. Kirisame et al.
    • ICLR 2021
  • Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
    • ICS 2022
  • Coop: Memory is not a commodity [Paper]
    • J. Zhang et al.
    • NeurIPS 2023

Static Evicting

  • Checkmate: Breaking the memory wall with optimal tensor rematerialization [Paper] [Code]
    • P. Jain et al.
    • MLSys 2020
  • Loongtrain: Efficient training of long-sequence llms with head-context parallelism [Paper]
    • D. Gu et al.
  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
    • T. Yuan et al.
    • USENIX 2024
  • Reducing activation recomputation in large transformer models [Paper] [Code]
    • V. A. Korthikanti et al.
    • MLSys 2023
  • DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [Paper] [Code]
    • D. Li et al.

Redundancy Reduction

Fully Sharding

ZeRO [145], FSDP [146]

  • ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]

    • S. Rajbhandari et al.
    • SC 2020
  • PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [Paper] [Code]

    • Y. Zhao et al.
    • VLDB 2023

Partially Sharding

  • ZeRO++: Extremely Efficient Collective Communication for Giant Model Training [Paper]
    • G. Wang et al.
    • ICLR 24 Poster
  • MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud [Paper]
    • Z. Zhang et al.
    • VLDB 2022
  • Rethinking Memory and Communication Cost for Efficient Large Language Model Training [Paper]
    • C. Wu et al.
  • RTP: Rethinking Tensor Parallelism with Memory Deduplication [Paper]
    • C. Luo et al.
  • AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training [Paper]
    • Q. Chen et al.

Defragmentation

Tensor-based Defragmentation

  • ROAM: memory-efficient large DNN training via optimized operator ordering and memory layout [Paper]
    • H. Shu et al.
  • ZeRO: Memory optimizations Toward Training Trillion Parameter Models [Paper] [Code]
    • S. Rajbhandari et al.
    • SC 2020
  • A Heuristic for Periodic Memory Allocation with Little Fragmentation to Train Neural Networks [Paper]
    • A. Imanishi et al.
    • ISMM 2024
  • Megtaichi: Dynamic tensor-based memory management optimization for dnn training [Paper]
    • ICS 2022
  • Coop: Memory is not a commodity [Paper]
    • J. Zhang et al.
    • NeurIPS 2023

VMM-based Defragmentation

  • GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching [Paper] [Code]
    • C. Guo et al.
    • ASPLOS 2024
  • Expandable Segments [Code]

Offloading

CPU Offloading

  • Static Offloading
    • Training Large Neural Networks with Constant Memory using a New Execution Algorithm [Paper]
      • B. Pudipeddi et al.
    • ZeRO-Offload: Democratizing Billion-Scale Model Training [Paper]
      • J. Ren et al.
      • USENIX ATC 21
    • Elixir: Train a Large Language Model on a Small GPU Cluster [Paper] [Code]
      • H. Huang et al.
    • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper]
      • T. Yuan et al.
      • USENIX 2024
  • Dynamic Offloading
    • TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting [Paper]
      • X. Nie et al.
      • ICDE 2022
    • PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management [Paper] [Code]
      • J. Fang
      • TPDS 2023
    • Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers [Paper]
      • Y. Feng
      • ASPLOS 2023
    • Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers [Paper]
      • Y. Li et al.
      • VLDB 2022
    • Tensor Movement Orchestration in Multi-GPU Training Systems [Paper]
      • S. Lin et al.
      • HPCA 2023
    • STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training [Paper]
      • X. Sun et al.
      • SC 2022

SSD Offloading

  • ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning [Paper] [Code]
    • S. Rajbhandari et al.
    • SC 2021
  • Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent [Paper]
    • X Nie et al.
    • VLDB 2023
  • Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System [Paper]
    • H. Jang et al.
    • HPCA 2024
  • Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU [Paper]
    • C. Liao et al.
  • MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [Paper]
    • D. Yu et al.
    • ICS 2024