Skip to content

Latest commit

 

History

History
318 lines (233 loc) · 20.3 KB

Infrastructure.md

File metadata and controls

318 lines (233 loc) · 20.3 KB

INFRASTRUCTURE FOR LLM TRAINING

Introduction

We explore the infrastructure design for training LLMs, encompassing accelerators, networks, storage, and scheduling systems.

Content

  • AI Accelerators
    • NVIDIA GPUs
    • Other AI Accelerators
  • Network Infrastructure
    • Chip-to-Chip
    • Node-to-Node
    • Network Topology
    • Load Balancing & Congestion Control (CC)
  • Storage Systems
    • Checkpoint Storage
    • Training Data Storage
  • Scheduling Systems
    • Workload Scheduling
    • Resource Scheduling

AI Accelerators

NVIDIA GPUs

  • NVIDIA Ampere Architecture. [Website]

  • NVIDIA Hopper Architecture. [Website]

  • NVIDIA Blackwell Architecture. [pdf]

Other AI Accelerators

  • Amd instinct tm mi250x accelerator enabled by elevated fanout bridge advanced packaging architecture [pdf]

    • 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits).
    • R. Swaminathan, M. J. Schulte, B. Wilkerson, G. H. Loh, A. Smith, N. James
  • Gaudi training platform white paper [White paper]

    • Habana
  • Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings [pdf]

    • Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
    • N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles et al.
  • A comprehensive performance study of large language models on novel ai accelerators [pdf]

    • arXiv preprint arXiv:2310.04607, 2023
    • M. Emani, S. Foreman, V. Sastry, Z. Xie, S. Raskar, W. Arnold, R. Thakur, V. Vishwanath, and M. E. Papka
  • The cerebras cs-2: Designing an ai accelerator around the world’s largest 2.6 trillion transistor chip [pdf]

    • Proceedings of the 2022 International Symposium on Physical Design, 2022
    • J.-P. Fricker

Network Infrastructure

Chip-to-Chip

  • Nvidia dgx-1 system architecture white paper [White paper]

  • 3.2 the a100 datacenter gpu and ampere architecture [pdf]

    • 2021 IEEE International Solid-State Circuits Conference (ISSCC)
    • J. Choquette, E. Lee, R. Krashinsky, V. Balan, and B. Khailany
  • 2.2 amd chiplet architecture for high-performance server and desktop products [pdf]

    • 2020 IEEE International Solid-State Circuits Conference-(ISSCC)
    • S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony
  • A domain-specific supercomputer for training deep neural networks [pdf]

    • Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020.
    • N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson
  • Google’s training chips revealed: Tpuv2 and tpuv3 [pdf]

    • Hot Chips Symposium, 2020
    • T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. P. Jouppi, and D. A. Patterson

Node-to-Node

  • The development of mellanox/nvidia gpudirect over infiniband—a new model for gpu to gpu communications [pdf]

    • Computer Science-Research and Development, vol. 26, pp. 267–273, 2011
    • G. Shainer, A. Ayoub, P. Lui, T. Liu, M. Kagan, C. R. Trott, G. Scantlen, and P. S. Crozier
  • An introduction to the infiniband architecture [pdf]

    • High performance mass storage and parallel I/O, vol. 42, no. 617-632, p. 10, 2001.
    • G. F. Pfister
  • Supplement to infiniband architecture specification volume 1 release 1.2.2 annex a16 [pdf]

    • Infiniband Trade Association
  • Architectural Specifications for RDMA over TCP/IP [Website]

    • RDMA Consortium

Network Topology

  • A study of non-blocking switching networks [pdf]

    • Bell System Technical Journa
    • C. Clos
  • Bcube: a high performance, server-centric network architecture for modular data centers [pdf]

    • Proceedings of the ACM SIGCOMM 2009 conference on Data communication, 2009
    • Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, Songwu Lu
  • Dcell: a scalable and fault-tolerant network structure for data centers [pdf]

    • Proceedings of the ACM SIGCOMM 2008 conference on Data communication, 2008
    • Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu
  • Jellyfish: Networking data centers randomly [pdf]

    • 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)
    • A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey
  • Technology-driven, highly-scalable dragonfly topology [pdf]

    • ACM SIGARCH Computer Architecture News
    • J. Kim, W. J. Dally, S. Scott, and D. Abts
  • Dragonfly+: Low cost topology for scaling datacenters [pdf]

    • 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB)
    • A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, and E. Zahavi
  • Doubling all2all performance with nvidia collective communication library 2.12 [pdf]

    • K. Mandakolathur and S. Jeaugey
  • Alibaba hpn: A data center network for large language model training [pdf]

    • Proceedings of the ACM SIGCOMM 2024 Conference, 2024
    • Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, Dennis Cai
  • Optimized network architectures for large language model training with billions of parameters [pdf]

    • arXiv preprint arXiv:2307.12169
    • Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
  • Hammingmesh: a network topology for large-scale deep learning [pdf]

    • SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.
    • T. Hoefler, T. Bonato, D. De Sensi, S. Di Girolamo, S. Li, M. Heddes, J. Belk, D. Goel, M. Castro, and S. Scott
  • Sip-ml: highbandwidth optical network interconnects for machine learning training [pdf]

    • Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021
    • M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi
  • {TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs [pdf]

    • 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)
    • Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, Anthony Kewitsch

Load Balancing & Congestion Control (CC)

  • Analysis of an equal-cost multi-path algorithm [pdf]

    • C. Hopps
  • On the impact of packet spraying in data center networks [pdf]

    • 2013 proceedings ieee infocom
    • A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella
  • Challenging the need for packet spraying in large-scale distributed training [pdf]

    • arXiv preprint arXiv:2407.00550, 2024.
    • V. Addanki, P. Goyal, and I. Marinos
  • {MegaScale}: Scaling large language model training to more than 10,000 {GPUs} [pdf]

    • 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024
    • Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
  • IEEE. 802.1qbb – priority-based flow control [pdf]

  • Timely: Rtt-based congestion control for the datacenter [pdf]

    • ACM SIGCOMM Computer Communication Review
    • R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats
  • Swift: Delay is simple and effective for congestion control in the datacenter [pdf]

    • Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, 2020
    • G. Kumar, N. Dukkipati, K. Jang, H. M. Wassel, X. Wu, B. Montazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan et al
  • Congestion control for large-scale rdma deployments [pdf]

    • ACM SIGCOMM Computer Communication Review
    • Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang
  • Ecn or delay: Lessons learnt from analysis of dcqcn and timely [pdf]

    • Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies
    • Y. Zhu, M. Ghobadi, V. Misra, and J. Padhye
  • Hpcc: High precision congestion control [pdf]

    • Proceedings of the ACM special interest group on data communication, 2019
    • Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh et al.
  • An edge-queued datagram service for all datacenter traffic [pdf]

    • 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)
    • V. Olteanu, H. Eran, D. Dumitrescu, A. Popa, C. Baciu, M. Silberstein, G. Nikolaidis, M. Handley, and C. Raiciu
  • Rocc: robust congestion control for rdma [pdf]

    • Proceedings of the 16th International conference on emerging networking experiments and technologies, 2020
    • P. Taheri, D. Menikkumbura, E. Vanini, S. Fahmy, P. Eugster, and T. Edsall
  • Mltcp: Congestion control for dnn training [pdf]

    • arXiv preprint arXiv:2402.09589, 2024
    • Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi
  • {CASSINI}:{Network-Aware} job scheduling in machine learning clusters [pdf]

    • 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
    • S. Rajasekaran, M. Ghobadi, and A. Akella
  • Towards {Domain-Specific} network transport for distributed {DNN} training[pdf]

    • 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
    • H. Wang, H. Tian, J. Chen, X. Wan, J. Xia, G. Zeng, W. Bai, J. Jiang, Y. Wang, and K. Chen

Storage Systems

Checkpoint Storage

  • Facebook’s tectonic filesystem: Efficiency from exascale [pdf]

    • 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
    • S. Pan, T. Stavrinos, Y. Zhang, A. Sikaria, P. Zakharov, A. Sharma, M. Shuey, R. Wareing, M. Gangapuram, G. Cao et al.
  • The hadoop distributed file system [pdf]

    • 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).
    • K. Shvachko, H. Kuang, S. Radia, and R. Chansler
  • Ceph: A scalable, high-performance distributed file system [pdf]

    • Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06)
    • S. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn

Training Data Storage

  • Lustre: Building a file system for 1000-node clusters [pdf]

    • Proceedings of the 2003 Linux symposium
    • P. Schwan et al.
  • {GPFS}: A {Shared-Disk} file system for large computing clusters [pdf]

    • Conference on file and storage technologies (FAST 02)
    • F. Schmuck and R. Haskin
  • I/o characterization and performance evaluation of beegfs for deep learning [pdf]

    • Proceedings of the 48th International Conference on Parallel Processing, 2019
    • F. Chowdhury, Y. Zhu, T. Heer, S. Paredes, A. Moody, R. Goldstone, K. Mohror, and W. Yu
  • Tachyon: Reliable, memory speed storage for cluster computing frameworks [pdf]

    • Proceedings of the ACM Symposium on Cloud Computing, 2014
    • H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica
  • Juicefs: A High-Performance, Cloud-Native, Distributed File System [Github]

    • JuiceFS
  • Quiver: An informed storage cache for deep learning [pdf]

    • 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020
    • A. V. Kumar and M. Sivathanu
  • Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs [pdf]

    • 2022 IEEE 38th International Conference on Data Engineering (ICDE).
    • Rong Gu, Kai Zhang, Zhihao Xu, et al.

Scheduling Systems

Workload Scheduling

  • Tiresias: A {GPU} cluster manager for distributed deep learning [pdf]

    • 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019
    • J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo
  • Themis: Fair and efficient {GPU} cluster scheduling [pdf]

    • 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), 2020
    • K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman, A. Akella, A. Phanishayee, and S. Chawla
  • Elasticflow: An elastic serverless training platform for distributed deep learning [pdf]

    • Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023
    • D. Gu, Y. Zhao, Y. Zhong, Y. Xiong, Z. Han, P. Cheng, F. Yang, G. Huang, X. Jin, and X. Liu
  • Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads [pdf]

    • 14th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’20. USENIX Association, 2020
    • D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia
  • Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning [pdf]

    • Proceedings of the Fifteenth European Conference on Computer Systems, ser. EuroSys ’20
    • S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha
  • Beware of fragmentation: Scheduling GPUSharing workloads with fragmentation gradient descent [pdf]

    • 2023 USENIX Annual Technical Conference, ser. USENIX ATC ’23
    • Q. Weng, L. Yang, Y. Yu, W. Wang, X. Tang, G. Yang, and L. Zhang
  • Lucid: A nonintrusive, scalable and interpretable scheduler for deep learning training jobs [pdf]

    • Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
    • Q. Hu, M. Zhang, P. Sun, Y. Wen, and T. Zhang
  • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning [pdf]

    • 15th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’21
    • A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho, H. Zhang, G. R. Ganger, and E. P. Xing
  • Sia: Heterogeneity-aware, goodput-optimized mlcluster scheduling [pdf]

    • Proceedings of the 29th Symposium on Operating Systems Principles, 2023
    • S. Jayaram Subramanya, D. Arfeen, S. Lin, A. Qiao, Z. Jia, and G. R. Ganger
  • A codesign of scheduling and parallelization for large model training in heterogeneous clusters [pdf]

    • arXiv preprint arXiv:2403.16125, 2024
    • Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, Minyi Guo
  • Hydro: Surrogate-Based hyperparameter tuning service in datacenter [pdf]

    • 17th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’23
    • Q. Hu, Z. Ye, M. Zhang, Q. Chen, P. Sun, Y. Wen, and T. Zhang
  • Characterization of large language model development in the datacenter [pdf]

    • 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
    • Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y. Luo et al.

Resource Scheduling

  • Switches for hire: Resource scheduling for data center in-network computing [pdf]

    • Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021
    • M. Blocher, L. Wang, P. Eugster, and M. Schmidt
  • Silod: A co-design of caching and scheduling for deep learning clusters [pdf]

    • Proceedings of the Eighteenth European Conference on Computer Systems, 2023
    • H. Zhao, Z. Han, Z. Yang, Q. Zhang, M. Li, F. Yang, Q. Zhang, B. Li, Y. Yang, L. Qiu et al
  • Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters [pdf]

    • 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
    • J. Mohan, A. Phanishayee, J. Kulkarni, and V. Chidambaram
  • {EnvPipe}: Performance-preserving {DNN} training framework for saving energy [pdf]

    • 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023
    • S. Choi, I. Koo, J. Ahn, M. Jeon, and Y. Kwon
  • Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training [pdf]

    • USENIX NSDI, 2023.
    • J. You, J.-W. Chung, and M. Chowdhury
  • Perseus: Removing energy bloat from large model training [pdf]

    • arXiv preprint arXiv:2312.06902, 2023.
    • Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, Mosharaf Chowdhury