INFRASTRUCTURE FOR LLM TRAINING

Introduction

We explore the infrastructure design for training LLMs, encompassing accelerators, networks, storage, and scheduling systems.

Content

AI Accelerators
- NVIDIA GPUs
- Other AI Accelerators
Network Infrastructure
- Chip-to-Chip
- Node-to-Node
- Network Topology
- Load Balancing & Congestion Control (CC)
Storage Systems
- Checkpoint Storage
- Training Data Storage
Scheduling Systems
- Workload Scheduling
- Resource Scheduling

AI Accelerators

NVIDIA GPUs

NVIDIA Ampere Architecture. [Website]
NVIDIA Hopper Architecture. [Website]
NVIDIA Blackwell Architecture. [pdf]

Other AI Accelerators

Amd instinct tm mi250x accelerator enabled by elevated fanout bridge advanced packaging architecture [pdf]
- 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits).
- R. Swaminathan, M. J. Schulte, B. Wilkerson, G. H. Loh, A. Smith, N. James
Gaudi training platform white paper [White paper]
- Habana
Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings [pdf]
- Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
- N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles et al.
A comprehensive performance study of large language models on novel ai accelerators [pdf]
- arXiv preprint arXiv:2310.04607, 2023
- M. Emani, S. Foreman, V. Sastry, Z. Xie, S. Raskar, W. Arnold, R. Thakur, V. Vishwanath, and M. E. Papka
The cerebras cs-2: Designing an ai accelerator around the world’s largest 2.6 trillion transistor chip [pdf]
- Proceedings of the 2022 International Symposium on Physical Design, 2022
- J.-P. Fricker

Network Infrastructure

Chip-to-Chip

Nvidia dgx-1 system architecture white paper [White paper]
3.2 the a100 datacenter gpu and ampere architecture [pdf]
- 2021 IEEE International Solid-State Circuits Conference (ISSCC)
- J. Choquette, E. Lee, R. Krashinsky, V. Balan, and B. Khailany
2.2 amd chiplet architecture for high-performance server and desktop products [pdf]
- 2020 IEEE International Solid-State Circuits Conference-(ISSCC)
- S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony
A domain-specific supercomputer for training deep neural networks [pdf]
- Communications of the ACM, vol. 63, no. 7, pp. 67–78, 2020.
- N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson
Google’s training chips revealed: Tpuv2 and tpuv3 [pdf]
- Hot Chips Symposium, 2020
- T. Norrie, N. Patil, D. H. Yoon, G. Kurian, S. Li, J. Laudon, C. Young, N. P. Jouppi, and D. A. Patterson

Node-to-Node

The development of mellanox/nvidia gpudirect over infiniband—a new model for gpu to gpu communications [pdf]
- Computer Science-Research and Development, vol. 26, pp. 267–273, 2011
- G. Shainer, A. Ayoub, P. Lui, T. Liu, M. Kagan, C. R. Trott, G. Scantlen, and P. S. Crozier
An introduction to the infiniband architecture [pdf]
- High performance mass storage and parallel I/O, vol. 42, no. 617-632, p. 10, 2001.
- G. F. Pfister
Supplement to infiniband architecture specification volume 1 release 1.2.2 annex a16 [pdf]
- Infiniband Trade Association
Architectural Specifications for RDMA over TCP/IP [Website]
- RDMA Consortium

Network Topology

A study of non-blocking switching networks [pdf]
- Bell System Technical Journa
- C. Clos
Bcube: a high performance, server-centric network architecture for modular data centers [pdf]
- Proceedings of the ACM SIGCOMM 2009 conference on Data communication, 2009
- Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, Songwu Lu
Dcell: a scalable and fault-tolerant network structure for data centers [pdf]
- Proceedings of the ACM SIGCOMM 2008 conference on Data communication, 2008
- Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, Songwu Lu
Jellyfish: Networking data centers randomly [pdf]
- 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12)
- A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey
Technology-driven, highly-scalable dragonfly topology [pdf]
- ACM SIGARCH Computer Architecture News
- J. Kim, W. J. Dally, S. Scott, and D. Abts
Dragonfly+: Low cost topology for scaling datacenters [pdf]
- 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB)
- A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, and E. Zahavi
Doubling all2all performance with nvidia collective communication library 2.12 [pdf]
- K. Mandakolathur and S. Jeaugey
Alibaba hpn: A data center network for large language model training [pdf]
- Proceedings of the ACM SIGCOMM 2024 Conference, 2024
- Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, Dennis Cai
Optimized network architectures for large language model training with billions of parameters [pdf]
- arXiv preprint arXiv:2307.12169
- Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani
Hammingmesh: a network topology for large-scale deep learning [pdf]
- SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.
- T. Hoefler, T. Bonato, D. De Sensi, S. Di Girolamo, S. Li, M. Heddes, J. Belk, D. Goel, M. Castro, and S. Scott
Sip-ml: highbandwidth optical network interconnects for machine learning training [pdf]
- Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021
- M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi
{TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs [pdf]
- 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)
- Weiyang Wang, Moein Khazraee, Zhizhen Zhong, Manya Ghobadi, Zhihao Jia, Dheevatsa Mudigere, Ying Zhang, Anthony Kewitsch

Load Balancing & Congestion Control (CC)

Analysis of an equal-cost multi-path algorithm [pdf]
- C. Hopps
On the impact of packet spraying in data center networks [pdf]
- 2013 proceedings ieee infocom
- A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella
Challenging the need for packet spraying in large-scale distributed training [pdf]
- arXiv preprint arXiv:2407.00550, 2024.
- V. Addanki, P. Goyal, and I. Marinos
{MegaScale}: Scaling large language model training to more than 10,000 {GPUs} [pdf]
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
IEEE. 802.1qbb – priority-based flow control [pdf]
Timely: Rtt-based congestion control for the datacenter [pdf]
- ACM SIGCOMM Computer Communication Review
- R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats
Swift: Delay is simple and effective for congestion control in the datacenter [pdf]
- Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, 2020
- G. Kumar, N. Dukkipati, K. Jang, H. M. Wassel, X. Wu, B. Montazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan et al
Congestion control for large-scale rdma deployments [pdf]
- ACM SIGCOMM Computer Communication Review
- Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang
Ecn or delay: Lessons learnt from analysis of dcqcn and timely [pdf]
- Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies
- Y. Zhu, M. Ghobadi, V. Misra, and J. Padhye
Hpcc: High precision congestion control [pdf]
- Proceedings of the ACM special interest group on data communication, 2019
- Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, M. Zhang, F. Kelly, M. Alizadeh et al.
An edge-queued datagram service for all datacenter traffic [pdf]
- 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)
- V. Olteanu, H. Eran, D. Dumitrescu, A. Popa, C. Baciu, M. Silberstein, G. Nikolaidis, M. Handley, and C. Raiciu
Rocc: robust congestion control for rdma [pdf]
- Proceedings of the 16th International conference on emerging networking experiments and technologies, 2020
- P. Taheri, D. Menikkumbura, E. Vanini, S. Fahmy, P. Eugster, and T. Edsall
Mltcp: Congestion control for dnn training [pdf]
- arXiv preprint arXiv:2402.09589, 2024
- Sudarsanan Rajasekaran, Sanjoli Narang, Anton A. Zabreyko, Manya Ghobadi
{CASSINI}:{Network-Aware} job scheduling in machine learning clusters [pdf]
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
- S. Rajasekaran, M. Ghobadi, and A. Akella
Towards {Domain-Specific} network transport for distributed {DNN} training[pdf]
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
- H. Wang, H. Tian, J. Chen, X. Wan, J. Xia, G. Zeng, W. Bai, J. Jiang, Y. Wang, and K. Chen

Storage Systems

Checkpoint Storage

Facebook’s tectonic filesystem: Efficiency from exascale [pdf]
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
- S. Pan, T. Stavrinos, Y. Zhang, A. Sikaria, P. Zakharov, A. Sharma, M. Shuey, R. Wareing, M. Gangapuram, G. Cao et al.
The hadoop distributed file system [pdf]
- 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler
Ceph: A scalable, high-performance distributed file system [pdf]
- Proceedings of the 7th Conference on Operating Systems Design and Implementation (OSDI’06)
- S. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn

Training Data Storage

Lustre: Building a file system for 1000-node clusters [pdf]
- Proceedings of the 2003 Linux symposium
- P. Schwan et al.
{GPFS}: A {Shared-Disk} file system for large computing clusters [pdf]
- Conference on file and storage technologies (FAST 02)
- F. Schmuck and R. Haskin
I/o characterization and performance evaluation of beegfs for deep learning [pdf]
- Proceedings of the 48th International Conference on Parallel Processing, 2019
- F. Chowdhury, Y. Zhu, T. Heer, S. Paredes, A. Moody, R. Goldstone, K. Mohror, and W. Yu
Tachyon: Reliable, memory speed storage for cluster computing frameworks [pdf]
- Proceedings of the ACM Symposium on Cloud Computing, 2014
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica
Juicefs: A High-Performance, Cloud-Native, Distributed File System [Github]
- JuiceFS
Quiver: An informed storage cache for deep learning [pdf]
- 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020
- A. V. Kumar and M. Sivathanu
Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs [pdf]
- 2022 IEEE 38th International Conference on Data Engineering (ICDE).
- Rong Gu, Kai Zhang, Zhihao Xu, et al.

Scheduling Systems

Workload Scheduling

Tiresias: A {GPU} cluster manager for distributed deep learning [pdf]
- 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019
- J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo
Themis: Fair and efficient {GPU} cluster scheduling [pdf]
- 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), 2020
- K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman, A. Akella, A. Phanishayee, and S. Chawla
Elasticflow: An elastic serverless training platform for distributed deep learning [pdf]
- Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023
- D. Gu, Y. Zhao, Y. Zhong, Y. Xiong, Z. Han, P. Cheng, F. Yang, G. Huang, X. Jin, and X. Liu
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads [pdf]
- 14th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’20. USENIX Association, 2020
- D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia
Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning [pdf]
- Proceedings of the Fifteenth European Conference on Computer Systems, ser. EuroSys ’20
- S. Chaudhary, R. Ramjee, M. Sivathanu, N. Kwatra, and S. Viswanatha
Beware of fragmentation: Scheduling GPUSharing workloads with fragmentation gradient descent [pdf]
- 2023 USENIX Annual Technical Conference, ser. USENIX ATC ’23
- Q. Weng, L. Yang, Y. Yu, W. Wang, X. Tang, G. Yang, and L. Zhang
Lucid: A nonintrusive, scalable and interpretable scheduler for deep learning training jobs [pdf]
- Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
- Q. Hu, M. Zhang, P. Sun, Y. Wen, and T. Zhang
Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning [pdf]
- 15th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’21
- A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho, H. Zhang, G. R. Ganger, and E. P. Xing
Sia: Heterogeneity-aware, goodput-optimized mlcluster scheduling [pdf]
- Proceedings of the 29th Symposium on Operating Systems Principles, 2023
- S. Jayaram Subramanya, D. Arfeen, S. Lin, A. Qiao, Z. Jia, and G. R. Ganger
A codesign of scheduling and parallelization for large model training in heterogeneous clusters [pdf]
- arXiv preprint arXiv:2403.16125, 2024
- Chunyu Xue, Weihao Cui, Han Zhao, Quan Chen, Shulai Zhang, Pengyu Yang, Jing Yang, Shaobo Li, Minyi Guo
Hydro: Surrogate-Based hyperparameter tuning service in datacenter [pdf]
- 17th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’23
- Q. Hu, Z. Ye, M. Zhang, Q. Chen, P. Sun, Y. Wen, and T. Zhang
Characterization of large language model development in the datacenter [pdf]
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
- Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y. Luo et al.

Resource Scheduling

Switches for hire: Resource scheduling for data center in-network computing [pdf]
- Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021
- M. Blocher, L. Wang, P. Eugster, and M. Schmidt
Silod: A co-design of caching and scheduling for deep learning clusters [pdf]
- Proceedings of the Eighteenth European Conference on Computer Systems, 2023
- H. Zhao, Z. Han, Z. Yang, Q. Zhang, M. Li, F. Yang, Q. Zhang, B. Li, Y. Yang, L. Qiu et al
Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters [pdf]
- 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
- J. Mohan, A. Phanishayee, J. Kulkarni, and V. Chidambaram
{EnvPipe}: Performance-preserving {DNN} training framework for saving energy [pdf]
- 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023
- S. Choi, I. Koo, J. Ahn, M. Jeon, and Y. Kwon
Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training [pdf]
- USENIX NSDI, 2023.
- J. You, J.-W. Chung, and M. Chowdhury
Perseus: Removing energy bloat from large model training [pdf]
- arXiv preprint arXiv:2312.06902, 2023.
- Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, Mosharaf Chowdhury

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure.md

Infrastructure.md

INFRASTRUCTURE FOR LLM TRAINING

Introduction

Content

AI Accelerators

NVIDIA GPUs

Other AI Accelerators

Network Infrastructure

Chip-to-Chip

Node-to-Node

Network Topology

Load Balancing & Congestion Control (CC)

Storage Systems

Checkpoint Storage

Training Data Storage

Scheduling Systems

Workload Scheduling

Resource Scheduling

Files

Infrastructure.md

Latest commit

History

Infrastructure.md

File metadata and controls

INFRASTRUCTURE FOR LLM TRAINING

Introduction

Content

AI Accelerators

NVIDIA GPUs

Other AI Accelerators

Network Infrastructure

Chip-to-Chip

Node-to-Node

Network Topology

Load Balancing & Congestion Control (CC)

Storage Systems

Checkpoint Storage

Training Data Storage

Scheduling Systems

Workload Scheduling

Resource Scheduling