RDMA over Ethernet for Distributed Training at Meta Scale
2024pp. 57–70
Citations Over TimeTop 1% of 2024 papers
Adithya Gangidi, Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashidhar Gandham, Hongyi Zeng
Abstract
The rapid growth in both computational density and scale in AI models in recent years motivates the construction of an efficient and reliable dedicated network infrastructure. This paper presents the design, implementation, and operation of Meta's Remote Direct Memory Access over Converged Ethernet (RoCE) networks for distributed AI training.
Related Papers
- Design Guidelines for High Performance RDMA Systems.(2016)
- → RDMA Communciation Patterns(2020)14 cited
- → Collie: Finding Performance Anomalies in RDMA Subsystems(2023)6 cited
- → Towards RDMA-Based High Performance Network Technologies(2017)