GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training
Citations Over TimeTop 10% of 2020 papers
Abstract
Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.
Related Papers
- → On the duality between Or-parallelism and And-parallelism in logic programming(1995)14 cited
- → Relating data-parallelism and (and-) parallelism in logic programs(1996)11 cited
- → Relating data-parallelism and (and-) parallelism in logic programs(1995)6 cited
- → On the Concepts of Parallelism in Biomolecular Computing(2018)
- Relating data—parallelism and (and—) parallelismin logic programs(1996)