Optimizing distributed training deployment in heterogeneous GPU clusters
Citations Over TimeTop 14% of 2020 papers
Abstract
This paper proposes HeteroG, an automatic module to accelerate deep neural network training in heterogeneous GPU clusters. To train a deep learning model with large amounts of data, distributed training using data or model parallelism has been widely adopted, mostly over homogeneous devices (GPUs, network bandwidth). Heterogeneous training environments may often exist in shared clusters with GPUs of different models purchased in different batches and network connections of different bandwidth availability (e.g., due to contention). Classic data parallelism does not work well in a heterogeneous cluster, while model-parallel training is hard to plan. HeteroG enables highly-efficient distributed training over heterogeneous devices, by automatically converting a single-GPU training model to a distributed one according to the deep learning graph and available resources. HeteroG embraces operation-level hybrid parallelism, communication architecture selection and execution scheduling, based on a carefully designed strategy framework exploiting both GNN-based learning and combinatorial optimization. We compare HeteroG with existing parallelism schemes and show that it achieves up-to 222% training speed-up. HeteroG also enables efficient training of large models over a set of heterogeneous devices where simple parallelism is infeasible.
Related Papers
- → Exploiting task and data parallelism on a multicomputer(1993)60 cited
- → On the duality between Or-parallelism and And-parallelism in logic programming(1995)14 cited
- → Relating data-parallelism and (and-) parallelism in logic programs(1996)11 cited
- → EXTENDING OPENMP FOR TASK PARALLELISM(2003)6 cited
- → Intra-function Parallelism(2001)