Optimus
Citations Over TimeTop 1% of 2018 papers
Abstract
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (e.g., speech recognition, machine translation). A deep learning training job is resource-intensive and time-consuming. Efficient resource scheduling is the key to the maximal performance of a deep learning cluster. Existing cluster schedulers are largely not tailored to deep learning jobs, and typically specifying a fixed amount of resources for each job, prohibiting high resource efficiency and job performance. This paper proposes Optimus, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models. Optimus uses online fitting to predict model convergence during training, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Based on the models, a simple yet effective method is designed and used for dynamically allocating resources and placing deep learning tasks to minimize job completion time. We implement Optimus on top of Kubernetes, a cluster manager for container orchestration, and experiment on a deep learning cluster with 7 CPU servers and 6 GPU servers, running 9 training jobs using the MXNet framework. Results show that Optimus outperforms representative cluster schedulers by about 139% and 63% in terms of job completion time and makespan, respectively.
Related Papers
- → A Decomposition Method for Makespan Minimization in Job-Shop Scheduling Problem Using Ising Machine(2021)8 cited
- Capacity Planning of Enterprise Information System through Simulation(2012)
- → Energy Consumption Characteriation of Heterogeneous Servers(2013)2 cited
- Study on Substituting Localized Servers for Imported Servers on ETL Application(2015)
- → System optimization of contents delivery network with liformation-zooming function(2005)