VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
Citations Over TimeTop 10% of 2019 papers
Abstract
Global checkpointing to external storage (e.g., a parallel file system) is a common I/O pattern of many HPC applications. However, given the limited I/O throughput of external storage, global checkpointing can often lead to I/O bottlenecks. To address this issue, a shift from synchronous checkpointing (i.e., blocking until writes have finished) to asynchronous checkpointing (i.e., writing to faster local storage and flushing to external storage in the background) is increasingly being adopted. However, with rising core count per node and heterogeneity of both local and external storage, it is non-trivial to design efficient asynchronous checkpointing mechanisms due to the complex interplay between high concurrency and I/O performance variability at both the node-local and global levels. This problem is not well understood but highly important for modern supercomputing infrastructures. This paper proposes a versatile asynchronous checkpointing solution that addresses this problem. To this end, we introduce a concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead. We illustrate this technique using the VeloC prototype. Extensive experiments on a pre-Exascale supercomputing system show significant benefits.
Related Papers
- → A Review of Supercomputer Performance Monitoring Systems(2021)6 cited
- → The practice of conducting performance analysis of supercomputer applications(2019)1 cited
- 슈퍼컴퓨터센터의 최적 운영환경을 위한 기반시설 용량 산정에 관한 연구(2010)
- → The Next-generation Supercomputer and Visuakization(2006)
- Multi-level Structure Abstract and Description of Supercomputer(2008)