Enhancing Checkpoint Performance with Staging IO and SSD
Citations Over TimeTop 10% of 2010 papers
Abstract
With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel systems doesn't perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2.
Related Papers
- Capacity Planning of Enterprise Information System through Simulation(2012)
- → Energy Consumption Characteriation of Heterogeneous Servers(2013)2 cited
- → Network redesign through servers consolidation(2009)2 cited
- Study on Substituting Localized Servers for Imported Servers on ETL Application(2015)
- → System optimization of contents delivery network with liformation-zooming function(2005)