Optimizing tertiary storage organization and access for spatio-temporal datasets
Abstract
We address in this paper data management techniques for efficiently retrieving requested subsets of large datasets stored on mass storage devices. This problem represents a major bottleneck that can negate the benefits of fast networks, because the time to access a subset from a large dataset stored on a mass storage system is much greater that the time to transmit that subset over a network. This paper focuses on very large spatial and temporal datasets generated by simulation programs in the area of climate modeling, but the techniques developed can be applied to other applications that deal with large multidimensional datasets. The main requirement we have addressed is the efficient access of subsets of information contained within much larger datasets, for the purpose of analysis and. interactive visualization. We have developed data partitioning techniques that partition datasets into ``clusters`` based on analysis of data access patterns and storage device characteristics. The goal is to minimize the number of clusters read from mass storage systems when subsets are requested. We emphasize in this paper proposed enhancements to current storage server protocols to permit control over physical placement of data on storage devices. We also discuss in some detail the aspects of the interface between the application programs and the mass storage system, as well as a workbench to help scientists to design the best reorganization of a dataset for anticipated access patterns.
Related Papers
- → Efficient organization and access of multi-dimensional datasets on tertiary storage systems(1995)54 cited
- → Constructing collaborative desktop storage caches for large scientific datasets(2006)25 cited
- Multi-level data layout optimization for heterogeneous access patterns(2013)
- → Solid-State Storage and Work Sharing for Efficient Scaleup Data Analytics(2014)1 cited
- → Mapping Datasets to Object Storage System(2020)