Techniques for Warehousing of Sample Data
Citations Over TimeTop 10% of 2006 papers
Abstract
We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application.
Related Papers
- → Elastic Stream Computing with Clouds(2011)55 cited
- → Bleach: A Distributed Stream Data Cleaning System(2017)10 cited
- → Dual-Paradigm Stream Processing(2018)1 cited
- → REAL TIME DATA STREAM AGGREGATION AND WINDOWING(2017)
- → A Framework for Simulating Real-world Stream Data of the Internet of Things(2022)