Towards estimation error guarantees for distinct values
Citations Over TimeTop 10% of 2000 papers
Abstract
We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.
Related Papers
- → Multiple importance sampling revisited: breaking the bounds(2018)32 cited
- → Difference-type estimators for estimation of mean in the presence of measurement error(2014)3 cited
- → Difference-type estimators for estimation of mean in the presence of\n measurement error(2014)2 cited
- → The Influence of Measurement Errors on Generalized Estimator of Population Mean(2022)
- → Estimators for the Product of Two Population Means Using Auxiliary Attribute in the Presence of Non-response(2022)