0 citations0 references

Towards estimation error guarantees for distinct values

2000pp. 268–279

Citations Over TimeTop 10% of 2000 papers

Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya

Abstract

We consider the problem of estimating the number of distinct values in a column of a table. For large tables without an index on the column, random sampling appears to be the only scalable approach for estimating the number of distinct values. We establish a powerful negative result stating that no estimator can guarantee small error across all input distributions, unless it examines a large fraction of the input data. In fact, any estimator must incur a significant error on at least some of a natural class of distributions. We then provide a new estimator which is provably optimal, in that its error is guaranteed to essentially match our negative result. A drawback of this estimator is that while its worst-case error is reasonable, it does not necessarily give the best possible error bound on any given distribution. Therefore, we develop heuristic estimators that are optimized for a class of typical input distributions. While these estimators lack strong guarantees on distribution-independent worst-case error, our extensive empirical comparison indicate their effectiveness both on real data sets and on synthetic data sets.

Citations Over TimeTop 10% of 2000 papers

Abstract

Related Papers