Why inverse document frequency?
Citations Over TimeTop 10% of 2001 papers
Abstract
Inverse Document Frequency (IDF) is a popular measure of a word's importance. The IDF invariably appears in a host of heuristic measures used in information retrieval. However, so far the IDF has itself been a heuristic. In this paper, we show IDF to be optimal in a principled sense. We show that IDF is the optimal weight of a word with respect to minimization of a Kullback-Leibler distance suitably generalized to nonnegative functions which need not be probability distributions. This optimization problem is closely related to maximum entropy problem. We show that the IDF is the optimal weight associated with a word-feature in an information retrieval setting where we treat each document as the query that retrieves itself. That is, IDF is optimal for document self-retrieval.
Related Papers
- → An Improved TFIDF Algorithm in Text Classification(2014)18 cited
- → An Improved TFIDF Algorithm in Electronic Information Feature Extraction Based on Document Position(2012)4 cited
- Research and Improvement of TFIDF Text Feature Weighting Method(2014)
- On Improvement of Feature Weight Algorithm in Hierarchical Text Classification(2011)
- A text feature selection algorithm based on class discrimination(2013)