Generalized Inverse Document Frequency
Source:
ACM Conference on Information and Knowledge Management (CIKM) (2008)
Abstract:
Inverse document frequency (IDF) is one of the most useful
and widely used concepts in information retrieval. There
have been various attempts to provide theoretical justifications for IDF. One of the most appealing derivations follows
from the Robertson-Sparck Jones relevance weight. However, this derivation, and others related to it, typically make
a number of strong assumptions that are often glossed over.
In this paper, we re-examine these assumptions from a Bayesian perspective, discuss possible alternatives, and derive a
new, more generalized form of IDF that we call generalized
inverse document frequency. In addition to providing theoretical insights into IDF, we also undertake a rigorous empirical evaluation that shows generalized IDF outperforms
classical versions of IDF on a number of ad hoc retrieval
tasks.
Download: