Similarity join size estimation using locality sensitive hashing
Citations Over TimeTop 10% of 2011 papers
Abstract
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses Locality-Sensitive-Hashing (LSH). The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates throughout the similarity threshold range and small variance using real-world data sets.
Related Papers
- → Two MRJs for Multi-way Theta-Join in MapReduce(2013)6 cited
- → Towards multi-way join aware optimizer in SAP HANA(2020)3 cited
- USING SEMI-JOINS TO REALIZE JOINS AND MIXED-SEMI(?)JOINS(1987)
- → Storing Join Relationships for Fast Join Query Processing(2017)
- → The Copies, Joins, Join Sketches and Plates(2021)