Fast Pattern-Matching via k-bit Filtering Based Text Decomposition
Citations Over Time
Abstract
This study explores an alternative way of storing text files to answer exact match queries faster. We decompose the original file into two parts as filter and payload. The filter part contains the most informative k bits of each byte, and the remaining bits of the bytes are concatenated in the order of appearance to generate the payload. We refer to this structure as k-bit filtered format. When an input pattern is to be searched on the k-bit filtered structure, the same decomposition is performed on the pattern. The k bits from each byte of the pattern form the pattern filter bit sequence, and the rest is the payload. The pattern filter is first scanned on the filter part of the file. At each match position detected in the filter part, the pattern payload is verified against the corresponding location in the payload part of the text. Thus, instead of searching an m-byte pattern on an n-byte text, first k·m bits are scanned on k·n bits, followed by a verification of (8 − k)·m bits on the respective locations of the matching positions. Experiments conducted on natural language texts, plain ASCII DNA sequences and random byte sequences showed that the search performance with the proposed scheme is on average two times faster than the tested best exact pattern-matching algorithms. The highest gain is obtained on plain ASCII DNA sequences. We also developed an effective bitwise pattern-matching algorithm of possible independent interest within this study.
Related Papers
- → Using Parallel Bloom Filters for Multiattribute Representation on Network Services(2009)48 cited
- → TinySet—An Access Efficient Self Adjusting Bloom Filter Construction(2017)39 cited
- → TinySet - An Access Efficient Self Adjusting Bloom Filter Construction(2015)6 cited
- → Efficient false positive free set synchronization using an extended bloom filter approach(2013)3 cited
- → ASCII Text Numbers(1995)