An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance
Citations Over TimeTop 10% of 2017 papers
Abstract
The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.
Related Papers
- → Improving Jaccard Index for Measuring Similarity in Collaborative Filtering(2017)27 cited
- → Improving Jaccard Index Using Genetic Algorithms for Collaborative Filtering(2017)5 cited
- → Comparison of Similarity Coefficients on Morphological Rodent Tuber(2018)9 cited
- → Utilization of Jaccard Index Measures on Multiple Attribute Group Decision Making under Neutrosophic Environment(2020)3 cited
- → Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient)(2004)51 cited