Complexity curve: a graphical measure of data complexity and classifier performance
Citations Over TimeTop 11% of 2016 papers
Abstract
We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).
Related Papers
- → A multi-class classifier based on support vector hyper-spheres for steel plate surface defects(2019)29 cited
- → Techniques for improving precision and construction efficiency of a pattern classifier in composite system reliability assessment(2012)7 cited
- → Hellinger distance as a penalized log likelihood(1994)21 cited
- → Ensemble of Classifiers Based on Hard Instances(2011)1 cited
- → Semi-supervised Learning using Adversarial Training with Good and Bad Samples(2019)3 cited