Clustering Based Undersampling for Handling Class Imbalance in C4.5 Classification Algorithm
Citations Over TimeTop 16% of 2020 papers
Abstract
Abstract Machine Learning is very difficult to make an effective learning model if the distribution of classes in the training data set that is used is not balanced. The problem of class imbalance is mostly found during classifications in the real world where one class is very small in number (minority class) while the other classes are very numerous (majority in class). Building a learning algorithm model without considering the problem of class imbalance causes the learning model to be flooded by majority class instances so that it ignores minority class predictions. Random undersampling and oversampling techniques have been widely used in various studies to overcome class imbalances. In this study using the undersampling strategy with clustering techniques while the classification model uses C4.5. Clustering is used to group data and the undersampling process is performed on each data group. The goal is that sample samples that are useful are not eliminated. Statistical test results from experiments using 10 imbalance datasets from KEEL-repository dan Kaggle dataset with various sample sizes indicate that clustering-based undersampling produces satisfactory performance. Improved performance can be seen from the sensitivity and AUC values that increased significantly.
Related Papers
- → An Empirical Study of Oversampling and Undersampling for Instance Selection Methods on Imbalance Datasets(2013)60 cited
- → An Empirical Study of Oversampling and Undersampling Methods for LCMine an Emerging Pattern Based Classifier(2013)22 cited
- → Integration of synthetic minority oversampling technique for imbalanced class(2019)13 cited
- → Combination of Oversampling and Undersampling Techniques on Imbalanced Datasets(2022)3 cited
- → An Approach for Mining Imbalanced Datasets Combining Specialized Oversampling and Undersampling Methods(2023)1 cited