Abstract—Data mining is the automatic process to find from data interesting and useful patterns for specific tasks such as predicting future data or classifying label or group of the new data items. Many data mining algorithms successfully applied to several real-life data are in a tree group. Among the tree-based algorithms, decision tree is the most popular and renowned one for its high accuracy on classifying data in general cases in which data in each class are quite equally distributed. But many datasets in real applications are imbalanced; amount of data in some group outnumber those in other group. Such uneven distribution among classes is a main reason why classification accuracy is not excellent even when using decision tree algorithm. Inefficiency is due to the case that in the tree growing phase, the algorithm tends to favor the majority data and ignores the minority data to be incorrectly classified. In the past many researchers try to solve this data imbalanced problem with many ways like over-sampling, under-sampling, cost-sensitive classification, or even ensemble of cost-sensitive decision tree. In this paper, we introduce a simplified method of learning classification and regression tree (CART) with resampling technique for classifying imbalanced datasets. We compare our proposed method with other methods based on several metrics including the precision on classifying the minority data as opposed to the classification on majority data, the overall accuracy regardless of minority nor majority classes, and the Matthews Correlation Coefficient (MCC). The use of MCC is suitable for imbalanced data because it takes into account all four classifying metrics: true positive, true negative, false positive, and false negative. The performance of our proposed method to combine resampling with CART is satisfied based on the MCC metric. From all five experimental imbalanced datasets, our method performs the best.
Index Terms—Classification and regression tree, CART, resampling technique, imbalanced data, matthews coefficient correlation.
The authors are with the School of Computer Engineering, Suranaree University of Technology (SUT), Thailand (corresponding author: Supajittree Boonamnuay; tel.: +66892865318; e-mail: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org).
Cite: Supajittree Boonamnuay, Nittaya Kerdprasop, and Kittisak Kerdprasop, "Classification and Regression Tree with Resampling for Classifying Imbalanced Data," International Journal of Machine Learning and Computing vol. 8, no. 4, pp. 336-340, 2018.