Abstract—A well balanced dataset is very important for creating a good prediction model. Medical datasets are often not balanced in their class labels. Most existing classification methods tend to perform poorly on minority class examples when the dataset is extremely imbalanced. This is because they aim to optimize the overall accuracy without considering the relative distribution of each class. In this paper we examine the performance of over-sampling and under-sampling techniques to balance cardiovascular data. Well known over-sampling technique SMOTE is used and some under-sampling techniques are also explored. An improved under sampling technique is proposed. Experimental results show that the proposed method displays significant better performance than the existing methods.
Index Terms—Class imbalance, under-sampling, oversampling, clustering, SMOTE.
The authors are with Department of Computer Science, University of Hull, UK (e-mail: M.M.Rahman@2009.hull.ac.uk, D.N.Davis@hull.ac.uk).
Cite:M. Mostafizur Rahman and D. N. Davis, "Addressing the Class Imbalance Problem in Medical Datasets," International Journal of Machine Learning and Computing vol. 3, no. 2, pp. 224-228, 2013.