AB-SMOTE: An Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

—SMOTE is an oversampling approach previously proposed to solve the imbalanced data binary classification problem. SMOTE managed to improve the classification accuracy, however it needs to generate large number of synthetic instances, which is not efficient in terms of memory and time. To overcome such drawbacks, the Borderline-SMOTE (BSMOTE) is previously proposed to minimize the number of generated synthetic instances by generating such instances based on the borderline between the majority and minority classes. Unfortunately, BSMOTE could not provide big savings regarding the number of generated instances, trading to the classification accuracy. To improve BSMOTE accuracy, this paper proposes an Affinitive Borderline SMOTE (AB-SMOTE) that leverages the BSMOTE, and improves the quality of the generated synthetic data by taking into consideration the affinity of the borderline instances. Experiments’ results show the AB-SOMTE, when compared with BSMOTE, managed to produce the most accurate results in the majority of the test cases adopted in our study.


I. INTRODUCTION
Imbalanced data binary classification is a very well-known problem, in which the datasets have two classes, one of them is called majority or negative class that has much more instances than the other class, which is called minority or positive class. Having imbalanced data leads to a bias towards the majority class during the classification process, which in turns leads to inaccurate classification results. This is a common problem that we can find in many different fields such as categorization [1], medicine [2], customer churn prediction [3], wine quality [4] and others, which have high imbalanced distribution of instances within the classes. In most cases, the minority class is often the intended class to be predicted, meaning that we need the classifier to generate a model that can correctly classify new data that belongs to the minority class.
Different approaches have been proposed to solve this problem, trying to minimize the imbalanced ratio such as works in [5]- [10]. Such approaches could be classified as under-sampling and oversampling approaches. In the under-sampling approaches such as works in [1], and [10] some of the majority class instances are randomly deleted to Manuscript  balance the numbers in the minority class. However, such approaches may lead to inaccurate results as they may delete important information needed to generate the classification model. In the other hand, the oversampling approaches such as the works in [6]- [9], they generate new instances into the minority class to balance the data. Such oversampling approach may improve accuracy, however they also have some drawbacks, such as overfitting or duplicating the data, where they won't give crucial new information for model building [5]. Hence, ensuring generating high quality non-duplicate synthetic data is crucial for the success of the oversampling approach. This is done via many heuristic strategies for data generation.
Synthetic Minority Over-sampling Technique (SMOTE) [6] is one of the popular oversampling techniques where it randomly generates new synthetic instances between the minority instances without replicating them, thus eliminating data overfitting side effect. SMOTE managed to improve the classification accuracy, however it needs to generate large number of synthetic instances (e.g. up to 500% of the minority class) [10], which is not efficient in terms of memory and time. To overcome such drawbacks, the BSMOTE [9] is previously proposed to minimize the number of generated synthetic instances by generating such instances based on the borderline between the majority and minority classes. It generates synthetic instances using borderline instances and minority class instances, as shown in Section II. Unfortunately, the BSMOTE could not provide huge savings regarding the number of generated instances, trading to the classification accuracy as shown in Section IV.
The reason for such loss in accuracy is that BSMOTE generate instances around the nearest neighbors of the borderline, and not focused inside it. This might confuse the classifier by increasing the vagueness of the borderline by increasing its boundaries, as shown in Section IV. Hence, we argue in this paper that we could have better results if we focused the oversampling inside the boundaries of the borderline and increasing its density. This motivates us to further investigate the oversampling process.
To improve BSMOTE accuracy, this paper proposes an Affinitive Borderline SMOTE (AB-SMOTE) that takes into consideration the affinity of the borderline instances to help the classifier to be more accurate in differentiating between the classes.
Experiments' results show that the AB-SOMTE, when compared with BSMOTE, managed to produce the most accurate results in the majority of the test cases adopted in the study. However, the savings in the number of generated instances were still small. Hence, we will focus on reducing the number of generated instances in future work.
This paper structure is organized as follows. Section II will contain a brief introduction about SMOTE and Borderline SMOTE. Section III will discuss the proposed AB-SMOTE. Section IV presents the datasets used and the raw results from our experiments. In Section V, we will analyze the results of our new method against BSMOTE. Finally, in section VI, we discuss the conclusion and future work of this research.

II. BACKGROUND
A. SMOTE: Synthetic Minority Over-Sampling Technique The most known oversampling technique used in machine learning is SMOTE [6]. It checks the nearest neighbors between the instances within the minority class. The algorithm gets the percentage of new instances to be created and the number of the K nearest neighbors (Knn) to base its calculation on, as input from the user. The number of nearest neighbors is 5 by default. SMOTE chooses one minority instance, then calculates its Knn. After that it randomly chooses one of Knn to calculate a distance vector that is also considered as difference between the initial minority instance and the other selected instance from the same class. After the distance is calculated it will be multiplied with a random number called gap, which has a value between 0 and 1 to generate new instances falling in the line space between the selected instances. Then SMOTE continues to do the same process with other minority instances so that it will double the minority class instance number according to the percentage given by the user (e.g. 100% is the default value).  SMOTE in this way will increase the number of minority instances randomly, without focusing on specific instances of the minority class. This lead to overfitting the minority class. For this reason there were other variations of the algorithm such as Safe-level-SMOTE [7], SMOTEBOOST [8], Borderline SMOTE [9] and others, to overcome this problem. BSMOTE proposed in [9] eliminates some instances from the computation function considering them as noise, or safe instances. It focuses its computation on the borderline instances that fall between the two classes, shown in blue color in Fig. 3 to generate new instances.

B. BSMOTE: Borderline-SMOTE
The borderline instances are chosen by calculating the number of majority instances (M') that are found within the M nearest neighbors (Mnn) between each instance belonging to the minority class and all other instances within the dataset. Such that, if (M') value is between M/2 and M, the minority instance (Pi) is considered to be as a borderline instance (P ' i). After creating the new subset that have all the minority borderline instances, BSMOTE measures Knn between borderline instances and other minority instances, then generates the new instance using the following function: where P ' i is the borderline minority instance, Pj is the random International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 chosen Knn minority instance, and gap is a random number having value between 0 and 1. Fig. 4 demonstrates how BSMOTE works.
There is also another version of BSMOTE that is called BSMOTE2 where the algorithm can take points from the majority class. But using an instance from the majority instance, the gap values used will be between 0 and 0.5 so that the new created instance can be adjacent to the borderline. We can see that in BSMOTE creation of new instances is somehow focused around the nearest neighbors of the borderline, but not the borderline itself. We would have better results if we focused on the existing boundaries of the borderline to increase its density. We denote such oversampling strategy as "borderline affinity". Hence, we propose the AB-SMOTE to adopt the borderline affinity oversampling strategy.
III. AB-SMOTE: AFFINITIVE BORDERLINE SMOTE After researching SMOTE and BSMOTE, we noticed that we could develop BSMOTE in a way that we can make the generation of new instances computed only between borderline instances, thus increasing the minority instances within that area. We named our approach Affinitive Borderline SMOTE (AB-SMOTE) that works very similar to the BSMOTE but instead of checking Knn between the borderline instances and all minority instances, it only checks Knn within the borderline instances. Thus excluding the noise and/or safe instances that were used before to generate new instances. Fig. 5 shows AB-SMOTE diagram uses only instances within the borderline area to generate new instances.
Whereas Fig. 6 describes how AB-SMOTE works. AB-SMOTE defines the minority and majority class instances, then computes for every minority instance it's M nearest neighbors within the whole training data. At each iteration, it counts the number of majority instances M' found within the M nearest neighbors, and if this M' number falls between M/2 and M, the intended minority instance is considered as a borderline instance and copied to a new subset having a name borderline (danger). In case M' was less than M/2 the minority instance is treated as a safe one, where most of its surroundings are from the minority class, and if it was equal to M, it is considered as a noise because all of the M nearest neighbors are from the majority class. Later the algorithm checks Knn within the borderline instances, randomly chooses one of the Knn, calculate the distance between the two instances and apply the following function to compute the value of the new instance. Which makes the creation of new instances focused on the borderline.

IV. EVALUATION
In order to do the evaluation for our new approach, we International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 used WEKA [11] software for machine learning, which is an open source java script program. That allowed us to modify its codes. We also used different supervised datasets shown in Table I obtained from different online repositories. These real life datasets are widely used in machine learning scientific research. These datasets were downloaded from Crowd analytics, UC Irvine Machine Learning Repository, Keel, and IBM. We adopted the imbalanced threshold considered in [12] in choosing the datasets. That if minority number of instances are less than 40% of the number of instances of the majority class, the dataset is considered unbalanced and chosen for our evaluation. We classified our datasets using decision tree classification algorithm, which is called J48 in WEKA. The classification stage was conducting using 5-folds cross-validation to have enough positive minority class instances in every fold to minimize the data distribution problems [20], [21]. As we are dealing with unbalanced data set, evaluating the classifier with classification accuracy alone do not give a good overview about the classifier accuracy in predicting the minority class, instead a confusion matrix such as Table II is used for checking different metrics to get a good overview about the classifier prediction power [22]. The majority class is considered as Negative, while the minority class is Positive. Using the confusion matrix, true positive (TP) number should be as high as it might get within the limit of total number of the real positive value. Which implies that the model is correctly predicting the instances that have an actual positive class to be in a positive class. We evaluated the classifiers with recall and f-measures values that show up as an output in WEKA after applying the decision tree classifier with 5-folds cross-validation. The best values for recall and f-measure is when they tend to reach the value of one. Because we are working with supervised datasets, the instances have a known class value, and when applying 5-fold cross-validation means, that we split the data into 5 folds, 4 for training and one for testing, and it is done 5 times, and each time the testing portion is different than the other. WEKA gets the results from each fold, compute and output its average inside WEKA window. When a model is created using a training subset, it is applied on a test subset and compared the class value between the actual instances and predicted one to create the results. The results of the original data sets are obtained and recorded, then we applied different oversampling techniques such as SMOTE, BSMOTE, and AB-SMOTE using different tuning options that are shown in Table III. Where Mnn stands for the M number of the nearest neighbors between minority and whole dataset to create the borderline subset, Knn stands for the number for K nearest neighbors, G stands for Gap, the number which is multiplied by the distance to create the new instance, percentage is the value that calculate the number of new generated instances, P' is the borderline instances, and P is the minority instances.   Table IV to Table XI show the obtained recall and f-measure values. Gen resembles the number of generated new instances. We bolded the best obtained values.

V. RESULTS ANALYSIS
As we can see from previous results, BSMOTE and AB-SMOTE managed to provide better results than the original SMOTE approach except in one case in Table V. This means focusing on the boundary to generate the synthetic data is a very promising strategy. Hence, we can deduce that the borderline affinity strategy is better than International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 randomly generating new instances from safe and/or noise instances.
For every borderline-oriented approach, every dataset will require its proper parameters tuning to get the best fitting approach for this dataset. However, to see overall performance of the borderline-oriented approaches regardless of their parameters tuning, we marked the approach with the best value of recall and F-measure from the previous results' tables and constructed Table XII. From Table XII we can notice that AB-SMOTE outperformed BSMOTE by getting the most number of best values in terms of Recall and F-measure. This means focusing on increasing the density of the borderline area is a more effective oversampling strategy rather than increasing the boundaries of the borderline as in BSMOTE. Hence, we can say that the borderline affinity oversampling strategy worked very well with the decision tree classifier. In Future work, we will study to see if borderline affinity strategy holds for other classifiers or not. Table IV to Table XI also show that both BSMOTE, and AB-SMOTE could not significantly reduce the number of generated instances. Hence, we will investigate this issue in more depth in future work. This will be done by adopting a selective strategy that chooses certain instances in the borderline to be used in new instances generation rather than randomly choosing the instances. This will require examining the density distribution of the borderline instances, then carefully generates the new instances in a way that does not disrupt the calculated borderline density distribution.

VI. CONCLUSION
In this paper, we proposed the AB-SOMTE to improve the accuracy of the existing SMOTE and BSMOTE. This is done by taking the affinity of borderline instances into consideration when generating the synthetic data. We compared the accuracy of the three approaches against different data sets.
Experiments' results show the AB-SOMTE managed to produce the most accurate results in the majority of the cases. This means the proposed borderline affinity oversampling strategy is very promising, and could be leveraged more to select fewer instances from the borderline to reduce the number of generated data.
We believe this a very important direction for future work, as both BSMOTE, and AB-SMOTE could not provide big savings in the number of generated data. By selecting fewer key borderline instances to generate new instances, we could heavily minimize the number of generated synthetic data.
However, this should be carefully done without disrupting the borderline density distribution.