A Credit Risk Predicting Hybrid Model Based on Deep Learning Technology

Credit risk evaluation (CRE) is a very challenging and important management science problem in the domain of financial analysis. Many popular methods have been applied to tackle this problem in recent years. However, feature extraction and imbalanced data problem have not been paid enough attention in the current research, which play significant function in field of CRE. In this paper, we employed a deep learning approach to extract effective features and under-sampling technique to balance dataset. Our model combine under-sampling technique, Deep Boltzmann Machine (DBM) and Discriminative Restricted Boltzmann Machine (DRBM) method. To examine the performance, real world credit data of Lending Club is applied in the proposed model. The stable and better performance results show that the Hybrid classifier we propose is more effective and powerful.


I. INTRODUCTION
With the development of financial market, especially the advent of Internet Fintech, the credit risk evaluation of individual borrowers is becoming more and more crucial to the competition between financial institutions. In fact, the risk of credit in many financial intermediates can account for 60% of their business activities [1]. Therefore, CRE plays a great significance to the development of financial market. Nowadays, with the collection of large number of user data and the development of data mining algorithms, it is much more practical to resolve the credit risk issues, such as borrower default prediction.
In this paper, we proposed a hybrid classifier model with deep learning techniques, DBM [2] and DRBM [3], to predict default borrowers. Deep learning methods are certified as powerful feature selection theory, which have been widely employed in many fields, such as face recognition [4], and emotion recognition [5]. Differ from the traditional shallow classifiers with less hidden layers, Deep learning method with sufficient hidden layers, extended by [2] on traditional neural networks. As one kind of deep learning technology, DBM has many advantages in mining complex information but was rarely used in credit risk field. Therefore, this paper aims to introduce a Hybrid model with DBM for CRE to find a relationship between the borrowers' characteristics and possibility of default.
Lending Club dataset is selected to examine the proposed model (http://www.lendingclub.com), which provides complete loan data for all loans issued. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. We compared the proposed model with traditional classification technique on Lending Club dataset, and the performance shows that our model is more effective.
The structure of the paper is shown as follows. Previous research of credit risk evaluation is presented in the Section II. Section III is devoted to introduce the details of the proposed model and demonstrates the effectiveness of the proposed technology compare SVM with experimental studies.

Previous Research of Credit Risk Evaluation
As early as the 1990s, proposals had already been made for Machine Learning (ML) techniques for credit risk evaluation [6], [7]. Credit risk evaluation is usually formulated as a classification problem, which determines the default probability based on various information of the borrowers. Banks provide loans to users with good credit, increase revenue, avoid potential bad users and reduce losses. From the perspective of classification technique, CRE can be categorized into the binary classification problem [8]- [10]. Since then an increasing number of classification techniques for credit risk evaluation have been proposed. The existing studies undergo a classification technique development from a single classifier method to a Hybrid method and ensemble method. The three kinds of classification techniques are reviewed in the following. Single classifier, as one kind of the classical pattern recognition techniques, was adopted early for CRE and is still active in this field now. Single classifier method generally refer to the application of traditional classical classifiers, such as decision tree [11], back propagation [12], SVM [13] and so on. Some research applied single classifier with other method to evaluate the credit risk. For example, [11] compared five single classifier contains J4.5 decision tree, AdaBoost, random forest, naï ve bayes, PART, and they applied German credit dataset to test these methods based on filter and wrapper features selection. [14] applied back propagation (BP) neural network to credit risk evaluation based on Lending Club dataset. [15] made an experimental comparison research on the performance of four single classifiers, BP, important academic background of this paper.
With the development of the single classifier method, ensemble method is widely considered to be superior to many single classifiers for evaluating credit risk in terms of accuracy [18]. Ensemble methods apply ensemble techniques, such as bagging [19] and boosting [20], to combine single classifiers to evaluate credit risk. Florez-Lopez and Ramon-Jeronimo [21] introduced an ensemble approach based on merged decision trees, the correlated-adjusted decision forest (CADF), to produce both accurate and comprehensible models. [22] proposed a new Decision Tree ensemble (DTE) model for imbalanced enterprise credit evaluation based on the synthetic minority over-sampling technique (SMOTE) and the Bagging ensemble learning algorithm with differentiated sampling rates (DSR), which is named as DTE-SBD based on SMOTE, Bagging and DSR. In [23], a novel ensemble model based on the synthetic minority over-sampling technique (SMOTE) and a classifier optimization technique is proposed for personal credit risk evaluation. [24] proposed a personal credit risk assessment model based on Stacking ensemble learning. The model uses different training subsets and feature sampling and parameter perturbation methods to train multiple differentiated XGBoost classifiers. present, however, feature selection and imbalanced problem were not paid enough attention. As a deep learning method, DBM is a potential method to extract features from complex data [4], [5], and which is rarely applied in the credit risk evaluation. In this paper, we construct a hybrid method framework of credit risk evaluation with under-sampling, DBM feature selection and DRBM classification. Fig. 1 shows the framework of the hybrid model, which contains three main parts, re-sampling part, DBM feature selection and DRBM classification. Random selection belongs to data preprocessing which also includes data normalization, and quantification.  In this paper, we applied under-sampling on training data to several subsets and combined them as a re-sampling pool to mix good data (full paid) and bad data (charged off) after under-sampling on the training individual credit data. Under-sampling is one kind of technology to adjust class distribution of the original data. Under-sampling select a certain number of data randomly which equals to the smaller category from the larger number class. The selected data combine the smaller dataset to new training dataset. In this article, we apply under-sampling technology to solve the imbalanced problem of the candidate' lending data.

A. DBM-DRBM Hybrid Model
After re-sampling part, DBM was utilized to select effective feature information from the lending dataset. As a method of deep learning, DBM has powerful feature Extreme Learning Machine (ELM), incremental extreme learning machine (I-ELM), and support vector machine(SVM), by using the dataset of CRE and then discussed its advantages and disadvantages. [16] compared three single classifiers with the model they proposed on the Lending Club dataset to analyze the influence of supervised classification models and unbalanced data processing technique to credit prediction rates. [17] proposed Mahalanobis distance induced kernels in support vector machines with application in CRE and compared with traditional SVM kernels. Their research results show superior performance on real world credit datasets. Other numerous studies, although not be mentioned above, still constitute the Except single classifier and ensemble method, Hybrid classifier method is also widely suggested in many researches [25]. Hybrid method is applied to combine single classifier, ensemble method with other techniques, which is flexible and diverse. Hybrid method is suggested in many researches. For example, [26] presented an empirical comparison of various combinations of classifiers to solve the imbalanced problem in the Lending Club data set. [27] proposed a method of combining Random Forests (RF) and Neural Network for predicting borrower's default. [28] found the problem of misclassifications near the optimal hyper plane by adopting SVM, accordingly provided an SVM-KNN Hybrid model with K-Nearest Neighbor (KNN) to cope with the defects, and validated this improved method on CRE. [29] proposed a three-phase hybrid credit prediction model, which contains preprocessing, ensemble feature selection and multilayer classifier framework. [30] presented a feature selection-based Hybrid-bagging algorithm (FS-HB) to assess credit risk, and obtained better performance compared with feature selection-based classifier and bagging. In addition, many other Hybrid classifiers have been developed and applied in CRE [31].
Many credit risk evaluation methods have been proposed at extraction ability [2]. Using deep structure, DBM can extract effective feature information from complex and diverse attribute values. In this paper, we applied DBM to mine key representation hidden in credit data set. A Deep Boltzmann Machine is a network of symmetrically coupled stochastic binary units. There are connections only between hidden units in adjacent layers. Consider a DBM with two hidden layers as shown in follows.
The last part of the proposed model is DRBM classification, which regard the DBM's behavior feature output as the classification input.
DRBM [3] is a two-layer, bipartite, undirected graphical model with binary units in input and hidden layers. In a DRBM, there are connections between the hidden and input units but no connections between two units within the same layer. Actually, the architecture of DRBM has deep physical motivation. [32] demonstrated that Discriminative Restricted Boltzmann machine is statistically equivalent to the well-known physical model of Hopfield network [33]. The illustration of the discriminative restricted Boltzmann machine is shown as follows: i W U h y e Fig. 3. The illustration of the discriminative restricted Boltzmann machine.
In Fig. 3, i represents the state value of selected features and y e shows the state value of categories. DRBM works by utilizing a hidden layer of binary stochastic units h to model the joint distribution of the input data and its label. This is done by defining an energy function:  In the experiment, we used the Lending Club dataset. The official loan status of the dataset contains 6 categories, which is 'current' 'fully paid' 'late (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)' 'in grace period' 'late (31-120)' 'charged off'. For the purpose of this study, we considered loans issued of whole year of 2018, filtering out loans that are not fully paid or charged off yet.

B. Lending Club Dataset
International Journal of Machine Learning and Computing, Vol. 11, No. 3, May 2021 The dataset contains 110 attributes, some qualitative attributes, like 'purpose' 'verification_status', and null data. Preprocessing that includes normalization and quantification is necessary before random selection phase. In this study, we chose 78 attributes and 6,300 candidate's lending data with 1000 charged-off and 5300 fully-paid. Table I shows some important variables of Lending Club. Fig. 1 shows the framework of the Hybrid classifier model. Different phases of the model have different operations with different parameters. In this section, parameter details and experiment results is presented.

C. Experiment Design and Result Analysis
Before the DBM feature selection, credit data was processed with random selection and under-sampling. We made a random selection on every class with a ratio of test data to train by 3:7. Here, a high proportion of data is taken as training data to guarantee the validity of the proposed model. After that, we applied under-sampling technique to construct new training dataset (re-sampling pool) with 1:1 ratio of the charged off and fully paid. The details of the experimental data are shown as Table II. In this phase, we perform under-sampling with 20 times construct re-sampling pool. The size of good and bad data of every under-sampling subset is half of the bad training data. The purpose of the sampling pool setting is to prevent over-fitting of bad data caused by large amount of repetition and ensure data balance with more bad data to participate in training. After under-sampling phase, DBM takes the previous data as input and extracts features from the data. Then, the feature output of the DBM is taken as the input of DRBM classification. The parameters of this two phases are indicated in Table III. 'Node number of hidden layers' has four values, with the first three correspondences to the hidden number of DBM and '50*' is the hidden number of DRBM. 'Iteration time of single DBM' 'Fine-tune times' are the parameters in DBM and 'iteration number of classification' is the adjustment parameter of whole model of DBM+DRBM.
After the parameters details design, we compared the hybrid model with traditional classifier SVM. The comparison performance of the two model is shown as follows.   [14] 0.7860 DBM+DRBM 0.8858 RF and neural network [27] 0.7350 Logistic Regression [26] 0.8173 Table V shows the experimental classification accuracy results of DBM+DRBM compared with the other researches. In the Table V, the first column and the third column separately express the situation of feature selection and imbalanced data problem. The results show the proposed model has more effective performance.

IV. CONCLUSION
Credit risk evaluation is mainly focus on the possibility of default of borrowers, which required effective feature to discriminate bad credit and good credit. Therefore, finding an efficient method to extract critical features from a large number of attributes is significance. Deep learning techniques are utilized to select features and evaluate the default of borrowers.
In this paper, we propose a new Hybrid technique by combining DBM and DRBM, and expect the new model to achieve better generalization meanwhile keeping the merit of finding patterns as complex as DBM. As one kind of deep learning technology, the hybrid model inherit advantages of DBM, which has the potential ability of learning internal representation hidden in the data of CRE field. The new hybrid model constructed in this paper makes multiple adjustments to the weight to ensure the accuracy and the generalization ability of the model at the same time. To evaluate the applicability and performance of the proposed Hybrid classifier in real credit world, Lending Club dataset is used to compare its classification rate, which is reformulated into an error minimization problem. We preprocess the original data and make it balance with under-sample technology. We compared the proposed model with other published model from the perspective of feature selection and imbalanced data. The experimental results in Table IV show that the proposed Hybrid classifier has more stable and effective performance.
In addition, there are also some interesting topics that are worth of further investigation. Firstly, the empirical tests on Lending Club credit datasets show that the new Hybrid classification technique is promising, but the finding is applicable to this dataset and may not be generalizable to other datasets. A future study with more complex datasets will enhance external validity of this finding. Secondly, except from single classifier and hybrid model, there also exist ensemble model, which will be an interesting topic on Lending Club dataset and we will look into these issues in the near future.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
Chong Wu and Dekun Gao analyzed the data and constructed the model; Dekun Gao made programming experiment; Siyuan Xu and Chong Wu wrote and proofread the paper; all authors had approved the final version.