AComparative Study of Learning Techniques with Convolutional Neural Network Based on HPC-Workload Dataset

High-Performance Computing or HPC is a computer system that has high computing power. The HPC supports various computational domains. A huge amount of jobs from a large group of users prefer to complete their jobs in this kind of system. Therefore, managing the jobs or job scheduling is very important since it involves the overall system efficiency. The analysis of an HPC-workload log file is a solution to improve system efficiency. Because some information may appear in the log file, this information can help the system scheduler to make an appropriate decision for job scheduling in the HPC system. This research proposed predictive models for predicting the job status at the finishing state in the HPC system. The model can be used as a tool for monitoring the jobs in the HPC system. We develop and build the three models including HPC-CNN, HPC-AlexNet, and HPC-VGG16 based on the two different learning techniques, which comprise Initial and Transfer Learning of Convolutional Neural Network based on the HPC-work load dataset. Moreover, the three state-of-the-art Machine Learning methods: Classification and Regression Tree (CART), Artificial Neural Network (ANN), and Support Vector Machine (SVM) are used as the baseline models for performance comparison. The results show that the model that performs the best predictive performance is the proposed HPC-CNN model. It archives 76.48% accuracy of the prediction followed with the CART model (75.60%), while the SVM model performs lowest the accuracy at 66.80%.


I. INTRODUCTION
The HPC systems provide computing power in many computation domains [1]- [5]. The type of jobs, which are computed on the HPC system is diverse since it combines various computing domains from different users. Therefore, the job management or job scheduling as called job scheduler [6]- [8] is very important for the HPC system. The system efficiency of the HPC system can be evaluated from the job success rate in the system. In other words, the performance of the scheduler affects the overall system efficiency of the HPC Manuscript  system. The efficiency of the system can be evaluated as the power that the system consumes and the job success rate. This means that the HPC system that has a high job success rate indicating the high efficiency of the system. Whereas, the low job success rate indicates the poor efficiency of the system.
Job scheduler like a brain of the HPC system. It is a middle-ware for receiving the job from users. Then, it sends the job to appropriate computing resources with the best strategy. In the job scheduling process, the scheduler records all events that occur in the system with numeric or string to the file as called the HPC-workload log file. This file can be used as source information for a system administrator to investigates or tracks the problems when problems occur in the system. Moreover, the HPC-workload log file may contain some hidden information that the administrator can be used to improve the efficiency of the system. For the traditional HPC-workload log analysis, an administrator manually analyzes using basic statistical methods based on their knowledge. Analyzing that data in this manner may be inefficient since it takes a long time to process, event there is no flexibility to be used for the generic model.
In the last decade, data mining techniques have been widely applied in the log analysis domain [9]. This research proposed the classifier models using Deep Learning with different learning techniques of CNN based on the HPC-workload dataset. The proposed models for predicting the job status at the finishing state in the HPC system. Meanwhile, the three state-of-the-art models of Machine Learning including CART, ANN, and SVM are created based on the same dataset that can be used as the baseline models. This research uses the HPC-workload log file as a dataset. The raw dataset contains 421,459 records. Each record consists of 27 attributes. The dataset is collected from the production of the HPC system named "Atom Computer Cluster" of National Electronics and Computer Technology Center (NECTEC) in Thailand. This HPC is operated under the National e-Science Infrastructure Consortium project. It provides the computing resources to support various computational projects in Thailand since the middle of 2012.
The main objective of this research is (i) to propose the developing and modeling classifier models using Deep Learning with different learning techniques of the CNN based on the HPC-workload dataset, (ii) to propose the comparative study of the performance of the models based on Deep Learning techniques and the models based on Machine Learning techniques including CART, ANN and SVM, and (iii) to demonstrate the advantages as well as disadvantages of the proposed models based on the real-world dataset.
In the next section, we illustrate the existing works that relate to our research. Section III, the methodology and the dataset are explained in this section. The experimentation of this research is described in section IV. Section V illustrates the results. Section VI and VII illustrate the discussion and conclusion of this research, respectively.

II. LITERATURE REVIEW
The log file is a time series event-based record of the systems or applications while the process is online. The contents in a log file consist of many types of messages, such as only text, only numeral, or the combination of text and numeral. Analysis of the log file to extract useful information that investigates the root cause of the problem in order to find the suitable configuration of the system, the characteristic of the user's behavior, and etc. Currently, machine learning techniques play an important role in the log analysis domain. In this section, we describe the existing works that related to the use of a machine learning technique in the log analysis.

A. Log Analysis using Machine Learning
The existing works in network log analysis, D. J. Arndt and Zincir-Heywood [10] conduct a comparative study of the three classifier models to classify binary-class problems of encrypted network traffic (SSH encrypted or Non-SSH encrypted). The models are built based on machine learning methods, which include C4.5, K-means, and K-mean with Multi-Objective Genetic Algorithm (MOGA). This research shows C4.5 classifier archiving in overall performance. Meanwhile, the K-mean with MOGA gives the highest accuracy in some test cases as well as reduces time complexity of K-mean. Bujlow et al. [11] propose a classifier model with a decision tree method. The C5.0 method is applied to create the model for classifying the seven types of network traffic (Skype, FTP, P2P, Web, Web radio, Game, and SSH). The dataset in this research is a real-world dataset that is collected by their Volunteer-Based System. The result shows that their classifier has a better performance than the previous work. The performance in terms of accuracy of their model is approximately 99.3 -99.9%.
Cao and Qiao [12] develop an Abnormal Detection System (ADS) for predicting the cyber-attack of the web (normal access or abnormal access) through the two levels of machine learning techniques. For the first level, they create three classifiers: logistic regression, decision tree, and support vector machine to label the data. For the second level, they choose the dataset, which is labeled from the best model according to the first level. Then, the classifier model is built with the Hidden Markov Model (HMM) based on the chosen dataset. The results in terms of performance comparison show that the proposed model archives the highest accuracy at 93.54%. The dataset is collected from the industrial.
Ertam and Kaya [13] conduct a comparative study of the classifiers for classifying the package permission, which composes of Allow, Deny, Drop and Reset-Both. The SVM algorithm is applied to build the model with different kernel functions including Linear, Polynomial, Radial Basis Function (RBF) and Sigmoid function. The dataset is a firewall log, which is taken from the firewall device of the Firat University. The result shows that the SVM classifier model using an RBF function overcomes other kernel functions with the best F1 score at 76.4%.

B. High-Performance Computing Log Analysis
Hsu and Feng [14] propose a prototype of power awareness in the HPC system. The main objective of the research is to help the HPC system to reduce power consumption. This research uses the -adaption Algorithm with Dynamic Voltage and Frequency Scaling technology to control the CPU workload in the system. Computer profiling (Real-time log) is used as a dataset. For experimentation, the model runs using benchmark applications. The result demonstrates that the proposed method reduces the power consumption of the HPC system around 20% for sequential Benchmark test cases and 25% for the parallel benchmark test case.
Taerat, et al. [15] conduct research using descriptive analysis to explain the characteristics of the HPC system based on system failures. The HPC log file of the IBM Blue Gene/L system of Louisiana Tech University is used as a dataset. The result shows in terms of the enumerated information, such as the severity level of failures, time to repair (TTR) or mean time to failures (MTTF). The conclusion of the analysis assumes a time to repair (TTR) as 10 minutes. Then, the results suggest that the system has a mean time to failure (MTTF) at 5.89 hours, or around 4 times a day.
Pelaez et al. [16] develop a system failure detection through the improvement of the clustering algorithm. The proposed method so-called Decentralized Online Clustering (DOC). The system is built based on a case study of the Ranger supercomputer of the Texas Advanced Computing Center. The result illustrates that the performance of the system failure detection is not different compared to the baseline. Meanwhile, the proposed model reduces approximately 2% of the overall overhead (CPU, memory and network bandwidth).
Klinkenberg et al. [17] propose a monitoring system for predicting system failures for the HPC system of the RWTH Aachen University. The first phase of the research uses a descriptive statistic to identify the events through the characteristics of the event. In the second phase, a comparative study of classification methods: logistic regression, decision tree, random forest, SVM, and multilayers perceptron based on preprocessing data in the first phase. The performance evaluation using 10-fold cross-variation demonstrates that the proposed model archives 98% precision and 91% recall.
Yoo, Sim, and Wu [18] conduct a comparative study using six methods of machine learning including decision tree, random forest, logistic regression, and naïve bayes to build the classifier models for predicting the job unsuccess at running state in the HPC system. The dataset is an HPC-workload log file of Genepool Scientific Cluster Computer of the NERSC. The result shows that the model based on the random forest method outperforms other models. The performance of the classifier archives 99.8% accuracy, 83.6% recall and 94.8% precision.
In conclusion, the literature review we mentioned above demonstrates that machine learning techniques are widely applied in many log analysis domains, especially for HPC log analysis. However, in this research, we proposed classifier models using deep learning techniques with different learning techniques of the convolutional neural network.

III. RESEARCH METHODOLOGY
The classification technique is a technique in machine learning. It is a supervised learning technique to classify or predict binary or multi-label classification problems. Currently, deep learning is a subset of machine learning that becomes a popular technique in the artificial intelligent domain. This research applies deep learning techniques with the CNN method to develops classifier models for predicting the job status at the finishing state in the HPC system.

A. Convolutional Neural Network
The Convolutional Neural Network (CNN) also known as ConvNet is an algorithm, which uses the process of the neural network. The architecture and process of neural networks mimic the process of the human brain. Therefore, the ConvNet is a popular algorithm in deep learning technique that has been applied in many domains, such as the self-driving car system [19], medical science [20], [21], and environmental science [22]. The architecture of the ConvNet composes of the input layer (receive input data), the hidden layer (computational process), and the output layer (classify or predict the result). In a part of the hidden layer of the ConvNet, it is different from the normal neural network. Therefore, it can be separated into two main procedures. The first procedure is the process of convolution to maintain the value with Rectified Linear Unit (ReLU). The second procedure reduced features using pooling techniques (select one feature in the region). Then, the network re-processes the two procedures again until finish convolution loop. Fig. 1 shows the example of the max pooling technique. There are two techniques to build a classifier model through the ConvNet algorithm. The first technique is the initial learning technique [23], [24]. This technique creates all-new architecture as well as initial learning the data from zero at the model learning state. The second technique is the transfer learning technique [25], [26]. This technique uses the pre-train network with fine-tune technique and modifies the pre-train network according to the dataset. The transfer learning technique reduces the model learning time in the learning state. Generally, this technique suits for using with the general image. Fig. 2 shows a comparison of the initial learning and transfer learning techniques. In this research, we use AlexNet and VGG16 as the pre-train network. Table I shows the characteristics of the pre-train network.

B. The State-of-The-Art Machine Learning
Machine learning algorithms are divided into two groups according to the learning process including supervised and unsupervised learning. The supervised learning means the target variable must be defined at the learning state while unsupervised learning the target variable has not to be defined at the learning state. Mostly, the algorithms that propose classification and regression tasks are grouped as a supervised learning technique. Meanwhile, the algorithms that propose a clustering task are grouped as an unsupervised learning technique. In this paper, we use the three state-of-the-art machine learning algorithms including Classification and Regression Tree (CART), Artificial Neural Network (ANN), and Support Vector Machine (SVM) building as the baseline models.
The CART is a tree based-algorithm [27]. The CART algorithm supports the model for classification as well as a regression task. In other words, this algorithm can be used with the dataset that is a categorical and continuous type of target variable. Therefore, the learning data to create the tree structure rule of the CHART algorithm are the Gini index value and variance reduction criterion for classification and regression task, respectively.
Artificial Neural Network (ANN) [28] is an algorithm developed from the motivation of the human brain works. Typically, the ANN architecture composes of three parts including the input layer, hidden layer, and an output layer. The multilayer perceptron is a basis of ANN architecture (one input layer, one hidden layer, and one output layer). The process of the ANN algorithm sends the data into the input layer, and then, propagates the data into the hidden layer. At the same time, the input values are computed by multiplying the weight including the bias values. The result is called "net input". Then, the activate function is taken to the net input. Finally, the result is processed in the output layer for classifying the data. is an algorithm that finds the appropriate line to separate the data in a hyperplane. The line is defined from a mathematical function called kernel function. The popular SVM kernel function is Linear, Radial Basis Function, Polynomial, and Sigmoid. Previously, the native SVM supports only binary-label classification problems. Presently, modern SVM can be used to handle the multi-label classification problem as well as increasing the robustness to the outliner. However, finding the appropriate kernel function of the SVM algorithm is a difficult task.

C. Assessments
To evaluate the performance of the models, we select the four evaluators including accuracy, recall, precision, and F-measure. All evaluators are computed from the confusion matrix table. Fig. 3 illustrates the example of the confusion matrix of a two-class problem.
The true positive (TP) is the number of the predicted value is "True", and the actual value is "True". The false negative (FN) is the number of the predicted value is "False", while the actual value is "True". The false positive (FP) is the number of the predicted value is "True", while the actual value is "False". The true negative (TN) is the number of the predicted value is "False", and the actual value is "False".
The accuracy (1) is an evaluator that assesses the overall performance of the model. The recall (2)

D. Dataset and Tools
This research uses the dataset, which is collected from the National Electronics and Computer Technology Center, Thailand or NECTEC. The dataset is an HPC-workload log from the PBS/Torque scheduler in a production computer cluster called "Atom computer cluster". Atom computer cluster is a medium size HPC system in Thailand that composes of 580 computing elements, 2.7 TBytes of memory, and 50 TBytes of the storage capacity. This system provides free-computing resources for the research in Thailand since mid-2012. The raw HPC-workload log file contains 421,659 records. Each record composes of 27 attributes. Fig. 4 illustrates the example of the raw data of the HPC-workload log file.
In this research, we use MATLAB software version R2018b to build the models through different learning techniques of the CNN network. In addition, we use the IBM SPSS Modeler 18.0 for creating the baseline classifier models through the machine learning methods. Moreover, we use Python 3.4 to handle the raw HPC-workload log in the data preprocess state. All experimentation is run on the working station computer (Intel Xeon Silver 4116 CPU, 2.10 GHz, 24 GB of memory without GPU).

IV. EXPERIMENTATION
In this research, we divide the experimental process into four main parts including data collection, data preprocessing, experimentation, and analyzing the results. The data collection process has already been described in section III. The experimental results and discussions are presented in sections V and VI, respectively. In this section, we describe the data preprocessing and experimentation parts.

A. Data Preprocess
After the raw data is collected from the system, we prepare the dataset through the data preprocessing process. This International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 process makes a suitable dataset for the experimental process. This dataset is a good quality one since, in the year 2016, the HPC system has a little downtime (around 6%). Then, we clean the data by eliminating outliers and missing values. Next, we select 11 out of 27 attributes using expert knowledge. There are 10 predictor variables including "Queue Type", "Execute Host", "CPU Usage", "Memory Usage", "VMemory Usage", "Queueing Time", "Execute Time", "CPU Time", "Limit Wall Time", and "Wall Time". The "Finish Status" is a target variable in this research. The target variable is a binary-class problem that composes of "success" and "error" state. Table II shows details of all attributes in the dataset.

B. Classifiers Modelling
In this process, we use the HPC-workload dataset that is already prepared according to the previous process. We separate the experimentation into two phases. The first phase according to the main objective (i) of this research that is to model the classifier models for predicting the job status at the finishing state of the HPC system. The models are built through the different learning techniques of the CNN network. The second phase according to the objective (ii) of this research that builds the three baseline models through the machine learning methods, which include CART, ANN, and SVM. In the first phase of the experiment, we propose the three classifier models to predict the job status at the finishing state of the HPC system based on the HPC-workload dataset. A deep learning technique is used to build models. The HPC-CCN is the proposed model, which is modeled through the initial learning technique of the CNN network. The network architecture and configuration of the HPC-CNN are defined using expert knowledge as showed in Fig. 6. The other two proposed models are HPC-AlexNet and HPC-VGG16. These models are built using the transfer learning technique of the CNN network. The AlexNet and VGG16 are used as a pre-train network for HPC-AlexNet and HPC-VGG16, respectively. In the transfer learning process, we fine-tune the three layers of the output port of the pre-train networks. As the input of the proposed models must be an image, we perform an extra-process for transforming the HPC-workload dataset into an image dataset. In this process, the categorical value of the predictor variable is changed to be a nominal value. Then, the numeric value in a dataset is normalized to 0 -255. At the end of this process, the HPC-workload dataset is ready transformed into the image data. The image data is created one by one from rows in the HPC-workload dataset. We duplicate nine times of row to be 10×10 pixels image data (Fig. 7). Fig. 8 shows an example of the image data after the transformation process is done. The color channel of the image data for the HPC-CNN model is 1 channel (grayscale), while the proposed models, which are modeled from the pre-train network (HPC-AlexNet and HPC-VGG16) are 3 channels (RGB). After the HPC-workload image dataset is created, we randomly split the dataset into 80% train-set and 20% test-set. The three proposed models (HPC-CNN, HPC-AlexNet, and HPC-VGG16) are built from the train-set with the same configuration as shown in Table III. The accuracy, recall, precision, and F-measure score are used to evaluate the performance evaluation of the proposed models. Then, the model, which perform the best performance is selected in order to compare its performance with the baseline models.   In the second phase, we create the baseline models for our comparative study with the proposed model. The three machine learning methods including CART, ANN, and SVM are used to build the baseline classifier models. We also randomly split the dataset into 80% train-set and 20% test-set. For the ANN configuration, it composes of one input layer with 10 neural nodes, one hidden layer with 7 neural nodes, and the output layer has 1 neural node. For SVM, the RBF kernel function is applied with the Gamma = 0.1 and C = 3.

V. RESULTS
From the experimentation, the three CNN network models HPC-CNN, HPC-AlexNet and HPC-VGG16 that are the classifier model, which are used to predict the job status at the finishing state in the HPC system. The performance International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 evaluation of the models illustrates that the HPC-CNN model archives the highest accuracy at 73.55%. In a part of the two models, which are built using the transfer learning technique, HPC-AlexNet and HPC-VGG16 provide the accuracy of the prediction at 57.35% and 42.65%, respectively. For the performance in terms of recall, precision, and F-measure, only HPC-CNN returns the results. The results are 59.69% recall, 73.35% precision, and 65.79% F-measure score as shown in Table IV. The model building time of the three models shows that the HPC-CNN model takes less time of 5 minutes and 13 seconds while the HPC-VGG16 takes the longest time at 11 hours (Table V). Fig. 9 shows the confusion matrix of the HPC-CNN model.  We increase the epoch at the learning state of HPC-CNN up to 18 epochs (Fig. 10). As a result, the accuracy of the model increases up to 76.49%.

VI. DISCUSSION
The result shows that the performance of the HPC-AlexNet and HPC-VGG16 models are very poor as they cannot return the results of recall, precision, and F-measure. Based on this result, it could be concluded that the models, which are built using the transfer learning technique of CNN from the pre-train networks (AlexNet and VGG16) are unsuitable for the HPC-workload dataset. This could be because the network architecture of the pre-train networks is inconsistent with the input data. In other words, there are some unnecessary of the hidden layers (convolutional part) since the HPC-workload is a low dimensional dataset. This conclusion seems to be supported by the result that the HPC-CNN network archives higher accuracy than the HPC-AlexNet and HPC-VGG16. This is possibly because it has only three hidden layers.

VII. CONCLUSION
This research proposed classifier models to predict the job status at the finishing state of the HPC system based on the HPC-workload dataset. The two learning techniques including initial and transfer learning of the CNN network is utilized to model the proposed models. The HPC-CNN network uses the initial learning technique of the CNN network. Meanwhile, HPC-AlexNet and HPC-VGG16 use the transfer learning technique. The AlexNet and VGG16 network is used as the pre-train network. The performance comparison of three proposed models demonstrates that the HPC-CNN model archives the highest accuracy at 76.5%. Moreover, this research is a comparative study of the proposed model with the three state-of-the-art machine learning methods including CART, ANN, and SVM. The results show that the proposed HPC-CNN network outperforms the others with 76.49% accuracy, while the baseline models CART, ANN, and SVM provide the accuracy of the prediction at 75.4%, 72.9%, and 66.8%, respectively. For the future work, we will apply a grid search or random search to find the best CNN configurations based on the HPC-workload dataset and increase the scale of the dataset in International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 order to enhance the performance of the model.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
The first author is designing the research framework, organizing the experimentation steps and preparing the manuscript. The second author helps to validate the manuscript. The third author had approved the final version. The last author takes part in the experimentation design.