Runtime Estimation and Scheduling on Parallel Processing Supercomputers via Instance-Based Learning and Swarm Intelligence

—Supercomputing has been indispensable in the unstoppable trend of high-speed computing evolution. This work aims at improving its running efficacy by introducing a new two-step scheduling approach. Based on the analysis of large historical data, we provide an accurate runtime estimation scheme using Instance-Based Learning (IBL) in the first step. Then a swarm intelligence based scheduling (SIBS) method is proposed to optimize the scheduling performance in terms of total runtime makespan and fair resource allocation. A method comparison on a dataset from the ALPS supercomputer, which consists of 804k workload data in 2016, shows that our proposed method outperforms the most commonly used strategy – Extensible Argonne Scheduling System (EASY).


I. INTRODUCTION
With the advancement in technology, many high-speed computing techniques have emerged. Applications unimaginable in past few years has now become achievable. Owing to the development on Internet applications [1] and new computing schemes, scenarios such as cloud computing [2] and parallel computing [3] have come into play and resulted in dramatic improvements in high-speed computation [4]- [6]. Consequently, new fields of studies such as big data analysis and artificial intelligence have started to thrive. Supercomputer, a high-level performance computer, consists of tens of thousands of processors that are capable of performing billions to trillions of calculations per second and achieving massive computing power is, without doubt, the indispensable role in the unstoppable trend of high-speed computing evolution. Therefore, efficiently improving performance on a supercomputer would be without doubt a vital issue. Many institutions have started to increasingly add computing cores to achieve higher computation performance. However, [7] shows that simply expanding the number of processing nodes and leveraging technology scaling would not be an efficient way to improve the processing power of Manuscript  supercomputers, as power demand would increase unsustainably. To improve supercomputer's running efficacy, many researchers have devoted full effort into supercomputer scheduling [8]- [11], coming up with various scheduling schemes to enhance the overall performance of the supercomputer without the need of setting up additional hardware.
Before designing the scheduling scheme, an important factor in scheduling performances is the runtime estimation. It is an important attribute used by the schedulers in various scenarios. Its accuracy is proved to be highly correlated with scheduling performances by [12]. Researchers have been working thoroughly on this topic [13]- [16], trying to come up with different solutions to provide accurate estimates of runtime data. It would be important to have good domain knowledge and insight with their own runtime data to improve the accuracy of estimation. This work applies the data offered by the supercomputer, Advanced Large-scale Parallel Supercluster (ALPS) in National Center for High-Performance Computing (NCHC) in Taiwan.
As the need for large computation keeps increasing, large traffic workload has gradually become a burden for ALPS. To handle this issue, this work provides two major contributions. First, an accurate runtime estimation scheme based on the analysis of a large historical data from ALPS is proposed using Instance-Based Learning (IBL) [17]. Second, a new scheduling scheme for supercomputers on large traffic load using Swarm Intelligence is designed.
A scheduling scheme is a critical factor to the performance of a supercomputer. Many researchers have as well concentrated on the design of supercomputer scheduling trying to obtain a suitable approach in the optimization of various goals. Due to the attractiveness in simplicity, effectiveness, and fairness, the most common used strategy in supercomputer scheduling is FCFS (First-Come First Served) with backfilling, also known as the term EASY (Extensible Argonne Scheduling System). Although easy to implement, job scheduling on supercomputers, however, can be complicated due to diverse demands of system administrators and may not be enough to be effectively approached by simply applying EASY. In fact, runtime efficiency and fairness are usually conflicting goals to be achieved. The inefficiency becomes evident especially when the workload is large. Therefore, to both consider the runtime efficiency and user fairness comprehensively while preserving the feature of simple implementation in EASY, a heterogeneous non-preemptible scheduling scheme to obtain a real-time scheduling on large traffic workload is proposed. This work designs a Swarm Intelligence Based Scheduling (SIBS) method to optimize the performance and achieve both efficiencies on total runtime makespan and fair resource allocation.
By combining IBL and SIBS, this work designs a new two-step approach that performs a runtime prediction scheme and conducts a novel scheduling algorithm for efficient supercomputing scheduling problem. Both steps require little computation efforts compared to classic neural network learning and convex optimizations. The rest of the paper is organized as follows. In Section II background knowledge on IBL and original SIB optimization are provided. Section III presents the design of IBL runtime estimation, based on data from ALPS, and the modified SIB for resource-constrained job scheduling. In Section IV, the simulation setup is described and the result of the proposal is evaluated. Finally, Section V concludes the paper and outlines the contribution of this work.

II. BACKGROUND
To provide an efficient approach to improve the performance of a supercomputer, both runtime estimation, and job scheduling should not only be operative but also computationally effective. For runtime estimation, global parametric learning algorithms, such as neural networks, attempt to establish an input-output mapping via a single function with a global network view. However, this would neglect important properties of data partitions when the input is highly correlated to local data, which is often the case for runtime estimation. This work found IBL most suitable and perform good results of our estimates. For scheduling, classic optimization approaches such as nonlinear programming or dynamic programming can compute the exact solution and have better accuracy but are computationally time-consuming when the large-scale problem is considered. Therefore, this research designs a metaheuristic approaches SIB that gives near-optimal answers but is computationally efficient.

A. Instance-Based Learning
Runtime prediction of new input data is formed through past related experiences in a historical database. Experiences consist of several input features and one output result. Every input features depict the characteristics of the data while the output describes the runtime result corresponding to the conditions of these features. New input data consists of only input features whereas its runtime prediction is formed based on these features. Generally speaking, instead of querying the entire experiences in the database to form a prediction, only past experiences with high correlated input features are used as training sets to provide runtime estimation through similarity calculation. This allows an estimate to preserve useful local information and filter out unrelated information that would degrade the performance of accuracy.
IBL can be categorized into two major parts: similarity calculation and kernel regression. In similarity calculation, a distance function is defined as an indicator of similarity between two data according to the feature of the attributes. In kernel regression, a weighted-distance average of output is provided for final runtime prediction. The weights given to different runtimes are defined by the kernel function. The kernel function determines the weights on a given runtime data according to the measured similarity between input and historical experiences.
In summary, the preprocessing procedure of IBL starts with the search of the relevant historical data records (past experiences) according to the value obtained from the similarity metrics like distance function. Then it selects the first k important data (data with top k lowest distance values) for runtime estimation and filters out the rest.

1) Distance function
The distance function for similarity measure is defined as where is the feature, is its weight,

2) Kernel function
Kernel function provides the result of predicted runtime estimates R E through similarities obtained from distance function and is formulated as where j R is the actual runtime of related experience j and K(d) is the exponential kernel function used to derive the weight for runtime j R shown below.

B. Swarm Intelligence Optimization
Swarm intelligence has been a popular nature-inspired metaheuristic optimization method for more than 20 years. Phoa et al. [18] introduced the Swarm Intelligence Based (SIB) method with two new operations, MIX and MOVE, to tackle optimization problems in discrete spaces, which are common in mathematical and statistical optimization. This method is then widely used in many applications, see [19]- [22]. The general idea of the SIB algorithm in depicted International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 in Fig. 1.
In the step of initialization, possible solutions are generated as initial particles and the objective values for these particles are evaluated. Through evaluation, each particle perceives its own location of initial optimum in the search space called Local Best (LB) particles. All particles share information by comparing its LB with other to obtain the overall optimum called Global Best (GB) particle. For particles to collectively arrive at the perceived overall optimum solution, they go through the steps of MIX and MOVE operations iteratively after initialization. In the MIX operation, particle exchanges elements with LB and GB particles to form new particles mixed LB and mixed GB respectively. In the MOVE operation, the objective value of mixed GB, mixed LB, and particle are evaluated. A particle with better objective value is chosen to replace particle . However, if both mixed GB and mixed LB do not make particle move toward a better location in the search space, elements in particle would be replaced with any random particle as a prevention of being trapped in a locally optimal solution. GB and LBs are updated if any better solutions are found. LB particles and GB particle are updated continuously in every iteration until the stopping criterion is fulfilled. The stopping criteria can be the reach of either the pre-specified maximum number of iterations or a known optimal value of the GB particle. The former criterion is related to the computational capacity and expert's experience. When all information of GB are required to be recorded, an exceptionally large number of iterations may exceed the computer's memory. On the other hand, experts may suggest a certain number of iterations from their experience because the GB may no longer change their locations afterwards. The latter criterion is related to existing theoretical results in the literature. In some cases, the optimal values can be determined theoretically, so it can serve as a termination criterion for the SIB method.

III. RUNTIME ESTIMATION AND JOB SCHEDULING ON SUPERCOMPUTERS
This section introduces the method of runtime estimation on user workloads using IBL and describes a newly designed SIB scheduling algorithm for supercomputers.

A. Job Runtime Estimation
This work evaluates the prediction technique using data from the ALPS supercomputer system. Characteristics of execution jobs in ALPS have shown in Table I. Through correlation analysis, a strong degree of dependency between jobs summited by users and the runtime feature can be found. As a result, the search space of every new input data is separated into various partitions according to different users. For instance, if user 1 submits a new job to the system to perform IBL prediction, the system only considers user 1's historical experiences as a relevant dataset for runtime estimation. This not only preserves data locality but also decreases the search space to perform similarity computation, which would cause huge computation burden when the entire dataset is considerably large.
After deciding the relevant dataset of user 1, the distance function between input data and all experiences in the dataset are calculated. All distance metrics are now available for the next step. Finally, nearest neighbors with the lowest values of distances are chosen. The runtime prediction of the newly submitted job is determined by these nearest neighbors using the kernel function.
The estimation procedure can be generalized into four major steps upon receiving a new job request:

1) Dataset determination
The identity of job submitter is first determined. Afterward, the submitter's past experiences are chosen as the relevant dataset to perform IBL.

2) Similarity computation
The similarity metric between features from the new input and its corresponding experiences in the relevant dataset is computed with the distance function.

3) K-Nearest neighbors
After acquiring all similarity metrics, experiences with the lowest similarity values are selected as the final dataset to perform runtime prediction. Through simulations, the results show that the estimation provides great accuracy when only three nearest experiences (K = 3) are selected as the final dataset. This decision of the parameter K narrows down the International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 estimation complexity without compromising the overall accuracy. Note that the selection of K = 3 is mainly due to empirical evaluation and experience. This selection can provide accurate estimations in terms of stable RMSE estimation with low variance, and the results are no worse than the suggested value of K mentioned in [23]. However, instead of directly using our choice, we suggest a detailed investigation on the suitable value of K when other problems are being handled.

4) Runtime estimate
The kernel function takes the experiences in the final dataset as input and comes out with the runtime estimation result for the input data.

B. Swarm Intelligence Based Scheduling
In previous researches, like most of the evolutionary algorithms, SIB optimization focused on unconstrained and non-ordering problems. In supercomputer scheduling, however, resource constraints on remaining supercomputer cores needs consideration. Every scheduling decision should concern the availability of cores in every time slot in order to make full use of the resources and result in an efficient job schedule that reduces the total operating time.
Under busy traffic conditions, job arrivals are large and have to wait in the queue for processing. The scheduler reschedules the arrival jobs in every tupdate time units. During tupdate, the scheduler awaits for new arrivals while the supercomputer simultaneously provides operations on previously arrived jobs. To provide scheduling efficiency, the setting of tupdate has to satisfy two conditions. First, it has to be larger than the total makespan tmakespan of previous assigned jobs in the system so the supercomputer would not be waiting without serving any jobs. Second, it should be large enough for a certain amount of new jobs to arrive so that the scheduling result would be influential enough to upgrade system's performance.
The times for makespan and update are correlated. By definitions, tupdate refers to the waiting time of the supercomputer scheduler to wait for incoming jobs before scheduling, and tmakespan refers to the total time (from the previous end time to the next start time) for a computer to finish all jobs. Their correlation guarantees that the supercomputer would always be working such that the scheduler waits for arrivals of incoming job requests and the computer would not be idle any time.
We denote is the average arrival rate of jobs and is the average amount of finishing workloads in the one-time unit. Then / corresponds to the resulting expected computer makespan from the arrival jobs in every time units and tupdate． / becomes the expected makespan caused by the scheduler in the waiting process. To satisfy the two criteria, the relationship of tupdate and tmakespan leads to expected tmakespan ≥ tupdate (4) Simple algebra leads to / ≥ 1. This implies that when the traffic is large enough, the computer would not be idle while the scheduler waits for arrivals of incoming job requests.
In this research, the objective of a scheduling decision is to minimize total waiting time of all user-submitted jobs. This objective ensures fairness on waiting time for different jobs is considered while improving system's performance on total operation time. While being processed, the job requires processing cores and units of processing time with job starting time . The total amount of available cores in the supercomputer is denoted as . The optimization problem can be formulated as follows.
where S () t is the set of all submitted jobs during the time interval t .
Jobs being assigned after tupdate would only be scheduled on the supercomputer in the next scheduling session. Therefore, to enhance the fairness of the waiting time of all jobs. update t  is added as a penalty factor to the objective value for assigning any jobs after tupdate.
In order to solve the constrained optimization problem, this work designs a new particle formation and a novel MIX operation for SIB scheduling. In SIB scheduling, every particle is designed to represent a list of scheduling priority and every element in a particle represents a job ID. Every particle's performance is evaluated by decoding the priority list into job schedules through the parallel scheduling generation scheme. In this research, the idea of particle transformation is applied due to its merit that the scheduling result decoded from particles after MIX and MOVE is still feasible to construct a resource feasible project schedule.

1) Parallel scheduling generation scheme
A parallel scheduling generation scheme iterates over the time stamp of projects and adds jobs that are eligible to the schedule. Scheduling starts at the time point and schedules jobs before the time pointer is increased. It selects a job at each decision point from the eligible job set and assigns a scheduling sequence of these eligible jobs according to the priority list in each particle. The pseudo-code of the scheme is as follows: where the complete set contains all jobs that have been scheduled and completed before and the active set contains all scheduled but unfinished jobs. corresponds to International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 the finishing time of the job , where is the remaining core available at the time so that .
2) MIX operation for SIB scheduling Instead of exchanging particle elements with LB and GB, new particles are formed through the idea of imitation. During MIX operation, every particle imitates the priority order of a good particle (LB or GB) starting from the first element. In every iteration, a position and job ID of the element of the good particle is perceived. Particle searches its own elements to find the index that contains the value . After finding the corresponding index, particle swaps the job ID in that index with the initially perceived position in the good particle. The pseudocode of the MIX is shown below. The returns the index of that contains the value .

Algorithm 2 MIX Operation in SIB Scheduling
Initialize: while do end while

IV. SIMULATION
The simulation performed in this section is carried out by a SIB program written in R. This section evaluates the performance of the proposed runtime prediction technique and the efficiency of SIB supercomputer scheduling using data from the ALPS supercomputer. Workloads from January to December 2016 with the total amount of 804697 data are used to evaluate the designs in this work.

A. Runtime Prediction Evaluation
In the simulation, the training workload consists of 563287 experiences and the testing workload contains 241410 inserting data. The performance of the proposed runtime prediction is evaluated using Root-Mean-Squared Error (RMSE) indicator as shown in Table II.
The runtime prediction scheme provides an average estimation error of 20.73 minutes with a standard deviation of 33.82 minutes. The best estimation error can be achieved below 1 minute, while the error in the worst case is bounded by 151.58 minutes. The result shows that the prediction provides good accuracy on runtime estimation.

B. SIB Scheduling Evaluation
The performance of the SIB scheduling design is compared with the EASY scheduling scheme. New inserting jobs are sampled from the 241410 testing data workloads. The scheduling schemes use the estimated time from IBL runtime prediction to determine their schedules. The total makespan of the actual execution time of the works between two different designs is evaluated as the indicator of scheduling performance. In order to examine the efficacy of the scheduling design under every condition of busy traffic, the worst case, that is tupdate = tmakespan, is considered in the simulation.
The performance between different amounts of sampled workloads and total makespans for two different scheduling schemes is demonstrated. The impact of the number of initial particles on the SIB scheduling performance is also shown in Fig. 2. The performance on the makespan of job arrivals of SIB scheduling outperforms the EASY scheduling scheme. Moreover, as the number of the initial particles (seed) increases, the scheduling performance of the SIB upgrades to a higher level. The performance of SIB scheduling on a supercomputer is proved to be efficient. Fig. 2(a). SIB versus EASY at 50 seeds. Fig. 2(b). SIB versus EASY at 100 seeds. Fig. 2(c). SIB versus EASY at 150 seeds.
As shown in Fig. 2(a) to Fig. 2(c), the performance on the makespan of job arrivals of SIB scheduling outperforms the EASY scheduling scheme. Moreover, as the number of the initial particles (seeds) increases, the scheduling performance International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 of the SIB upgrades to a higher level. The performance of SIB scheduling on a supercomputer is proved to be efficient.

V. CONCLUSIONS
To improve supercomputer's running efficiency, it is important to both upgrade the accuracy of runtime prediction and enhances the efficiency of the scheduling algorithm, while maintaining the system to work below a certain complexity level. This work proposes a new two-step approach that performs a runtime prediction scheme via instance-based learning and conducts a novel scheduling algorithm via swarm intelligence. Both steps require little computation efforts compared to classic neural network learning and convex optimizations. The instance-based learning runtime estimation scheme is proposed based on the characteristics of the data in the ALPS supercomputer to improve the accuracy of prediction while the new swarm intelligence scheduling algorithm is designed to optimize the performance and achieve both efficiency on runtime makespan and fair resource allocation on supercomputers under busy traffic conditions.
There are several potential improvements and extensions from this work. First, the CPU runtime estimation in the first step can be significantly improved if additional information on the incoming jobs can be provided. The dataset we analyzed in this work consisted of only four features, which are obviously not enough for good estimations on CPU runtime no matter what statistical tools are used. Additional information on the property of jobs and users will greatly help in the learning procedure. Nevertheless, an accurate CPU runtime estimation is the first step to the success of the whole method.
Another possible extension comes from the adaptive approach for the scheduler to switch its operating mode according to the current job traffic status. By a given threshold, it is ideal that the scheduler activates the SIB when the traffic is above the threshold, and other simple schemes are operated otherwise.
In addition to extra feature, one may consider using other learning approaches instead of instance-based learning to handle the estimation of CPU runtime. For example, deep learning can be a good tool to classify whether the job is a test job (that requires few CPU cores and short CPU runtime) or a real job (that requires a lot of CPU cores and long CPU runtime). Bayesian model can also be applied when historical data are available as a prior to the current estimation. When more features are available, many different class of regression or parametric estimations can lead to more accurate results. He is an author of over 50 scientific articles, a speaker of over 100 invited talks in the international conferences and 60 seminar talks in the universities around the world. His research interests include design and analysis of physical, computer and network experiments, analysis of internet and social media data, network data analysis, nature-inspired metaheuristics optimization, big data analysis, stochastic control in large-scale systems, semiparametric methods to the data with missing covariates, deep learning and neural network modeling. Dr. Phoa