Homogeneous Ensemble Instance Intervals Determination Method of Time Series Data Based on Granular Computing

 Abstract —It is very important to determine the size of the instance since it has a large impact on the recognition performance of the devices. In this paper, we propose a novel method to recognize the intervals of the time-series data using granular computing. Unlike traditional methods which use fixed size or knowledge-based, our method is conducted data-driven. Based on the concept of the granular computing, we classified the operation data of devices into three levels and proposed a multi-SVM-based machine learning method that can automatically classify each granule. We have proven the excellence of our method by conducting and evaluating experiments with two perspectives.


I. INTRODUCTION
In general, the operation data of devices are recognized using time-series data for which the size of an instance is not known. However, since it has a large impact on the recognition performance of the devices, it is very important to determine the size of the instance. To do so, some researchers proposed some ways to amplify the volume and the type of data that can be collected by increasing the number of sensors instead of determining the exact size of the instances [1]. Recently, some methods are proposed that extract the exact motion data in interest using the deep learning method [2]. But in the former, it is becoming increasingly difficult to apply it due to the increased complexity of data patterns and the resulting computational cost even though recognizable activities have diversified. In the contrary, in the latter, the accuracy of the activity recognition is very high. However, it takes a lot of computational burden to find the exact interval that varies depending on the types of them.
To overcome the limitations, we attempted to use granular computing [3] to recognize the activities. Based on the concept of the granular computing, we classified the operation data of devices into three levels and proposed a multi-SVM-based machine learning method that can automatically classify each granule using time-series data. We also proposed a feature selection method that maximizes the classification performance according to the level of each Manuscript  granularity. By using granular computing on the recognition of the operation data, it becomes possible to subdivide the predefined activities, and to extract excellent features to find subdivided activities. As result, it is possible to find out the start and the end of the instances more clearly, and then extract the excellent features. This makes it possible to derive high performance with basic machine learning methods rather than deep learning.
The paper is organized as follows. Section II reviews the related research. Section III offers detailed descriptions about the overall architecture and the components of the framework. In Section IV, experimental results are suggested to demonstrate the effectiveness of the framework. Finally, Section V presents the conclusions and further research.

II. RELATED WORKS
The multimodal sensor data is data that are collected from various types of sensors such as auditory, visual, and state for specific purposes. The characteristic of this data is that it has a lot of information due to its large volume compared to that collected from a single sensor [4]. As shown in Table I, the multimodal sensor data has been used in many fields that need a large volume of information such as medical, robotics, activity recognition Nonetheless, the heavy volume of the multimodal sensor data may cause a computational burden [5]. To reduce the burden, granular computing is emerging as a solution.
Granular computing is defined as an umbrella term to cover any theories, methodologies, techniques, and tools that make use of granules in complex problem solving [6]. A granule which is a basic element of the granular computing is a small particle especially, one of the numerous particles forming a larger unit [6]. The granule is made by any elements that are bundled by similarities, differences, and functions. Furthermore, data granularity is a measure that represents a scale of the granule. In the granular computing, the key features may varied depend on the data granularity. So, it requires different learning models that can reflect the characteristic of the granular computing. As result, the prediction performance will be improved [7]. In order to take advantage of granular computing, research is being conducted in various fields. It is summarized in Table I. Granular computing is useful to make it possible to segment and to analyze time-series sensor data that cannot be defined in advance. In addition, it can contribute to reduce the preprocessing cost for each granule. Using the advantages of the granular computing, we propose the framework of determining the intervals of the instance for multimodal sensor data.

Medical
-Granules: chunk of medical image pixels; take into detailed information and reflect the inherent spatial relation of the image [10], [11] Data Processing -Make data process easier by dividing big data set with higher dimensions into relatively smaller data subsets.
[12], [13], [14] Situation Awareness -Adopt to solve open issues in the three levels of the Situation Awareness model [15] Learning -Make information granules by collecting detailed data in same type.
-Figure out data complexity problem by learning different type granule separately.

A. Preliminaries
To divide time-series data collected using a multimodal sensor into instances, we must first convert it to structured tabular data, and then add the target label column, which is required for classification, to the tabular data. However, it is very hard to determine the intervals of the transformed tabular data because it is difficult to find the determined decision boundaries of them. To reduce the difficulty, we attempt to apply granular computing to the transformed tabular data. To do so, it is necessary to determine the data granularity representing the granule and granule. It is defined as follows. Let be a collection of layers composing the hierarchical granularities. So, each layer is composed of data granules, which is defined as a class or a target label to be classified. If the concept of data granule and the data granularity are applied to the operating data of the water purifier, it can be schematized as depicted in Fig. 1.
From now on, all discussions focus on the operating data of the water purifier.

B. Overall Framework
As depicted in Fig. 2, the framework for determining the interval of an instance consists of three modules: Idle-time Granule Deletion Module, Abnormal Granules Deletion Module, and Level 3 Granularity Classifiers Learning Module.

1) Idle-time granule deletion module
This module is conducted to find a model that can distinguish between busy-time data granule and idle-time data granule among multimodal sensor data. To do this, we analyzed the characteristics of the two granules. As a result, we can see that the variance of the busy-time granule is larger than that of the idle-time granule. However, it is difficult to prove the applicability of the characteristics of the granules to all sensors and all data fields. So, we select the sensors and the data fields to which the characteristics can be applied. Overall procedure is as follows.
Step 1. Data normalization We perform normalization for each data field to compare the variance of sensor data measured with different measurements.
Step 2. Instance Segmentation We divided the normalized tabular data according to the target label that is appended. At this time, the instance i (   in granularity 1 G n is defined as the tabular data that is included in interval in which the same target label value is repeated.
where   Step 4. Top-K method for data field selection We conducted Hadamard division to identify the top-k data fields using AoV of the busy-time granule and AoV of the idle-time granule [8]. Based on the results of Hadamard division, we select the top-k data fields.
Step 5. Features matrix for input vector generation k   features matrix is generated using top-k data fields ( is the number of features to be generated). As a next, the features matrix is converted into 1 k   input vector. As a result, the input vector is generated as many as the total number of instances of 1 G n . Step 6. Support Vector Machine learning The appropriateness of the instance interval to be determined by applying the concept of granularity will be verified in the last module. So, we concluded that it is necessary to learn high-order data faster than classification performance at Level 1 granularity. It is the reason why we select the SVM as Level 1 learner.

2) Abnormal granules deletion module
The purpose of this module is to classify normal granules and abnormal granules using only busy-time granule data and finally to remove abnormal granules. The process of removing the abnormal granules is similar to the previous module except that the data field is determined using Mutual Information (MI) instead of AoV. The reason for using MI instead of AoV is that both the normal granules and the abnormal granules data have a similar distribution. We generate the feature matrix and input vectors using the two granules with the highest information gain among all the granules of Level 2 and the data fields contained in these granules. As a next, we have found a model that removes the abnormal granules by learning one-class SVM models using these features and sizes of each granule. As in the previous step, one-class SVM model of

3) Level 3 granules discovery module
This module is conducted to find a model that can distinguish among Level 3 granules in a normal granule. To do this, we performed clustering on a normal granule to find start-granule, several mid-granules, and end-granule, which are the minimum chunks of the data. The processing steps are as follows.
Step 1. K-means clustering for the normal granules Unlike level 1 and 2 granules, the target labels of level 3 granules are not defined in advance. To determine the number of the target labels, we adopted k-means clustering method on the instance     were mapped to start-granule, mid1-granule, mid2-granule, …, and end-granule.
Step 2. Determination of Granule Size To determine the length of the actual instance using the union of level 3 granules, the length level 3 granules had to be determined. Step 3. Top-K method for data field selection We calculated AoV for all granules in level 3. However, Hadamard division is not performed, because we assume that level 3 data was decomposed sufficiently small.
Step 4. Feature matrix for input vector generation Feature matrices were created in a similar manner as before.
As a result, the input vectors with a size of c αk were generated for each granule of level 3.
Step 5. Support Vector Machine Learning Finally, one-class SVMs learning is performed using the input vectors and their target labels corresponding to each level 3 granule.
Finally, we store the length     b c d and feature sets of the th c granule along with the learned SVM models.

C. Instance Intervals Determination
The learned SVM models in each module is sequentially applied to the data when new data is created. The intervals are determined by finding the starting-points and end-points of the instances. Through this process, it is judged whether level 3 granules are sequentially detected. As a result, the intervals of the instances can be verified by the detected sequences. The process is shown in Fig. 3.

IV. PERFORMANCE EVALUATION
In order to demonstrate the superiority of the proposed method, we performed two experiments using the following sensors and features (Table II). The multimodal sensor integrates the sensors shown in Table II, which was installed in a water dispenser to collect water data.  The metadata of the data collected using the multimodal sensor is as follows. The data collection period of the multimodal sensor was 0.5hz, the total operation time of the sensor was 6,930 seconds, 304 watering operations, and 307 waiting operations. The target labels of the data consist of 280 normal watering, 24 over watering, and 307 idle states. We mapped the normal watering, over watering to busy-time granule and the idle state to idle-time granule. The programming language used in the experiment was python 3.7, and the data analysis packages were numpy, scikit-learn, and pandas. The contents of the experiment are as follows.

Experiment 1: Comparison of data fields (features of sensors) between AoV method and filter method
To verify the validity of the selected top-k data field using the proposed AoV, we compared each top-k data field from AoV and filter method. As a result, there was no significant difference between the two methods. It meant that AoV can guarantee the performance of the filter method.

Experiment 2: Comparison of complexity between AoV method and filter method
We compared the complexity of the filter method with the complexity of the AoV method in the second experiment. In general, the complexity of the filter method is

Experiment 3: Accuracy of granular computing-based predicted instance intervals
We performed experience to check the accuracy of the predicted instance interval obtained by the granular computing-based method. Our method found 258 instances out of 280 normal watering instances without appending the target labels. We calculated 258 number of Jaccard's distances [9] between the predicted intervals and the actual interval. The result is shown in Fig. 5.
We checked that the distribution of the accuracy was a normal distribution using normality test with p-value 0.006. It followed as normal distribution of which mean was 0.7294 and standard deviation was 0.1552. Since 94.98% of the data was distributed in the range of 95% confidence interval [0.4252, 1.03], the predicted interval accuracy was 73%.

V. CONCLUSION
We proposed a method to determine the interval of instances that can be classified from normal granule of time series data by using granular computing. We also reduced the computational burden using different feature selection methods according to the granularity to handle multimodal sensor data of various kinds and volume.
The contribution of our paper is as follows. First, we applied the granular computing to the time series data and determined the instance intervals which were difficult to find by the conventional method. This is a major development of the existing research. Second, we could find out the deterministic instance length by finding the starting point and finishing point of the instance. The limitations are as follows. Because our method used the supervised learning, it has disadvantage that it cannot be used if the target value of normal, abnormal, and idle is not attached. It is also impossible to classify in real time because we do not know appropriate collection period to determine the type and length of an instance. Therefore, in future studies, we will research with semi-supervised learning method which is used when target value does not attach entirely. Also, we will find optimal collection period for real time classification.