On Modeling Smoke-Haze Incidence with Cluster and Regression Analyses

Thailand and many countries in the Southeast Asia have long been suffered from the regional smoke-haze incidences. Smoke-haze is a kind of air pollution event frequently occurred from forest fires that had been intentionally set for vegetation purpose. The smoke-haze can cause serious health problem from high concentrations of small particulate matters that can retain in the lung or even spread through the whole body to cause obstruction in major organs. In the northern part of Thailand, smoke-haze normally occurs during the dry season from late January to early April with the peak polluted air around March. Controlling burning activity is an obvious solution but impractical when burning areas are in remotely high mountains that are hard to reach by ground officers. Monitoring incidences as well as estimating pollution level are more or less efficient and practical ways to handle the smoke-haze situation. We thus propose the application of machine learning technology to learn smoke-haze patterns from historical events. The specific approach used in our work is the cluster analysis with the k-means and Kohonen self-organizing map algorithms. The cluster with serious pollution effect is then further analysed to induce the predictive regression model using the meteorological factors. The built model can serve as a predictive pattern useful for invoking an early warning sign for air pollution awareness.


I. INTRODUCTION
Important air pollution incidences in Asia relates to haze. The natural haze phenomenon has been defined by the World Meteorological Organization as extremely small and dry particles suspending in the atmosphere with tremendous amount to cause dusky sky [1]. Winter haze events in many areas of China cause health problems from the toxic air pollutants [2]- [5]. Many forecasting models and simulation methods have been applied to study the transport characteristics of haze events [6]- [11].
Unlike haze that can occur naturally, smoke-haze is caused by humans to set large area of fires mainly for Manuscript received July 27, 2019; revised June 1, 2020. This work was supported in part by grants from the National Research Council of Thailand (NRCT) and Suranaree University of Technology (SUT) through the funding of Data and Knowledge Engineering Research Unit in which the third and fourth authors are principal researchers.
N. Kerdprasop is with the School of Computer Engineering and the Data and Knowledge Engineering Research Unit, SUT, Thailand (e-mail: nittaya@sut.ac.th). agricultural purpose. Countries in the Southeast Asia, including Thailand, have long been suffered from the smoke-haze incidences [12]- [16]. Smoke-haze has also recently been a problem in Russia [17]. High concentrations of hazardous heavy metals and small particulate matters (PM) are dangerous constituents of haze and smoke-haze. PM10 and PM2.5 are particulate matters with the aerodynamic diameters less than or equal to 10 and 2.5 micrometers, respectively. The smaller PM, the more harmful because these extremely tiny particles can retain in the human lung and spread through the whole body. PM also causes airway damages, cardiovascular impairments, and adverse effects in infants [18].
In the Southeast Asian countries, smoke-haze is a periodic and transboundary incidence occurred during the dry season [19]- [22]. The man-made forest fire season in Thailand can start from the late January and extend possibly to the last week of April or even early May, depending on the coming of rain. In northern Thailand, the extent and intensity of fire-related air pollution have been increasing recently. Controlling burning is obviously an effective solution, but it is quite impractical because there are so many large areas in valleys and high mountains almost unreachable by officers. Monitoring burning incidences as a warning sign is more or less a feasible way for handling in advance the smoke-haze situation [23].
We thus propose in this paper a data-driven modeling, also known as data mining, as an alternative approach to traditional simulation method. The advantage of data mining method is its automatic approach on deriving, or learning, discriminative patterns form the provided historical events as training data. The derived model of our approach is a pattern of smoke-haze occurrence that can transport among mountain regions. Our study area and details of data records are presented in Section II. Our model building strategy is explained in Section III. Model evaluation results are presented in Section IV. We finally conclude the paper in Section V.

A. Study Area
Northern Thailand is the focus area of our study because of its intensity in smoke-haze from the burning of land. The terrain in the north is forested and mountainous, as shown in Fig. 1. The smoke-haze problem normally starts from the border provinces such as Chiang Rai in the up north and Mae Hong Son on the west side. These two provinces have many high mountains along their borders to neighbor countries: Lao on the north and Myanmar on the west. Therefore, they are vulnerable to smoke-haze transport from the neighborhoods. Forest fires in Chiang Rai and Mae Hong Son also affect other valley provinces such as Lamphun and Lampang. The main areas of our study are thus mountainous regions along the Lao and Myanmar borders and provinces in lower river valleys.

B. Data Attributes
The start of the dry season around mid to late February is the onset of slash and burn farming. The lingering smoke from the fires causes serious pollution problem to people living along the valley and lowland areas during March to April every year. We thus collect air pollution data, with a specific attention to PM10, of nine provinces in the mountainous and valley areas. Pollution data are collected from the Air Quality and Noise Management Bureau of Thailand [24]. The PM10 data from these nine provinces are collected in 2016 from January to April, totaling of 121 days. Each data record contains 24-hour average PM10 values from ground stations in each province. For some missing PM10 records, we impute with the nearest neighbor technique.
We also include data collected from the local airport stations [25] as meteorological factors. The data attributes used in our modeling process are summarized and shown in Table I.

III. MODELING METHODOLOGY
The steps in our method for geographical polluted area clustering and pollution predictive modeling are illustrated in Fig. 2. The first step is data collection. The PM10 data [24] from the ground stations in nine provinces are collected from the first of January to the last day of April 2016. We average PM10 values from all stations in each province and use this mean as the pollution value for that province. To gain better understanding regarding the PM10 episodes of the nine provinces, we have to explore the characteristics of PM10 occurrences and levels using a graph plot. The next step is checking for data completeness. There exist some missing values in the PM10 records from some stations. We therefore impute the missing values with the nearest neighbor method that estimates potential PM10 value from the closest station.
After completing the missing PM10 values, all the 121 data records are clustered to form groups of provinces experiencing the same level of air pollution due to smoke-haze. We apply two automatic data clustering algorithms for this step including the k-means clustering [26] and the Kohonen self-organizing map [27]. The main reason for applying two clustering methods is to select the best approach that yields reasonable result suitable for the International Journal of Machine Learning and Computing, Vol. 10, No. 6, November 2020 subsequent step of predictive model building.
Our criteria for selecting clustering results are number of groups that should much less than nine, which is the number of provinces to be clustered, and the silhouette coefficient that should higher than 0.5 to reflect a good formation of clusters.
We then consider from the clustering results to extract group of provinces that experience the strongest impact from air pollution due to smoke-haze. The meteorological data (attributes 11-27 in Table I) for that specific group are then used to build a predictive model for estimation PM10 level from the meteorological proxy.

IV. EXPERIMENTATION AND RESULTS
Based on the design of our analysis methodology, there are three main phases of our work: preliminary data exploration, automatic data clustering based on the PM10 concentrations in each province, and the creation of predictive model to estimate amount of PM10 using the meteorological factors such as humidity, wind speed, and so on as pollution predictors. The results from each phase are demonstrated and explained in the following subsections.

A. Data Exploration
To correctly planning the analysis experimentation, we have to firstly explore the pollution situation based on the observed PM10 concentrations from the first day of January extending to the last day of May in year 2016. The levels of average PM10 from the ground stations in each province are displayed in Fig. 3.
We can observe from the pollution level distribution that the peak period of air pollution mainly caused by smoke-haze is between days 61 to 121, which are March to April. During the peak period the average level of PM10 concentration among the nine provinces is around 105, whereas before this peak period the pollution level is at 57. The PM10 concentration level higher than 101 is considered harmful due to the UK standard [28] (as shown in Table II). We therefore partition dataset into two subsets: data during days 1-60 and those during days 61-121. Pollution data in May are excluded from further experimentation because our focus is on the peak period of smoke-haze events. The prepared two data subsets of average PM10 for each of the nine province are used in the next step of automatic clustering.

B. Cluster Analysis Results
We perform cluster analysis twice with two different data subsets: PM10 concentrations from nine provinces during days 1-60 and PM10 during days 61-121. The two clustering algorithms adopted in our experimentation is k-means and Kohonen self-organizing map (SOM). The result (shown in Fig. 4) turns out that the k-means clustering yields a smaller number of clusters than the Kohonen SOM. Even though the silhouette coefficient value of k-means is lower than the value obtained from the Kohonen SOM, we choose the k-means method from its reasonable grouping results. From both experiments performed on the two data subsets, we obtain the same number of clusters, that is, five. But provinces in each group are different. Details of clustering results are summarized in Table III and geographically shown in Fig. 5.
From the clustering results of provincial locations based on the PM10 concentrations, we can notice (in Table III) that at the non-peak period during January-February the air quality of the four provinces (Chiang Rai, Mae Hong Son, Phayao, Nan) in cluster 5 is good with the low level of PM10 pollutant. The air quality in these areas drastically changes in an unsatisfactory way during the peak period of burning (March-April). The pollution level in the four provinces is worsen rapidly from the low level (index = 3) to the harmful level (index = 10) in less than a month.  Among the most affected four provinces, Chiang Rai is the worst one with the highest PM10 concentration at 309 g/m 3 in late March. We are therefore interested in deriving a predictive model to estimate the pollution level of Chiang Rai province using meteorological factors.

C. Multivariate Linear Regression Model to Predict PM10
There are six meteorological indexes used in the predictive model creation including the temperature (°C), dew point (°C), humidity (%), sea level pressure (hPa), visibility (km), and wind speed (km/h). These parameters, except wind speed, are further categorized into three groups: low, high, and average. The wind speed are categorized as high and average. Therefore, there are 17 factors used in the model building process.
We apply the multivariate linear regression technique to derive a predictive model to estimate the level of PM10 to occur in Chiang Rai province. The obtained model is shown in Fig. 6. Among the 17 predictors, low humidity and average visibility distance are two most important factors to estimate pollution level. Importance of the best ten predictors are graphically shown in Fig. 7.  From the multivariate linear regression relationship and the importance of predictors, we can estimate the high pollution level when the humidity is low, the average visibility distance is short, and the sea level pressure is low. Wind speed has negative correlation to the PM10 concentration; that means wind can reduce level of air pollution.

V. CONCLUSION
Smoke-haze is a post-harvesting event constantly occurred in almost every region of Thailand during late January to early April. In some specific areas in the north that are surrounded by high mountains, the small particulate matters from farm burning are extremely severe air pollution incidence. This pollution event is preventive because it is man-made.
We thus try to study the occurrence patterns of smoke-haze with cluster analysis and to induce predictive model from the historical data. Our intention is to gain some insight regarding smoke-haze occurrence patterns and to apply the predictive model as a warning tool for officers and local people.
From our experimentation, we found that the technique of k-means clustering yielded a reasonable grouping result. The k-means clustering results are then analyzed to select group showing highest level of pollution for further analysis with predictive modeling.
The best modeling technique for this specific data is multivariate linear regression. The correlation coefficient of the linear model is as high as 0.837. The mean absolute error of model prediction is 30.513. We thus can conclude from this evaluation result that the predictive model is accurate enough to apply for the future situation.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHOR CONTRIBUTIONS
The first author is responsible for designing the research framework, organizing the experimentation steps and preparing the manuscript. The second author helps conducting experimentation and confirming the results. The third author takes part in the programming work. The last author helps editing the manuscript and discussing the results.