Abstract—Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.
Index Terms—Histogram random forest, histogram data, random forest PCA. histogram features.
Ram B. Gurung and Tony Lindgren are with Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden (e-mail: firstname.lastname@example.org, email@example.com).
Henrik Boström was with Department of Computer and Systems Sciences, Stockholm University, Stockholm, Sweden. He is now at KTH Royal Institute of Technology, School of Information and Communication Technology, Kista, Sweden (e-mail: firstname.lastname@example.org).
Cite: Ram B. Gurung, Tony Lindgren, and Henrik Boström, "Learning Random Forest from Histogram Data Using Split Specific Axis Rotation," International Journal of Machine Learning and Computing vol. 8, no. 1, pp. 74-79, 2018.