Abstract—Extracting information from a training data set for predictive inference is a fundamental task in data mining and machine learning. With the exponential growth in the amount of data being generated in the past few years, there is an urgent need to develop or adapt existing learning algorithms to efficiently learn from large data sets. This paper describes three scaling techniques enabling machine learning algorithms to learn from large distributed data sets. First, a general single-pass formula for computing the covariance matrix of large data sets using the MapReduce framework is derived. Second, two new efficient and accurate sampling schemes for scaling down large data sets for local processing are presented. The first sampling scheme uses the single-pass covariance formula to select the most informative data points based on uncertainties in the linear discriminant score. The second technique on the other hand selects informative points based on uncertainties in the logistic regression model. A series of numerical experiments demonstrates numerically stable results from the application of the formula and a fast, efficient, accurate and cost effective sampling scheme.
Index Terms—Linear discriminant analysis, logistic regression, classification, sampling, mapreduce, single-pass.
Che Ngufor is with Computational Sciences and Informatics at George Mason University, Fairfax, VA (e-mail:firstname.lastname@example.org).
Janusz Wojtusiak is with the George Mason University Department of Health Administration and Policy (e-mail:email@example.com).
Cite: Che Ngufor and Janusz Wojtusiak, "Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing," International Journal of Machine Learning and Computing vol.4, no. 3, pp. 216-224, 2014.