• Dec 12, 2017 News!Good News! All papers from Volume 7, Number 1 to Volume 7, Number 5 have been indexed by Scopus!   [Click]
  • Jul 03, 2017 News!Good News! Since 2017, IJMLC has been indexed by Scopus!
  • Nov 14, 2017 News!Vol.7, No.5 has been published with online version.   [Click]
Search
General Information
Editor-in-chief
Dr. Lin Huang
Metropolitan State University of Denver, USA
It's my honor to take on the position of editor in chief of IJMLC. We encourage authors to submit papers concerning any branch of machine learning and computing.
IJMLC 2014 Vol.4(3): 216-224 ISSN: 2010-3700
DOI: 10.7763/IJMLC.2014.V4.415

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Che Ngufor and Janusz Wojtusiak
Abstract—Extracting information from a training data set for predictive inference is a fundamental task in data mining and machine learning. With the exponential growth in the amount of data being generated in the past few years, there is an urgent need to develop or adapt existing learning algorithms to efficiently learn from large data sets. This paper describes three scaling techniques enabling machine learning algorithms to learn from large distributed data sets. First, a general single-pass formula for computing the covariance matrix of large data sets using the MapReduce framework is derived. Second, two new efficient and accurate sampling schemes for scaling down large data sets for local processing are presented. The first sampling scheme uses the single-pass covariance formula to select the most informative data points based on uncertainties in the linear discriminant score. The second technique on the other hand selects informative points based on uncertainties in the logistic regression model. A series of numerical experiments demonstrates numerically stable results from the application of the formula and a fast, efficient, accurate and cost effective sampling scheme.

Index Terms—Linear discriminant analysis, logistic regression, classification, sampling, mapreduce, single-pass.

Che Ngufor is with Computational Sciences and Informatics at George Mason University, Fairfax, VA (e-mail:cngufor@masonlive.gmu.edu).
Janusz Wojtusiak is with the George Mason University Department of Health Administration and Policy (e-mail:jwojt@mli.gmu.edu).

[PDF]

Cite: Che Ngufor and Janusz Wojtusiak, "Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing," International Journal of Machine Learning and Computing vol.4, no. 3, pp. 216-224, 2014.

Copyright © 2008-2015. International Journal of Machine Learning and Computing. All rights reserved.
E-mail: ijmlc@ejournal.net