Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Home > Archive > 2014 > Volume 4 Number 3 (June 2014) >

IJMLC 2014 Vol.4(3): 216-224 ISSN: 2010-3700
DOI: 10.7763/IJMLC.2014.V4.415

Che Ngufor and Janusz Wojtusiak

Abstract—Extracting information from a training data set for predictive inference is a fundamental task in data mining and machine learning. With the exponential growth in the amount of data being generated in the past few years, there is an urgent need to develop or adapt existing learning algorithms to efficiently learn from large data sets. This paper describes three scaling techniques enabling machine learning algorithms to learn from large distributed data sets. First, a general single-pass formula for computing the covariance matrix of large data sets using the MapReduce framework is derived. Second, two new efficient and accurate sampling schemes for scaling down large data sets for local processing are presented. The first sampling scheme uses the single-pass covariance formula to select the most informative data points based on uncertainties in the linear discriminant score. The second technique on the other hand selects informative points based on uncertainties in the logistic regression model. A series of numerical experiments demonstrates numerically stable results from the application of the formula and a fast, efficient, accurate and cost effective sampling scheme.

Index Terms—Linear discriminant analysis, logistic regression, classification, sampling, mapreduce, single-pass.

Che Ngufor is with Computational Sciences and Informatics at George Mason University, Fairfax, VA (e-mail:cngufor@masonlive.gmu.edu).
Janusz Wojtusiak is with the George Mason University Department of Health Administration and Policy (e-mail:jwojt@mli.gmu.edu).

[PDF]

Cite: Che Ngufor and Janusz Wojtusiak, "Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing," International Journal of Machine Learning and Computing vol.4, no. 3, pp. 216-224, 2014.

PREVIOUS PAPER

Classifying Cognitive Load and Driving Situation with Machine Learning

NEXT PAPER

Iris Recognition Using Fuzzy Level Set and GEFE

General Information

E-ISSN: 2972-368X
Abbreviated Title: Int. J. Mach. Learn.
Frequency: Quaterly
DOI: 10.18178/IJML
Editor-in-Chief: Dr. Lin Huang
Executive Editor: Ms. Cherry L. Chen
Abstracing/Indexing: Inspec (IET), Google Scholar, Crossref, ProQuest, Electronic Journals Library, CNKI.
E-mail: ijml@ejournal.net

Home

About IJML

Editorial Board

Author Guideline

Editor Guideline

Reviewer Guideline

Special Issues

Archive

Home > Archive > 2014 > Volume 4 Number 3 (June 2014) >

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

General Information

Article Metrics in Dimensions