Home > Archive > 2020 > Volume 10 Number 1 (Jan. 2020) >
IJMLC 2020 Vol.10(1): 1-9 ISSN: 2010-3700
DOI: 10.18178/ijmlc.2020.10.1.890

Automated Techniques for Creating Speech Corpora from Public Data Sources for ML Training

Lawrence Drabeck, Buvana Ramanan, Thomas Woo, and Troy Cauble

Abstract—For machine learning (ML) to work well, there is a need for large amounts of good quality training data. Obtaining such data is often the key bottleneck for the entire ML development process. Using humans to do explicit collection has been the main approach, but this tends to be expensive and time-consuming. Therefore, there is significant interest in creating alternative data collection techniques. We explore these alternative data collection techniques in the context of speech data in this paper. We were initially motivated by the problem of wake word engine training, where we need a large number of utterances for specific wake words. Given that there are already large public repositories of media data (e.g., YouTube, DailyMotion), we were curious as to how feasible it is to find the utterances that we need. Our results are encouraging as we found many different types of words can readily be found and downloaded in the quantity and quality needed to create training corpora for DL training. Usually > 30% of the found words are suitable for corpus creation. Greater than 80% of the top 10,000 ranks words and > 50% of the top 20,000 words we selected easily produced > 5000 found words, which is sufficient to train a high quality Wake Word Engine. Besides general words, we specifically looked for words used in wake word engine construction such as Name/Place/Product Name. Here, again, we find most common names/places/products return more than a sufficient number of words for corpus creation. Only uncommon names and places (like Atticus or Maximus) are difficult to find in sufficient quantities for corpus creation. We demonstrate a wake word engine trained from words we found in YouTube has the equivalent performance to one trained with traditional human collected words. Even though we were focused on wake words, our approach is general. It can be applied to create speech corpus for various purposes.

Index Terms—Corpus, found data, training data, wake word engine, machine learning, deep learning.

L. Drabeck and T. Cauble are with Nokia Bell Laboratories, Holmdel, NJ 07733, USA (e-mail: lawrence.drabeck@nokia-bell-labs.com, troy.cauble@nokia-bell-labs.com).
B. Ramanan and T. Woo are with Nokia Bell Laboratories, Murray Hill, NJ 07974 (e-mail: buvana.ramanan@nokia-bell-labs.com, thomas.woo@nokia-bell-labs.com).

[PDF]

Cite: Lawrence Drabeck, Buvana Ramanan, Thomas Woo, and Troy Cauble, "Automated Techniques for Creating Speech Corpora from Public Data Sources for ML Training," International Journal of Machine Learning and Computing vol. 10, no. 1, pp. 1-9, 2020.

Copyright © 2020 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

 

General Information

  • E-ISSN: 2972-368X
  • Abbreviated Title: Int. J. Mach. Learn.
  • Frequency: Quaterly
  • DOI: 10.18178/IJML
  • Editor-in-Chief: Dr. Lin Huang
  • Executive Editor:  Ms. Cherry L. Chen
  • Abstracing/Indexing: Inspec (IET), Google Scholar, Crossref, ProQuest, Electronic Journals LibraryCNKI.
  • E-mail: ijml@ejournal.net


Article Metrics in Dimensions