Home > Archive > 2019 > Volume 9 Number 2 (Apr. 2019) >
IJMLC 2019 Vol.9(2): 143-148 ISSN: 2010-3700
DOI: 10.18178/ijmlc.2019.9.2.778

Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network

Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop

Abstract—Speaker recognition approach can be categorized into speaker identification and speaker verification. These two subfields have a bit varied in definition from domain usage. If we has a voice input, the goal of speaker verification is for authentication by determining an answer from a question: “is the voice someone’s voice?” For speaker identification, will try to find an answer: “the voice is whose voice?” It can be thought that verification is a special case of open-set identification. In this work, deep learning model using a convolution neural network (CNN) for speaker identification is proposed. The voice input to the method is no constrained on the words the speaker speaks. That means it is in a form of text-independent of which more difficult than text-dependent system. By the method, each 2 seconds of the speaker voice is transform to a spectrogram image and input to the generated CNN model training from scratch. The proposed CNN based method is compared to the classic Mel-frequency cepstral coefficients (MFCCs) based featured extraction method classified by support vector machine (SVM). Where, up to date, MFCC is the most popular feature extracted method for audio and speech signal. Our proposed method that the spectrogram image is used as an input is also compared to a case when image of raw signal wave is employed to the CNN model. Experiments are conducted on the speech from five speakers speak in Thai language of which voices are extracted from YouTube. It reveals the proposed CNN based method trains on spectrogram image of voice is the best compared to the other two methods. The average classification results of the testing set by the proposed method is 95.83%. For MFCC based method is 91.26% and for CNN model trained on image of raw signal wave is only 49.77%. The proposed method is very efficient when only short utterance of voice is used as an input.

Index Terms—Convolution neural network (CNN), deep learning, speaker recognition, speaker identification, text-independent.

The authors are with the School of Computer Engineering, SUT, 111 University Avenue, Muang, Nakhon Ratchasima 30000, Thailand. (corresponding author: S. Bunrit; tel.: +66944961244; e-mail: sbunrit@sut.ac.th, thuttapholti@gmail.com, nittaya@sut.ac.th, kerdpras@sut.ac.th).


Cite: Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop, "Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network," International Journal of Machine Learning and Computing vol. 9, no. 2, pp. 143-148, 2019.

General Information

  • E-ISSN: 2972-368X
  • Abbreviated Title: Int. J. Mach. Learn.
  • Frequency: Quaterly
  • DOI: 10.18178/IJML
  • Editor-in-Chief: Dr. Lin Huang
  • Executive Editor:  Ms. Cherry L. Chen
  • Abstracing/Indexing: Inspec (IET), Google Scholar, Crossref, ProQuest, Electronic Journals LibraryCNKI.
  • E-mail: ijml@ejournal.net

Article Metrics in Dimensions