Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network

Home > Archive > 2019 > Volume 9 Number 2 (Apr. 2019) >

IJMLC 2019 Vol.9(2): 143-148 ISSN: 2010-3700
DOI: 10.18178/ijmlc.2019.9.2.778

Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop

Abstract—Speaker recognition approach can be categorized into speaker identification and speaker verification. These two subfields have a bit varied in definition from domain usage. If we has a voice input, the goal of speaker verification is for authentication by determining an answer from a question: “is the voice someone’s voice?” For speaker identification, will try to find an answer: “the voice is whose voice?” It can be thought that verification is a special case of open-set identification. In this work, deep learning model using a convolution neural network (CNN) for speaker identification is proposed. The voice input to the method is no constrained on the words the speaker speaks. That means it is in a form of text-independent of which more difficult than text-dependent system. By the method, each 2 seconds of the speaker voice is transform to a spectrogram image and input to the generated CNN model training from scratch. The proposed CNN based method is compared to the classic Mel-frequency cepstral coefficients (MFCCs) based featured extraction method classified by support vector machine (SVM). Where, up to date, MFCC is the most popular feature extracted method for audio and speech signal. Our proposed method that the spectrogram image is used as an input is also compared to a case when image of raw signal wave is employed to the CNN model. Experiments are conducted on the speech from five speakers speak in Thai language of which voices are extracted from YouTube. It reveals the proposed CNN based method trains on spectrogram image of voice is the best compared to the other two methods. The average classification results of the testing set by the proposed method is 95.83%. For MFCC based method is 91.26% and for CNN model trained on image of raw signal wave is only 49.77%. The proposed method is very efficient when only short utterance of voice is used as an input.

Index Terms—Convolution neural network (CNN), deep learning, speaker recognition, speaker identification, text-independent.

The authors are with the School of Computer Engineering, SUT, 111 University Avenue, Muang, Nakhon Ratchasima 30000, Thailand. (corresponding author: S. Bunrit; tel.: +66944961244; e-mail: sbunrit@sut.ac.th, thuttapholti@gmail.com, nittaya@sut.ac.th, kerdpras@sut.ac.th).

[PDF]

Cite: Supaporn Bunrit, Thuttaphol Inkian, Nittaya Kerdprasop, and Kittisak Kerdprasop, "Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network," International Journal of Machine Learning and Computing vol. 9, no. 2, pp. 143-148, 2019.

PREVIOUS PAPER

Robustness Analysis of 3D Convolutional Neural Network for Human Hand Gesture Recognition

NEXT PAPER

Neural Networks to Predict Dropout at the Universities

General Information

E-ISSN: 2972-368X
Abbreviated Title: Int. J. Mach. Learn.
Frequency: Quaterly
DOI: 10.18178/IJML
Editor-in-Chief: Dr. Lin Huang
Executive Editor: Ms. Cherry L. Chen
Abstracing/Indexing: Inspec (IET), Google Scholar, Crossref, ProQuest, Electronic Journals Library, CNKI.
E-mail: ijml@ejournal.net

Home

About IJML

Editorial Board

Author Guideline

Editor Guideline

Reviewer Guideline

Special Issues

Archive

Home > Archive > 2019 > Volume 9 Number 2 (Apr. 2019) >

Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network

General Information

Article Metrics in Dimensions