Improved Language-Independent Speaker Identification in a Non-contemporaneous Setup

630 doi: 10.18178/ijmlc.2020.10.5.984 0F0F0F Abstract—One of the most effective approaches available in the literature for Automatic Speaker Identification is based on Gaussian Mixture Models (GMMs) with Mel Frequency Cepstral Coefficients (MFCCs) as features (Reynolds (1995). The use of GMMs for modeling speaker identity is motivated by the interpretation that the Gaussian components represent some general speaker-dependent spectral shapes, and the capability of mixtures to model arbitrary densities. In an earlier work, the authors have presented and demonstrated empirically (using the benchmark speech corpus NTIMIT) how combining two different well-known set of features (MFCCs and Perceptual Linear Predictive Coefficients (PLPCs)) and using ensemble classifiers in conjunction with the Principal Component Transformation (PCT) and some robust statistical estimation techniques, enhances significantly the performance of the baseline MFCC-GMM speaker recognition system. In this work, the authors demonstrate that this approach, besides being statistically robust, is also significantly more robust than the baseline system to language mismatch in a non-contemporaneous setup. This has been done with the help of ISIS/NISIS, a bilingual dual-channel speech corpus with multi-session speech recordings.

This approach is still one of the best available in the literature.
In an earlier work [5], the authors have proposed an extension of the baseline MFCC-GMM speaker recognition system of Reynolds, which shows significantly enhanced recognition accuracy on NTIMIT by 1) augmenting the original feature set consisting of MFCCs with a set of Perceptual Linear Predictive Coefficients (PLPCs) [6]; 2) implementing robust statistical estimation techniques like the trimmed mean, to eliminate the effect of outliers, that is, observations that are too different from the majority of observations and may be due to the inherent variability in the data set or to measurement error.
The following two ideas from a previous work [7] were also used: 1) Incorporation of the individual correlation structures of the feature sets of each speaker into the corresponding speaker models: This correlation structure is a significant aspect of the speaker models which was completely ignored by Reynolds by assuming the MFCCs to be independent. This is achieved through the well-known Principal Component Transformation (PCT) [8].
2) Using ensemble classification: It is a well-known fact that the use of an ensemble of classifiers instead of a single classifier can improve the accuracy to a great extent. A clever device has been employed to design an ensemble classifier which further improved the classification accuracy. This paper is organized as follows. The features used, namely, MFCCs and PLPCs, are briefly described in the following section. GMMs are covered in Section III, which also outlines the baseline speaker recognition system with MFCC features and GMM speaker models. The extension of the baseline MFCC-GMM proposed by the authors in [5] is described in Section IV. Section V presents the salient features of the ISIS/NISIS bilingual dual-channel multi-session speech corpus used to demonstrate empirically that the algorithm in Section IV, apart from being statistically robust, is also significantly more robust than the baseline system to language mismatch in a non-contemporaneous setup. Results that support this are provided in Section VI. Concluding remarks are made in Section VII.

A. Mel Frequency Cepstral Coefficients (MFCCs)
The Mel Frequency Cepstrum (MFC) is a representation of the short-term power spectrum of a speech signal, based on a linear cosine transform of a log-energy spectrum on a nonlinear mel scale of frequency. It exploits auditory principles, as well as the decorrelating property of the cepstrum, and is amenable to compensation for convolution distortion. As such, it has turned out to be one of the most effective feature representations in speech-related recognition tasks [9].
Mel-frequency Cepstral Coefficients (MFCCs) [10] are coefficients that collectively make up an MFC. A given speech signal is partitioned into overlapping segments or frames, and MFCCs are computed for each such frame. Based on a bank of K filters, a set of M MFCCs is computed from each frame as follows: N being the length of the discrete Fourier transform. The magnitude of ( , ) k Yn is then weighted by a series of filter frequency responses whose center frequencies and bandwidths roughly match those of the auditory critical band filters, that is, the so-called mel-scale filters, collectively referred to as a mel-scale filter bank (see below). If the frequency response of the th mel-scale filter is denoted by

Mel-scale Filter Banks
A mel-scale filter bank (Fig. 1) is a set of filters spaced uniformly on the mel scale (described below), which has a triangular bandpass frequency response, and the spacing as well as the bandwidth is determined by a constant mel frequency interval.

The Mel Scale
Psychophysical studies show that human perception of the frequency contents of sounds for speech signals does not follow a linear scale. For each tone with an actual frequency, f, measured in hz, a subjective pitch is measured on the so-called 'mel' scale. The mel scale is a scale of pitches judged by listeners to be equal in distance from one another. The word mel comes from the word melody to reflect this. This scale is a linear frequency spacing below 1000 hz and a logarithmic spacing above 1000 hz (Fig. 2). A popular formula to convert f hertz into m mel is:

Computation of MFCCs
This involves the following steps: 1. Partitioning the speech signal into overlapping segments or frames 2. Taking the Fourier transform of signal from each frame. 3. Mapping the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. 4. Taking the logs of the powers at each of the mel frequencies. 5. Taking the discrete cosine transform of the list of mel log powers, as if it were a signal.

B. Perceptual Linear Predictive Coefficients (PLPCs)
Perceptual Linear Prediction is a method of spectral estimation proposed by Hermansky [6]. In different psycho-acoustic experiments it was observed that human frequency resolution varies over different frequency ranges and low frequencies mask higher ones. Moreover, it has been found that hearing is most sensitive at mid-frequencies.
While listening people generally integrate 1 bark of spectrum, whereas for discrimination purpose people seem to integrate about 3.5 barks of spectrum. These observations inspired the development of Perceptual Linear Predictive Coefficients (PLPC) which turned out to be superior in many ways to the Linear Predictive (LP) coefficients in the task of speaker identification.
In this technique of speech analysis, mainly three psycho-acoustic concepts are used to estimate the auditory spectrum which are critical-band spectral analysis, the equal loudness curve and the intensity power law. PLP algorithm can be described using the following steps --first in the spectral analysis phase the speech signal is partitioned into overlapping segments and each segment is weighted by the Hamming window.
The short-term power spectrum P(ω) is computed for each of these segments. In the next stage, the spectrum P(ω) is warped along the frequency axis into the Bark Frequency which is then convolved with power spectrum of the simulated critical band masking curve that results in samples of the critical-band power spectrum. In this step, spectral resolution is significantly reduced. The sampled power spectrum is then pre-emphasized by an equal-loudness curve and a cubic-root amplitude compression is performed simulating the power law of hearing. Finally, in the autoregressive modeling phase, the resulting spectrum is modeled by a 5th order model using the autocorrelation method of all-pole spectral modeling. The following block diagram shows the steps of PLP algorithm.
PLP has an advantage of approximating the speaker independent effective second formant. It reduces the disparity between voiced and unvoiced speech. It has been shown in different experiments that there exists a strong correlation between the perceptually estimated second formant and that estimated by the PLP method.

A. Gaussian Mixture Models (GMMs)
If x is a d-dimensional feature vector, then for a S-speaker problem, the probability distribution of the MFCCs obtained from speaker , 1, 2, , , i i S  is modeled as a mixture of C component probability densities as follows: GMM models for all speakers are trained by the Expectation-Maximization algorithm [10]. An unknown speech sample is split into a number of overlapping segments, with MFCCs computed from each segment. The likelihood function for the sample is computed, based on all MFCC vectors obtained from it, and it (the unknown sample) is classified by the principle of maximum likelihood, described below.

B. Speaker Recognition by Maximum Likelihood
Consider a speaker database consisting of S speakers where the th i speaker is being represented by a GMM ( | ) i p  x as defined above. If a speech utterance of unknown origin is presented, and it is known that is speaker is represented in the speaker database, the objective of speaker recognition is to identify which of the K speakers could have uttered it.
Suppose the unknown utterance is split into P overlapping frames using the same procedure as for the training samples,

IV. ROBUST SPEAKER RECOGNITION BY FUSION OF FEATURES AND CLASSIFIERS
A novel approach for robust speaker recognition by fusion of features and classifiers was presented by the authors in an earlier work [5]. In this approach, significant enhancement in classification accuracy of the baseline MFCC-GMM speaker recognition system was made possible by a combination of the following: 1. Augmentation of the MFCC-based feature set with a PLPCs: Experiments were conducted separately with both these feature sets. Further investigations revealed that the classifiers built by fusing both feature sets could identify different speakers even more accurately 2.    we make use of the well-known trimmed mean procedure [12], which is described below.

A. Principal Component Transformation (PCT)
This is a widely-used linear orthogonal transformation for converting a set of observations on possibly correlated variables into a set of observations on linearly uncorrelated variables called principal components [8]. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to (i.e., uncorrelated with) the preceding components. Principal components are guaranteed to be independent only if the data set is jointly normally distributed. PCT is sensitive to the relative scaling of the original variables. Depending on the field of application, it is also called the Karhunen-Loè ve transform (KLT), and so on.
Let X be a mn  data matrix each of whose n columns represents an observation on an m-variate random variable U. It is assumed that the columns have zero empirical mean (that is, the arithmetic mean of the n observations has been subtracted from each of them). If the mm  matrix Σ is the  V. THE ISIS/NISIS SPEECH CORPUS ISIS (an acronym for Indian Statistical Institute Speech) and NISIS (Noisy ISIS) are speech corpora that respectively contain simultaneously recorded microphone and telephone speech (spontaneous as well as read), over multiple sessions, in two languages (Bangla and English), in a typical office environment with moderate background noise. They were created in the Indian Statistical Institute, Kolkata, as a part of a project funded by the Department of Information Technology, Ministry of Communications and Information Technology, Government of India, during 2004-07. The speakers generally had Bangla or another Indian language as their mother tongue, and so were non-native English speakers. Details of the methodology of collection are given in [13].
Particulars of both corpora are given below:  Number of speakers: 105 (53 male + 52 female)  Recording environment: moderately quiet computer room  Sessions per speaker: 4 (numbered I, II, III and IV)  Interval between sessions: 1 week to about 2 months  Types of utterances in Bangla and English per session: 10 isolated words (randomly drawn from a specific text corpus, and generally different for all speakers and sessions) answers to 8 questions (these answers included dates, phone numbers, alphabetic sequences, and a few words spoken spontaneously) 12 sentences (first two sentences common to all speakers, the remaining randomly drawn from the text corpus, duration ranging from 3-10 seconds) Thus, for each session, there are two sets of recordings per speaker, one each in Bangla and English, containing 21 files each.
To conduct the experiments whose results are reported in the following section, ten utterances from the "sentences" folder were used for 100 speakers enrolled in the NISIS corpus. Two separate series of experiments were conducted by splitting the 10 recordings for each speaker at each session as follows:  The first 6 were used for training and rest for testing  The first 8 were used for training and rest for testing For brevity, these experiments will subsequently be referred to as 6:4 and 8:2 respectively.
For brevity, we shall subsequently refer to the set of all Bangla sentence utterances recorded in session no. i as BS-i, i=I, II, III, IV, for each speaker. The corresponding notation for the set of all English sentence utterances recorded in session no. i is ES-i.

VI. RESULTS
In [13], the performance with NISIS in the 6:4 setup of the baseline MFCC-GMM system based on 32-component GMMs built with 38 MFCCs, was reported. The corresponding results, with training and test data mismatched in respect of session and/or language, are given in the light grey cells of Table I. For conducting further experiments, a 39-dimensional feature set (FS-I), formed by combining 13 MFCCs, 13 delta MFCCs and 13 PLPCs, has been used. For building ensemble classifiers, we have also made use of 39-dimensional feature set (FS-II) consisting only of MFCCs. Both feature sets have been used in conjunction with different values of underlying parameters to construct a number of classifiers for use in the ensembles. A GMM with 32 components based on one or both of these 39-dimensional feature vectors has been used for each speaker in all cases. For comparison, in the darker grey cells of Table I, we provide the results of the MFCC-GMM system, based on FS-I, which will henceforth be referred to as the MFCC-PLPC-GMM system. It is very evident from this Table that the MFCC-PLPC-GMM system has substantially improved recognition accuracy relative to the baseline MFCC-GMM system. This pattern has also been observed in the 8:2 case. In other words, by combining the PLPC features with the MFCC features, significant improvement is achieved in the accuracy across all the experiments. This has been used as the baseline in further experiments.
To investigate the effect of language and session mismatch on speaker identification accuracy of the approach proposed in [5] and described in Section IV, extensive experiments were conducted in which the training/test data were taken from different sessions/languages. For brevity, this approach will henceforth be referred to subsequently as PREF (an acronym for Principal component-transformed Robust Ensemble of Features).
As will be amply evident from the following discussion, PREF based on FS-I (or PREF-I, in short) leads to even greater improvement in recognition accuracy. Overall, the degradation due to language and temporal mismatch is seen to have been restored to a great extent by PREF-I in most of the experiments.
Performance of the MFCC-PLPC-GMM system (as baseline) as well as PREF-I, with training and test data varying across languages and sessions, for both the 6:4 and 8:2 setups, are reported in Tables II and III respectively.