Deep Convolutional Neural Networks for Emotion Recognition of Vietnamese

Human emotions play a very important role in communication. Emotional speech recognition research brings human–machine communication closer to human-to-human communication. This paper presents the evaluation using ANOVA and T-test for the Vietnamese emotional corpus and using deep convolutional neural networks to recognize four basic emotions of Vietnamese based on this corpus: neutrality, sadness, anger, and happiness. Five sets of characteristic parameters were used as inputs of the deep convolutional neural network in which the mel spectral images were taken and attention was paid to the fundamental frequency, F0, and its variants. Experiments were conducted for these five sets of parameters and for four cases, depending on dependent or independent content and dependent or independent speakers. On average, the maximum recognition accuracy achieved was 97.86% under speaker-dependent and content-dependent conditions. The results of the experiments also show that F0 and its variants contribute significantly to the increased accuracy of Vietnamese emotional recognition.


I. INTRODUCTION
Emotional expression through speech can be regarded as sophisticated human behavior. In addition to speech, people can express emotions through gestures and facial expressions. Understanding the message that the communicator wants to convey through his or her emotional expression plays a very important role in communication between people in everyday life. Likewise, for human-machine interaction to achieve the desired effect (i.e., human-to-human communication), the proper identification of the object of communication during conversation is critical to successful communication. If we look at Reference [1] as early research on speech emotion based on signal processing, it can be said that this field of study has gone through 20 years and has achieved encouraging results [2]. However, the expression of human emotions through speech is very diverse, rich, and sophisticated; hence, research in this area has several difficulties to overcome.
According to statistics [3], we can have up to 300 emotional forms. The difficulties for an emotional recognition system are that the same form of emotion, sadness for example, expresses this state through the same sentence, which may differ according to the context: the reason for sadness, level of sadness, etc. In Vietnamese, for example, the same sad state can be expressed in various ways: buồn (sad), buồn rầu (sad and melancholy), buồn rũ rượi (sad Manuscript received July 9, 2020 (At that time, IJMLC was indexed by Scopus); revised February 17, 2020. and very tired), buồn mê nh mang (sadness stretches), buồn da diết (sad and tormenting), buồn cười (sad and funny), buồn chán (sad and bored), buồn bã (sad and tired), buồn tê nh (sad and empty feeling), buồn phiền (sadness sorrow), buồn bực (sad and angry), buồn đau (desolate), or buồn tuyệt vọng (sad and depressed). A slight variation in the intonation, intensity, pronunciation, or duration of a word can be interpreted in terms of various emotional states.
The majority of models, classifiers, and speech features were recapitulated in Reference [59]. For the speech emotion corpus, there are three ways in which to build a corpus: corpus of emotion speech based on simulation, corpus of emotion speech based on elicitation, and corpus of speech based on natural emotion. For the corpus of emotion speech based on simulation, experienced and well-trained artists express various emotions on demand for linguistically neutral sentences. This is an easy way to build an emotional corpus in terms of volume and quality. However, this method tends to express emotions in the corpus more strongly than real emotions. A corpus of speech emotions based on elicitation simulates artificial expressions in which the speaker is unaware of the emotion expressed in the first way. The speaker will naturally express emotions in the context of dialog that has been built. In this case, the quality of the emotional expression is less appropriate if the speaker knows that the dialog is being recorded. In the third method, the corpus can be collected through natural dialog in everyday life. In this case, it would be difficult to meet the volume of the corpus and the manifold expressions of emotion.
Up to now, research on emotions in Vietnamese has been conducted at the linguistic level [60], but little research has been conducted on signal processing. It can be said that the first corpus on Vietnamese emotions was that by Le Thi Xuyen [61]. In her Ph.D. thesis, the Vietnamese emotional corpus consists of five sentences and two speakers (one male and one female). The corresponding sentences in French are also spoken by two French speakers. Speakers had to familiarize themselves with the sentences, study them, and somehow repeat their performance before the final recording. Among the five sentences, four are expressed with 12 emotions: neutral*, deception, surprise*, joy*, anger*, contentment (satisfaction), confirmation, boredom*, advice, doubt*, irony*, and regret. The remaining sentence is expressed in seven emotions (the emotions are marked with *). On the basis of this corpus, Le Thi Xuyen studied acoustic signals representing psychological and expressive attitudes, the relationship between acoustic facts and the results of perception tests, and crossed experience in both languages. Khoa M. D. et al. in [62]- [64] tested the modeling of Vietnamese prosodies with a multimodal corpus to synthesize expressive Vietnamese. The corpus of this research was built with one male voice. Duyen N. T. and Duy B. T. [65] proposed a modified Vietnamese-speaking model to create an emotional expression in the voice channel for a Vietnamesespeaking virtual person. In this study, emotional vocabulary included the Vietnamese pronunciation of a male artist and a female artist pronouncing 19 sentences in five emotions: natural, happy, sad, angry, and very angry. For emotional recognition in Vietnamese, the study [66] used SVM to classify emotions with the input using the EEG signal. The results show that five emotional states can be identified in real time with an average accuracy of 70.5%. For the research in Reference [67], the corpus contains two male voices and two female voices with six sentences for six emotions: happiness, neutrality, sadness, surprise, anger, and fear. The characteristic parameters were MFCC, short-term energy, pitch, and formants, which were used with the GMM model. The highest recognition score was 96.5% for a neutral emotion, and the lowest was 76.5% for sadness. The corpus for Reference [68] consists of six voices with 20 sentences and the emotions as in Reference [67]. In Reference [68], with Im-SFLA SVM, the recognition score on the Vietnamese language reached 96.5% for neutrality and dropped to 84.1% for surprise. For the first time, the emotional corpus for the Vietnamese language BKEmo based on the simulation method by recording artists' voices expressing the desired emotions was built at Hanoi University of Science and Technology. See References [69], [70] for more details on this corpus. With the corpus BKEmo, we performed the recognition of four emotions using the GMM model: happiness, neutrality, sadness, and anger [70]. The parameters for our GMM model are MFCC and its first and second derivatives, fundamental frequency, energy, formants and the corresponding bandwidths, spectral characteristics, and F0 variants. The average score of recognition was 77.21%.
In this paper, the deep CNN model will be used to recognize the four emotions mentioned above with the same BKEmo corpus. However, as described below, the use of input characteristic parameters for deep CNN is different from the GMM model in Reference [70].
The rest of the paper is organized as follows. Section II describes the additional evaluation of the corpus for the emotional recognition of Vietnamese. Section III describes the deep CNN configuration for training experiments. Section IV details the parameter sets used for the experiments. The corpus used for the experiments is described in Section V, and the experimental results are presented and discussed in Section VI. Section VII concludes the paper.

A. ANOVA Analysis
In Reference [69], using ANOVA and T-test, we evaluated the corpus BKEmo with nine characteristic parameters of the speech signal spectrum: harmonicity, center of gravity, standard deviation, skewness, kurtosis, central spectral moment, mean, slope, and standard deviation of LTAS (longterm average spectrum). The results show that the above parameters affect the differentiation of four emotions. In this paper, we perform additional evaluation of the remaining 18 characteristic parameters, including intensity, four formants and the corresponding bandwidths, and F0 and F0 variants (dF0, F0NormAver, F0NormMinMax, F0NormAverStd, LogF0NormAver, dLogF0, LogF0NormMinMax, and LogF0NormAverStd). See Reference [70] for details on the F0 variants. All the parameters of the speech signal used in our experiments were calculated using Praat [71].
Next, this section presents the results of using one-way ANOVA to evaluate the influence of the 18 characteristic parameters of the speech signal mentioned above. Table I presents the ANOVA results with statistical values of F and P-value for the 18 characteristic parameters.  Table I shows that the P-value is very small (0), which means that the probability of the expected values being the same for emotions is very low. Therefore, the hypothesis of nondistinction between emotions is rejected, and one can assert that emotions can be differentiated based on these parameters. The P-values in the tables of this paper are calculated with MatLab using double precision and are retained in the format rendered by MatLab. In fact, very small values such as xxxE−243, xxxE−248, and so on can be considered as a value of 0.

B. T-Test Results
Because of the T-test, we can identify which pairs of emotions can be distinguished from each other based on the above parameters.  Table II shows that the P-values are very small in the evaluation of each parameter for each pair of emotions. It is known that in the majority of cases, a P-value of 0.05 is used as the cutoff for significance [72]. If the P-value is less than 0.05, the null hypothesis that there is no difference between the emotion pairs will be rejected and a significant difference does exist. The emotion pairs sad-happy and sad-angry are best distinguished for most of the 18 characteristic parameters. This distinction is in line with the fact that these two pairs of emotions are easy to distinguish by hearing them. The remaining emotion pairs are also distinctive; however, the mean, standard deviation of LTAS, and intensity parameters are less affected in the neutral-angry pair. The parameters dF0 and dLogF0 affect the angry-happy pair less. For this pair, it is also difficult to distinguish using the dLogF0 parameter.

III. DEEP CNN CONFIGURATION FOR EXPERIMENTS
For the emotional recognition of Vietnamese, the full configuration of the deep CNN for training is described in Table III in the case of the baseline model with 260 parameters. For models with parameter numbers greater than 260, the network configurations can be easily deduced in a similar way.
For layer 1, the input image is 260 × 260 (260 mel spectrum coefficients × 260 frames). The chosen frame number is 260 because for the files used for experiments, if the frame width is 0.025 s and the frame shift is 0.01 s, the total number of frames will be around 260 for most files. An example of the mel spectrum of the speech signal as the input image for layer 1 in the case of the baseline model is given in Fig. 1. After convolution using a 3 × 3 moving filter with padding, there are 64 feature maps with a size of 260 × 260. The next operation of layer 1 is batch normalization, followed by a nonlinear activation function ELU, max pooling with 2 × 2 windows, and finally dropout with a coefficient of 0.5. The operation of the convolution was carried out in a turn from left to right, top to bottom. The purpose of the convolution operation is to determine the probability of occurrence of samples at certain locations in the image. Formula (1) describes the convolution operation at a particular location: Here, consists of spectral pixels in the scanned window (i.e., 3 × 3 spectral pixels), are weights to learn, and is bias. If is the number of input pictures, is the number of feature maps, and the number of parameters for a convolution operation is × ( × (moving filter size) + 1). For each layer, the goal of batch normalization is to achieve a stable distribution of activation values throughout training and thereby yield a substantial speedup in training [73]. ELU speeds up learning in deep neural networks and leads to higher classification accuracies [74]. Max pooling reduces the number of model parameters, also known as downsampling or subsampling, while also making the detection of features invariant to orientation changes or scale [75]. Finally, dropout is considered as a means of preventing neural networks from overfitting [76]. Layers 2, 3, and 4 perform the same operations, convolution using a 3 × 3 moving filter with padding, and the outputs of these convolutions are 128 feature maps with the dimensions 130 × 130, 65 × 65, and 32 × 32, respectively. After the convolution, these layers also undergo batch normalization, ELU, max pooling, and dropout. Finally, for layer 5, after batch normalization, ELU, max pooling, and dropout, we have a fully connected layer with 256 inputs and four outputs corresponding to four emotions. The transfer function of the fully connected layer is Softmax, representing the probability distribution for each emotion.
Usually neural networks with three or more hidden layers can be considered as deep neural networks. Reference [77] gives an example of CNN architecture that also has five layers, but this deep CNN is used for image classification.

IV. PARAMETER SETS USED FOR EXPERIMENTS
The parameters are divided into five sets and detailed in Table IV.

V. TESTS FOR EXPERIMENTS
The method of dividing the corpus to perform the tests in this paper is similar to that used in Reference [70], which means that there are four sets of corpus for the four tests. The tests are denoted by Ti (with i = 1, 2, 3, 4). Test1 (T1): Speaker-dependent and content-dependent corpus.
For Test1, speaker-dependent and content-dependent mean that the same speaker expresses the same content but expresses at different times. This still has a practical meaning because of the variability of the speech signal. The same person speaks the same sound at different times; however, the speech signals of different times are not quite the same [78].
In this paper, the number of voice files per test is divided into a 2-1-1 proportion corresponding to the training-valid-test. . When we increased the number of parameters from 260 to 264, the T4 and T2 tests did not increase the recognition accuracy, but the T1 and T3 tests did. For the Ti − 267 tests (i = 1, 2, 3, 4), the T1, T2, and T4 tests all increased the recognition accuracy. The highest increase in recognition accuracy was 15.85% (T1 − 267 vs. T4 − 267), and the smallest increase in recognition accuracy was 1.21% (T3 − 267 vs. T4 − 267). In the case of T2 − 267, the recognition accuracy decreased by 0.79% compared with T2 − 264.
In the case of 294 parameters, the T1, T2, and T4 tests all increased the recognition accuracy. In particular, the recognition accuracy in test T2 − 294 increased significantly (3.08%) compared with that in T2 − 267. In contrast, the recognition accuracy in test T3 − 294 was reduced (2.71%) from that in test T3 − 267. As the number of parameters increased to 296, the T1 and T2 tests had the highest recognition accuracy compared with the other cases. The recognition accuracy reached 98.93% for test T1 and 89.76% for test T2.
Therefore, in the four tests with five parameter sets, the T1 and T2 tests achieved the highest recognition accuracy with 296 parameters. The highest recognition accuracy of the T3 test is with 267 parameters; however, the highest recognition accuracy of the T4 test is with 294 parameters. The average recognition accuracy of the four tests and the four emotions for each set of parameters is given in Fig. 3. Fig. 3 shows that the average recognition accuracy of the tests was highest when using 296 parameters and lowest when using 260 parameters. This result also shows that the addition of energy, spectral characteristics, the fundamental frequency F0 , the variants of F0 , the four formants, and the corresponding bandwidths increased the recognition accuracy.
These characteristics are parameters that have a strong influence on the ability to differentiate emotions and have been analyzed by one-way ANOVA and T-test, as reported in the above section. The influence of the fundamental frequency through the dF0 and LogF0NormAver parameters as they are used in the 296 parameters set has improved the recognition accuracy (from 87.26% to 88.01%).   The highest recognition accuracy of each emotion for each experiment is given in Fig. 4. Fig. 4 shows that, in general, sadness has the highest recognition accuracy compared with the other emotions in the T1, T2, and T4 tests but was low for the T3 test. This result is shown in Fig. 5, which demonstrates each emotion's average recognition accuracy for the tests. For the T1 test, the highest recognition accuracy was achieved, and the percentage was not significantly different (95.42%-100%). With the T4 test, neutral emotion has a greater disparity of recognition accuracy than sadness.
In terms of the average recognition accuracy of all the tests for each set of parameters, the mean percentages of sadness were highest (90.54%), followed by anger (89.03%), happiness (87%), and neutral emotion (80.77%). The average recognition accuracy of each test for each emotion is shown in Fig. 5.
For the case using 260 parameters, these are universal parameters for different languages. F0 is an important parameter for both tonal language and language without tone. Furthermore, depending on the language, specific language parameters such as voice quality can be exploited.
For 296 parameters, 14 coefficients of the inverse filter of the vocal tract take up a significant amount. These coefficients carry information about the vocal tract. However, it can be seen that the significant increase in these coefficients does not significantly increase the recognition accuracy when compared with increasing the number of parameters related to F0. This shows the importance of F0 for emotional recognition and is also worth noting when considering the trade-off between number of parameters and average accuracy, in terms of training time and costs.
The experiments were performed on two machines with Intel Core i7-4790 CPU @ 3.60 GHz × 8, 16  6 days for one of four tests (Ti, i = 1, 2, 3, 4) with one of five parameter sets. Among the characteristic parameters used, the effect of spectral parameters on emotional recognition accuracy for the GMM model was studied in Reference [69]. The influence of parameters directly related to F0 on emotional recognition accuracy for the GMM model was presented in Reference [70]. These might represent the information required to decide on the optimum number of parameters.

VII. CONCLUSION
The results of ANOVA and T-test showed the ability to distinguish emotions from the BKEmo corpus by using the speech signal's characteristic parameters. The recognition results of four emotions using deep CNN are also consistent with these evaluation results of the ANOVA and T-test for the corpus. The feature parameter set for our deep CNN includes the characteristic parameters for the sound source and vocal tract. For the characteristic parameters, the fundamental frequency F0 is the sound source parameter and is important for Vietnamese because it is a tonal language. The variation law of fundamental frequency also depends on the emotion that needs to be expressed. Overall, the accuracy of Vietnamese emotional recognition achieved with deep CNN is very positive, and the experiment's results show that information on the fundamental frequency F0 improved the accuracy in emotional recognition. The corpus BKEmo is shared for the deep CNN model of this paper and for the GMM model in Reference [70]. However, it would be unreasonable if the comparison of the recognition scores of these two models is carried out to conclude which model has the higher recognition score. In fact, the characteristic parameters used by these two models are different in our case. However, both the deep CNN and GMM models have improved the recognition scores by adding more information on F0 and the F0 variants. For a preliminary comparison between the GMM and ANN models, it can be said that besides learning ability, the ANN model has promising prospects toward building recognition systems in general because of diversity in the architecture of neural networks. Our next work is to extend the corpus to other forms of emotion in Vietnamese and perform recognition for these forms of emotion. One problem for recognition systems including emotional recognition is that data in the real environment may not belong to a training or a testing set. In such cases usually, the recognition accuracy may be reduced. To approach this issue in emotional recognition, there have been studies using transfer learning [79]- [82], and this will also be one of the upcoming research directions for us to recognize emotions.