Effect of Named Entity Recognition on English-Vietnamese Neural Machine Translation

—Translators are becoming more and more popular and achieving reliable results since deep learning was born. English-Vietnamese machines translation (MT) still have limitations due to Vietnamese contain words with many different meanings, thus resulting in the lower accuracy of automatic MT systems. Our study applied Named Entity Recognition (NER) tool for Vietnamese sentences to determine the category of words in the English-Vietnamese parallel corpus with over 900K sentence pairs. Then, we performed experiments to assess the effect of NER on English-Vietnamese MT systems. The results showed that NER had a positive effect on MT with averagely 1.24 Bi-Lingual Evaluation Understudy (BLEU) scores and averagely 1.8 Translation Error Rate (TER) scores increased comparing to data without using NER.


I. INTRODUCTION
In recent years, with the need to exchange large information between countries and the strong development of deep learning, automatic translation machines have been growing rapidly. Translators between popular languages like English, German, French, Chinese and Spanish have achieved remarkable results. However, translation machines related to the Vietnamese language still face many difficulties due to the lack of the Vietnamese language and research related to the Vietnamese language. A Vietnamese word may have different meanings depending on the context, thus resulting in the difficulty of MT when selecting words in context. NER classifies words in a text into predefined categories such as the name of person, organization, location, time, quantity, percent, etc. It indicates vocabulary that corresponds to different contexts, so it makes the machine easier to select words in the target language corresponding to the word in the source language.
There have been many studies on NER, especially NER for popular languages [1]- [3]. Most of the recent research about NER based on deep learning algorithms and named entities are mainly included person, location, organization etc.
NER for Vietnamese is mentioned on several papers. Le [4] described an efficient approach to improve the accuracy of a NER system for Vietnamese that combines regular Manuscript  expressions over tokens and a bidirectional inference method in a sequence labelling model. Their research achieved an accuracy overall 89.66% by F1 score. Dong and Nguyen [5] proposed an attentive neural network for the task of NER in Vietnamese. Character-based language models and word embeddings were handled to encode words as vector representations while decoder layers were used to decode knowledge of input sentences and to tag label for entities. In another study, Nguyen et al. [6] recommended a neural architecture for Vietnamese sequence labelling tasks including NER and part-of-speech (POS). Their research paper has shown that the combination of bi-direction Long-Short Term Memory and Conditional Random Fields lead positive results in NER and POS labelling.
There are a series of papers on MT that have been published in recent years. Koehn et al. [7] showed an open-source toolkit for SMT named MOSES. Bahdanau et al. [8] established that encoding the source sentences into an auto-length vector could significantly improve the quality of the neural MT (NMT). Johnson et al. [9] designed a simple model that use a single NMT model to build multiple translation systems. Matusov et al. [10] introduced an NMT model for translation of subtitles in the domain of entertainment. Based on the NMT model, Graç a et al. [11] have built a German-English MT system using the Back-translation method.
There are several research papers about English-Vietnamese MT. Nguyen et al. [12] built a bidirectional English-Vietnamese statistical MT (SMT) system using MOSES, an open-source toolkit for SMT. In this paper, they collected a parallel corpus with over 880,000 English-Vietnamese sentence pairs and over 11,000,000 Vietnamese monolingual sentences to train the statistical translation model. They used the BLEU score to evaluate the performance of their SMT system. Luong et al. [13] published NMT Systems for Spoken Language Domains with the MT from English to German and Vietnamese. In English-Vietnamese MT systems, they collected over 133,000 English-Vietnamese sentence pairs and the BLEU score was used to evaluate the accuracy of their MT systems. When using the test sentences available in their corpus, their MT has high accuracy, but it gives low results when using a random sentence that is not in their data. In another research, Ngo et al. [14] built an English-Vietnamese Bilingual Corpus for MT with over 800,000 sentence pairs and 10,000,000 English words as well as Vietnamese words has been collected and aligned at the sentence level. In term of translation experiments, they used the example-based MT models and SMT models.
The with a small number of sentences in a parallel corpus. In this paper, we present the basic knowledge about a neural architecture for NER and the state-of-the-art technique in MT. Moreover, we evaluate the effectiveness of NER on English-Vietnamese MT with a large dataset to training.

II. NEURAL ARCHITECTURES FOR NAMED ENTITY RECOGNITION
In this paper, we present a up-to-date neural architecture for NER that bases on bidirectional LSTMs and conditional random field (CRF) [15].

A. Long Short-Term Memory Networks
LSTMs are a special form of Recurrent neural network (RNN) and it learns to store information over the training time by recurrent backpropagation information. LSTMs were introduced by Hochreiter and Schmidhuber [16], then it was improved and applied in many areas [17]- [19]. LSTMs are designed to avoid the problem of long-term dependency without any intervention. They take a sequence of vector as input (x 1 , x 2 ,…, x n ) and return another sequence of vector as output (h 1 , h 2 ,…, h n ). Every regression network is in the form of a series of repetitive modules of the neural network. With standard RNN networks, these modules have a very simple structure including a tanh layer, as shown in Fig. 1.  LSMT, shown in Fig. 2, also has a similar structure to RNN, but it has up to 4 layers interacting especially. LSTM can remove or add the needed information for the state of the cell, they are carefully adjusted by groups called gates. Gates are where filtering information passes through it, which is combined by a sigmoid layer and a multiplication. The sigmoid layer that will give the output is a number in the range [0, 1] and [0,1] describes how much information can be passed. If the output is 0, it means that no information is passed, and if that is 1, it means that all information goes through it. An LSTM consists of three such gates to maintain and regulate the state of the cell.
At step t, components in LSMT are calculated by where is the element-wise sigmoid function, ⨂ is the element-wise product, is the forget gate, is the output of the main layer, is the input gate, is the out gate, is the cell vector, ℎ is the hidden layer.
In neural architectures for NER, two LSTM networks facing each other to form a model called Bi-LSMT. In the Bi-LSTM model, the output at each step bases on the previous elements and the behind elements. For instance, to predict missing words in a sentence, it is necessary to consider both the previous parts and the next parts of the sentence.

B. Conditional Random Field Tagging
In the field of Natural Language Processing, most of the basic models are built on the Bag of Words method [20], [21]. But, these models cannot identify syntactic relationships between words. For example, with an emotional analysis model built on the Bag of Words method, we can not identify the difference in the sentence: "I like you", in which the verb "like" is a verb that indicates positive emotions. And similarly, with another sentence: "I am like you", the verb "like" is the similarity between two objects, not just affection. In this article, we will look at the CRF algorithm [22]- [24].
CRF is conditional probability algorithms, which are very useful for NER and POS labelling. CRF is a scalar graph model, which allows defining the probability distribution of the entire sequence of states with given conditions. In the CRF algorithm, the input is a set of attributes from the input data set according to a rule. The weight of the expression with the input attributes and the tags that have been tagged before will be used to predict the current tagged label.
The objective function in the algorithm determines the label for each word in the sentence. In CRF, the current word is formulated according to the labels of previous words. The model weights will be the most reasonable. Fig. 3 demonstrates the architecture of Bi-LSTM-CRF for NER tagging. An input sentence is embedded into vectors, then these word embeddings are given to the forward and backward LSTM network. represents the word and its left context, symbolizes the word and its right context while stands for the representation of the concatenating of two vector ⃗ and ⃗. In the next step, the CRF layer forecasts the named entity output sequence that best corresponds to the input sentence.
To tag NER to Vietnamese sentences, our research based on the system of Nguyen et al [6] that showed remarkable improvements as following table.

III. NEURAL MACHINE TRANSLATION
NMT uses the neural network to train a statistical model for MT that can generate a target sentence depending on maximizing the conditional probability of a given source sentence. There are numerous studies have shown that NMT is more effective than previous MT methods [8], [13], [25], [26]. Our research-based on the encoder-decoder model with the attention that was introduced by Bahdanau [8], [27], [28].

Input word embeddings
Right-to-Left RNN Fig. 4. The encoder of the NMT model.

A. Encoder
The encoder is a bidirectional RNN, including a forwarding RNN and a backward RNN. The input sentence is encoded into a matrix, then the basic language model will process this matrix with an RNN network as Fig. 4. The encoder states are the concatenation of the hidden state left-to-right RNN ℎ ⃗⃗ and the hidden state right-to-left ℎ ⃖⃗ . Mathematically, the input sentence = ( 1 , 2 , … , ) , where n is sentence length, will be encoded by the following equation: ℎ ⃖⃗⃗⃗ = (ℎ +1 ⃖⃗⃗⃗⃗⃗⃗⃗⃗ , ) where f is a typical feed-forward neural network function such as tanh; is the word embedding matrix for the source language.

B. Decoder
The decoder is also an RNN, it takes the representation of the input context, the previous hidden state and the output word prediction to generate a new hidden decoder state and a new output word prediction = ( 1 , 2 , … , ), where m is the sentence length. The hidden state of the decoder is computed as: = ( −1 + −1 + ) (13) * , * and * are weight matrixes; E is the word embedding for the target language; c is the input context vector.

C. Attention
The source sentence is encoded into a fixed vector, as Fig.  5. Despite using LSTM to overcome the weaknesses of traditional RNN networks with Vanishing Gradient phenomenon [29], [30], it makes the model difficult especially with long sentences. Hayashi et al. [31] introduced the attention mechanism applied in sequence-to-sequence model and they proved that attention is effective for MT. Bahdanau et al. [8] proposed a mechanism that allowed the model to focus on the important parts (associated words from source to target). Instead of just using the context layer created from the last layer of the encoder, they used all the outputs of each cell through each timestep and the hidden state of each cell to synthesize a context vector (attention vector), as Fig. 6.

A. Datasets
In this paper, we combined the dataset of Nguyen et al. [12] and the dataset of Luong et al. [13] into a parallel corpus. Then, we removed the sentences that are too short or the sentences that are extremely long in this dataset. The English-Vietnamese parallel corpus be using in our system is demonstrated as Table II. After combining two above datasets and removing some sentences, the dataset has over 900k sentence pairs. While the average length in English sentences is 7.9 words, that in Vietnamese sentences is 10.7 words. The token in Vietnamese sentence is also outstanding that of English sentence with approximately 9.8 million token and overly 7.2 million tokens, respectively.

B. Implementations
We perform many preprocessing steps before putting them into the NMT system [32] including tagging NER for Vietnamese sentences. Our bi-direction English-Vietnamese NMT systems include: Baseline: uses the original English sentences and the Vietnamese sentences with Word Segmentation (Word Seg.) NER: uses the English texts and Vietnamese sentences after applying both Word Seg. and NER.
In Vietnamese texts, we used the research of Nguyen et al. [33] to address the problem of the unclear word boundaries in Vietnamese, then, we applied the research of Nguyen et al. [6] to tag NER to the sentences. The sentence in Table III (Nhiê n đang chơi ở công viê n Cầu Giấy -Nhien is playing in Cau Giay park), in which the word "Nhien", and the word "Cau Giay" are transformed into "Nhien|B-PER" and "Cau_Giay|B-LOC", respectively. In the above NMT systems, we used deep LSTMs with the 4 layers and with 1000 cells at each layer and 1000 dimensional word embedding. We used BLEU [34] and TER [35] scores to evaluate the performance of each NMT system.

C. Results
The results of our Korean-Vietnamese NMT systems have shown in Table IV. Generally, the NMT systems using NER for English in Vietnamese-to-English systems have better results than the NMT system in the reverse dimension. After using NER for Vietnamese sentence, English-to-Vietnamese MT system improved 1.34 BLEU points and 2.28 TER points and while the reverse dimension systems improved BLEU 1.13 points and 1.32 TER points. Depending on the context, a word may have a different meaning. In Vietnamese, various words are having different meanings based on their context. For instance: the word "Huế" can be the name of a person or it can be the name of a street in Hanoi. It can be tagged into "Huế|B-PER" or it also is transformed into "Huế|B-LOC". The toolkit of Nguyen et al. [6] was used for Vietnamese NER labelling, it may divide the word in text into Location, Person, Organization. After using NER, the Vietnamese sentence become clearer, thus increasing the quality of NMT systems.
In another example, the word "nhà" can mean the wife or the husband or the house depending on the context. When NER tags the word "nhà" to "nhà|PER" or "nhà|LOC", NMT will understand that these are two different words. After using NER, the translation sentence has higher accuracy than the translation of the systems without using NER. Therefore, NER has a positive effect on English-Vietnamese MT systems.

V. CONCLUSION
In this paper, we applied NER for Vietnamese sentences. Then, we built bi-directional English-Vietnamese NMT systems and evaluated the effectiveness of NER for NMT systems. Our results showed that NER has a positive effect on NMT systems. In the future, we intend to build word sense disambiguation system for English and Vietnamese sentences to improve the quality of our NMT systems.

CONFLICT OF INTEREST
The authors declare no conflict of interest.