Using Word Embeddings in Turkish Part of Speech Tagging

The close relation between the stem (relatively the word meaning) and part of speech tag of the word turns part of speech tagging as an important preprocessing task in natural language processing and understanding problem. For example, if the Turkish word “gelecek” is labeled as noun, the word stem is to be “gelecek” meaning future. If it is labeled as verb, the stem is “gel” and in English it means, “come”. In many languages including Turkish, part of speech tagging problem is generally solved by rule based approaches. In this paper, a setup where the neural network architecture SENNA together with word embeddings is employed. The combination of Wikipedia 2016 and METU corpora is utilized in training of word embeddings; PARDER is used in part of speech training and testing. The word embeddings that are obtained by different methods and different vector sizes are evaluated intrinsically considering analogic and semantic similarity distances; and assessed extrinsically based on the performance on part of speech tagging task.


I. INTRODUCTION
In natural language processing, part of speech tagging (POS) is defined as the mapping of words to their corresponding part of speech tags in a text. The POS tagging is important in many different fields such as information retrieval, natural language generation, and automatic translation. Though there exists different categorization in different resources, part of speech tags may be classified in eight main categories in Turkish: noun, verb, adjective, adverb, pronoun, conjunction, question and preposition. In Table I, regarding tags and some Turkish examples are given.
The main difficulty in POS tagging is that a single word may have a different part of speech tag in different sentences based on the contexts. This is why POS tag of a word must be determined according to the context. Following, three sentences are given as examples, where the word "gelecek" should be tagged as adjective, noun and verb with respective to three different meanings (Eng. "future", "next" and "will come"). Manuscript   (Our neighbors will come to visit us this evening.) When the POS tagging studies in the field are examined, it is observed that POS tagging methods are gathered around two main groups: rule-based and statistical approaches. In rule-based approaches, typically contextual information is employed to assign tags to unknown or ambiguous words. Simply, analyzing the linguistic features of the word, its preceding word, its following word, and other aspects disambiguation is performed. For example, in Turkish, if the preceding word is an adjective, then the word in question must be a noun or an adjective. And in rule-based methods, this information must be coded in the form of rules. On the other hand, the part of speech tagging studies based on statistical methods commonly utilize corpora in order to obtain required statistical information. The simplest statistical POS taggers label the words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set with the word is the one assigned to an ambiguous instance of that word. An alternative to this approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the n-gram approach. Since, in n-gram approach, the tags of words in a sequence of n words are employed, it is accepted to consider the context while POS tagging. In literature, there also exist methods (e.g. hidden Markov model) that consider both the context and In POS tagging, there exist multiple factors that have influence on the performance of tagger. One of the main factors is the corpus that is utilized in experiments while modeling the tagger language. The corpus size, reliability of labeled data and variety of the corpus has influence on POS tagger performance. For example, Brown corpus was one of the firstly used corpus for POS tagging studies in English. The second factor is the set of the preprocessing tasks such as tokenization. And the last but not the least one is the language. It is known that for different languages the POS tagging approaches may perform differently. Though the performances of POS tagging studies that employ statistical or rule-based methods reach to acceptable values (96%-98%) in English [1]- [8], the performances drop to 80%-92% levels in Turkish studies [9]- [14]. This is due to the agglutinative structure of Turkish and/or the theoretical infinite size of vocabulary in Turkish. In Table II, an example set of POS taggers and the methods that are employed are provided. Averaged perceptron LTAG-SPINAL [30] Bidirectional perceptron learning We propose the use of word embeddings with deep learning methods in order to identify POS tags in Turkish. The word embedding is simply a type of word representation where the text is turned to numbers allowing words with similar meaning to be understood by machine learning algorithms. It is also called as distributed semantic model or distributed represented or semantic vector space or vector space model. In word embedding approach it is accepted that the words convey their meaning with the words occurring in the same context. As a result, fruits like apple, orange should be placed close whereas sports will be far away from these words. In a broader sense, word embedding will create the vector of fruits, which will be placed far away from vector representation of sports. This enables to run simple mathematical operations to detect the semantic relations between words. As in a typical example, it is possible to obtain the embedding of king by subtracting the embedding of woman from the embedding of queen.
In this study, the experiments are performed on SENNA neural network structure developed by Collobert et al. [31]. Firstly, the performance of the approach is measured utilizing English Brown corpus to assure the correct employment. Then, the experiments are repeated using Turkish corpus.
In following sections, the method will be presented, experimental results will be given and the paper will be concluded respectively.

II. THE PROPOSED METHOD: WORD EMBEDDINGS IN POS TAGGING
In this study, Turkish word embeddings are generated by word2vec method and the embeddings are given as input to the SENNA tool to mark the part of speech labels of the regarding word.
SENNA (Semantic/Syntactic Extraction using a Neural Network Architecture) tool built by Collobert et al. [31] is an architecture that provides machine learning by a neural network. The tool may be used for several tasks (e.g. semantic role labeling, entity recognition) in natural language processing field. The main goal in SENNA is enabling several tasks omitting feature engineering and learning the semantic relations between words in text based on the occurrence frequencies. SENNA is proposed in two different set-ups to be used for different tasks. These set-ups are similar in terms of neural network structure. The difference between them is the approach to generate the required input to the network. These approaches are i. Window-based: The approach requires determining the neighboring words to the target word and employing their word embeddings. The approach is commonly used for natural language processing problems such as named entity recognition (NER) and POS tagging where the target word is related to the context words. ii. Sentence-based: In this approach, all the words residing in the same sentence with the target word are considered. It is required to obtain word embeddings of all words to generate the sentence embedding and give sentence embedding as an input to the architecture. It is commonly used in problems such as semantic role labeling where the solution is hidden in the sentence structure.
The window-based approach is employed in our study assuming that each word is related to the neighboring words in a given window. In Fig. 1, window-based approach is exemplified for the sentence "Ayşe okula geç geldi" (Ayşe came to the school late). In this example, target word is "geç" (late); window size is set to 2; "Ayşe", "okula" (to the school) and "geldi" (came) are the context words of the regarding target word.
The tasks that are followed to build up the set-up in Fig. 1 are: i. The neighboring words of the target word in the given window size=2 are determined as context words and accepted as inputs to the system. ii. Word embeddings of the contexts words are retrieved from the word embeddings data set. iii. A merged matrix is built by appending the word embeddings.
iv. The matrix is transformed to a linear data structure by affine transformation. v. Tangent activation function is applied on the matrix in the hidden layer of the neural network. vi. The probability values of possible POS tags for each target word are determined by softmax classifier using the transformed matrix. vii. Finally, the target word is marked with the tag that holds the highest probability value. Based on the probability values, target word "geç" is labeled as "adverb" in given example. A similar procedure may be applied for the sentence-based approach. The difference between the two approaches is that in the sentence-based approach all the words that reside in the same sentence with the target word is considered as its neighboring words and their word embeddings are given as input to the system. In our experiments, we did not run tests by sentence-based approach.

III. EXPERIMENTAL RESULTS
The performance of the proposed approach is firstly in measured on English Brown corpus. Brown corpus in NLTK library is utilized in building training (50545 sentences), validation (2505 sentences) and testing (4134 sentences) data sets. The word embeddings pre-trained by GloVe [32] and EDBSG (Extended Dependency Based Skip-Gram) [33] methods are used. Skip-gram model predicts surrounding context words given a target word. In Fig. 2, the architecture of the basic skip gram model is depicted. Here, w(t) is the target word and there exists one hidden layer which performs the dot product between the weight matrix and the input vector of w(t). No activation function is used in the hidden layer (depicted as projection layer) and the result of the dot product at the hidden layer is passed to the output layer. Output layer computes the dot product between the output vector of the hidden layer and the weight matrix of the output layer. Then the softmax activation function is applied to compute the probability of words appearing to be in the context of w(t) at given context location. As the number of words to be predicted increases, the problem gets more complex. GloVe (Global Vectors for Word Representation) [32] is developed by Stanford University to generate vector representations for words. The aim of GloVe is to produce word vectors that find the "meaning in vector space" by using statistics of global count. Distinctly from continuous bag of words or skip gram models, GloVe learns based upon a co-occurrence matrix and trains vectors thus their differences estimate co-occurrence ratios. In GloVe model, global matrix factorization and local context window methods are employed. Here, local context window methods are well-known continuous bag of words and skip-gram methods. The global matrix factorization is used to reduce large term frequency matrices in latent semantic analysis. And also, this method is used in GloVe to include global frequency information in order to build up word vectors. In GloVe model, instead of co-occurrence probabilities the ratio of co-occurrence probabilities is used.
In this experiment, vector size of word embeddings (D) is set to 300 and window size (W) is 5. Table III gives the performance results for English corpus where accuracy values on test (Test_Accuracy) and validation (Val_Accuracy) sets are obtained by running the system ten times (run=10). The term accuracy refers to a statistical measure that presents the ratio of correctly classified samples. It is formulated as below:

Accuracy = TP + TN TP + TN + FP + FN
(1) where TP refers to true positives, TN is true negatives, FP is false positives and FN is false negatives. In the Table III, SPB is the window size used in SENNA tool to label part of speech tags. ES represents proportion of the words that reside in testing set but do not have a valid word embedding.
As given in Table III, the accuracy values for testing set reach to 93.19% and 95.98% by GloVe and EDBSG embeddings, respectively. The performance results for English corpus are similar to previous studies on POS tagging showing that the proposed set-up is proper to be used in marking part of speech tags. A similar experiment is performed on Turkish corpus as the second step of the study. Wikipedia (March 2016https://dumps.wikimedia.org) articles (Wikipedia2016) and METU corpus [34] are merged to build Turkish corpus (VO). Turkish word embeddings are obtained applying word2vec skip-gram method on VO corpus by Gensim tool [35] using surface forms of words. In order to decrease the number of words with missing embeddings, punctuation marks are removed from VO corpus, all numerical entities are labeled as NUM and the words that occur in corpus less than two times are not included in training. Training is repeated for two different window (W=2 and W=5) and vector (word embedding) (D=100 and D=200) sizes. For example, (W, 2, 100) represents the setting where the embedding vector size=100 and window size=2.  The semantic similarities of the embedding vectors for target words "Türkiye" (Turkey), "Apple" and "ağustos" (august) are given in Table III as examples. Cosine similarity is used in measuring semantic similarity between given couple of embedding vectors. In Table V, the first row includes the target words, each word in the regarding column shows the most similar words to the target word. For example, the target word "Apple" is similar to "google", "iphone", "ios","ipod" and "app" in order when W=2 and D=100. The sorted list of similar words to a group of target words (such as the ones in Table III), showed that word embeddings are quite successful to represent the words in Turkish.
Following the retrieval of word embeddings, Turkish part of speech training and testing tasks are performed on PARDER [36] Turkish corpus. The corpus is split into three parts as training set of 12397 sentences, testing set of 1535 sentences and validation set of 1104 sentences. The word in PARDER corpus is labeled with 17 different POS tags (adjective, adverb, conjuction, determinant, duplication, interjunction, Ndet, Ndot, Ntime, Nnum, noun, number, post-pronoun, pronoun, punctuation, question, verb). Table VI gives the experimental results obtained from Turkish corpus. In Table VI, W is window size; D is vector size, and ES represents the proportion of the words that do not have valid embeddings. While the experiments are repeated with different W and D values all units referring to numbers are accepted as a single word. The number of iterations in each experiment is set to 10 and POS window size is determined as 3. In the initial experiments, it is observed that most of the words that do not have valid embeddings are punctuation marks. As a result, word embeddings for all punctuation marks are generated and the experiments are repeated. For example, for the punctuation mark ".", a word embedding (vector) is built with the given size. After this correction, it is observed that the proportion of such words are decreased from ES=13.50% to ES=~3.75%. The experimental results before the word embedding correction of punctuation marks (ES=13.50%) are given in first two rows of Table VI. The accuracy values when ES value is lowered to 3.75% are presented in third and fourth rows on Table VI.
Examining Table VI, it is seen that the highest accuracy value (83.09%) is obtained when windows size is set (W) to 5 and vector size D=200.

IV. CONCLUSION
In this study, a set-up that uses SENNA tool with word embeddings is proposed to label Turkish words with proper part of speech tags. Though the experimental results showed that the proposed set-up is quite successful, there exists a room for improvement since the performance values are still lower compared to existing 80%-92% accuracy values in previous Turkish studies. We believe that the performance for Turkish may be improved by increasing the number of samples in training set. In order to test the change in performance as the size of training set is increased, we repeated the tests on different sizes of training set. The tests are performed on English corpus since there exists still not enough data samples in Turkish. In Fig. 3, horizontal axis represents the number sentences in training corpus and vertical axis holds the accuracy values. It may be examined from Fig. 3 is that as the size of the training set is changed from 10.000 to 50.000 samples, the accuracy value continuously increases supporting our claim. As a further study, we plan to increase the training data set size in Turkish, change the number of levels in neural network structure in order to improve POS tagging performance in Turkish.

CONFLICT OF INTEREST
The authors declare no conflict of interest.