Classification of Smartphone Application Reviews Using Small Corpus Based on Bidirectional LSTM Transformer

This paper provides the classification of the review texts on a smartphone application posted on social media. We propose a high performance binary classification method (positive/negative) of review texts, which uses the bidirectional long short-term memory (biLSTM) self-attentional Transformer and is based on the distributed representations created by unsupervised learning of a manually labelled small review corpus, dictionary, and an unlabeled large review corpus. The proposed method obtained higher accuracy as compared to the existing methods, such as StarSpace or the Bidirectional Encoder Representations from Transformer (BERT).


I. INTRODUCTION
With the recent increase in the number of mobile terminal devices, especially smartphones, communication using social media has been increasing among the youth. In the past, e-mails, Internet bulletin boards, chats, etc. used to be main tools for communication. However, at present, social media platforms such as Twitter [1], Facebook [2], and Instagram [3] have become mainstream.
Twitter is a social media platform where we can post short messages called tweets, together with videos, images, or web links. This platform is compatible with smartphones and its users can freely post messages, making it a prime real-time property. Twitter is different from Facebook or LINE [4] in anonymity and some Twitter users use different accounts depending on their usage, such as business, private, or hobbies. On Twitter, users can read reputations of various products or services by referring to other tweets. Among such products and services, smartphone applications are considered as one of the best model systems on which users can easily post their reviews as both the applications and the Twitter platform are commonly used on a smartphone device.
The latent application users can save time in collecting information if they can see reputations of the applications from the word-of-mouth tweets regarding the corresponding smartphone application. The applications that can be used on a smartphone can be categorized as games, health, information collecting, etc. The applications are also Manuscript received November 9, 2020, revised January 2, 2020. K. Matsumoto, T. Kojima, M. Yoshida, K. Kita are with Tokushima University, Minamijosanjima-cho 2-1, 7708506, Japan (e-mail: matumoto@ is.tokushima-u.ac.jp, mino@ is.tokushima-u.ac.jp, kita@ is.tokushima-u.ac.jp).
H. Kondo is with Sharp Corporation, 5908522, Osakafu, Sakai-shi, Takumicho-1, Japan. different in terms of pricing (free or paid) and the developing agency (developed by an individual or developed by a company).
Most of the applications are registered on App Store [5] and Google Play [6], which also show reviews posted for each application. However, several of these reviews are complaints or issues with the applications and positive reviews sometimes lack detailed descriptions. Therefore, latent users might have incorrect prepossession to the application if they accept the review comments on their face value.
In this study, we aim to analyze the reviews, for the benefit of the latent application users, by targeting the word-of-mouth tweets including users' real opinions. We collected the word-of-mouth data regarding an application, annotated polarity labels on the words that can be judged as evaluation expressions, and annotated polarity labels on the word-of-mouth tweets. Based on them, we created training data and an evaluation expression dictionary. In this process, we constructed a foundation to classify application reviews that is small in size but includes all the important elements. Moreover, to classify the kind or category of the application, we trained the distributed representations from the existing large size of review corpus. This process also helped improve the robustness of unknown expressions.
In this study, by combining a Transformer with a self-attention mechanism and long short-term memory (LSTM) networks, a model with a higher performance than those of the existing methods has been developed.
Section II describes the existing studies on review classifications, and Section III provides an overview and flow of the proposed method. Section IV describes the dataset used in this study and the experimental condition. Section V mentions the results and discussion. Finally, Section VI concludes the paper.

A. Studies on Review Text Classification
This section describes previous studies on review text classification. Generally, the studies on classification of an evaluation sentence, which is written based on the writers' subjective views, such as a review sentence, are termed under sentiment analysis. The field of sentiment analysis includes studies that classify emotional polarity as positive or negative, based on word or sentence structure obtained from the document [7]- [31], studies that analyze the evaluation score, which indicates the degree of emotion, and studies that judge emotion or non-emotion.
These studies either use large training datasets or analyze the documents based on the dictionary. However, even though the task is similar, if the domain of the target text is different, the dictionary or model corresponds to the domain, and the creation cost increases significantly.
Kobayashi et al. [32] constructed the dictionary [33], which includes evaluation expressions annotated with evaluation polarity, and to be public. This dictionary provides references to commonly used evaluation terms. However, for smartphone applications, which is a recently developed field, several new words or slangs are included in the review text. Hence, the general evaluation expression dictionary is not sufficient for the classification of the review texts of smartphone applications.
Asakura et al. [34] applied the neural attention mechanism to aspect-base sentiment analysis (ABSA). Aspect-base Sentiment Analysis estimates emotional polarity of a review text based on multiple aspects [35]- [37]. They used the dataset of SemEval2016 Task5 Subtask 1 (SE16T5S1) [38], and applied semi-supervised learning on the neural attention model by using word vectors, which were pre-trained based on Google News Corpus as initial parameters. Because this study treats short text on Twitter, it is not basically considered that the sentence which multiple views included in. However, by way of exception, even if some tweets include multiple opinions, we targeted the sentence that could be judged as having a positive/negative polarity.

B. Review Dataset
The Internet Movie Database (IMDb) [39] is now available as review data for sentiment analysis. This dataset includes 25,000 review texts about movies for training and for testing, which are classified into positive/negative depending on the review contents (positive: 12,500, negative: 12,500). The dataset is often used as a benchmark to evaluate a model for sentiment analysis. However, because IMDb is written in English, it cannot be used for classifying application reviews written in Japanese.
The Amazon Review data [40] also includes English language texts and does not include review texts in Japanese.
The UMass Amherst Linguistics Sentiment Corpora is a corpus [41] that counts word n-gram and total of the Rating of each n-gram in the review texts for the products advertised in Japanese, English, German, and Chinese websites of Amazon. This corpus, however, does not include any review text itself.
The Rakuten dataset [42], which includes reviews of Rakuten's products or accommodations, and the "Intage dataset (Minrepo)" [43] are examples of databases corresponding to a Japanese review corpus.
However, in these corpora, the review scores do not always match with the contents. All such review data whose review scores do not match with the contents can be considered as noise. In this study, because we focused on word-of-mouth reviews about applications that are posted on social media, we created an original dataset by collecting and labelling the word-of-mouth review tweets from Twitter.

C. Prediction of Review Document Quality
Recently, several studies have classified review texts by using methods based on neural networks [44]- [48]. This review classification is relatively simple, as it is basically a binary classification. However, the text data on social media platforms such as Twitter often include many hashtags, images, links to other websites, and Retweets that could be considered as noise. Therefore, using these raw data to create a classification model will reduce accuracy.
Ezaki et al. [49] proposed a method to assess the effectiveness of a review document by using a classifier to classify a text as review text. Their method calculates the ratio of the review sentences included in a document by using the results of the review classifier, and identifies a review text as useful if the rate is over the certain percentage.
To classify a sentence as a review sentence, the authors used bag of words as a feature and support vector machine (SVM) as a machine learning algorithm. Consequently, their proposed method, based on the percentage of the review content in a document, could obtain 10 % higher accuracy than the simple bag of words document classification. Although Ezaki et al. targeted reviews in documents (weblog articles), it would be difficult to apply the same method to the reviews in tweets, which is the target of our study, as there are very few sentences.
To assess the usefulness of a review sentence, Kurahashi et al. [50] studied the reviews presented in Amazon. To evaluate the usefulness of a review sentence, the authors combined information obtained from the sentence with other sources, such as the number of characters in the review texts, appearance frequency of each part of speech tag, the rating of the post, posting date, polarity score of the review sentence, and appearance frequency of link. To evaluate usefulness, we labeled the reviews that obtained more than a certain number of answers to the question "it served as a useful reference or not"; In addition, we labeled the reviews that obtained a high approval rate (over 0.7) as high quality and those that obtained a low approval rate (under 0.3) as low quality. An accuracy of 73 % was obtained by the proposed method in the classification experiment using SVM. This accuracy was approximately 10 % higher than that of the method using only bag of words. While analyzing the results of the experiment, it was found that the difference between the score and the average review score and the difference between the polarity score and the average polarity score was very important.
The word-of-mouth reviews on Twitter cannot post scores, and the polarity score was the same as the classification score of the review texts in our study. Therefore, to evaluate the usefulness of a sentence, "the number of characters" or "the appearance frequency of noun" was initially considered. However, on Twitter, word-of-mouth reviews are basically short texts. Hence, it is difficult to assume that the number of characters or the appearance frequency of a part of speech would represent significant features. Therefore, as training data, we used only the high quality review tweets that were judged as "useful" by other people.
Several studies have tried to predict the quality of reviews [51]- [54]. Liu et al. [53] detected spam reviews. It is difficult to extract spam reviews by comparing with the other spam texts, as the content quality of spam reviews is not always low. The better a sentence is written, the easier it is to fail to detect it as a spam. In our study, the negative effects of spam were avoided by manually removing the spam reviews in advance.
International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020

A. Overview of the Proposed Method
Our proposed classification algorithm with high accuracy and complementarity is based on small data and large data. Firstly, we used an evaluation expression dictionary created manually as small data.
For the evaluation dictionary, we used the Japanese appraisal evaluation expression dictionary [55] and the Japanese sentiment polarity dictionary [33]. These dictionaries have collected evaluation expressions that often appear in Japanese texts and have classified them in terms of positive/negative polarity. These dictionaries have high quality and wide utility.
Meanwhile, our study targets smartphone application review texts posted on Twitter, which may require considering distinctive expressions that often appear in such review texts.
For review texts regarding a specific application, we manually annotated a polarity label (positive/negative) to individual words and tweets, which helped create high quality data.
To extract features from a review text, we used distributed word representation by CBoW and skip-gram models. These distributed word representation models were trained in advance based on a large scale Wikipedia article corpus. Because these models are not specialized in review text, these might include some redundant knowledge as well.
Therefore, we used a large sized review text corpus in addition to the existing pre-trained distributed word representation models. We constructed a more suitable distributed word representation model by learning distributed word representations with this corpus. The flow of classification model creation is shown in Fig. 1.

B. Transformer Classifier
This subsection describes the review text classification by neural networks using Transformer and attention mechanism. Transformer is one of the encoder-decoder models for neural machine translation (NMT) [56]. Other NMT models include seq2seq (sequence to sequence) [57] as an NMT model based on recurrent neural networks (RNN).
By self-attention, Transformer calculates feature similarity among the words in the input sentence, and encodes position information of each word. By doing so, it can consider a more global relationship than models such as convolutional neural networks (CNN) [58] or RNN. Besides, this model enables parallel computation, enabling it to operate at a higher speed than RNN.
The self-attention in Encoder-Decoder can be expressed as Eq. (1). Q indicates query, K indicates key, and V indicates value. The similarity (Attention) between query and key is calculated by the inner product of Q and K. The inner product between the normalized attention weight by Softmax and V indicates the value corresponding to the key as a weighted sum.
This mechanism helps pick up the word string that includes similar distributed word expressions by inputting a sentence as a string of distributed word expressions. In other words, a string of distributed word expressions can be decoded into similar sentences.
Eq. (2) and (3) show the positional encoding of Transformer. "posw" indicates word position and "i" indicates the index of the vectorized word. "d" indicates the number of word embedding dimensions. These equations calculate positional information tensor that uniquely decides the position of the word and the dimension of the distributed word expression.
In this study, we use the structure of Transformer Encoder, and add a Softmax layer that classifies the output layer as positive/negative. Fig. 2 shows the structure of Transformer networks used in this study. We used three Transformer blocks in the preliminary experiment as it has been shown that three Transformer blocks perform better than the one or two blocks.

C. LSTM Transformer
By using self-attention mechanism and Transformer, review classification model can be suited to review text classification. However, in many cases, if the number of the training data is small, the model cannot work accurately. When the data are insufficient, it tends to be affected by the pre-training accuracy. In addition, it is necessary to consider the order of words in the review texts.
For example, it is often observed in the review texts that positive opinions are said in the anterior half, and negative opinions are said in the latter half. The correct order of opinions can be identified if LSTM [59] or bidirectional LSTM [60] is used. In this study, we tried to improve the review classification accuracy by combining LSTM-RNN or bidirectional LSTM-RNN with Transformer. The network structures of LSTM Transformer and bidirectional LSTM Transformer are presented in Fig. 3 and Fig. 4. Both of LSTM Block and biLSTM Block include one LSTM/biLSTM layer.
We used two Transformer blocks in the preliminary experiment as it has been shown that two Transformer blocks perform better than one or three blocks.

D. Convolutional Neural Networks
In this study, we used CNN as a comparison method. Because CNN can consider context, it is expected to demonstrate a higher accuracy than the method using bag of words. We used two-dimensional convolution in the CNN. The structure has 3 convolution layers and 3 max pooling layers. Fig. 5 shows the structure of the CNN.

E. Word Emotion Polarity Encoder Networks
We propose a method to extract word feature based on the model that classifies a text into word emotion polarities.
Based on the existing evaluation polarity dictionary and the dictionary manually made using data from the reviews of certain applications, we trained a positive/negative classifier with deep neural networks.
The distributed word representation vector can be converted into the feature quantity that considers emotion polarity (positive/negative) by using the output (64 dimensional vector) of the fully connected layer directly before the output layer as a feature.
This feature vector was referred to as PN. We fine-tuned this vector based on LSTM and biLSTM with a small review corpus, and constructed the review classification models. Fig.  6 shows the network structure when the PN vector encoder was trained.

IV. DATA
This section describes the application review data, the pre-training review corpus, the evaluation expression dictionary, and noise removal.

A. Word-of-Mouth Tweets Collection
Because the target sentences are not the only sentences including evaluation expression, classification of positive or negative text must be conducted even if the sentence does not International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 include the evaluation expressions. Table I shows the breakdown of the target review corpus. Sp and Sn show the number of tweets and Wp and Wn show the total number of words in the positive/negative tweets. The number of the positive tweets was 228, while the number of the negative tweets was 172. In total, 400 tweets were carefully selected from a set of tweets obtained from the Twitter API [61] using the application name as query.
In this study, we manually evaluated the usefulness of the review tweets. The evaluation criteria have been presented below.
1. The reasons of the evaluation of the application or the problems with the application are concretely and clearly mentioned 2. The final evaluation can be clearly judged as positive or negative If both of the above requirements were met, the review text was judged as a useful review tweet. The examples of review tweets and noise tweets are shown in Table II.
Among the collected review tweets, the maximum number of tweets was obtained for application ID:13 "Hole.io". Therefore, we created an evaluation expression dictionary based on the review texts of this application. Table III shows a few words from the dictionary.

B. Word Embedding
The large scale of review corpus data used to pre-train distributed representations consisted of application ID:13's review texts in addition to the approximately 196,000 review sentences posted under "work databases (sakuhin database)" [62]. We tokenized the corpus with Japanese morphological analyzer MeCab [63] and used the same as a training data. To correctly tokenize the words in the evaluation expression dictionary, which was created based on ID:13's review texts, pre-processing was conducted by jointing the character strings. The distributed word representation model (word embedding model) was trained based on the CBoW model by using the word2vec [64] module in the gensim [65] Python library. As a training parameter, we set the window size as 5 and the dimension as 200. Other parameters were set as default values. The resulting distributed word representation vector model was named RC. Table IV presents an overview of the review corpus that was used to train the word embedding model. To perform a comparison, we also used the Wikipedia Entity Vector (200 dimensions) as "WE," a pre-trained word embedded by fastText [66] (300 dimensions) [67] as "FT," and the word learned through supervised classification embedded with the annotation of emotion polarity labels by StarSpace [68] as "SS."

C. Sentence Embedding
We converted the review tweet sentences into distributed sentence expressions by using the pre-trained model based on Bidirectional Encoder Representations from Transformer (BERT) [69].
We use the pre-trained BERT model [70] based on the Japanese Wikipedia article corpus. This model is distributed by the Kyoto University Kurohashi Laboratory. The 768 dimensional distributed word expressions thus obtained were used to train the review classifier by using machine learning algorithms such as SVM, Adaboosting, random forests, and Light GBM [71]. The hyper parameters of SVM were optimized by using the Grid Search algorithm.

D. Noise Filtering and Stop Word Removal
Generally, several noise strings specific to Twitter can be included in the word-of-mouth review texts. In this study, we defined the stop words and the character string patterns that International Journal of Machine Learning and Computing, Vol. 10, No. 1, January 2020 can be removed by regular expressions, which could be noise, by referring to the review texts of app ID: 13. The examples of the stop words and the regular expressions for noise removal are shown in Table V.

A. Experimental Setup
This section describes the experimental setup. In the experiment, except for the application (ID:13), which was used for the creation of the evaluation expression dictionary and the pre-training of word embedding, the remaining 21 applications were used as the evaluation targets. The accuracy, recall, precision, and the F1-score were calculated by a cross validation test, which splits data by application. Equations (4), (5), (6), and (7) show the calculation of accuracy, recall, precision, and F1-Score.
TP indicates a true positive, i.e., the frequency of the true label that matched with the predicted label. TN indicates a true negative, i.e., the frequency of the false label that did not match with the predicted label. FP indicates a false positive, i.e., the frequency of the predicted label that matched with the false label. FN indicates a false negative, i.e., the frequency of the false label that did not match with the predicted label.
To avoid overfitting, we tuned the training number of trials based on the validation loss. Consequently, the optimized epoch numbers for each method were in the range of 5 to 30. Therefore, we compared the experimental result when the number of epochs was between 5 and 30.
We used PyTorch [72] as the deep learning framework. We used Ubuntu 18.04 LTS as the Operating System, and Geforce GTX980 as the Graphic Processing Unit. The SVM or Adaboosting algorithms were run using the machine learning algorithms in the scikit-learn [73] Python library. To train the bag of words feature as baseline method, we used the LBFGS logistic regression classifier of Clasiass [74].

B. Results
Tables V and VI show the accuracy and F1-Score of before noise removal and after noise removal scenarios, respectively, in descending order according to accuracy. "Vector" indicates the kind of vector used as feature in the experiment. The abbreviations are explained as follows.
 RC: unlabeled review corpus based 200-dimensional embedding which was pre-trained by word2vec  FT: unlabeled Wikipedia article based 300-dimentional embedding which was pre-trained by fastText  WE: Wikipedia Entity Vector (300-dimentional embedding) [75]  BERT: 768-dimentional sentence embedding which was pre-trained by BERT using Wikipedia article corpus  BoW: word appearance frequency based bag of words vector made using labelled training data  SS: word embedding trained by StarSpace using labelled training data  PN: 64-dimentional word embedding trained by FFNN using pre-trained word embedding by word2vec These results show that noise removal significantly improved the accuracy levels. With respect to word embedding, better results are obtained in order of RC > FT > WE. BiLSTM Transformer could obtain high accuracy and the most stability among all the training algorithms.
Meanwhile, if the feature vector is extracted by using BERT, the number of words in the review texts decreased while applying noise removal. In addition, because valid feature vectors cannot be extracted, the accuracy decreased to 73.5 %. Fig. 7 shows the comparison of accuracy and F1-score between the methods without noise removal. It was found that the F1-score (Positive and Negative) and accuracy were low for the baseline method of lbfgs.logistic that uses the bag of words feature.
It is thought that the frequencies of specific words affect the accuracy. In the small review corpus, proper nouns related to the application as a review target tend to be repeatedly used. Though noise removal can improve this partially, if a word has the same notation as the general noun, it becomes difficult to distinguish between noise and data.
Meanwhile, because the method using CNN did not obtain high accuracy, CNN could identify the peripheral word's relationship. However, it could not accurately classify the text with which the review polarity changes between the first half and the latter half. The classifier using pre-trained word embedding based on SS could obtain a balanced classification. However, because this method cannot treat unknown expressions, which do not appear in the training review corpus, it obtained a lower accuracy than the method using word embedding based on large review corpus (RC).
Meanwhile, because a lower accuracy (68.9%) was obtained by the method based on Transformer classifier using word embedding RC, it could be due to the poor performance of the Transformer. Fig. 8 shows the accuracy for each application with respect to the four methods; biLSTM Transformer (RC30), StarSpace (2-gram, dim:200), SVM+BERT, and lbfgs.logistic + BoW.
From this graph, we can see that there is a small variation in accuracy between the various types of applications.
Among the four methods, BERT+SVM could achieve micro accuracy (average accuracy), which was the maximum accuracy, and biLSTM Transformer (RC30) obtained 76.2 % accuracy, which was the lowest macro accuracy. Therefore, it was found that high accuracy can be achieved irrespective of the application type by using the BERT feature.
The biLSTM Transformer (RC30) achieved extremely low accuracy (ID:17, 16.7%). The number of review texts for this application is only six. A possible reason for this low accuracy could be that this is the only application that belongs to the category "health." Thus, this application is different from the applications that belong to the major category of "game," and the review content of this application's review texts was very different from the other review texts.
Overall, it was better in the unlabeled review corpus-based pre-trained word embedding as compared to that based on the large size of the Wikipedia corpus. It is expected that the accuracy improved by increasing the labelled corpus.
Because the accuracy achieved by StarSpace, which did not use pre-training, was 77 %, it was considered that the performance could be improved by integrating the pre-training word embedding used in our proposed method.

C. Error Analysis
We analyzed the errors by visualizing attention about the review texts, which were misclassified by the biLSTM Transformer + RC.
Table VII presents visualization of attentions at each step for each review text. It means that the darker the background color of a word, the higher is the weight of attention. In this study, the low frequency words were removed during pre-processing of review texts. Therefore, in this table, some words from the original review text do not appear.
In Example-1, in the first step, higher weights were assigned to "Useful," "te," and "sou." In the second step, higher weight was assigned to "sugi." As "-sugiru" in Japanese means excess, it was considered appropriate for feature expression of review texts. However, as the step-2 weight of "useful (Benri)" was lower than the step-1 weight, the label was predicted as "negative." It is considered as factor of misjudge that the smaller number of words which can be key to judge.
In Example-2, the negative words such as "Kusoge-" (crappy game) and "kuso" (damn) were assigned low weights in step-1. However, these words were assigned high weights in step-2. The other non-distinctive words were also assigned high weights. The final evaluation was "positive," which is a misclassification. For the Transformer+FT method, this example predicted and assigned weights more accurately as compared to other methods. Because attention weights were accumulated by the biLSTM method, the assigning of weights was not well-modulated.

D. Discussions
The experiments showed that the review classifier can classify with high accuracy even if the training corpus is small, provided the high quality feature can be obtained beforehand.
The maximum accuracy of 84.0% was obtained by the Bidirectional LSTM with Transformer based on word embedding obtained through unsupervised learning from the large review corpus as feature and with noise removal. This accuracy is approximately 6 % higher than the baseline method, which uses the simple word frequency based bag of words feature.  The methods that classify using SVM, based on the feature that is obtained by using high versatile feature extraction method such as BERT, could achieve relatively high accuracy (79.9%). Therefore, the quality and quantity of the pre-trained corpus and the applied training algorithm were found to be important in developing a classifier based on the small labelled corpus.
As the result of error analysis, a lot of example which could not focus on characteristic words well by attention. For example, there were cases where the self-attention mechanism could not use the similarity between words and the unknown words that could not be pre-trained.
The accuracy can be improved by preferentially focusing on evaluation expressions by using the manually constructed evaluation expression dictionary. However, low accuracy was achieved while using the polarity score of the PN by the word polarity classifier based on the evaluation expression dictionary. The method adds the polarity score vector to the words that do not have sentiment polarity in reality, which potentially reduces the accuracy.
A similar accuracy was achieved using StarSpace algorithm where the classifier was based on Transformer using word embedding that was trained based on labelled training corpus. However, the accuracy was lower than that of the method using the other embedding method and the LSTM Transformer. Hence, we believe that the size of the pre-training data affects the accuracy.

VI. CONCLUSIONS
This study aims to create a high precision review classification model based on the small labelled review texts. In addition to the manually created small review corpus and dictionary, we propose a method that trains the classifier model by combining the large sized unsupervised review text corpus.
The proposed method, that uses word embedding dedicated to review texts as a feature based on self-attention Transformer and bidirectional LSTM networks, could achieve a higher accuracy than the algorithms such as the recently developed StarSpace or BERT, which use embedding based classification.
An improvement in accuracy could be observed by removing noise based on the stop word list, which was made manually.
In future, we would like to develop a method that would automatically detect noise words from review text and prepare a more sophisticated and flexible review classifier by adding the noise removal process as a pre-processing step.