Machine Learning Versus Deep Learning Performances on the Sentiment Analysis of Product Reviews

At this current digital era, business platforms have been drastically shifted toward online stores on internet. With the internet-based platform, customers can order goods easily using their smart phones and get delivery at their place without going to the shopping mall. However, the drawback of this business platform is that customers do not really know about the quality of the products they ordered. Therefore, such platform service often provides the review section to let previous customers leave a review about the received product. The reviews are a good source to analyze customer's satisfaction. Business owners can assess review trend as either positive or negative based on a feedback score that customers had given, but it takes too much time for human to analyze this data. In this research, we develop computational models using machine learning techniques to classify product reviews as positive or negative based on the sentiment analysis. In our experiments, we use the book review data from amazon.com to develop the models. For a machine learning based strategy, the data had been transformed with the bag of word technique before developing models using logistic regression, naïve bayes, support vector machine, and neural network algorithms. For a deep learning strategy, the word embedding is a technique that we used to transform data before applying the long short-term memory and gated recurrent unit techniques. On comparing performance of machine learning against deep learning models, we compare results from the two methods with both the preprocessed dataset and the non-preprocessed dataset. The result is that the bag of words with neural network outperforms other techniques on both non-preprocess and preprocess datasets.

For online retail store, platform like Amazon is very popular among customers because of its ease-of-use front-end and the efficient and smart back-end. Customers can easily surf through the website and order the products that they decide interesting to buy. Later, products are delivered to their place. By such online shopping manner, customers do not need to go to a shopping mall for buying things. But one minor flaw of online shopping is that the customers do not know how the real product looks like. This disadvantage of online shopping makes some customers reluctant to buy online. Hence, platform owners create the product review system for the customers who had used or ordered the product to leave feedbacks regarding the product [2]. This review system helps other new customers to make their decision wisely and very confident.
In most product review systems, there are the scoring part and the open-end part that allow customers to freely leave any comments they feel about the product. Customers have the right to leave not only the positive comments about the product, but also the negative ones. Moreover, customers may leave review or comment that conflicts to score that they give. Hence, platform owners want to classify this type of review into the correct group of positive, negative, or the other. This type of feeling classification is called sentiment analysis.
Sentiment analysis is one type of analysis method used in natural language processing that mostly focusing on the binary text classification such as positive or negative opinion. This analyzing method can be made automatic by adopting the machine learning technique that is the method being able to learn some patterns like a human does. But machine learning can search for patterns faster and analyze more data than a human could. Machine learning is thus a good tool for developing sentiment analysis model and analyzing text.
In this research, we develop sentiment analysis models from online retail store product reviews based on machine learning and deep learning algorithms to make an automatic model for classifying positive and negative reviews. For evaluating purpose, we choose several preprocessing methods to work with various machine learning and deep learning algorithms to develop model. We compare the performance of the models on the aspect of their classification accuracy as well as the time used for training and building the model.

II. LITERATURE REVIEW
From the literature review in a specific area of sentiment analysis over comments posted in the product review system, we found that most researchers developed their analysis models based on data from the internet especially from an online platform. Most current works on sentiment analysis model developing focus on analyzing text data from online business platforms such as product reviews from Amazon [3], tweets on Twitter [4], movie reviews from IMDB and travel recommendations on Yelp [5], [6]. These data are continuously generated by users and size of data grows rapidly because of large number of users providing feedbacks and comments through the internet every day. Platform owners collects these data and deploy them to analyze insight knowledge for better supporting the new product design or improving the product quality that mostly fits their customers in positive ways. Prior to main steps of the analysis process, analyzer needs to preprocess data because data are in a text form that computer could not understand. The preprocessing techniques normally applied by analyzer are bag of words and n-gram. Later, the analyzer uses the preprocess data for developing the machine learning based model for classifying text. Machine learning techniques that applied to sentiment analysis for product reviews or comments from customers include support vector machine (SVM) [7], [8] and term frequency-inverse document frequency (TF-IDF) for preprocessing and develop a model based on naive bayes [9].
Presently, deep learning is a very popular technique for developing model because it yield a classification model with high accuracy and it does not need feature selection steps. Deep learning has been applied to the research field of sentiment analysis in various work such as the use of word embedding for preprocess data and then developing a model based on convolutional neural network (CNN) [10], the application of transfer learning strategy based on the pre-train CNN to reduce additional model training time [4], and the use of word embedding to preprocess data and then applying sequence analysis algorithm such as recurrent neural network (RNN) and long short-term memory (LSTM) for sentiment analysis [8].
From the high accuracy of deep learning method applied for sentiment analysis as reported in the literature, we are thus interested in empirically studying the performance of deep learning. We use several machine learning techniques as a benchmark to compare against deep learning performance. The framework of our comparative study is presented in the next section.

III. RESEARCH FRAMEWORK
Our steps as shown in Fig. 1 is a framework of comparative study regarding the evaluation of preprocessing and learning techniques that yield the most satisfying result for the sentiment analysis of product comments and reviews. Firstly, we collect data to preprocess using several natural language processing steps including transforming texts to lowercase, removing stop words and punctuation, and removing prefix-suffix from words. After that, we analyze and develop models based on bag of words with machine learning and word embedding with deep learning methods. Finally, we compare the results in terms of accuracy and time to train the model.

IV. DATASET
In the experiment, we use book review data from the website Amazon.com collected by Blitzer et al. [11] as the case study to analyze and develop the models. This dataset contains review texts in English as shown some examples in Table I. The statistics regarding number of data and classes of data are in Table II. This dataset has 2000 records and 2 groups of the reviews that contain 1000 positive reviews that are labeled as 1 and 1000 negative reviews that are labeled as 0 as show on rating column. We split data into two subsets: 70% of them are the training set and the remaining 30% are the test set. We split equally between the positive reviews and the negative reviews. After train-test split, the training set contains 1400 records and the test set contains 600 records.

A. Natural Language Processing
Natural language processing (NLP) is the subfield of artificial intelligence (AI) that tries to make a computer machine to understand natural language similar to human understanding. By this process, text document needs to be transformed into a format that computer can understand and do the necessary computational steps [12]. In this research, we adopt the NLP method to work in the feature extraction workflow for preprocessing raw text into useful text. We process text data by the following steps: 1) Lowercase all characters in the dataset. 2) Remove punctuations and stop words.
3) Remove prefix-suffix from word and get the only base word. By this process, we use the model from WordNet [13] to process. In this research, we focus on text classification for sentiment analysis that has several methods to choose for developing predictive models. However, choosing the method that fits the dataset is a difficult task. Before developing models with machine learning technique, we need to transform data into the format that can be used with machine learning algorithms. We choose bag of words method that is the simple one and easy to use with the great performance by the subsequent machine learning process to build predict model.

B. Bag of words
Bag of Words (BoW) is a technique for natural language processing used for transforming text document into the table of features. This process counts the words within the dataset and transforms into the matrix representation of the word. To apply this technique, users must choose the number of words that are of interest to be used as the features of this matrix. Then BoW chooses the top word that are found as the features, counting frequency of words by features and normalize the values as shown the example in Fig. 2 [14]. BoW is a simple and fast technique for transforming text document into a useful form to be analyzed and developed a predictive model based on machine learning.

C. Machine Learning
Machine learning is also the subfield of AI that tries to make a computer machine learn things (mostly in a form of data patterns) from historical data like humans learn new knowledge from their past experiences. The learning process of a machine can either supervised or unsupervised learning. The supervised learning is the strategy that a computer learns from historical data that have the labeled target value as the guide of its learning for the correct pattern to predict target value of the new data instance having no target field. On the contrary, unsupervised learning is the learning strategy having no specific target to learn. This kind of learning tries to search for natural groups or interesting patterns from data [15]. In this work, we use state-of-the-art machine learning algorithms including logistic regression, naive bayes, support vector machine, and neural network.

1) Logistic regression
Logistic regression is the machine learning algorithm that has been extended from linear regression to apply in a classification task. This algorithm adapts the concept of linear regression and sigmoid function [16] with the main idea shown as computation formula in equation (1). The logistic regression function outputs value between 0 and 1. Logistic regression work well with the binary classification task.
When ( ) is sigmoid function, is input variable for sigmoid function.

2) Naive bayes
Naive Bayes is the algorithm based on Bayes theorem and it is a simple probabilistic classifier. This algorithm focuses on the target event and analyze the other events by assigning the focus event as the feature. Then calculates the probability between features and targets based on equations (2) and (3) [17]. All the features are independent to each others. Naive Bayes is very popular and widely used in a classification task. 3) Support vector machine Support Vector Machine (SVM) is a machine learning algorithm that can classify target object by creating a hyperplane that can split a mixture of data that are represented as multi-dimension vectors into homogeneous groups (as shown in Fig. 3). Vectors are created by kernel function (e.g. radial basis, polynomial, sigmoid) that users need to choose the best one fitting the most on the specific dataset for the best classification performance [18]. SVM works well on binary classification task and the extension of the original algorithm can classify multiclass data.

4) Neural network
Neural network (NN) is the machine learning algorithm that resembles the learning ability of human by designing the network of computation units similar to the nervous system of human. Neural network contains nodes that are similar to neurons. Each node is responsible for computing the weight appropriate for a specific input attribute. The structure of nodes in a network is designed to have multiple connecting layers similar to the nervous system as an example in Fig. 4.
Weight of nodes can be changed during the iterative training process to yield the most accurate prediction on the target attribute value. Each round of training is called epoch [19]. From the architecture of neural network that has many computational nodes, the neural network thus takes a long training time to build the model. This time consumption is trade off by its high predictive efficiency.

D. Deep Learning
Deep learning is state of the art learning algorithm developing from the success of neural network. The learning efficiency is increased by stacking more layers of the neural network. Hence, deep learning can analyze data more deeply and get useful features to increase the performance of the predictive model [20]. Presently, deep learning is a popular learning algorithm extensively applied in many application domains such as image processing, image segmentation, and object detection.
In NLP, deep learning has a specific model called a language model that can analyze data in sequence form. This kind of sequence learning rarely appears in traditional machine learning algorithms. The language model outperforms other learning algorithms because this model analyzes the text as the sentence which is a sequence of words. In the language model, deep learning has a method to preprocess text data called word embedding.

1) Word embedding
Word embedding is the pre-process method to transform a text document into a format that can be analyzed by the deep learning algorithm. This method converts words into vector for the ease of calculation to find the similarity of the word. The example of vectors is shown in Fig. 5 [21]. This word embedding method is popular to use with recurrent neural network (RNN) algorithms such as long short-term memory and gated recurrent units because this method can analyze sequence data. Word embedding also has a pre-trained model such as GloVe [22], Word2Vec [23], and FastText [24] that are available for further applied by learning algorithms.
2) Long short-term memory Long short-term memory (LSTM) is the deep learning algorithm that has been developed from RNN for solving the gradient-descent variants problem and searching long-term dependencies in the dataset. LSTM contains cell as a subunit with a fundamental structure shown in Fig. 6. The cell contains subsystem called input gate to get input data, forget gate for weighing the significance of memory cell state from the previous computed state, memory-cell state gate to compute a new memory cell state, and output gate for computing a new hidden state [25], [26].

3) Gated recurrent unit
Gated recurrent unit (GRU) is a deep learning algorithm developed to reduce the complexity of LSTM networks by reducing number of gates in the cell. The reduction results in the reduced set of parameters to be computed in the network. Hence, the speed up of computation time. GRU cell contains input, output, update, and reset gates. Update gate is for controlling change in the hidden state, while reset gate is for resetting the value of hidden state, as shown in Fig. 7. This cell structure makes GRU work faster than LSTM because of less parameter and more simple structure [26], [27].

VI. EXPERIMENTS
In our experiments to assess review sentiment classification performances of machine learning versus deep learning algorithms being preprocessed with the bag of words (BoW) and word embedding text pre-processing techniques, we firstly split data into 2 subsets: the training set and the test set. We apply two scenarios on performance evaluation: developing classification models from non-preprocessed data and developing models with preprocessed data. For the text preprocessing method, we convert all texts to lowercase, remove punctuation and stop words, and remove prefix-suffix from the word. After these steps, we get data set appropriate for further model development.
The machine learning algorithms for model creation are based on logistic regression, naive bayes, SVM, neural network with BoW transformation. The deep learning algorithms to develop models are based on LSTM and GRU with word embedding transformation. We compare the performances with two metrics: accuracy and time to train the model.
We use Python as the primary language to develop the models. We choose NLTK for preprocessing text dataset [28], Scikit-learn for developing models based on machine learning algorithms [29], and use Keras [30] for developing models based on deep learning algorithms. We define the interesting vocabulary to contain 5000 words for BoW, then train the model with Logistic Regression, Naive Bayes, SVM with the default configuration in Scikit-learn and use backpropagation neural network with 500 nodes in hidden layers, 0.001 learning rate and 30 epochs of the training process.
For deep learning methods, we use GloVe as pre-trained model for word embedding to use with LSTM and GRU. The vector in word embedding step has 200 dimensions and the sequence contains 500 values. The architectures of both LSTM and GRU are the same in that each having 2 layers. Each layer has 500 units with a dropout layer to prevent overfitting by defining 0.2 dropout rate. Finally, we define fully connected or dense layer as the last layer of deep learning architecture to perform the classification task by containing 1 node of sigmoid function as an activation function. In the training process, this network has been configured 0.000001 and 30 epochs.
The experiments on deep learning scheme use Google Colaboratory which is created by Google [31] as the working environment. This platform provides Graphic Processing Unit (GPU) by Nvidia Tesla T4 [32] for a fast training of deep learning models.

VII. RESULTS AND DISCUSSION
Our experimentation is based on the two scenarios for comparing performances of machine learning and deep learning algorithms a specific application of sentiment analysis from the text-based product reviews. The first scenario is learning models from non-preprocessed data, whereas another one is learning from the preprocess data. Preprocessing means the application of NLP techniques to remove unnecessary words resulting in reducing excessive features during the model learning phase.
From the comparative results shown in Table III and Table  IV, we can notice that the NLP techniques we had applied to reduce unnecessary features from text data can improve the performance of some machine learning algorithms such as logistic regression, naive Bayes, and SVM on improving classification accuracy of the test set and reducing model building time. But for the neural network, LSTM, and GRU algorithms, our preprocessing technique could not improve their classification performances. The cause of this result may be that these algorithms have built-in feature extraction inside the network.
When analyzing in the aspect of time to train the model, we found that model training based on machine learning algorithms is significantly faster than the model training using deep learning algorithms. The results from the experiment also show that LSTM and GRU take more time to train model but their classification performances do not outperform the other models built from the traditional machine learning algorithms.
The results also show that statistical machine algorithm such as logistic regression can improve its sentiment classification performance by applying the natural language preprocessing techniques prior to the model building phase. This strategy helps improving a little bit the performances of naive Bayes and SVM algorithms.
The final notice is that a combination of BoW text processing technique with neural network learning algorithm outperforms other models with the highest accuracy at 0.82 using 55.8 seconds to train the model from non-preprocessed dataset. The time to train model of neural network (NN) is significantly lower than the deep learning algorithms (LSTM and GRU) that take as long as 2.44 minutes but yield a lower classification accuracy than NN on test dataset. Moreover, LSTM and GRU show high performance of 100% accuracy on train data, while a lower accuracy at 74% on test data. This seems to reveal the overfitting character of the deep learning models.  The aim of this research is to develop a sentiment analysis model to classify product reviews as either positive or negative. The classification techniques are based on the incorporation of text processing methods (bag of words --BoW and word embedding) and the learning algorithms using machine learning (logistic regression, naive Bayes, SVM, and NN) and deep learning (LSTM and GRU) algorithms. This incorporation results in six combinations: BoW with logistic regression, BoW with naive bayes, BoW with SVM, BoW with NN, word embedding with LSTM, and word embedding with GRU.
To reduce computation time during model building, we propose the application of natural language processing (NLP) techniques as preprocessing steps for dimensional reduction. The applied NLP techniques include converting all texts to lowercase and removing punctuation, stop words, prefix, and suffix from the words. On model evaluation, we thus assess model performances on two main aspects: models built from NLP-preprocessing data and models built from the non-applying NLP-preprocessing data.
From the experimentation results, we found that the NLP-preprocessing method is important for the machine learning algorithms such as logistic regression, naive Bayes, and SVM. But the preprocessing is insignificant for neural network and deep learning algorithms.
On evaluating a combination technique of text processing and learning algorithm, we notice that the BoW combined with neural network yields the most accurate model (accuracy at 82% and training time around one minute) of classifying review sentiment as either positive or negative. It is noticeable that BoW combined with logistic regression trained by the NLP-preprocessing data gives the second best model at 76% of accuracy measurement with the training time shorter than one second. A combination of word embedding and deep learning algorithm comes a third place with 74% of accuracy and a long training time around two minutes.
This research is a case study of applying machine learning and deep learning in sentiment analysis with the focus on the preprocess method. For the future work, we need to improve the performance of the model by increasing the accuracy and adopt other methods such as n-gram and TF-IDF to reduce features for speeding up the model training process.