A Joint Multi-task Architecture for Document-level Aspect-based Sentiment Analysis in Vietnamese

  Abstract —The increasing demands of e-commerce websites have caused a vast amount of opinions about Internet users’ products and services. Therefore, the aspect-based sentiment analysis (ABSA) has attracted much attention from academia and industry due to its application in many real-life problems in recent years. This problem is a complex task that aims to extract both the aspects and sentiments from the text input. Classification are two of the challenging subtasks of ABSA, which detect a set of pre-defined categories and corresponding sentiment polarity for a given review. This paper presents an effective joint multi-task architecture based on neural network models to solve two tasks in the document-level ABSA datasets. Our model is designed to predict the whole mentioned aspect categories and corresponding sentiment polarities on the document-level datasets. We trained our model jointly on two tasks simultaneously and utilized the additional information of aspect category detection task for predicting the aspect categories and its sentiments for the specific domain. Our architecture can explore the hidden correlated information between categories and polarities in the review. Experiments on two Vietnamese language datasets in the restaurant domain and hotel domain demonstrated that our model outperforms the previous state of the art methods on two benchmark document-level dataset.


I. INTRODUCTION
With the development of information technology, the demand for online shopping has become a new social trend. The users can easily order an item via e-commerce websites as Amazon, eBay, Alibaba. Besides, we also book a hotel and restaurant through websites like TripAdvisor or Agoda. However, we always wonder about the quality and effectiveness of the products we want to use based on the website's descriptive information. Therefore, these websites often have sections for users to express their opinions about the products, services. There are many benefits when analyzing these comments. First, other users can refer to these reviews of the product before making a decision [1]. Second, the businessman can rely on those reviews to analyze the users' experiences about their product. In terms of business, the users' comments are precious free resources to discover the advantages and disadvantages of the product. Based on these analyses, they help the manufacturers to improve product quality [2]. With the vast amount of customer-generated data online, it is challenging to analyze manually by the human. For that reason, Sentiment Analysis has attracted the attention of a wide variety of researchers in academia and industry. Sentiment Analysis (SA) is an essential task of extracting opinions, subjectivity, sentiment information from online text documents. The primary purpose of this task is to identify the overall sentiment polarity from user-generated documents [3].
However, the users' reviews often contain also many different aspects, for example, "The staff serves good, but the food is bad". We can see that there are two aspects that are mentioned in the above example and the sentiment polarities of are different, i.e. the polarity of "Service" is positive. In contrast, the polarity of "Food#Quality" category is negative. A branch of SA task which identifies sentiments associated with specific aspects in the text input is called Aspect-based Sentiment Analysis (ABSA). This task helps us understand the users' opinions because it directly focuses on each aspect's sentiment rather than overall reviews. The first official international workshop about ABSA task was introduced in SemEval-2014 Task 4 [4] and continued over the next two years, SemEval-2015 Task 12 [5] and SemEval-2016 Task 5 [6]. In these workshops, three main tasks were introduced for the ABSA task included: Aspect Category Detection (ACD), Opinion Target Expressions (OTE), and Sentiment Polarity Classification (SPC) for two levels of text, including the sentence-level and document-level. The ACD task is to assign the aspect categories based on a pre-defined set. The OTE task is to detect the expression on the review mentioned the aspect while the SPC is to assign the sentiment polarity for ACD task and OTE task. For example, given a review for the restaurant domain as follow: "The staff serves good but the food is bad", the output of ACD task is "Service#General" and "Food#Quality", the output of OTE task is "staff" and "food", while the output of SPC is the "positive" for "Service#General" and "staff" term, "negative" for "Food#Quality" category and "food" term. We can see that it is a challenging task, especially if a review contains many aspects with different sentiment polarity polarities [7]. Besides, there are high quality datasets at the sentence-level with manual annotation on various language, i.e. English, Arabic, Chinese, are published for the research community from SemEval workshops.
For the Vietnamese language, Vietnamese Language and Speech Processing (VLSP) organized an ABSA workshop for the Vietnamese language in 2018. They provided two benchmark datasets for the restaurant and hotel domain [8]. Each dataset is annotated for the aspect category detection and sentiment polarity classification task. The participating teams were required to solve two tasks for the final output. It means the output of participation systems must detect the aspect categories and corresponding sentiment polarities mentioned in the reviews -called as Aspect Category with Polarity (ACP) task. The number of aspect categories follow by SemEval 2016 workshop Task 6 [6] and the number of sentiment polarities consist of three states: "positive", "negative", and "neutral". These datasets are demanding for the ABSA task in terms of processing memory and training time, where each review can be assigned many aspect categories with different polarities. The length of each review is not uniform in three sets. Another challenge is the imbalance of the aspect categories and polarities in these datasets.
More recently, the pre-trained Bidirectional Encoder Representations from Transformer (BERT) language model [9] has established new state of the art results in aspect-based sentiment analysis datasets (Hoang et al. [10], Sun et al. [11], Li et al. [12], Li et al. [13]). However, the limit point of the BERT model [9] is that the input sequence length is limited to 512 tokens; therefore, the text pre-processing technique is applied for the reviews with a length larger than 512. To solve this problem, Sun et al. [14] proposed the truncation and hierarchical methods to fine-tune BERT for text classification tasks. The authors experimented and demonstrated the effectiveness of truncation methods for the long text. This method includes the head-only (keep the first 510 tokens of the input), tail-only (keep the last 510 tokens of the input), and head+tail (combination first 128 tokens and the last 382 tokens). However, we cannot apply the BERT architecture for the document-level datasets based on the truncation methods for two following reasons: (1) the maximum input sequence length of pre-trained BERT is 512, and the length of reviews at document-level often bigger than 512 in our case, (2) we cannot apply the truncation technique to reduce the length of the review because the aspect categories may appear anywhere in the long reviews, the truncations lead to missing information of aspect mentioned in the review. Therefore, it is difficult to apply the BERT model for the document-level datasets in ABSA problem.
In this paper, we present a joint multi-task architecture based on neural networks which are efficient for the prediction of the aspect categories and corresponding sentiment analysis for document-level data. Our architecture is able to predict whole aspect category with its sentiment for the specific domain. Our architecture train two tasks at the same time and utilize the information features of ACD task with the features of ACP task as the final representation of the long review input. This representation is put into the fully connected layer with the number of softmax layers to predict the aspect category with corresponding sentiment. Experimental results demonstrated the effectiveness of our framework on two datasets at the document-level for the Vietnamese language than previous methods include BERT architecture [15].
The rest of this paper is organized as follows: The related work is introduced in Section II. Section III presents our architecture, while Section IV gives detailed experiments on two benchmark datasets. Results and discussion are provided in Section V. Finally, Section VI summaries the paper and provides our future research directions.

II. RELATED WORKS
In recent years, there have been tremendous studies on aspect-based sentiment analysis problems. There are many survey papers about aspect-based sentiment analysis studies (Schouten and Frasincar [16], Zhang et al. [1], Do et al. [17], Nazir et al. [3]). In these papers, various approaches are summarized, including directions based on knowledge-based, machine learning, and hybrid for each different task in ABSA. In the latest survey, Nazir et al. [3] discussed the issues and challenges. They also presented the progress of recent studies and given future research directions. Most recent studies try to solve through multi-task learning approach based on deep learning architectures. Xue et al. [18] proposed a MTNA framework based on BiLSTM and CNN for the Aspect Category Detection and Opinion Target Expressions tasks because they noted that there is a close relationship between the two tasks. Their framework can share the mutual information of two tasks and achieved positive results on various SemEval datasets. Then, Wang et al. [19] also proposed a novel multi-task neural learning for opinion target expressions and sentiment polarity classification task by leveraging attention mechanisms. Similarly, Schmitt et al. [20] presented a jointly model based on a deep neural network for the Aspect Category Detection and Sentiment Polarity Classification tasks. They experimented with the CNN and LSTM architecture combined with the fasttext embedding on the GermEval 2017 dataset. Their model is a type of end-to-end trainable model which predicts the number of aspects and corresponding sentiment polarity at the same time. They formatted the model output as a set of 4 class vectors (none, positive, negative and neutral) corresponding to each aspect category in the specific domain. Experimental results have shown that the end-to-end CNN outperformed to the best system on the GermEval datasets.
Then, Dai et al. [21] also presented a Multi-task Multi-head Attention Memory Network (MMAM), which use the shared document and category features for aspect category detection and sentiment polarity classification for the Chinese datasets. They compared their framework with other multi-task architectures and achieved the comparable results on two datasets. Recently, He et al. [22] introduced a interactive multi-task learning network (IMN) and shown the superior performance on three benchmark datasets. The IMN can train multiple related tasks simultaneously by using a message-passing mechanism to interact between tasks. The useful information will be sent back to a shared latent representation between tasks. This information will be repeated to update and propagate across multiple links for all tasks. The performance of the IMN framework is optimized through iterative training. The architectures used in the IMN framework are CNN model combined with word embedding. Their results have shown that the IMN model is better than multiple baselines for Opinion Target Expressions and Sentiment Polarity Classification task. Based on proposed of He et al., Liang et al. [23] presented an Iterative Knowledge Transfer Network (IKTN) model for the end-to-end ABSA task. There is an iterative knowledge transfer network, which can exploit the semantic relationships between tasks. Their approach significantly achieved new state-of-the-art results on three benchmark datasets for the OTE and SPC tasks. In very recently, with the development of pre-trained language model BERT [9], there are many studies have combined the deep contextualized word embedding layer with neural models and achieved the new state-of-the-art results on various ABSA tasks (Hoang et al. [10], Sun et al. [11], Li et al. [12], Li et al. [14]). Most of the above studies experimented on the sentence-level benchmark dataset from SemEval workshops. However, in this paper, we focus on the problem of the document-level dataset, where each sample is usually with a length larger than the input of BERT architecture.
For the Vietnamese language, there are a few research studies on the ABSA in recent years included public benchmark datasets (Nguyen et al. [8], Thuy et al. [24], Nguyen et al. [25], Thuy et al. [26]) and presented the methodology (Thin et al. [27], Thin et al. [28], Tran and Phan [29], Le et al. [15]). To the best of our knowledge, the research of Nguyen et al. [8] was the first study to publish benchmark datasets for the research community on the ABSA problem, which have same format of shared-task SemEval 2016 [6]. Their datasets are annotated at document-level users' reviews and split into the training, validation, and testing set for the hotel and restaurant domain. These datasets are very challenging because of the difference between the training set and testing set related to the number of samples and the reviews' length. Following that, Thuy et al. [24] also presented a manually annotated dataset at the sentence-level for the ACD task with 6 472 sentences (3 796 in Vietnamese and 2 676 in English) for the restaurant domain. Next, Thuy et al. [26] continually annotated the SPC task for this dataset and combined with translated English dataset [6] for the final dataset. Similarly, Nguyen et al. [25] also presented a dataset at the document-level for ABSA with two tasks: ACD and SPC for the restaurant reviews. Compared with other datasets, their dataset was annotated with 7 aspect categories and 5 sentiment polarities. Most of the dataset studies presented different baseline methods, they mostly used the Support Vector Machine model combined with various handcraft features [24]- [26]. The works of Thin et al. [27], Thuy et al. [24] and Thuy et al. [26] also separate two tasks (ACD and SPC) as two components where each component consist of N binary classifiers corresponding to N aspect categories. In contrast, Nguyen et al. [25] combined the two outputs of two tasks in a classifier for each aspect category. In our opinions, these approaches resolve aspect categories individually without utilizing relevant information between the aspect categories. In recent, Thin et al. [30] proposed a joint architecture based on deep learning methods to address two tasks on two VLSP datasets [8]. Next, Le et al. [15] presented the experimental results using Multilingual BERT-based for the VLSP datasets [8]. Because the VLSP datasets are the document-level dataset where each review is made up of many sentences and the input length of BERT architecture is limited by 512 tokens; therefore, they break each review into sentences and then put them into BERT architecture. They also resolve two tasks in the VLSP dataset separately; they treated two tasks as the multi-label classification problem. However, their experimental results are not effective than the joint architecture of Thin et al. [30].
Accordingly, in this paper, we present a joint architecture based on neural networks for the document-level dataset, which can train all aspect categories and predict two tasks into an architecture. In our architecture, we take advantages of the ACD task features by combining with the ACP task for each aspect category. Our model is trained whole categories for each domain to explore the correlation information between them. The experimental results on two document-level benchmark datasets show the effectiveness of our architecture than others.

III. PROPOSED ARCHITECTURE
This section describes the details of our proposed architecture for two tasks of the ABSA problem in the Vietnamese language. We define two tasks in this paper as follows: Given an input text with the length L, the objective of these tasks is to identify the combination of "Entity" and "Attribute" or "Entity" for each domain. Each pair of combination E#A mentioned in the commentary will be analyzed for sentiment polarity (simply called as sentiment) according to 3 levels: positive, neutral, and negative for the VLSP 2018 datasets or 5 levels: very positive, positive, neutral, negative, very negative for the UIT_ABSA 2019 dataset. Fig. 1 shows an overview of our neural model. At first, the model consists of two components corresponding to two tasks to be solved in this paper: Aspect Category Detection (ACD) and Aspect Category with Polarity (ACP). We now describe each component in our architecture in detail.

A. Component 1: Aspect Category Detection
This component aims to learn the features to predict the entities, the attributes, and aspect category (a combination of entities and attributes) of the input. The text input consists of the sequence of L words 1 , 2 , ... with L is the length of the input after padding. Every word is represented as a d-dimensional word vector retrieved from the pre-trained embedding. The embedding layer E with L × D dimension is generated by concatenating all word vectors. Then, we utilize a bidirectional Gated Recurrent Unit (GRU) [31] to encode both forward and backward sequences from the embedding layer. The Bi-GRU layer produces the concatenated output vectors H = {ℎ 0 , ℎ 1 , .., ℎ } where ℎ is the combination of the ℎ backward and forward hidden states. The attention is then applied to the output of Bi-GRU layer to capture the important words in the text. The word attention is designed to extract important words because there are many words in the document-level review. However, not all words can represent the aspect category meaning, therefore, we need an attention layer to extract the informative words. Our attention layer is implemented follow by Yang et al. [32]. In addition, we concatenate both the max pooling and mean pooling representation with attention vector to produce the input representation: This representation is then fed into a fully connected layer before predicting the Entity, the Attribute and the Entity#Attribute pairs as in Fig.  1. A fully connected layer is composed of a hidden layer and three sigmoid output layers. The output layer's size is the number of the entity, attribute, and aspect category classes for each domain. For the UIT_ABSA dataset [25], we only use the Entity as the output of this component because their dataset treated the Entity as the aspect category.
The representation 1 of input review is feed into a fully connected layer for classification. The output of this component includes three outputs corresponding to the entities, attributes and aspect categories (Entity#Attribute). The main output is the aspect categories, however, we also consider the entity and attribute as the output for the regularization effects. Therefore, we have three output layers where each output is a fully connected layer with sigmoid activation. The size of the output is the number of entities, attributes and aspect categories.

B. Component 2: Aspect Category with Polarity
Similar to the Component 1, this component aims to predict the sentiment polarity of aspects. The matrix-vector of the input with L words 1 , 2 , ... extracted from the embedding layer is fed into the Bi-GRU layer to extract the context semantic information in the text. The output of ℎ word is shown as the following equation: Here, the concatenation is used to combine the forward and backward pass hidden states to obtain the comprehensive information of input. This layer produces the context input representation = [ℎ 1 , ..., ℎ ], H ℝ × . Next, the convolution operation is applied to the input representation H generated from the Bi-GRU layer. Following the Kim [33], we use the CNN architecture with three convolution layers on the Bi-GRU output to extract the higher-level features. This paper uses three parallel convolution layers with 3,4,5 kernel size to capture the n-gram feature from the input representation . We can obtain a filter F with the word window to produce a feature map vector where each element is calculated as follows: where is a non-linear activation function, 0 is a bias value. Each convolution layer uses the same filters generate a matrix of feature map . Three convolution operations produce three feature map matrices 1 , 2 , 3 corresponding to 3, 4, 5 kernel size, respectively. We then apply a max-pooling operation and global pooling operation over each matrix to capture important features of particular filter. Finally, we obtain three vector 1 , 2 , 3 , where corresponding to 1 , 2 , 3 feature map matrix, respectively. Then, we concatenate three vectors together as a final vector.
Because the output of Component 2 is designed as a 4-dimension vector (not-mentioned, positive, neutral, negative) or 6-dimension vector (not-mentioned, very positive, positive, neutral, negative, very negative) for each aspect category, we concatenated this vector with Component 1 for the aspect category detection task to provide the final representation of this component. It helps this component have more useful information from the aspect category detection task to predict each sentiment polarity of aspect.
The vector 2 is put into a fully connected layer and n softmax layers, where n is the number of aspect categories for each texture domain.
where 2 is the output of a fully connected layer and 4 or 6 is the output of the ℎ category for the VLSP 2018 and UIT_ABSA 2019 dataset, respectively.

C. Training Objective
As shown in Fig. 1, our architecture has three output layers for the aspect category detection task (the above half of the Fig. 1) and output layers for detection of aspects and the classification of their polarity (the below half of the Fig. 1), International Journal of Machine Learning and Computing, Vol. 12, No. 4, July 2022 where is the number of aspect categories for each domain. For Component 1, the prediction outputs of these two datasets are multi-labels, therefore we use the Binary Cross-Entropy loss function for each output layer. This function and the loss function of Component 1 can be formalized as follows: (12) where N is the output size, ̂ is the ℎ value in the model prediction, is the corresponding target value. The sigmoid activation is used in the output layer. The is the Binary Cross Entropy loss on the aspect category labels. The and are the loss function of the entity and attribute labels.
For the Component 2, the loss function is the sum of categorical cross-entropy loss in each aspect category. The loss function of Component 2 is given as follows: where T is the number of aspect category for each domain, p is the one-hot vector corresponding to sentiment classes, p is the number of polarities + 1. Finally, we define the loss function of our architecture as follows: where and are the coefficients of weighting schemes to optimize the objective functions. The Component 2 is the primary output while the Component 1 is the auxiliary output; therefore, we set = 0.1 and = 1 after using the grid search technique.

A. Datasets
In this paper, we use two standard document-level ABSA datasets in Vietnamese: VLSP 2018 [8] and UIT_ABSA 2019 [25]. The VLSP 2018 [8] datasets are annotated for the restaurant and hotel domain. For the restaurant domain, there are 12 different aspect categories (Entity#Attribute) where each category is assigned one of three sentiments: "positive", "negative", and "neutral". There are 34 different aspect categories for the hotel domain. In contrast, the UIT_ABSA 2019 dataset [25] is only annotated for the restaurant domain with 7 aspect categories and 5 sentiment polarities. Here is a brief summarization of the two datasets used in this paper.  VLSP 2018: This dataset was developed by the Nguyen et al. [8] research team to organize the ABSA shared-task for the Vietnamese language in 2018. This dataset is built at the document-level data and consists of two datasets where each dataset includes 4 100 reviews divided into the training, validation, and testing set. As the Table I and Table II show the number of reviews, the number of vocabularies, etc. The average lengths of review vary across in three sets. For example, this value is 54 tokens per review for the training set in the restaurant domain, while this value is 163 tokens per a review for the testing set. We can also see the difference between the number of aspect categories per a review, which is usually higher in the testing set. Besides, this dataset is annotated at the document-level; therefore, the number of aspect categories with different sentiment polarities often appears in the samples. For those reasons, this is a challenging dataset for the Vietnamese language.  UIT_ABSA 2019: Similarly, this dataset is directly crawled from Foody 1 website for restaurant domain. It is the first official dataset with 5 sentiment polarities for the ABSA problem in Vietnamese. There are seven aspect categories where each category is assigned one of five polarity levels: "very positive", "positive", "neutral", "negative", and "very negative". Unlike VLSP 2018, this dataset includes only 7 different aspect categories, which are not a combination of attributes and entities. In total, this dataset includes 7 828 reviews divided into three sets with the ratio of 7/1/2. The authors used the iterative stratification 2 technique to the split overall dataset to balance the number of aspect categories and corresponding sentiment polarities in three sets. However, there is still an imbalance between the aspect categories and the different sentiment polarities in the training, validation, and testing set. For example, the "Quality" and "Service" are the most annotated aspect categories compared with others in the entire dataset. For the sentiment labels, the "very positive" sentiment is most annotated for all aspects with a percentage of 50%.   A common point of the two datasets is comprising users' comments crawled directly on the famous websites about the restaurant and hotel. Hence, there are still many grammatical errors, sentence structure, teen code, and icons in these datasets. The two datasets also reflect the ABSA dataset challenges of a very limited amount of annotated data and high-class imbalances.

B. Baselines
In this section, we present the various architectures as baseline methods which was compared with our framework.  SVM + handcraft features [ Because the reviews contain many errors in the text structure, therefore it is difficult to apply the sentence segmentation technique. Hence, we split the reviews as a combination of phrases as the input of Sentence Encoder component.

C. Experimental Setup
Model Configuration: In the following, we describe the model hyper-parameters during our experiments. To initialize the word embedding, we have trained a Skip-gram algorithm with 100-dimensions on the domain corpus. Specifically, our domain corpus includes 227K and 300K sentences corresponding to the hotel and restaurant domain, respectively. We used Gensim Library [34] to train our domain embeddings.
Because each Vietnamese word can be made up of one or more syllables, it is necessary to segment the text input into words. However, Vietnamese word segmentation tools are currently trained on the news corpora; it is very challenging to segment words on users' comments, which contained many abbreviations or wrong grammatical. It will lead to many words that cannot be segmented in the right way and do not exist in pre-trained word embedding. To solve this problem, we used a technique by taking advantage of the largest prefix and suffix of segmented words. The average value is presented as the word vector. This technique reduces the number of Out of Vocabulary (OOV) words in the training set. In our architecture, each GRU layer's hidden unit was set to 256 and we employed 3 different kernel sizes (3,4,5), and the number of filters were set to 300 for the CNN architecture. The first fully connected layer was also set to 300 neurons. We trained our model for 100 epochs with a batch size of 50. For optimization algorithm, we used RAdam [35] optimizer with default settings. We also experimented with other optimization algorithms such as Adam and Stochastic Gradient Descent (SGD); however, these optimization algorithms were not effective for our architecture. We used dropout operation for regularization with the rate of 0.5 for the word embeddings and the fully connected layer as the default parameter. To processing the text inputs, we conducted a series of pre-processing steps before putting them to our model as previous works [30] to show the effectiveness of our architecture. All models will be run 5 times with different random seeds and reported as the average scores.

A. Result
In this section, we compare our experimental results to previous approaches on two different datasets: VLSP 2018 [8] and UIT _ABSA 2019 [25]. It is noted that the Aspect Category Detection (ACD) task is to identify the aspect category (e.g., Service#General, Food#Quality) mentioned in user comments, while the Aspect Category with Polarity (ACP) task is computed based on the pairs of Entity#Attribute and corresponding polarity. Therefore, the effectiveness of ACP task depends on precisely identifying the aspect categories mentioned in a given review. In this paper, we consider the AS task as main task based on the results of previous studies. As follows, we show the experimental results on two tasks on both datasets. To evaluate our architecture's performance compared with other methods, we use Precision, Recall, and F1-score for each task.
Next, we show the results of our architecture for the whole aspect categories in both datasets. Fig. 2 and Fig. 3 present the F1-score of each aspect category for two domains of VLSP 2018 and UIT_ABSA 2019 dataset, respectively. For the restaurant domain, we can see that there are many aspect categories are achieved with the high score, such as "Food#Style&Options", "Food#Quality", and "Food#Prices". However, there are still a few aspect categories with a low score, such as "Restaurant#Miscellaneous" and "Restaurant#Prices" because these aspect categories are less annotated in both the training and testing set. For the hotel domain, we show the 12/34 aspect categories results on the testing set. However, there are many aspect categories which can not predict, for example, "Room_Amenities # Prices" (0 samples), "Room_Amenities # Miscellanous" (1 sample), "Rooms# Miscellanous" (3 samples), "Food&Drink_Miscellanous" (13 samples) and "Facilities#Miscellanous" (33 samples).    As shown in Fig. 3, we observe that our model can achieve satisfying results for categories in terms of F1 in the ACD task except for "Miscellaneous". The results of the ACP task are relatively low for all aspect categories. One of the reasons for poor performance of our architecture is the imbalance between the polarities of each aspect. Another reason might because the polarity sentiments in this dataset are assigned in 5 levels, which is more complicated than 3 level sentiments, may yield worse results for the ACP task. In conclusion, these results demonstrate that our architecture can utilize an aspect category information of ACD task to generate the final representation for aspect category with sentiment prediction.
B. Analysis B.1 Sensitivity Analysis. This section shows the effects of two important parameters in our architecture, the dropout rate and the word embedding size. The other parameters are held as the default when we analyze the effect of one parameter. + Dropout rate: Our architecture is a type of ensembles of neural networks with different model configurations; therefore, it is easy to overfit in trying to achieve good accuracy with small size dataset. Dropout is one of the regularization methods to reduce overfitting in deep neural networks effectively. Fig. 4 and Fig. 5 show the F1-score of different dropout values on the validation set corresponding International Journal of Machine Learning and Computing, Vol. 12, No. 4, July 2022 to the hotel and restaurant, respectively. It can be found that the dropout rate has significant impacts on the performance of the model. As shown in the three charts ( Fig. 4 and Fig. 5), the dropout value in the range of 0.5 and 0.7 will help our model achieved the best performance on the validation set. It proves that the value of dropout significantly affects the performance. + Embedding dimension: In this paper, we trained two-word embeddings on domain corpus, respectively. We conduct a comparative experiment to discover the effects of different dimensions of pre-trained word embedding. We choose the vector size as follow: dim {50, 100, 150, 200, 300, 400, 500}. As the results are shown in Fig. 6, the embedding dimension affects both data domains differentlyhigh dimension helps our model achieves the better results for the hotel domain, but the difference between these dimension values is not significant. As shown in Fig. 7, the best result is achieved with the embedding dimensions with value of 100 on the UIT_ABSA 2019 dataset.

B.2 Case study.
In this part, we make a case study on the results of the testing set of two datasets. Table VII shows two examples with ground-truth and our prediction labels for the VLSP 2018 datasets. As for the first example in Table VII, our model predicts the number of aspect categories nearly correct for the review, except the "Service#General" category. Because the user mentions a sentence "Mỏi lần đợi vk đi chợ là phải là m ở đây vài ly mới chịu dc" ("Every time" I wait for my wife to go to the market, I have to eat a few things at this store.) leads to wrong this category in our prediction. However, this sentence, describes the writer's supplementary information rather than referring to the "Service#General" category. We verify this error by removing this sentence in the review, and our model does not predict this category for the review. Compared to the sentiment of aspect category, our predictions are different from the ground-truth label in the "Location#General" and "Food#Prices" categories. The ground-truth of sentiment of "Location#General" is "neutral" while our prediction is "positive" because our model notes that the phrase "quán chè ngay phía đầu đường sau khi vừa vượt qua nhã 3"(the tea shop, which is right at the beginning of the road after crossing "three-way crossroads") means the "positive" for this category.
In contrast, our model assigned the "neutral" sentiment to "Food#Prices". In terms of the guideline, "if the user only mentions specific prices and does not express the opinion for the item will be assigned "neutral" for the category. Based on this example, we can see that the user mentioned many specific prices of foods such as "Đậu nà nh 4k/bịch, chè 10k bịch đem về nè" (Soybeans with 4k a bag and tea bags with 10k for one can buy and take away), "bánh cam 4k/cái"(There is also a 4k / orangecake), etc. Therefore, our model predicts the "neutral" sentiment for "Food#Prices" category. After listing the food names with its prices, however, the user concludes that the price of food is pretty okay in the phrase "Giá thành chợ thì khá ok rồi" (The price at this market is quite ok). Besides, there is also a bunch of spelling mistakes. For the words we put in quotation marks, it means we did translate it into the correct one. In the first example, the word "nhã 3" should be "ngã 3" instead. As for the second example, our model is unable to find the "Food&Drink#Style&Option" because the phrase "cắt miếng bánh mì đen thui"( Cutting pieces of bread and make it black) -this is a type of implicit aspect category; and "Room_Ameninites#Design&Features" is wrong predict as "Room_Ameninites#Quality". For the sentiment prediction of each aspect category, our model can capture the correct sentiments corresponding to each detected category from the review. In this paper, we focus on predict the pairs of aspect category and corresponding sentiment, therefore, the overall scores of our architecture depend on the ACD task.

VI. CONCLUSION AND FUTURE WORK
This paper presented an effective joint multi-task architecture based on neural networks for the Aspect Category Detection and Aspect Category with Polarity Classification tasks for Vietnamese document-level ABSA datasets. Our model jointly trains the aspect category detection and corresponding polarity tasks simultaneously and combines the information feature of the ACD task with the ACP task for the final representation. Our model can predict the whole aspect categories with corresponding sentiment polarities for each domain. We conducted various experiments and compared it against several previous methods to show the effectiveness of our model. Experimental results on two benchmark document-level datasets demonstrated that our model has good performance for the document-level input. Up to now, our model established the new state-of-the-art results for the VLSP 2018 and UIT_ABSA 2019 datasets.
For future works, we intend to explore the new other neural networks to solve this problem in the Vietnamese language. Our framework can be applied to other languages in the document-level datasets for ABSA problem. Besides, we also want to combine domain knowledge and sentiment features to provide additional features for the sentiment polarity task. Moreover, the idea of using rich-resource languages is additional data using translation approaches, and multilingual embeddings can be the potential direction for these tasks. Recently, the Iz Beltagy et al., [36] presented a new model for the long document, however, this model is not available for Vietnamese up to now. For the future works, the combination this architecture and our approach for the document-level for ABSA problem is also a new potential research direction. Ngan Luu-Thuy Nguyen is a scientist at the University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam. She received her PhD degree in information science and technology from the University of Tokyo, Japan. She was a postdoctoral researcher at the National Institute of Informatics, Japan from 2012 to 2013. Her research interests include natural language processing and data analysis.