Ensemble of Feature Ranking Methods Using Hesitant Fuzzy Sets for Sentiment Classification

—The increase in the volume of opinion posted on social media sites has led to a tremendous increase in the dimensionality of data used for the sentiment analysis. The selection of informative features from textual data can improve the performance of supervised learning methods. In this article, we propose a novel and efficient method for integrating different filter-based feature selection methods for sentiment classification. The ensemble method utilizes hesitant fuzzy sets for representing opinions of different filter-based feature selection methods in order to optimize the relevancy score among features and class labels. Based on this relevancy score, top-k ranked features are selected for sentiment classification. The proposed feature selection method with Naïve Bayes and Support Vector Machine classifiers was evaluated on three most widely used datasets for sentiment analysis using Unigram and Parts-of-Speech based text representation schemes. The performance is evaluated using five-fold cross validation technique and the results show that the proposed method can achieve greater value of accuracy with only 10-25% of total extracted features. The outcomes of comparison carried out via statistical tests confirm that the aggregation using hesitant fuzzy sets is more effective than baseline feature selection methods on Parts-of-Speech features in terms of performance metrics.


I. INTRODUCTION
The views and opinions posted by people on the World Wide Web is growing exponentially. The posted content need to be analyzed in order to help the organizations, companies or individuals for better decision making. Sentiment analysis is a research area of text mining which involves computational identification of people's opinions, attitudes and emotions towards an entity [1]. Sentiment classification is a major task of SA process which assigns positive or negative polarity to the given opinion document. It can be viewed as a problem of supervised learning from machine learning perspective. The authors in their work [2] showed that supervised learning techniques such as NB, SVM and ME outperform human-produced baseline approaches. Sentiment classification involves feature extraction and selection from textual opinions before classification of opinion given in the samples into positive or negative class.
Feature selection methods are aimed at improving the Manuscript  classifier performance by reducing the dimensionality of the extracted feature set in order to overcome overfitting problem.
The time complexity of the model can also be reduced by selection of more informative feature subset. The feature selection techniques can be classified into three types [3]: Filter methods, wrapper methods and embedded methods. Filter approach is also categorized as univariate approach as it selects informative features on the basis of relationship between individual feature and class. These methods use statistical characteristics of the features in the selection process. These methods are fast and useful in high dimensional datasets. The filter based feature selection discussed in their work were correlation criteria and mutual information. These methods have low computational cost and thus are suitable for high-dimensional datasets. The drawback of filter methods is that they do not involve feature interaction which may lead to selection of redundant features. Wrapper approaches are multivariate approach and they find optimal feature set by evaluating goodness of the subset based on its classifier performance. Though these approaches are computationally expensive, they select more relevant features compared to filter methods.
To overcome the problem of dependency of classifiers on individual feature selection method, we proposed a univariate method to integrate filter-based feature selection methods using hesitant fuzzy sets (HFS).The HFS has been used in our work to aggregate the opinion of different feature rankers on high dimensional feature set through a function for final decision making. The aggregated values are then used to rank the features for final feature selection.
The main contributions of the paper are as follows: • A univariate approach to ensemble filter-based feature selection methods using hesitant fuzzy set (HFS-FS) is proposed for sentiment classification • Performance analysis of proposed approach using NB and SVM classifiers on both POS-based and unigram-based feature representation schemes • The experimental results are obtained on three widely used datasets of sentiment analysis domain • A paired t-test is conducted to find statistical significance of the HFS-FS as compared to baseline algorithm for feature selection The paper is organized as follows: Section II briefly presents related work in the field of sentiment analysis and feature selection methods. Section III gives a brief introduction on HFS, Filter based feature selection methods and classifiers used in our work. Section IV describes the methodology used to integrate feature selection methods using hesitant fuzzy set approach. Experimental results are presented in Section V. Finally, Section VI concludes the paper and scope for further enhancement in the work.
II. RELATED WORK Sentiment classification using machine learning approach was popularized in early 2000s [2]. Three popular machine learning algorithms: Naï ve Bayes, Support vector machine and Maximum entropy were employed by the authors for sentiment analysis of movie data. In the unsupervised learning approach proposed [4], three steps were performed to classify review as positive or negative. In the first step, phrases containing adjective or adverb were extracted from the review text. In the next step, semantic orientation of these phrases was computed by using PMI-IR and in the last phase, the overall sentiment of the review was computed by taking the average of these phrases. The sentiment detection of reviews [5] faced two major problems: subjectivity and sentiment classification. The authors discussed about the machine learning approaches and the major issues encountered while applying them in word and documentlevel sentiment classification problems. In the sentiment analysis survey by [1], feature extraction/selection and sentiment classification are the two main steps in the process of sentiment analysis. The selection of appropriate features before classification can improve the performance of classifier.
Feature engineering has become one of the most important task of any machine learning algorithm. The authors [6], [7] used unigrams, bigrams, trigrams and combination of these as features with Naï ve Bayes, Support vector machine, Maximum Entropy and Stochastic Gradient Descent as classifiers for the task of polarity detection of IMDB movie reviews. The work [8] utilized information gain as heuristic of genetic algorithm for selecting optimal feature set. They evaluated their approach on the benchmark dataset of movie reviews and were able to achieve good accuracy with SVM classifier. A new feature selector that used SentiWordNet with Proportionate Difference (SWNPD) and SentiWordNet with Subjectivity Scores (SWNSS) was introduced [9]. The authors also proposed new feature weighing method based on SentiWordNet for word scoring groups and polarity groups. The significant difference of their work from conventional feature selection and weighing methods was proved using NB and SVM classifier.
Rogati M. et al. [10] presented a comparison on different filter-based feature selection methods such as Document frequency, information gain, mutual information, Chi square and term frequency for text classification problem. They achieved best results using Chi square test and Information gain on four classifiers. Chen J. et al. [11] presented two feature evaluation metrics: Multi-class odds ratio and Class Discriminating measure for the Naï ve Bayesian classifier that was applied on multiclass text datasets. A new feature selection method based on query expansion ranking approach used in information was proposed [12]. The authors compared their approach with baseline feature selection methods on Turkish and English product and movie reviews using Naï ve Bayes, SVM, Maximum Entropy and Decision Tree classifiers. In the study [13], hybrid of feature extraction and selection for dimensionality reduction was proposed to utilize strength of both. Hybrid approaches take benefit of both filter and wrapper method to choose features with minimum redundancies and thus increase classifier performance [14].
It was found [15] that more robust feature subsets can be generated by using ensemble of feature selection methods rather than a single feature selection method. Also the ensemble methods can solve the problem of classifier dependency on only one feature selection method. Ensemble of different feature sets and classifier using fixed, weighted and meta-classifier combination as integration strategies was utilized [16] to improve the classification performance. The effectiveness of the work was tested by SVM, ME and NB classifiers on part-of-speech base and word-relation base features. The robust and efficient feature set can be selected by aggregation of individual feature sets generated by various feature selection methods. Genetic approach was utilized [17] to aggregate filter-based feature selection methods for optimal feature selection. In the work, the optimized feature list is the list that minimizes the Spearman foot rule distance between the existing feature lists. NB and K-nearest neighbor algorithm were used as base learners in the experimental study. A hybrid of filter and wrapper approach is proposed [18] in order to obtain an optimal feature set. They obtained final feature set using two approaches: Ordinal based integration of feature vector (OIFV) and frequency based integration of feature vector (FIFV). In OIFV, feature subsets were obtained using different filter based feature selection methods in the first step. In the second step, an ordinal based integration of these subsets is done to generate new feature subsets which are evaluated by four different classifiers to obtain optimal feature subset with best accuracy. In their second approach (FIFS), they used wrapper method to evaluate feature subsets obtained by different feature rankers to generate feature vectors and then applied frequency based integration on feature vector to get final feature set.

III. PRELIMINARIES
The proposed ensemble of feature selection uses hesitant fuzzy sets, filter algorithms and classifier. The concepts of these will be reviewed in the subsections.

Hesitant Fuzzy Sets (HFS)
The concept of hesitant fuzzy sets and its definition was introduced [19], [20]. The hesitant fuzzy sets is defined as follows: Definition 1: Let X be a reference set, then hesitant fuzzy set on X in terms of a function h is that when applied to X returns a subset of [0,1]. Definition 2: Let M = {µ1, µ2, µ3………. µN } be a set of N membership functions. Then, the hesitant fuzzy set associated with M is defined as follows: HFS were introduced as simplification of fuzzy sets. The modelling of decision using HFS was further discussed in their work [21]. When decision is represented by different fuzzy sets, they can be used to aggregate decision information through some function for final decision making. The aggregated values are used to rank the alternative or select few of them.
International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019

Feature Representation Methods
For sentiment analysis problems, feature engineering consist of two phases: first phase is feature identification and next is feature selection. The main challenge is extraction of appropriate features and then selecting most valuable features before representing a piece of text into a feature vector. The main feature representation methods as discussed [22] are as follows: N-gram features: These features are sequence of n objects from a given sample of text at token/word or phoneme level. These features can be classified as unigram or N-grams (a sequence of two or more words).
Parts-of-speech (POS): POS information is commonly exploited in the area of sentiment analysis. POS is linguistic category of words such as Noun, Pronoun, adjective, adverb etc. The POS tagging of documents is utilized to extract adjectives in a piece of text as this feature is highly correlated with sentence subjectivity.
Negation: These features play a major role in opinion mining as they can change the semantic meaning of the given token. The bag-of-word representation of "I like this movie" and "I don't like this movie" is same but the negation used in latter changes the sentiment of the sentence.
Syntactic and semantic dependency: Documents can be parsed to extract dependency between words in sentiment classification for extracting relevant feature set. Parsing or dependency tree can be utilized for modelling valence shifters [23] such as negation, diminishers and intensifiers in text classification.
In our experimental work, we used two types of feature representation methods: fixed and variable n-gram features [18]. In the fixed n-gram, unigrams, bi-grams and tri-grams of fixed size are extracted. We used only unigram features in our work. Variable n-grams are extracted from raw document using POS tags. First POS tags are assigned to the whole sentence. Then based on the linguistic filter, POS patterns of variable length (1-3) are extracted. For detecting features as Noun, Adjective, Verb and Adverb, twenty-three POS filters are utilized.

Feature Selection Methods
The five global filter-based feature selection methods utilized in the proposed approach are discussed as follows: Chi-square Test [24]: It is the global feature selection method. It is used to calculate the degree of relationship between feature and class. A higher value of feature-class score implies that class is more dependent on that feature and thus the feature is more informative. The formula used to calculate the Chi-square score of a feature is as follows: where f is feature, Ck is k th class and k=1, 2…m, m is the number of classes, n is the number of documents in the dataset, A is the number of documents in which f and Ck co-occur, B is the number of documents that neither contain f nor Ck, C is the number of documents in which f occurs without Ck and D is the number of documents in which Ck occur without f. Document Frequency [24]: Document Frequency (DF) is one of the simplest method for feature ranking in text classification. It is based on the concept that informative features occur more number of times in the documents of the corpus. DF is computed by counting the number of documents in the corpus that contain that feature f.
Information Gain [24]: Information gain (IG) is commonly utilized feature selection method for text categorization problems. It is two-sided global feature selection metric. The score of IG is obtained by the presence or absence of a term in a document for predicting the correct class of the document. The formula to calculate IG score is as follows: where 1<=k<=m and m is the number of classes Standard Deviation [25]: Standard deviation (SD) is a statistical tool that is used to measure deviation of a value from its mean. In feature selection, Sdev calculates the amount of dispersion of a feature from average in the feature space. The higher value of standard deviation shows that the feature is distributed over large range of values thus will be useful in discrimination between classes. For two-class problem of sentiment classification, this value is calculated using the following formula: where k=1 or 2 for binary classification problem, xij is the weight of j th feature in the i th sample, meank and Sdev(fi,Ck) is the mean and standard deviation of the i th feature to the kth class.
Gini Index [26], [27]: Gini Index (GI) is the modified version of attribute based feature selection and used as a feature selector for text categorization problems. It is global feature selection method and assign a positive score to each feature. The maximum the score of the feature, the better is its rank.
The GI score is calculated using the following formula: where 1<=k<=m and m is the number of classes P (Ck│feature): Conditional Probability of class Ck given

Sentiment Classification Using Supervised Learning
We utilized the following machine learning algorithms for our experimental study as these two algorithms are able to achieve best accuracy in text classification problems.

1) Naï ve Bayes classifier (NB)
Naï ve Bayes is widely used in text classification as it is computationally efficient and has good classification accuracy. In recent years, Naï ve Bayes classifier has been applied in various text classification problems [28,29]. The probabilistic model uses Bayes theorem to predict the probability of a given feature set belonging to the particular class: where P (Ci) is the prior probability of a class i. P (Ci│features) is the prior probability of feature set being classified to class i. P (features) is the prior probability that feature set has occurred, where P (features) is constant for all classes so ignored. The class Ci for which P (Ci │features) is maximized determines the class of feature set.

2) Support Vector machine (SVM)
SVM is introduced as a new learning method [30]. It is the fast and accurate method for classification suitable for both linear and non-linear data. SVM is popular in text classification [31], [32] as the classifier has the ability to deal with sparse document vectors created in the case of textual data. Also the method is less prone to overfitting problem and can deal with high dimensional space. The method searches for the maximal marginal hyperplane that separates the two classes using support vectors. These support vectors are the most difficult tuples to classify and carry maximum information. Also support vectors provide compact description of the learned model. A separating hyperplane can be written as follows: where W is a weight vector W= {w1, w2………….wn} where n is the number of features and b is scalar value known as bias.
For binary classification problem any tuple that falls above the hyperplane belongs to class +1 and any tuple that falls below or on hyperplane belongs to class -1.

IV. ENSEMBLE OF FEATURE SELECTION METHODS USING HFS (HFS-FS)
In this section, the ensemble of feature selection using hesitant fuzzy set (HFS-FS) is presented. The method consists of two phases: in the first phase, feature relevancy is measured according to five ranking algorithms. In the second phase, the feature importance measured by individual feature rankers is integrated using hesitant fuzzy sets. According to existing literature, decision making problems which involves opinion of different experts on same entity can be modeled using HFS. According to basic principle of HFS, a vector of membership values is needed for generating HFS. In our work, HFS is generated by considering five feature selection methods: Chi-square test, Information gain, Document frequency, and Standard deviation and Gini index. Each method give different relevancy score to every extracted feature in the dataset. The process of feature subset selection is explained below: where Fi represents i th feature and Hk(Fi) represent its membership value according to relevancy score by k th feature selection method where k varies from 1 to 5.The relevancy score of each feature is normalized in range of [0,1] using min-max normalization which is computed as follows: where max (µk) and min (µk) represent the maximum and minimum score obtained by k th feature selection method.
where µ1' (Fi) is normalized relevancy score from Chi-square test, µ2' (Fi) is normalized relevancy score generated from Information gain, µ3' (Fi) is normalized relevancy score generated from document frequency, µ4' (Fi) is normalized relevancy score generated from standard deviation and µ5' (Fi) is normalized relevancy score generated from Gini Index.

Phase 2: Compute the overall information energy for HFS.
calculate the overall information energy EHFS of ith feature using hesitant fuzzy set HFSi, the following formula is used: where k is the number of feature selection methods in our case k=5. The information energy of i th feature computed in Equation  12 is stored at the i th position of a relevancy feature vector RFV of size n as follows: where 1<=i<=n and n is the number of extracted features from dataset. The feature Fi are ranked according to weight of feature in RFV (i). The higher the value of the feature in RFV, better the rank of the feature.
The subsets of top-k ranked features is selected and evaluated by two different classifiers to obtain best feature subset with maximum relevancy. The ensemble of feature rankers using hesitant fuzzy set is explained in Algorithm 1 and 2: In Algorithm 1, dataset is first preprocessed and Unigram or POS features are extracted from m-samples of dataset D. In the second step, the term-document matrix is generated using term-frequency and inverse document frequency (tf-idf) of extracted features. The hesitant fuzzy set is then created for each feature using five feature selection methods as discussed in the previous section. The relevancy score of each feature is obtained and stored in the relevancy feature vector (RFV). The top-k features are the features that have best relevancy score in RFV. The best accuracy score and subset size is obtained on different datasets using both NB and SVM classifier as shown in Algorithm 2. To predict the real performance of classifier in terms of accuracy, 5-fold cross validation is employed on dataset considering top-k features returned by Algorithm 1 where k ranges from 5-30% of total extracted features. The value of k that results in best accuracy score is returned as the final subset size for that classifier and dataset.

V. EXPERIMENTS AND RESULTS
The performance of the proposed algorithm is evaluated on Naï ve Bayes and Support Vector machine classification algorithms using five-fold cross validation. In this section, we discuss the performance metrics, datasets, validation techniques, performance measure and finally result analysis.

Performance Metrics
To evaluate classifier performance, we used Accuracy, Precision and Recall in our work [33]. Accuracy is the percentage of test tuples that are correctly labelled by the classifier. Accuracy cannot be the only evaluation measure as it is possible that a system showing 90% accuracy may be poor system because it is identifying only one class correctly not the other in binary classification. For an accurate system, negative as well as positive tuples should be correctly classified. Precision and recall are used to check accuracy with respect to both positive and negative tuples. These measures are calculated as follows: Here TP is the positive items that are correctly classified as positive, TN is the negative items correctly classified by as negative, FP is the negative tuples incorrectly classified as positive and FN is the positive items incorrectly classified as negative.

Cross Validation Technique
We used five-fold cross validation technique to get reliable and unbiased results of the classifier. In this technique, dataset is divided into 5 folds: out of which four folds are kept for training and one fold for testing. In each run, test set is varied and performance is evaluated by the classifier. This process is repeated for 5 times and mean of the predicted results is computed to get final performance of the classifier. the most widely used datasets for sentiment analysis: Movie Review dataset [34] and Amazon product reviews [35]. The amazon product reviews dataset consist of reviews of 25 different products. We used reviews of only two products: Book and Music from amazon dataset. The datasets used in our study is balanced data containing equal number of positive and negative samples. The dataset statistics are shown in Table I.

Preprocessing
To extract unigram and POS features from text reviews as shown in the work [18], we adopted two different methods of preprocessing. In the first step, each review or document is converted to lower case and then tokenized into sentences. In the next step, each sentence is tokenized to words to extract unigram features. In this process, all stop words and words having document frequency less than 5 are eliminated. The term-document matrix is constructed using popular TF-IDF feature weighing scheme after feature extraction from review.  To extract POS features from text, word on each tokenized sentence is further tagged by NLTK POS tagger. The POS patterns of length < 3 are used as linguistic filters to annotate the reviews using POS tags. The four class of pattern can represent a feature in sentiment analysis problems. The class of feature and linguistic filters used to extract that feature is shown in Table II. The last column of the Table II shows few POS features extracted from Music Review datasets using POS patterns. Table III shows the feature names and relevancy score of some of the best ranked top-100 POS features from Book dataset. The top ranked feature 'book' has relevancy score 0.601. POS feature are ranked on the basis of their relevancy score and then top-k features are selected for evaluating performance of the classifiers.

Experimental Settings
We conduct the experiments on Lenovo Ideapad 310 with 2.5 GHz Intel Core I5 processor and 8GB RAM. Anaconda with Python 3.4 is used to design our project. For the initial steps of preprocessing and feature extraction, NLTK is utilized. It is a tool in Python with a set of text processing libraries for classification, tokenization, stemming, tagging, parsing etc. Scikit-learn (sklearn) in Python is used for sentiment classification. It is simple and efficient tool in python used for data mining and data analysis tasks and the tool is built on Numpy, Scipy and Matplotlib in Python. For data analysis and graph plotting, Matplotlib library of Python is utilized.

Performance Evaluation
We evaluated the performance of our proposed system on three review datasets: Movie, Book and Music using two feature representation schemes: Unigram and POS features. We employed Naï ve Bayes and Support Vector machine for classification as they are the most popular algorithms in the area of text classification. The results of accuracy, precision and recall of NB and SVM classifiers on three datasets using 5-fold cross validation techniques has been presented in this section. Table IV shows the classifier performance using accuracy, precision and recall on all three datasets without feature selection. The results are obtained for both feature representation schemes. We can depict from Table IV that SVM outperforms NB in terms of accuracy and precision in case of Movie and Book review dataset represented by both unigram and POS features whereas NB outperforms SVM in terms of recall for the same. The Fig. 1-3 shows the detailed analysis in terms of precision, recall and accuracy on all three datasets with varying size of feature subset. The size of feature subset is chosen based on the number of total extracted features using both types of feature representation. The analysis is done by choosing 5-30% of best ranked features termed as 'k' from the proposed HFS-FS. Fig. 1(a) and Fig. 1(b) depicts the classifiers performance on unigrams features of Movie Reviews. The value of k is chosen in the range of 1000-5000 with a variation of 200. As shown in the figure, both classifiers give best accuracy value of around 90% at k=2400 but the precision value is best at International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 2400 for SVM classifier and 2000 for NB classifier. The performance graph in Fig. 1(c) and Fig. 1(d) depicts best performance of classifier on selecting around 20000 feature with an average accuracy score of 90%. The value of k is chosen in the range of 5000-25000 with a variation of 1000.   Fig. 2(c) and Fig.  2(d) depicts that classifier performance on POS features where value of k is 1000-10000 with variation of 500. In this case also, NB outperforms SVM and gives an accuracy of 87.5% on selecting only 6500 informative features (around 23%) from full feature set size of 28033.The graph also shows that only small feature size of 1000 from extracted unigrams can give good results on Book reviews using our proposed method. Fig. 3(a) and Fig. 3(b) depicts classifiers performance on unigrams features of Music Reviews. The value of k is chosen in the range of 1000-5000 with a variation of 200. NB classifier outperforms SVM and give best accuracy value of around 87.6% and precision value as 88.5% at k=1800. The performance graph in Fig 3 (c) and (d) depicts the classifier performance on POS features where value of k is in the range of 1000-6000 with variation of 200. In this case also, NB outperforms SVM and gives an accuracy of 86.39% on selecting only 2800 informative features (around 11%) from full feature set size of 25798.The graph also shows that only small feature size from extracted POS features can give good results on Music reviews using our proposed method.  classifiers on Unigram and POS features achieves a significant rise of around 6% using HFS-FS on Movie Reviews. The accuracy score by both classifiers on Unigram and POS features using our proposed feature selectors improved the accuracy by around 10% on Book reviews. Music reviews dataset also shows tremendous improvement of around 10% in accuracy value on both type of features using HFS-FS.
Based on the results obtained, Naï ve Bayes performs better than SVM on all datasets except Movie Reviews. Also our proposed method gives better accuracy when HFS-FS selects from Unigram features in case of Book and Music dataset whereas Movie reviews dataset achieves better performance with POS features. The results also indicate reduction in dimension of unigram features and POS features to 14% on an average across all three datasets.

Discussions
The proposed ensemble of filter-based feature selectors is independent of the individual classifier. The selection of features is based on the relevancy score of the features which is computed by integration of different filter based feature International Journal of Machine Learning and Computing, Vol. 9, No. 5, October 2019 selection methods using hesitant fuzzy sets. The low computational cost of the algorithm makes it more suitable for large datasets. The ensemble of feature selectors benefit the weak filter-based feature selection methods without compromising the cost. To prove that there is significant difference between the proposed approach and baseline approaches of feature selection methods, the paired t-test [36] is conducted. Table  VI shows the results of statistical paired t-test based on accuracy score obtained from 5-fold cross validation method using HFS-FS or baseline algorithms for feature selection. We choose Chi-square, Information gain and Gini index as baseline feature rankers and evaluated their performance using any one of the classifier. The t-value indicate paired ttest value for comparing the means of two methods. Based on this value and 5% confidence level, the hypothesis H is rejected or accepted. H=0 indicates not statistically different and H=1 indicates statistically significant difference between the baseline and proposed method. The results on all three datasets shown in Table 6 depicts that HFS-FS outperform all baseline feature selection methods except Chi-square test when classifier performance is predicted on unigram features. The POS-based features when selected using HFS-FS shows great improvement in accuracy than all baseline algorithms and statistical test value indicates this difference in accuracy to be statistically significant on all datasets.
Finally, our proposed approach is compared with the work [16], [18] in terms of accuracy measure as performance metrics on all three datasets using both type of feature representation methods. The comparison is done to conclude whether our approach outperforms the integration methods proposed in the past. Table VII shows the comparison between our approach and baseline integration approaches used in the area of sentiment analysis. It may be noted that results on Music dataset is not available in the work [16].
The results indicate that HFS-FS outperform baseline integration algorithms in terms of accuracy in case of Book and Music datasets. It shows around 2% improvement in average accuracy score when best ranked k-features are selected using our proposed ensemble of feature selection methods. However, on Movie review dataset, difference is not so significant.