Sentiment Analysis based on Soft Clustering through Dimensionality Reduction Technique

Clustering based sentiment analysis confers new directions to analyze real-world opinions without human participation and pre-tagged training data overhead. Clustering based techniques do not rely on linguistic information and more convenient as compared to other traditional machine learning techniques. Combining the dimensionality reduction techniques with clustering algorithms highly influence the computational cost and improve the performance of sentiment analysis. In this research, we applied Principal Component Analysis technique to reduce the size of features set. This reduced feature set improves binary K-means clustering results of sentiments analysis. In our experiments, we demonstrate the performance of the clustering system with a reduced feature set to provide high-quality sentiment analysis. However, K-mean clustering has its own limitations such as hard assignment and instability of results. To overcome the limitation of traditional K-means algorithm we applied soft clustering (Expectation maximization algorithm) approach which stabilizes clustering accuracy. This approach allows a soft assignment to cluster documents. Consequently, our experimental accuracy is 95% with standard deviation rate of 0.1% which is sufficient to apply the clustering technique in real-world applications.


INTRODUCTION
n this advanced era of technology, due to the wide use of World Wide Web, a lot of data is available over the internet. Volume of online sources is increasing beyond storage capacity [1]. Most of the data is available in the form of text, video, and images. This textual information is categorized into two types, facts, and opinions. Facts are the objective point of view of an event and property. Opinions are the subjective point of view that people describe their feelings and expressions. Researchers are actively investigating automatic techniques for text processing to extract useful information and improve ecommerce, science, society and national security. once before purchasing a product. More than 81% analysis reports [8] illustrate that the product reviews or sentiments have a great influence on purchase [9] and merchandising [10]. Before World Wide Web, people had limited choice for making a decision by themselves, their friends and families. However, analyzing opinions, and monitoring user-generated content is still not a simple task due to diverse nature of practical applications, different languages and the huge volume of textual data. Most of blogs and tweets have some hidden opinions. Thus, automated opinion or sentiment analysis is crucial task of natural language processing. Currently, Sentiment Analysis (SA), computational linguistics and text mining are growing disciplines of Information Retrieval (IR) and natural language processing.
A bulk of unstructured and subjective information in online reviews requires some statistical approach to analyze textual data. Researchers are implementing automatic techniques such as supervised machine learning and symbolic technique for opinion processing [11]. The supervised machine learning technique achieves very high accuracy but requires human involvement such as a large amount of benchmark dataset. This dataset consumes a lot of time to train the classification model. On the other hand, the symbolic technique gives limited accuracy and does not include human involvement. Its performance depends on the scoring method.
A novel clustering-based technique for sentiment analysis is proposed by Li and Liu [12] to overcome major issues in traditional supervised machine learning and symbolic techniques. Clustering method neither requires human involvement and prior knowledge to explore important features from unstructured data nor rejects linguistic information [13]. However, clustering method is still not proved comparable to traditional supervised learning technique. Li and Liu [12] used k-means algorithm for sentiment cluster in two different groups (positive and negative). K-means clustering results are unstable due to the random selection of initial centroids. All sentiments not only belong to true positive or negative class but some opinions are neutral in context [14]. The aim of our research is to further enhance clustering approach to improve the accuracy of unsupervised learning models in the domain of sentiment analysis and address the impact of dimensionality reduction techniques with soft clustering approaches.
Rest of paper is organized as follows. In Section 2 we present literature review of sentiment analysis. Performance analysis of existing clustering approach with problem statement is described in Section 3. Our proposed approach of producing more efficient and accurate results is described in Section 4.
Step by step experimental analysis and results are described in Section 5. In Section 6, discussion and evaluation of results are presented and finally Section 7 presents conclusion and further research directions.

LITERATURE REVIEW
Semantic orientation is main research objective of sentiment analysis (identifying the polarity of an opinion positive or negative) at phrase, sentence or document level [15]. It determines the reviewer's attitude towards the discussed topic. A sentiment contains opinionated words to express opinion polarity. For example, good, gorgeous and amazing are positive opinionated words while bad, ugly and poor are negative words. In natural language, usually, adjectives and adverbs are considered as opinionated words. These conjoined adjectives or adverbs help to identify the opinion class of sentiment through classification and other log-linear regression techniques [16].
The main stream approaches conducted to analyze the sentiments are symbolic techniques, and the supervised and unsupervised machine learning techniques [12]. Supervised machine learning and symbolic techniques are traditional approaches for sentiment analysis and used by many researchers in text mining domain.

Supervised Learning Techniques
The first sentiment analysis was conducted by Pang et al. [17] on a movie reviews dataset. They developed a classification model of supervised machine learning technique to classify reviews in negative and positive classes. They applied three classification models, namely Support Vector Machine (SVM), Naive Bayes (NB) and Maximum Entropy (ME) model with different classification features. Their experiments achieved accuracy of 77% and best accuracy of 77.7% for NB and ME respectively and SVM gave the best accuracy of 75.1%. SVM with NB features (NBSVM) produced fine results for sentiment analysis but it used an interpolation parameter which was the main drawback of NBSVM [18]. Recent few studies showed that NB produced more than 80% accuracy results after applying sentiment analysis on movie reviews dataset [19]. Multivariate Bernoulli Naive Bayes is another variant of NB Classifier but it only performs better with unigram feature [19].
K-nearest neighbor is nonparametric instanced-based learning model. This model assigns an output class membership to a given test sample by voting its nearest k neighbors of training data samples for optimal solution [20]. SA using K-NN classifier yielded an accuracy of 69.81% for movie review dataset [19]. Kataria and Singh [20] used K-NN using Genetic Algorithm (GA) that led higher classification accuracy up to 90% and reduced training size [20].
Subjectivity and objectivity analysis in a sentiment is another step towards accuracy improvement, categorizing opinion polarity by identifying subjective sentences in a review and removes extra and misleading text. Pang et al. [17] used a cut based technique to determine clause of subjective terms in a sentiment. They trained simple Naive Bayes classifier from 5000 subjective and 5000 objective sentences and extracted subjectivity (PSij) and objectivity (POij) probabilities of reviews sentences. Association graph was built to associate relatedness of one sentence with the other. A minimum cut based technique was used on the associated graph to remove extra and misleading (objective) sentence from review to enhance the sentiment classification results [17]. This cut-based subjectivity detection increased accuracy to 86.4% and 86.15% for NB and SVM respectively. In 2005 Pang and Lee [21] further extended their research to classify multi-class (Positive, negative and neutral) based on author's rating [21]. Light weight discourse analysis [22] is another direction which analyzes the twists and turns in a sentiment [23]. A summary of the most important articles in opinion mining is presented in Table 1.

Symbolic Techniques
The symbolic technique is a process of applying polarity score to each term of a sentiment. This score indicates the intensity of sentiment term in an opinion. The average sum of all terms score gives whole document score. Scoring method directly influences the performance of the symbolic technique. A simple scoring method firstly used for the symbolic technique is score applying to each term through human participation [24] using word bank. But this technique is highly dependent on human understanding and background domain knowledge.
WordNet based score is calculated for only adjective terms of reviews [25]. Graph of adjective terms generated with nodes as terms and values at vertices shows the synonyms relation between words. Two baseline words 'good' and 'bad' are used as negative and positive reference terms respectively. The distance of word from these reference terms expresses terms direction towards negative or positive intensity. Based on this reference distances D 'good', w and D 'bad', w an evaluation method is used to calculate final word score as This Evaluation function produces word score in the interval of [-1, 1]. Turney [27] achieved 70% Accuracy based on this WordNet scoring method. Semantic relatedness in Senti-WordNet for a large number of synset of senti-words can further increase scoring accuracy [26]. Turney [27] used Pointwise Mutual Information (PMI) for terms score based on assumption that if two words occur more frequently then they are most similar [28]. The PMI for two words w1 and w2 can be calculated as Turney applied this scoring technique on different reviews topic and achieved 65.83% accuracy for movie reviews and 84% for automobile data. Studies show that the accuracy is reduced when this PMI technique applies to an unbiased data as compared to a biased dataset. This traditional machine learning technique requires a lot of human understanding for Natural Language Processing (NLP) and background domain knowledge. Consequently, the learning models are language and domain-specific [29].

Unsupervised Learning Techniques
Instead of the two above discussed traditional techniques, Li and Liu [12] applied a clustering-based technique in the domain of sentiment analysis. They applied basic k-means clustering algorithm for opinion analysis on movie review data (IMDB). They randomly selected 300 positive and 300 negative reviews to obtain fair distribution and applied the Porter Stemmer and Stanford Part Of Speech (POS) tagger. Since adjectives and adverbs are good opinion indicators [30], they selected more than 6000 adjectives and adverbs for further analysis. They built a term-incidence and term-frequency matrix of adjectives and adverbs as a feature set from movie reviews. They achieved 55% accuracy from termfrequency and term presence matrix. These Clustering results were very unstable with Standard Deviation ranging from 2% to 4%. In the same publication, Li and Liu applied Term Frequency-Inverse Document Frequency (TF-IDF) weighting on term-document matrix: where tfi is frequency of term ti in document di in D and dfi is frequency of terms in whole dataset D. TF-IDF weights dramatically increased average accuracy up to 72% and 73% for frequency and presence of data respectively. But accuracy fluctuation rate highly increased. The standard deviation for term-frequency data reached to 4.02% and 6.2% for the presence of data. In order to overcome instability issue, they used a voting mechanism for final sentiment class identification. For example, they ran clustering algorithm N times, and identified the two classes (positive and negative) by using the following formula (equation (4)): Since term scores highly impact to identify the polarity of document hence they further applied a score based symbolic technique, the previously discussed score using Word-NET technique to enhance further accuracy results. They [12] calculated the score using geodistic distance in equation (5) of an adjective from reference word 'good' and 'bad' and finally got polarity weight towards reference words using experimental threshold equation described below.
By combining term score w with already discussed clustering method they built a hybrid system for sentiment identification which significantly increased the accuracy rates up-to 77% on average with low standard deviation rate. Consequently, use of adjective-terms decreased dimensionality overhead as well as term score influenced the accuracy fluctuation rate.
Clustering based sentiment analysis approach was further improved as compared to the above-discussed results [14]. The Opposite Opinion Content Processing (OOCP) technique was applied to identify opposite opinion sentence in reviews. Negation words were used (not, nor, neither) with some discoursed relation (use the conjunction like but, although, unlike) in sentences to identify opposite opinions in a sentiment.
The sentence polarity was inverted if it contained any negation words to extract opposite opinions in sentiment. Although OOCP technique did not highly impact on fluctuation rate, there was an increase of 1% in accuracy.
Since, in Natural Language (NL) each sentence in a review document does not contain opinionated words but some extra and misleading text may be present in documents to describe a reviewer's complete attitude.
The authors of reference [31] applied a cut based subjectivity extraction technique to remove nonopinion contents from review documents. This cut based classifier calculated the subjectivity probability Pi(S) and objectivity Pi(O) using the Naive Bayes model trained on 5000 subjective and 500 objective sentences. These probabilities were further used to build association graph G[V,T] where V are vertices (sentences) and T are edges (association probabilities) between nodes of V adjective node (Obj), subjective node (Sub) in G. A minimum cut was applied on G to partition review sentiments as objective and subjective. These two enhancements improved the accuracy of k-means clustering technique up to 4% as applied in the previous publication and finally achieved 88.7% and 88.9% for term frequency and term presence respectively.
Apart from improvements in binary clustering method, they further applied three class clustering techniques to find the neutral opinion in movie review dataset. Modified voting mechanism was used to identify positive, negative and neutral opinion (equation (6)).
Li and Liu achieved the maximum of 60% accuracy with a balanced review class dataset which is quite higher than the baseline (33%) of the three class sentiment classification techniques.
Current techniques are focused towards binary class identification such as a positive and negative opinion. Neutral opinion analysis is currently growing research area. Identifying an opinion as strong negative, positive and neutral opinion is conducted by Nakov et al. [32].
Recently, Ma et al. [33] investigated different initialization procedures and weighting methods to improve k-mean's clustering performance on different polarity datasets [33]. They achieved best accuracy result for IMDB review dataset, through weighting method of DPH Divergence From Randomness (DPH-DFR). The weight wij for term ti in document length dlj is described as, For document clustering, they applied k-means algorithm to DPH-DFR matrix and finally achieved an average accuracy rate of 68.2%. In their experiments, the K-mean algorithm performed better than other clustering approaches with the best accuracy of 78%.
The most recent research articles in the domain of sentiment analysis in Movie Review domain are summarized in Table 1. A comparison is described in terms of different techniques adopted, their efficiency and drawbacks.
Recent researches do not apply dimensionality reduction techniques to overcome computational cost and improve the performance of Sentiment Analysis. In real-world problems, performance and effectiveness of algorithm is the goal of the analysis. In addition, the randomness of the k-means algorithm makes it mathematically not suitable model. Initialization is another issue of the k-means clustering algorithm. Due to the instability of K-means results, multiple iterations of clustering algorithm [12] are required for stable outcomes which are not suitable for real-world large size cluster dataset. All these issues directed towards new research trend in unsupervised sentiment analysis.
Existing clustering approach is applicable in the realworld analysis where there is no previous knowledge about patterns inside samples and its grouping. However, this approach is not mature as compared to Classification technique and requires further improvement to enhance the performance of clustering based sentiment analysis. In this research, Firstly, we have applied dimensionality reduction technique for improving the performance of traditional k-means clustering approach. Secondly, we have applied another probabilistic clustering (soft clustering) technique to overcome the issues of the K-means algorithm. We have applied Soft clustering techniques to extract more stable clustering results.

PROPOSED SOLUTION
This section describes our organization and methodology to speed up analysis approach. Our workflow is described in Fig. 1. We used Pong Lee Internet Movie Review Database (IMDB) (http://www.cs.cornell.edu/people/pabo/-moviereview-data/, the URL contains hyphens only around the word "review") because of its extensive use in recent research of sentiment analysis domain [8,12,14,21,31]. IMDB review dataset consists of 2000 review documents including 1000 Positive and 1000 Negative movie reviews. Due to computational cost and space complexity, we have randomly selected 600 documents, 300 positive and 300 negative respectively, from IMDB review documents. This selection highly decreases data dimensions issue in experiments.

Pre-Processing
To transform textual data in the statistical form to apply ML algorithms we applied basic NLP task and obtained a well-organized bag of words for termdocument matrix generation. We used c# libraries for text pre-processing. We parsed review documents and generated a list of tokens for potential possessing. Further, we used maximum entropy Stanford Part of Speech (POS) Tagger [34] to tag list of the bag of words. Stanford tagger has accuracy up to 78% to correctly apply POS tags. Since adjectives play a key role for sentiment polarity identification hence we have selected only adjectives for future analysis. A Term-Document Matrix (TDM) is the suitable approach to convert textual data in a statistical layout.

TF-IDF Weighting
We also applied term frequency-inverse document frequency (TF-IDF) weighting scheme to highlight term importance within a document and whole corpus.

TF − IDF Weight = tfi log D/dfi
We applied this technique to our TDM and generated a new document matrix, TF-IDF for term frequency and TP-IDF for term presence for both versions of data.

Dimensionality Reduction
In real-world applications, the dataset is large and consists of a huge number of adjectives. Further processing of these adjectives is a more crucial task. We can reduce data matrix size overhead by applying dimensionality reduction technique to reduce TF-IDF or TP-IDF matrix size. We applied the Principal Component Analysis (PCA) technique for dimensionality reduction which reduces high dimensional dataset to lower dimensions without losing any semantic relatedness between correlated variables by revamping these variables in new space [35]. PCA helps to express data in form of similarities and differences by an orthogonal transformation of original data. PCA is based on Singular Value Decomposition (SVD) factorization (described in Equation (9) to obtain low-rank approximation of original dataset.
where S is d × d matrix of orthonormalized largest eigenvectors of XX T and V is a e × e matrix of orthonormalized eigenvectors of X T X. ∑ is a diagonal d × e matrix whose diagonal values are square root of non-negative real eigen values of XX T . V T is the transpose of V.
where XX c = VΣV c VΣS c = VΣ " V c (10) X c X = VΣS c SΣV c = SΣ " S c PCA (  (11) Covariance matrix can be calculated as After applying PCA to TF-IDF and TP-IDF matrices we contracted new transformed matrix called as TF-IDFPCA and TP-IDFPCA. This technique highly reduces our matrix to very few dimensions (less than number of documents used for experiments). We randomly select a different set of principal components and examine clustering process to achieve the best number of PCs for accurate and stable results.

K-Means Clustering
K-means algorithm is the most popular clustering algorithm used to natural K groups in dataset due to its easy implementation. Corresponding to Lee's experiments [12] we used cosine distance function to calculate distance in the k-means algorithm to diminish variable length issue among documents. We applied voting method previously used to stabilize k-mean's result in equation (6).
The k-means algorithm is mathematically not suitable clustering model. It is a hard clustering approach where each sample is associated with only one cluster often not acceptable. Due to a random selection of initial centroids, it produces more unstable results. Unlike this traditional clustering approach, soft clustering model correlates each sample to probabilities of association to a cluster (soft assignment.

Expectation Maximization
Expectation Maximization (EM) algorithm is based on Gaussian Mixture Model of probability distributions. For space of k random variable X, the estimated Probability Distribution Function (PDF) for a random variable can be described as The Gaussian mixture model for a given density function is described as follows where wi is mixture component weight with constraint EM algorithm is (a soft version of the k-means algorithm) an iterative algorithm to find the maximum likelihood of posterior probabilities from initial few data points. This algorithm consists of two steps i.e. Expectation step and maximization step. Expectation step computes latent variables (Mean (μ), Covariance (σ) and Weight (w) values and maximization step updates our learning model function through new computed (in expectation step) values of latent variables. For a k-dimensional data with N samples, Expectation step follows.
and the latent variable mean (µ), covariance (ƒ) and mixture component weight (w) can be computed in maximization step as These new latent variables are used in expectation step iteratively. In our experiments, we used MATLAB functions for EM algorithm. Since EM algorithm requires an initial start for Expectation step, we used k-means clustering output as an initial start for EM algorithm.

Label Identification from EM Probabilities
Unlike traditional k-means clustering, EM computes probabilities for predicted labels associated with a data sample. Consequently, it provides soft membership for sample documents. We associate a clustering document to a cluster for which document contains the highest association probability.

EXPERIMENTAL RESULTS
In this section, we describe experiments performed for clustering based sentiment analysis. We perform step by step experiments to analyze the performance behavior of dimensionality reduction technique and impact of the soft clustering algorithm in sentiment analysis.

Data Acquisition
In our experiments we used two processed versions of raw reviews of IMDB dataset after a lightweight preprocessing. IMDB contains a processed version of raw reviews after removing some noise and extra HTML tags (by four different authors) that is available online. Authors processed this version V1 dataset and labeled them according to the rating scale. They further applied Subjectivity extraction technique [31] to scaled Version V1' dataset. The extracted dataset after applying subjectivity summarization is available on the same site.

Experiment 1: Preliminary investigation
We applied TF-IDF weighted mechanism on sub-set of IMDB dataset (Scale version V1 and Subjectivity extracted (V1') provided by Pong and Lee). We have Computed TF-IDF and TP-IDF matrix for term frequency and term presence respectively. Matrix dimensions are 600-by-1223 for 600 randomly selected movie review documents. We applied K-means algorithm with voting method (used in Gang Lee's research [14]) to achieve natural groups of sentiments. We repeatedly applied clustering step to acquire set of results and computed an average accuracy. The standard deviation rate in this accuracy set (Table 3) describes the efficiency and consistency of our clustering models in Fig. 2.

Experiment 2: Overview of Accuracy enhancement for K-mean algorithm through PCA
After applying PCA to TF-IDF and TP-IDF matrix we contracted new transformed matrix called as TF-IDFPCA and TP-IDFPCA . This technique highly reduces the matrix to very few dimensions (less than number of documents used for experiments).
We have randomly selected a different set of Principal components and examined clustering process to achieve the best number of PCs for accurate and stable results. We again applied K-means Algorithm to these reduced dimensions PC matrices. After 30 times applying K-means approach, we obtained results of applying clustering procedure (denoted as TFP-K, TPP-K or Scaled V1 data and TF'P-K, TP'P-K for Subjectivity Extracted version) displayed in Table 4.
We obtained a significant improvement in results (denoted as TF'P-K, TP'P-K. Thus processed data produced more accurate and stable results as compared to base-line approach.

Experiment 3: Performance Enhancement of sentiments analysis through EM algorithm
In our third experiment, we have applied EM algorithm on reduced matrices. The results of EM algorithm after applying PC matrices on both versions of data (V1 and V1') are described in Table 5.
The result of applying the k-means algorithm to Principal Components of term Frequency and term presence is elaborated in Fig. 4.

Impact of PCA on Performance of K-means Algorithm
According to Ding and He [36], PCA is a continuous solution for binary clustering. According to their theorem 2.2, the Objective function for k-means clustering is to maximize least sum of square error between clustered points and centroids. PCA provides rotation of original data in new space and maximizes variations in initial few dimensions. In addition, Nepoleon and Pavalakodi [37], give evidence that initial centroids for k-means clustering. PCA is a nondeterministic method for k-means initialization which outperforms for the large dataset (Survey by Celebi et al.) [38]. We experimentally proved above described theorems.
A comparison of baseline results with PCA-Kmeans in Fig. 5 and Fig. 6 describe improvements in clustering accuracy throu6gh dimensionality reduction. PCA proportionally improves clustering accuracy for TF-IDF and TP-IDF matrices.   Fig. 5. For subjectivity extracted scaled Version V1 data, Maximum accuracy achieved after dimensionality reduction is 96.83% for TP'P-K is a significant improvement in clustering results as compared to 89.66% baseline results in Fig. 6. Furthermore, Low standard deviation rate demonstrates the significance of PCA before K-means clustering technique.

Performance comparison of EM and K-means Algorithm
EM algorithm is not based on any cluster distance function. Alternatively, it computes probability based on Gaussian distributions for each observation and assigns it to the respective cluster group.
The objective function of EM algorithm is to maximize overall dataset probabilities for final cluster solution. Due to probabilistic nature, EM algorithm provides review labels as probabilities associated with each cluster. The decision for cluster membership through these probabilities is more accurate as compared to the voting method previously devised for k-means clustering.
EM algorithm (soft clustering approach) is a probabilistic framework that provides a mathematically principled way to understand and Sentiment Analysis based on Soft Clustering through Dimensionality Reduction Technique addresses the limitations of K-means. Consequently, we apply EM algorithm for a fine grinds opinion analysis.
According to standard deviation rates (shown in Fig.  7), accuracy results for EM algorithm are more stable (relatively straight line) as compared to k-means clustering technique. Furthermore, since in this technique we do not apply Voting mechanism, it executes as fast as compared to k-means approach.
In our experiments, documents which contain equal posterior probability for both clusters (negative and positive) are denoted as multi-class documents. There are four such documents in 600 reviews that are classified as negative as well as positive simultaneously. In these review documents, the ratio  of positive and negative opinions is almost balanced.

Significance Testing
The experimental results gathered through statistical inference allow us to assess evidence in favor of our proposed technique hypothetically. To verify the impact of dimensionality reduction technique on clustering performance, we applied significance testing to our experimental results. In our Case, pretagged dataset is available, hence we are able to apply McNemar's Test [39] to support our proposed clustering approach.

Performance analysis of K-means clustering with PCA Hypothesis
We validate the effectiveness of k-means clustering results after applying PCA technique through the following hypothesis. H0 = PCA-K_means is less accurate than baseline approach. H1 = PCA-K_means Performs better (higher accuracy) than baseline approach.

Evaluation
Let MBaseline is baseline results and MPCA-Kmean are our proposed results. McNemar's statistics to highlight significance of MPCA-Kmean are displayed in Table 6.
h=1 indicates rejection of the null hypothesis (MPCA-Kmean is less accurate than MBaseline accuracies) at 0.5 significance level. Asymptotic p-value < 0.5 addresses a strong evidence to reject the null hypothesis. Consequently, Alternative hypothesis is accepted, which indicates that MPCA-Kmean is statistically significant. Furthermore, Clustering Error rate summarizes that MPCA-Kmean is more accurate than MBaseline.
Based on above experiments we compared our clustering results (in Table 7) with previous K-means clustering approaches. We examined that TF-IDF weighting is not enough able to achieve stable results and we need to use some hybrid approaches [12,14]. Li and Liu [12] hybrid technique is effective to achieve stable results but the computational cost for term score calculation is one-off task. Subjectivity extraction has positive impact on binary clustering accuracy [14]. Bisecting K-means [33] does not produce stable and most accurate results. Although, our K-means clustering results are close near and better to supervised machine learning results, however, Voting method for K-means clustering forces an overhead for large dataset processing. We applied EM algorithm (soft clustering technique) which produces more stable and accurate results as compared to our k-means clustering consequences. In addition, it speeds p clustering process because EM does not require any voting method to stabilize clustering results.

CONCLUSION AND FUTURE DIRECTION
The contribution of our paper is to apply the dimensionality reduction technique into clusteringbased sentiment analysis approach to enhance its performance. Currently, by applying PCA we are able to produce higher accuracy in binary sentiment analysis.
Dimensionality reduction through PCA reduces the size of feature set and computational cost. PCA before clustering process provides more accurate and stable results because its first principal components help to select initial centroids for K-means algorithm. We have also verified that voting mechanism overhead may be reduced by applying Gaussian mixture model, a Soft clustering technique. In addition, EM algorithm highly improves the stability of clustering results.
This clustering based approach is language independent and more efficient as compared to classification approaches. Accuracy rate and stability of results are good enough to apply this clustering technique in real-world applications. In future, different clustering techniques can be used to apply in multi-class sentiment analysis for fine grid opinion mining and prediction of review rating.