Predicting Collective Synchronous State of Sentiments for Users in Social Media

The increasing use of social media offers researchers with an opportunity to apply the sentiment analysis techniques over the data collected from social media websites. These techniques promise to provide an insight into the users’ perspectives on many areas. In this research, a sentiment analysis model is proposed based on HMC (Hidden Markov Chains) and K-Means algorithm to predict the collective synchronous state of sentiments for users on social media. HMC are used to find the converged state while K-Means is used to find the representative group of users. For this purpose, we have used data from a well-known social media site, Twitter, which consists of the tweets about a famous political party in Pakistan. The time series sequences of sentiments, of each user are passed on to the system to perform temporal analysis. The clustering with three and four number of clusters are found to be significant giving the representative groups. With three clusters, the representative group constitute of 82% of users and with four clusters, two representative groups are found having 45 and 36% of users. Analyzing these groups helps in finding the most popular behavior of users towards the concerned political party. Moreover, the groups perhaps tend to influence the opinion of other users in the network causing changes in their sentiments towards this party. The experimental results show that the proposed model has the power to distinguish behavior patterns of different individuals in a network.


INTRODUCTION
W ith the rapid rise in the use of social media, sentiment analysis of users has become of increasing interest to researchers. Sentiment analysis has often been defined as categorizing the text as positive and negative [1]. Sentiment analysis can be performed to identify the perspectives of social media users, such as their religious and political preferences and the associated issues [2]. Similarly, predictions can be made about a variety of real-life and marketing problems such as movies' rating [3], and the perceptions of parents about getting their children vaccinated at a particular locality [4].Various machine learning algorithms have successfully been used in this field, for example: KNN (K-Nearest Neighbor) and Naïve Bayes on the reviews of movies and hotels [5], a hybrid model of Naïve Bayes and SVM (Support Vector Machine) for Twitter's data [6], a combination of classification and clustering [7], Deep Learning for movies data [8], among many others.
HMM (Hidden Markov Models) are widely accepted as a statistical tool among the research community for modeling a wide range of time series data [9]. An intuitive picture for complex systems can be drawn using the HMM technique.They are often used for modeling generative sequences which have applications in areas of signal processing, speech processing, and natural language processing. The linear sequence labeling problems have also been efficiently solved by probabilistic models based on HMM.
The novel micro blogging social media service, Twitter, was launched in 2006. According to an estimate, the number of monthly unique visitors to the site is 20 million [10]. Tweets are short messages (up to 140 characters) published by the users. These tweets are visible through the public message board provided by the website and can also be accessed by several third-party applications.
Twitter has been supporting worldwide communication of more than one million messages per hour. Twitter users not only post their personal statuses on the website but, in addition, a wide range of topics including politics, product information and reviews are covered.The format of tweets is varied, some comprise short sentences, some have links to the websites and some have direct messages to other users. Due to the increasing scope of Tweets, researchers find it useful to develop techniques to exploit sentiment analysis of this social media platform.
There are many studies [11][12] that have been presented to model synchronization phenomena in a large network of interacting elements to simulate collective behavior under certain assumptions. This research is based on a model proposed by Liang and Ng in [13], where they proposed a probabilistic model (HMM) to discover the collective synchronous behavior in a network of users.The researchers in [13], took the inspiration of their research using the theory of synergetics [14] in Physics which deals with the formation of structures that are selforganized and spontaneous. In our research work, we have used this probabilistic approach to propose a model for sentiment analysis. This analysis is performed on the time series data of interactive users in a network. The HMM model is used to find the collective synchronous state of the system and the groups with the most popular behavior. The dataset that has been chosen is based on the tweets about a political party in Pakistan. There have been various studies where the social media role has been discussed in politics specially in analyzing public opinion about political parties. For example, [15], researchers have proposed a framework for analyzing public opinion, measuring sentiments and information discourse before elections. In [16][17]  Although, we have used the model proposed by the researchers in [13], however, our research has two significant contributions over their research work. First, we use a real-world dataset taken from Twitter, rather than using synthesized data as done by [13]. A synthesized dataset is the one which is created under controllable parameters. In contrast, a real-world dataset is taken from a real-world problem domain having many complexities which requires extra techniques and care to be applied. The real contribution of any theoretical concepts can only be seen by applying them on a realworld problem. Our second contribution is that we analyze the opinions of users to find the collective synchronous state of users in terms of sentiments and this is a more useful and complex scenario than the one discussed in [13]. HMM has also been extensively used in the field of image processing specially to identify and segment the objects in images by making use of the temporal data [29][30].
Authors in [29], claimed to propose a new approach that constructs HMM model in 2D (Two-Dimensional) for recognizing facial images in 2D. Whereas, in [30] a comparison was presented by the same author for 1D and 2D data models. Another study [31] used discrete HMM for the recognition of 3D gestures obtaining an accuracy of 80% for simple gestures and 60% for complex ones. HMM has its application in the analysis and prediction of human activities and its states by utilizing the temporal sequence patterns [34]. In a recent literature [35], PU (Primary User) channel state future prediction has been widely investigated for predicting PU channel state based on time series and HMM. A brief overview of other research work has been presented in this section. In next section, complete detail of model and approach is described.

METHODOLOGY
The methodology of this research is comprised of the

System Description
This research study is based on the technique proposed in [13]. (1) M is the total number of individuals/users.
(3) X p (t) is a random variable which is discrete to represent an individual p at time t.
(4) R t is a random variable to represent the statistical aggregate state of system for all users. (5) S is the finite state space with N total states. i.e. s n  S, n = 0,1,2, … N.
(6) π is stationary probability distribution over N states.
The behavior of users was modeled by many MMC with the assumption that the state of any user at time t depends upon its own state, and on the state of other users in that network at time t-1.
These assumptions made the model to have multiple coupling MMC as shown in Fig. 1, where X p (t) is a random variable for an individual p, at time t.
Since the network may have many users which makes the process of inferring very time-consuming for all random variables (users). For simplification, a variable R t is [13] introduced to represent the aggregate state of the system  (4).
The collective synchronous behavior emerges when sufficient time for evolving is given and the stationary probability distribution of user has the coherence with the probability distribution of a complete system i.e.
Pr(X p (t) )=Pr(R t-1 ). The equation of evolution of MMC is given below for which the state probability distribution  (5), are exactly same.
The parameters of HMM i.e., transition probability matrix and emission probability matrix are learned by using

FIG. 2. COUPLING HMM
Baum-Welch Algorithm [36], which is an iterative algorithm to estimate these parameters based on observation data sequences. The proposed macro variable R was assumed to have a random walk process having transition probability and a steady state probability distribution stating the chances of being in each state after a long duration of time. This stationary probability vector π can be found out by eigenvector of transition probability matrix A shown in Equation (6).
The condition of irreducibility in Equation (7) is satisfied since it was allowed that each state is accessible by other states without restrictions.
The are calculated using C*'. This process repeated iteratively to improve A* and B, till π* gets stabled under a certain threshold υ of error where the error is calculated by Euclidean distance. The mechanism of evolution inside a system is represented by A* and the degree of reactive influence with respect to each system state is represented by B. The synchronous state S syn of the system is predicted, which is the maximum probability in π*. The detailed algorithm is given in Table 1. Reactive factor RF of a user, is used to measure the degree of dependence of everyone to other users in network. Reactive factor could be positive RF p and negative RF n , where positive reactive factor measures the chances that a state acquired by system would also appear on an individual as shown in Equation (8). On the other side, negative reactive factor calculates the chances a state appears on system and the individual will acquire a different observation state. Our research work is based on the theories presented in aforementioned research [13]. We have used the coupling MMC to model the state of sentiments of each user. The state of sentiments of a user at time t is not only dependent on his own state, but also on other users' states at time t-1. The approach of macro variable R t is used to represent the aggregate state of system at time t so that the system can be modeled by coupling HMM. The state of a user at time t now depends only on the state of the system at time t-1. It is known that the sentiments of a user may also affect the sentiments of his friends in a social network.
The collective synchronous behavior is caused by the preference probability of a group of people which will tend to be similar and stabilized, if the system is left to evolve for a long time. Rather than applying the model on a synthesized data, we have used a real-world dataset to measure its true usefulness. The dataset used for this purpose is taken from a well-known social media site, Twitter.
The proposed approach consists of the following three phases: (1) Data Acquisition, (2) Data Preprocessing, (3) Model Building (Learning Parameters) and synchronous state prediction.

Data Acquisition
To apply this analysis data is acquired by using a tool

Data Preprocessing
In the phase of Data preprocessing following steps are taken to prepare the data for model building:

Model Building and Prediction
Collective synchronous behavior is a commonly observed Equation (10): where  * is initial guess for stationary probability and A * is the average transition table of the largest cluster. This state having the maximum probability is the Synchronous state of the system in which it is most likely to be. By analyzing the clusters, we can also comment about the most popular behavior of the system. The complete algorithm to learn the HMM of the system is provided in Table 1. These probabilities formulate the system to predict the synchronous state of the system that can give the converging point of sentiments. The complete process model of this study can be seen in Fig. 4.

EXPERIMENTS AND RESULTS
The data was downloaded from Twitter using the End for π *'  π*. A* All A M  A* Return A*, π *', W M

TABLE 1. SYNCHSENTIMENTS ALGORITHM TO BUILD THE MODEL BY HMM AND K-MEANS CLUSTERING FOR PREDICTING THE COLLECTIVE SYNCHRONOUS STATE
transition matrix, initial emission matrix and the sequence of observation. Since we did not have the initial parameters, so we estimated them. The initial transition matrix was assigned value of 1/3 as a prior probability and the initial emission matrix was learned by using the observation sequence. The stationary probability vector for each user can be found by eigenvector of the state transition matrix of that user as shown in Equation (11).
where  is the initial stationary probability vector which is assigned prior value of 1/3 for each entry and A is the state transition matrix learned by Baum-Welch. K-Means Clustering is applied on the stationary probability matrices of each user. The clustering schemes are analyzed for different number of clusters (K) where K was set from 1-6. To evaluate and find the optimum clustering scheme, SSE (Sum of Squared Error) is calculated. A graph of SSE vs. K (number of clusters) is plotted to find the best K for our data distribution as shown in Fig. 5. The ideal number of clusters should be picked in a way that adding another cluster does not give better modeling of dataset.

Predicting Collective Synchronous State of Sentiments for Users in Social Media
From the graph in Fig. 5, it could be seen that points K=3 and K=4 are two points which can be considered as the more appropriate number of clusters for the data under consideration. The detail of each cluster is given in Table 2. We analyze data by both keeping K=3 and by keeping K=4. Both revealed interesting insights of data which are discussed next in this section. The clustering performed on stationary probability matrices of each user are shown in Table 3.

CLUSTERING SCHEME (K = 3)
When the number of clusters are kept three then we have clusters having following proportion of data points: C1 has almost 82% of users, C2 and C3 have 9% of users in each as shown in Fig. 6. The largest size cluster is C1 having the largest number of users. The average Transition table of this cluster is shown in Table 4 and the Stationary probability of this group is given in Table 5.
The   clusters are given in Tables 6-7 respectively. The Stationary probability of these clusters are given in Tables   8-9 respectively.
In this clustering scheme, the clusters C2 and C3 are same as we had in the clustering scheme when we kept K = 3.  As it has already been stated that the cluster C1 with K=3 is equal to the sum of users in C4 and C1 with K = 4.
The graph in Fig. 8

FINDINGS
To get the maximum advantage from coupling HMM model we need to have a larger number of users and for each user, we should have sufficient sequence of observations so that the algorithm can understand the behavior of each user and can anticipate the influence of other users. For this research, we have used a real-world problem data downloaded from a social media site unlike the work in [13] where they have applied coupling HMM on the synthesized dataset which they created themselves. In our dataset, we faced some very critical issues. For example, there were some time stamps where we had multiple Tweets, so we had to find a way to represent the polarity of that time stamp which we did by using Mode. Similarly, there were some cases when we did not have any Tweets for a particular time stamp, so we had to find a way to assign a polarity to this time stamp which we accomplished by assigning the polarity of its previous time. The real contribution of theoretical concepts can only be seen by applying them on a real-world problem. In this study, we have analyzed the Synchronous and the most popular behaviors in a network in terms of sentiments. Our analysis reveals that the political party under consideration will have a Positive state after the representative group converges with higher probability. It results in the most popular behavior among people of this network. It is to be noted that the findings of this research will not only assist the political party under consideration but also other parties.
This is because the users with negative sentiments can be used by other party to get attention of such users towards their party. Therefore, this analysis could benefit all types of parties to devise their political campaign in a way to increase their vote bank while decreasing the vote bank of their opponent party.

CONCLUSION
This work attempted to develop a model based on HMC for analyzing the sentiments of social media users. The data for this research was collected from Twitter, which has been selected because of the recent increasing population and usage of social media in general and