Sentiment Analysis for Roman Urdu

The majority of online comments/opinions are written in text-free format. Sentiment Analysis can be used as a measure to express the polarity (positive/negative) of comments/opinions. These comments/ opinions can be in different languages i.e

share their thoughts, opinions and feelings to the public [1].
The main goal of this paper is to help the non-native Urdu users to know the polarity of others opinion about any product or service from the comments posted in Roman Urdu.
In this paper, an attempt has been made to perform sentiment analysis on comments/opinions in Roman Urdu. First of all, opinion/comments data is collected from various online sources. Then, it is preprocessed by removing stop words, punctuation marks and numerical characters. Three supervised machine learning techniques NB, LRSGD and SVM have been applied on the data and different performance measures have been studied. Every algorithm has its own advantages and disadvantages in terms of model complexity and accuracy.
The contributions of the paper are: (1) There is no publicly available Roman Urdu opinion; therefore, we have prepared a dataset by taking the comments/opinions of people in Roman Urdu from different websites. Besides that, we have also tagged them in to positive, negative and neutral sentiments.
(2) We have achieved an accuracy of 87.22%, when SVM is used with Unigram + Bigram + TF-IDF as feature set.
The rest of the paper is organized as follows. Related work has been explained in Section II. Section III presents the methodology to analyze sentiment analysis. Experimental setup, results and discussions are briefed in Section IV. Finally, the paper is concluded in Section V.

RELATED WORK
A lot of work has been done on the sentiment analysis of content in English and other developed languages. A very little research is done on the sentiment analysis of the content in Roman Urdu/Hindi, the third largest language in world [5].
In this section, we first present some relevant work in English language. At the end of this section, we also discuss some work done on the sentiment analysis of the content in Urdu.
Tripathy et. al. [6] worked on sentiment analysis of movie reviews in English language using Machine Learning

The Dataset
As

Preprocessing
First, we manually annotate the data as negative, positive and neutral. We do this step before any processing, In preprocessing step, we remove unnecessary words like punctuation marks, numerical characters and stop words. The words that are common, high frequency and does not gives meaning in predicting the sentiment are called stop words. Luhn [13] was the first to introduce the concept of stop words. In our work, we have chosen the stop words manually. Data preprocessing also helps in reducing the computation time and dimensions of the data.

Feature Set Preparation
Even preprocessing the text is not enough to achieve better sentiment analysis results. The text after preprocessing is processed further to extract features that may improve the results. In literature, researcher have extracted many features for successful sentiment analysis. Here, we have chosen eight different feature sets, which are combined from various simple text features. First, we define the simple features, and then we will present our eight features sets. These eight feature sets are also commonly used in the sentiment analysis of text in English language [14].

Correct Predictions
Results and Discussion: The experimental results of three machine learning algorithms applied on eight feature sets is shown in Table 2. The bold numbers in each row presents the maximum accuracy for the three machine learning algorithms.

CONCLUSION
In this paper, we perform sentiment analysis of Roman Urdu

FUTURE WORK
In future, we will extend our dataset further and will cover more domains. Furthermore, we are also planning to use deep learning approaches to solve this problem.