Recognition and Effective Handling of Negations in Enhancing the Accuracy of Urdu Sentiment Analyzer

Although work has been done in Urdu Sentiment Analysis by researchers but still there is a lot of room for improvement in the form of achieving higher accuracy. Therefore, in this research, the accuracy of Urdu Sentiment Analysis in multiple domains is enhanced by dealing negations using Lexicon-based approach, one of the broadly used approaches for performing Sentiment Analysis. Negations in Urdu Sentiment Analysis are particularly focused in this research because of their effective role in Sentiment Analysis. Both local and long distance negations are considered. For achieving this goal, a corpus with 6025 Urdu sentences, from 151 blogs that belong to 14 different genres is taken in which use of negations is carefully observed. Two major steps are taken in this regard. First, to deal with the morphological negations, this type of negations is included in the negative word file of the Urdu Sentiment Lexicon developed for performing Sentiment Analysis of Urdu blogs. Secondly, rule-based approach is used for handling the implicit and explicit negations. Rules are designed that can deal with both implicit and explicit negations effectively. Implementation of these rules increased the accuracy of Sentiment Analyzer from 73.88% to 78.32% with 0.745, 0.788 and 0.745 Precision, Recall and Fmeasure respectively, which is statistically significant improvement.


INTRODUCTION
egations, also called negation particles, vary from language to language and are used for negating statements or parts of statements. It is essential to handle these negations carefully while performing Sentiment Analysis (SA) by computer. Thus, identifying the scope of negations becomes very important.
The polarity of a complete sentence or part of a sentence is normally reversed by the negation of words A complete summary report regarding performance of the company, 10-K reports, that is annually submitted to the security and exchange commission, shows that companies usually use positive words while framing negative news and seldom use negative words for positive news [7]. The results of positive words are mixed due to the wrapping of negative phrases in positive words [8].
Negations can change the sentimental orientation of the other terms occurring in a sentence [9]. Negations affect the polarity of a sentence, if a sentence has such words. The polarity of a sentence may be inverted due to negations [10]. Negations are important linguistics that affects the polarity of other associated words. To improve the performance o sentence level SA, negations should be handled effectively [11].
Due to the importance of negations mentioned above, they are particularly focused, identified and handled successfully in this research for improving the performance of Urdu Sentiment Analyzer in the form of achieving higher accuracy than the one that was achieved when negations were not handled.
For detecting the scope of negation, the existing approaches can be divided broadly into two categories: (1) Rule-based algorithms (Lexicon-based approach where no prior training data is required but lexicons are used) (2) Machine learning algorithms (e.g. Supervised Machine Learning where an annotated corpus for training and testing the classifier(s) is required).
In case SA is to be performed in multiple domains, then large amount of annotated data is required while using Supervised Machine Learning approach for the training of classifiers. This is not only a laborious task but also very time consuming and costly. The approach may not perform well if applied to a domain for which it is not trained. On the other hand, no training data is required in case of Lexicon-based approach. A wide coverage lexicon and an efficient algorithm are required in this approach, which are not much time consuming and costly.

Explicit Research Objectives
• To identify different types and forms of negations in Urdu. • To handle morphological, implicit and explicit negations while performing Urdu Sentiment Analysis.
• To enhance the accuracy as compared to the previous work done in Urdu Sentiment Analysis after successfully handling negations.

Types of Negations in Urdu
Negations are frequently used in Urdu. Some commonly used negation particles in Urdu are ‫ﻧہ‬ (nah i.e. not), ‫ﻧﺎ‬ (naa i.e. no), ‫ﻧﮩﻴﮟ‬ (nahi i.e. no), ‫ﻣﺖ‬ (mat i.e. do not, stop), ‫ﺑﻐﻴﺮ‬ (baghair i.e. without/except) and ‫ﺑﻨﺎ‬ (bena i.e. without/except). Different types of negations can appear in text [12]. These types are e.g. sentential negation, constituent negation, multiple negation and negation in coordinate structure. These types are discussed one by one in detail.

Constituent Negation
This type of negation is used in order to negate particular constituent(s). Usually, the negated constituents are followed by negative particles [12].

Multiple (Consecutive) Negation Particles
The occurrence of double negations is also observed in Urdu language in order to emphasize more on a certain point. In (i) and (ii), sentences are negated with more emphasis by using multiple consecutive negations i.e. ‫ﻧہ‬ ‫ﻧﮩﻴﮟ‬ and ‫ﻧﮩﻴﮟ‬ ‫ﻧﮩﻴﮟ‬ respectively.

Negations Found in Coordinate Structures
In case of situations like "neither-nor" in English, negations in Urdu, usually appear in the beginning position but sometimes may occur in the middle as well.

762
Is ka jawab naa to mojooda hukmraanon ke paas hai naa hi opposition ke paas hai aur naa hi awam ke paas hai.
The answer to this is neither with the government nor the opposition nor the public at large.
In (i), the first ‫ﻧﺎ‬ (naa i.e. no) appears in the beginning and the second one in the middle. In (ii), ‫ﻧﺎ‬ (naa i.e. no) does not appear in the beginning, although it occurs three times.

Forms of Negation in Urdu
Negations whether sentential, constituent or multiple can occur in three main forms in Urdu. These three forms are identified by Syed et al. [12], and are discussed one by one. ii .
In (i) and (ii), morphological negations are used by using the prefixes ‫ﺑﮯ‬ (bay i.e. without) and ‫ﻧﺎ‬ (naa i.e. no) respectively. To handle this type of negation, words starting with ‫ﺑﮯ‬ (bay i.e. without) and ‫ﻧﺎ‬ (naa i.e. no) (usually acting as negative) are included in negative word list.

Implicit Negation
Negation can be implicit as indicated by e.g. ‫ﮐﻢ‬ (kam i.e. low/less). Consider the following example: i .
In (i), the implicit negation in the form of ‫ﮐﻢ‬ (kam i.e. low/less) is used. This type of negation is handled by taking two steps. First, implicit negations are kept in separate files and then a rule is formulated which can work for it.

Explicit Negation
Negation can be explicit e.g. ‫.ﻧﮩﻴﮟ‬ Consider the following examples: Taqleed burii nahi balkay andhi taqleed halakat-khaiz hai. Following someone is not bad but following someone blindly is deadly.
In (i) and (ii), two most commonly used negations are used i.e. ‫ﻧﮩﻴﮟ‬ (nahi i.e. no) and ‫ﻧہ‬ (nah i.e. not) respectively. Explicit negation is the most common form of negation. To handle this type of negation, two steps are taken. First, all such negations are included in negation file and secondly, a rule is formulated.
Rest of the paper is organized as follows. Related work is presented in Section 2. Material and methods are discussed in Section 3. Results and discussion are presented in Section 4. The paper is concluded in Section 5.

RELATED WORK
Several studies can be found on the detection of negations while analyzing sentiments analysis [1,3,13]. The approach of n-grams for negation detection is used by [14]. The impact of negation tagging in French, English and Dutch opinion mining is studied with the conclusion that negation detection by using language specification is helpful [15]. Sentiment Analysis is also performed by handling intensifiers words and negations together [16].
The sentiment of each word that occurred after a negation is inverted till the next punctuation token [17]. Positive scores are assigned to expressions that are positive and negative scores to negative expressions. The polarity score is simply inverted in case of the presence of negation [18]. Negation is modeled on the basis of three features on polar expressions [3]. These three features are "shifter feature" (the occurrence of different polarity shifters e.g. "little"), "polarity modification features" (the words that are not negations explicitly but they modify the polarity e.g. "lack") and "negation features" (if there is a negation before polar expression).
For detecting negations, a list of negation cue is used [19]. The scope of negation is then determined by utilizing the syntax tree of the sentence (with a negation). Negations are thus analyzed by using different ways e.g. Dependency Tree, Parts-of-Speech and Bag-of-Words. The tactful combination of these approaches can lead to much better result compared to their use in isolation [20].
Most of the researchers have worked on SA using languages other than Urdu. Few researchers have worked in Urdu SA [21]. Urdu corpus and lexicon are developed by researchers [22][23][24][25]. A sentimentannotated lexicon is developed manually [26]. The developed Urdu lexicons are either publicly unavailable or cannot be used for Urdu Sentiment Analysis. Urdu SA is performed by different researchers using Lexicon-based approach [27][28][29]. Urdu SA is performed, where subjective expressions called Sentiunits are extracted that automatically cater for the effect of negations and negations are handled within the phrase [29]. The performance of the developed Urdu Sentiment Analyzer by these authors is domain specific and the annotated lexicon needs extension by including words from multiple domains. Further, the authors [29] have used three sets of data and have achieved satisfactory results but still further enhancement in the performance is needed.

MATERIALS AND METHODS
Different steps taken are discussed one by one:

Urdu Annotated Corpus
A corpus with 6025 sentences was already collected in yet another study conducted by few authors of this research. These sentences were taken from 151 online blogs belonging to 14 different genres. For labeling these sentences as positive negative and neutral, expertise of two human annotators were hired. Maximum vote strategy was adopted. In case of agreement on the same label by the two annotators sentence was labeled accordingly. In case of disagreement between the two for a particular sentence, the sentence was considered as a tied sentence. All such sentences were collected and annotated by the third annotator. Depending on the decision of the third annotator (whether he agreed with the first or second annotator), the sentence was labeled. The sentence was discarded where the third annotator neither agreed with the first annotator nor with the second annotator. After this step, 1876 sentences were labeled as positive, 2753 sentences as negative, 1388 sentences as neutral and 8 sentences were discarded due to disagreement between the annotators [30].
To observe the use of different negations discussed above, negations and their frequency is carefully observed in the collected corpus. Table 1 shows commonly used negations and the number of times they are used in the corpus.  [31].
In order to handle morphological negations for this study, a large number of such words are included in the negative word file in this lexicon.

Urdu Sentiment Analyzer
The algorithm, already developed by the authors (Ph.D thesis), is based on rules for handling different issues (e.g. negations, intensifiers and context-dependent words), that arises while classifying the sentences. Numbers of rules are incorporated using Java JDK 6, Netbeans environment to develop an Urdu Sentiment Analyzer (in this research, only the rules related to negations are focused). The Urdu Sentiment Analyzer takes an Urdu blog with several sentences as input. Words in each sentence are searched in different files e.g. positive, negative and negations and are assigned polarities according to rules. At the end of the processing of a sentence, all polarities in a particular sentence are added. If the sum is greater than 0, the sentence is declared positive. If the sum is less than 0, the sentence is assumed negative. If the sum is equal to 0, the sentence is considered as neutral i.e.

X = ∑ W
where m is the total number of positive and negative words surrounded by negations in each sentence and Wpoli (discussed in the next section) is the polarity value assigned to each positive and negative word with negations according to rules.
If X > 0, the sentence is declared as positive. If X < 0, the sentence is declared as negative. If X = 0, the sentence is considered as a neutral sentence. At the end, final conclusion is displayed which shows whether maximum number of sentences in the blog are positive, negative or neutral.

Rules for Handling Implicit and Explicit Negations
After a particular sentence (with negation) is provided to the system for processing, then each word in the sentence is checked in the developed Urdu lexicon which has the following files: In (i), ‫ﺷﮏ‬ (shak i.e. doubt) is a negative word but it is followed by the negation ‫,ﻧﮩﻴﮟ‬ so the combination of both will be assigned +1.
(2) The positive or negative word will have the same polarity i.e. +1 for positive word and -1 for negative word, if there is ‫ﺻﺮﻑ‬ (sirf i.e. only) between the negation and the positive or negative word (i.e. negation followed by ‫ﺻﺮﻑ‬ e.g. ‫ﺻﺮﻑ‬ ‫ﻧہ‬ (nah sirf i.e. not only).
where W1, W2 and W3 are words one after another (at times, there may be distance between the words).
In the above example, ‫ﺑﮩﺘﺮ‬ (behtar i.e. better) is a positive word and it remains positive in spite of the fact that it is preceded by the negation ‫,ﻧہ‬ due to the presence of ‫ﺻﺮﻑ‬ (sirf i.e. only). So, collectively these three words i.e. ‫ﺑﮩﺘﺮ‬ ‫ﺻﺮﻑ‬ ‫ﻧہ‬ (nah sirf behtar i.e. not only better), will be assigned +1. In other words, ‫ﺻﺮﻑ‬ (sirf i.e. only) cancel the effect of ‫.ﻧہ‬ (3) A positive word will have negative polarity and a negative word will have positive polarity, if there is a negation word after it, even in the presence of an intensifier.
In this example, take the string of words ‫ﻧﮩﻴﮟ‬ ‫ﺳﻤﺠﻬ‬ ‫.ﺯﻳﺎﺩﻩ‬ Here, the words ‫ﺯﻳﺎﺩﻩ‬ (ziyadah i.e. more) is an intensifier, ‫ﺳﻤﺠﻬ‬ (samajh i.e. understand) is a positive word and ‫ﻧﮩﻴﮟ‬ (nahi i.e. no) is a negation. The three words will collectively be assigned the polarity of -1. The intensifier has no effect here to increase or decrease the polarity.
(4) If the implicit negation e.g. ‫ﮐﻢ‬ (kam i.e. less) is preceded or followed by a positive word then its polarity from +1 will be changed to -1. However, if this negation is preceded or followed by a negative word then the polarity -1 will be changed to +1.
Consider example (i) in Section "Implicit negation". In this example, the first occurrence of ‫,ﻅﺮﻑ‬ which is a positive word, is followed by ‫ﮐﻢ‬ (i.e. an implicit negation), so collectively these words (i.e. ‫ﮐﻢ‬ ‫ﻣﻴﮟ‬ ‫,)ﻅﺮﻑ‬ will be assigned -1.

RESULTS AND DISCUSSION
First, the algorithm is implemented by dealing with positive and negative words including nouns, verbs and adjectives without handling negations. In the input there are sentences with both simple and compound words. All sentences (with and without negations) in the corpus are given as input to the software system and the accuracy is calculated. Later on, the algorithm is improved for handling negations as well. To check 766 that the rules framed are working properly, all sentences are again provided to the system and the accuracy is recalculated. Table 2 shows a comparison of the accuracy of the system before and after handling negations. where Accuracy is the measurement of how close the estimated classification is to the actual classification. Accuracy = (correctly classified sentences / total number of sentences) ×100 For evaluating the effectiveness and efficiency, accuracy alone is not sufficient performance metric. Therefore, the other three standard metrics i.e. Precision, Recall and F-measure are also computed.
where Precision = Tp/ (Tp+Fp) This value ranges from 0 to 1, the closer it is to 1, the better is the result. Table 3 shows the Precision, Recall and F-measure of the second phase (i.e. after handling negations) where 78.32% accuracy is achieved.  [29]. Table 4 shows this comparison. x .

‫ﺑﻬﺌﯽ۔‬ ‫ﻧﺎ‬ ‫ﺩﻳﮑﻬﻴﮟ‬ (Neutral Sentence)
Dekhen naa bhae Just see.  Fig.1, the output provided by Urdu Sentiment Analyzer is displayed, where the sentences are highlighted in different colors. The red highlighting is used for negative sentences, the green highlighting is used for positive sentences and the white highlighting is used for neutral sentences. It can be observed that all the sentences with negations are correctly identified. Fig. 2 shows the same output by means of a pie-chart and Fig. 3 shows the descriptive statistics generated by the system.  where n00 represents the number of instances correctly classified by both the classifiers. n11 represents the number of instances misclassified by both the classifiers. n01 represents the number of instances correctly classified by classifier 1 but misclassified by classifier 2. n10 represents the number of instances correctly classified by classifier 2 but misclassified by classifier 1.
The null hypothesis (Ho): The two classifiers are equal.
The critical value for the McNemar statistic = 3.84.
The null hypothesis is not rejected if the McNemar statistic is less than 3.84 or p-value is greater than 0.05. The null hypothesis is rejected if the McNemar statistic is greater than 3.84 or p-value is less than 0.05. Rule of thumb is: n01 + n10 is greater than or equal to 20.
For applying McNemar's Test, 720 sentences are randomly selected from the collected corpus and are tested by both the classifiers. Table 5 shows the classification detail by the two classifiers.

CONCLUSION
The paper shows that negations play an important role in the correct classification of sentences. Negation handling for Urdu language is the main focus of this research work. Rules are formulated for the three forms of negations i.e. morphological, implicit and explicit negations and software is developed to handle them effectively. Sentences from the corpus are tested for computing the accuracy of the system. The system classified the sentences correctly (as positive, negative or neutral sentence). A comparison in terms of accuracy is presented before and after handling negations. The accuracy of the system after dealing with negations is increased by almost 5% compared to its previous version which is statistically significant improvement for correct classification of sentences.

ACKNOWLEDGMENT
Authors acknowledge Department of Computer Scien ce, University of Peshawar, Pakistan for the motivation and support in the successful completion of this research work. Authors are particularly grateful to Mr. Al-Gaili, MS Computer Science, University of Peshawar, for his valuable suggestions, discussions and help throughout this research.