An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods

The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.


INTRODUCTION
noteworthy technique of machine learning and data mining [1], which predicts the results on the basis of training data set [2]. Now-a-days, there are several websites, blogs and social media sources which produce a large number of data.
This study, has developed the plain corpus by collecting data through internet resources and annotated that plain corpus with universal POS tag set and Sindhi POS tag set using online NLP (Natural Language Processing) resource (www.sindhinlp.com). We have analyzed Sindhi annotated corpus through machine learning model which is consisted of two supervised machine learning methods: (1) SVM (Support Vector Machine) non-linear and (2) RF (Random Forest). Purpose of this research study is to present the grammatical and morphological variation of Sindhi. Sindhi is a less resourced language [3,4] in comparison of English language. Nevertheless, some work has been done on tokenization and POS tagging of Sindhi text [5][6][7] as well as NLP tools are accessible online for solution of Sindhi linguistic problems [7]. In this connection, Sindhi Devanagari script [8] for POS tagging system is not helpful for right hand written script of Sindhi text. NLTK (https://www.nltk.org/) is one of the prominent resources for solution of linguistics problems of human languages but it does not support fully to Sindhi corpus analysis process because of unavailability of libraries for Sindhi stemming and morphemes, which are different than the English stemming words and morphemes. Therefore, this study has developed its own program to analyze the Sindhi annotated corpus.

Sindhi Language
Sindhi is a complete language having a culture, land, civilization, history, proper grammar and rich morphological structure with fifty-two alphabets [9].
Sindhi is an indigenous language having all properties of native and complete language [10]. ‫ﭤو‬ ‫ﺳﻣﮭﺟﻲ‬ ‫ﺗﻲ‬ ‫ٽ‬ َ ‫ﭘ‬ . (pat te sumhjay tho). Thus, these types of the differences make the Sindhi as a unique and significant language of the world.

Corpus Annotation Process
Sindhi annotated corpus is tagged with Universal POS  Fig. 1 and annotates them with universal part of speech, described in Fig. 2 and with Sindhi part of speech, described in Fig. 3.
The tokenization and annotation processes stand the Sindhi token as complete token for understanding and analysis. Table 1  Mostly, it is observed that Penn tree bank tags [13] or UPOS tags are used to annotate corpus of any language, but this study maps Sindhi POS tag set along with Universal POS tag set to Sindhi corpus which shows the significance of Sindhi POS tag set. Table 2 shows The most common word which is used frequently in Sindhi corpus is ‫(ﺟﻲ‬of). This word is preposition ‫ﺟر(‬ ‫ﺣرفِ‬ ) in Sindhi language. ‫ﺟﻲ‬ (jay) lexicon of Sindhi language describes the relation, correspondence or dependency in Sindhi text. The unique or frequently used words are text vectors in document term matrix. Table 3 shows the detail of total words, top words and frequently used word in Sindhi corpus.
TF-IDF finds significant words/feature names from Sindhi corpus which perform important role in documents.
Feature names are special terms which are deferent from each other and significant for corpus documents. Table 4 shows feature names, extracted from Sindhi corpus using n-gram model where n=1.

Sindhi Annotated Corpus
The Sindhi annotated corpus is multi-class and multi-   Table 5 shows the records of data set.

RESULTS EVALUATION AND ANALYSIS
Results are shown on basis of confusion matrices, accuracy rates, precision, recalls and f-scores. All the measurement techniques are important and significant for the evaluation and analysis of performance of supervised model which performs machine learning operations on Sindhi annotated corpus.

Confusion Matrix Analysis
There are two classes: UPOS and SPOS; therefore, the performance of both classifiers is different from each other.  Fig. 7 and the matrix, derived through RF is shown in Fig. 8.

Accuracy Analysis of Supervised Model
The

Precision, Recall and F-Measure Analysis
The precision and recall show the percentage of relevant and irrelevant data available in Sindhi annotated corpus, they confirm the true predicted data values. Generally, precision ratio is acquired from the number of related or relevant data instances which are obtained from total number of relevant and irrelevant data, whereas recall shows the sensitivity ratio of data by acquiring relevant data from total number of relevant data. The F-score is a measurement of test accuracy that is weight harmonic mean of precision and recall. Table 7 shows precision, recall and F-score of Sindhi corpus labelling class UPOS and Table 8 shows precision, recall and F-score of Sindhi corpus labelling class SPOS.
Precision, recall and F1 score of UPOS and SPOS acquired by RF are better than the SVM non-linear.
The average precision, recall and F-score results of class UPOS are shown in Table 9, which describe the better performance of RF supervised method on Sindhi annotated corpus.
Average precision, recall and F-score results of class SPOS are shown in   Sindhi NLP resources may also be used for further research on Sindhi language feature identifications and variations, topic modeling, sentiment and semantic analysis and information retrieval.