Romanized Sindhi Rules for Text Communication

Sindhi is one of the historical languages which is widely used in all over the world, but especially in the province of Sindh Pakistan. Sindhi language has its own script and written by the right-handed. Nowadays the use of different Sindhi platforms is increasing especially for communication. The majority of the people of Sindh province read, write and speak very well, but they face the problem in text communication while using different communication platforms. However, the users of computer and mobile phone feel trouble/difficulty during the use of the Sindhi script in typing of text messages, tweets and comments while using different platforms in computer and mobile phone. Natural Language Processing (NLP) is one of the better options for the solution of these problems of text communication on different platforms. For the proper solution of text communication issues, Romanized Sindhi text is used instead of Sindhi text. Romanized text writing is easier than the Sindhi text writing because Sindhi text writing needs the special type of keyboard while writing of Romanized text does not need any special type of keyboard. For the writing of Romanized Sindhi text, rules are defined in this paper which provide easiness during writing and understanding of the text. Romanized Sindhi Rules (RSR) are simple and easy to understand the meaning of the text and provide fast communication (text). This study is also helpful for further research in the Romanized Sindhi text by using different approaches and provides easiness in communication.


INTRODUCTION
indhi is an Indo-Aryan language of the historical Sindh region in the northern part of the Indian sub-continent. Sindhi is one of the oldest languages in the world, mostly spoken in the Sindh -Province of Pakistan. According to a survey published by Pakistan Government, 47.893 Million people living in Sindh province read, write and speak Sindhi language and it is also an official language of the province [1]. Majority population of this province use Sindhi language in text communication like applications, letters as well as use text messages in today's mobile applications [2,3]. Mobile phones, computers and laptops have become part of everyone's life. Communication through computers and mobile phones such as Short Messaging Service (SMS), Twitter, WhatsApp, Facebook and other such applications has enormously increased. English is commonly used in these types of services, however, in many countries, people prefer to communicate in their own mother tongue rather than English. Mother tongue is naturally a powerful source of communication. Therefore, in Pakistan and India researchers pay more attention on the issues related to the local languages (e.g. Urdu, Sindhi, Punjabi, Hindi).
The Information Retrieval (IR) and Data Mining (DM) is responsible for the association between words in the sentences or investigation. Subject to classification, result analysis and sentiment classification are involved in NLP knowledge base. NLP features are described in different ways, such as stem, lemma, token, Parts of Speech Tagging (POS), Stop words, shallow-parsing and Named Entity Recognition (NER) cover major values in all NLP systems [4]. A lot of work has been done in English, Urdu and Arabic languages, but still there is a huge vacuum for research and investigation for Sindhi language. For survival of language and efficient communication in our own mother tongue, it is need of the hour to create applications in our own language. This paper therefore focusses Sindhi language.

RELATED WORK
Bhatti et al. [5][6] proposed algorithms to build a Sindhi spell checker to check spellings in the Sindhi text. They gave valuable suggestions for the misspelt Sindhi text. Phonetic-based Sindhi language rules and patterns are required for the correctness and effectiveness for the execution of a spell checking system. This system builds for the phonetic support the soundEx algorithm and shapeEx algorithm for pattern matching which produces an accurate text as well as give suggestions about Sindhi text. The authors reported in their research how to generate the suggestions for correct expression. The authors developed the software for checking of spellings of the text. The important task of system application was to correct the spelling and suggestions about the misspelt text document. This system is buildup for the misspelt checker in text document and also checks the similar words in the text. In the initial stage, it is necessitated that how to work and respond. Initial architecture had been executed for basic architecture implementation. This research depends on three most important algorithms; those algorithms are used by the authors of reference [6] for checking in a misspelt Sindhi text document. The first is for distance (Distance edit), second for phonetic named (SoundEx) and finally used for finding out the patterns (ShapeEx).
The computer is the source which is used to read and write different languages [7]. Computers need instruction for these languages, but various analysis methods are available such as OCR [8]. From these methods, Optical Character Recognition (OCR) is one of them. A research conducted by the Hakro et al. [8] reported that lot of work is available on OCR in Japanese, Chinese and Arabic scripts. In this research, the authors worked on the OCR system for Sindhi and Arabic scripts. The Sindhi OCR needs more effort to build an OCR system. Also in other studies, OCR is a technology which is used for the handwritten text or images to understand and write by the machine.
Hakro et al. [11] worked on the OCR system and reported the issues and challenges in Sindhi script by using OCR. It was very difficult to identify the printed text of the Sindhi script in OCR system. It was a challenging task because the Sindhi language had 52 Characters as compared to 39 characters in Urdu, 32 characters in Persian and 28 characters in Arabic, 26 characters in English. Writing style and forms of writing Sindhi language were as same as those of Arabic language. Sindhi was found to be more complex as compared to other languages because dots (single dot, double, triple and four) are used in Sindhi script. Multiple placements of dots were observed and these were sometimes below, above, inside and in between the characters. Dootio and Wagan [12,13] worked on the NLP and reported in their research that Sindhi script had many classes and characteristics of Sindhi corpus. A lot of work is in English script and NLP tools are offered in English scripts which perform all tasks of English script, but in the Sindhi language, no powerful application is available for the feature extraction and corpus. Sindhi is the right-handed written script that is as the Arabic and Urdu [9]. But the usage of Sindhi script is increasing at every platform, especially in social media, text communication is also used in various sources (online magazines, newspapers, poetry, learning websites of Sindhi) etc. It means huge amount of data is available at different domains of websites. At this stage, NLP tools are not available online to perform different tasks like tokenization of documents in Paragraphs, sentences, words and characters. In this research authors found out Sindhi text in parts of speech and most important task was sentiment analysis of Sindhi text in different domains.
Nowadays use of Roman script is increased as the use of English in the field of computer science/Information technology has also increased. Majority of peoples use it on social media for their communication purposes because Roman script is easier than other scripts like Sindhi, Urdu, Arabic etc. in writing. Researchers have developed new applications for the use of this script. In the Arabic language, authors faced a more complex situation when they compared Roman script with Arabic script. The features of the Arabic language are also similar to the Sindhi language in writing scripts [8].
Kosurru et al. [14] proposed a system named RoLI to identify the Romanized text for a variety of Indian languages. They developed rules to find out the Romanized text in detail for the Indian Languages. Their proposed method can be applied to any language using less number of resources. Use of RoLI gave high accuracy of 98.3% in experiments conducted over five Indian language web pages containing a mix of those languages. In this paper, the authors describe variety of methods to Romanize the text.
Ali and Ijaz [15] worked on the NLP and used large corpus for the classification task. The documents contained 19.3 million-words, classified into six names; First is finance, second-culture, third-sports, fourth-news, fifth-personal and finally consumer information. They pre-processed the contents of six domains by performing tokenization, diacritics elimination, normalization, stop word removal, stemming and also used some statistical techniques.
The current dataset has no availability of Urdu Language Processing (ULP) at the Centre for Research in ULP (CRULP) and Computing Research Laboratory (CRL) as reported by Mukund et al. [16]. So the authors took data from different resources to build a tool for research. Urdu and Hindi languages which have same speaking and writing forms are different from each other. Urdu is also the national language of Pakistan; Urdu language is a right-sided language. Pre-processing of any kind of language is an important step before applying any technique of NLP, sub-task of periods, worlds-removals, Stem words, etc. Ayesha et al. [17] also worked with Roman Urdu and found out the polarity of the text.
Sodhar et al. [18] worked on the issues and challenges in Romanized Sindhi Text and reported in their research that nowadays computer technology has become very advanced and increased number of applications have been developed for the users for communication. They faced the problem of typing of Sindhi text for communication purposes and they felt difficulty of punctuations, symbols, dots, noise issues, row break and font style.
Bhatti et al. [19] worked on the academic informatics portal for the Sindhi community. The authors developed software for sharing ideas, pictures and other related materials. This software is based on the PHP, JavaScript and MySQL for compatible operating systems. Bhatti et al. [20] also worked on the word segmentation of Sindhi language text. The authors used various techniques of NLP for the solution of the segmentation problem and they selected tokenization of Sindhi text.

SINDHI DATASET
A key component to start research is availability of an appropriate dataset. NLP techniques also implement on the dataset and analyze the results completely based on the dataset. Sindhi training datasets are yet not available for the research. Huge gaps in Sindhi language are determined and availability of the resources is minimum.

Sindhi Alphabet
Sindhi language is more difficult than the other languages in speaking as well as writing because Sindhi alphabet has fifty-two (52) letters, as shown in Fig. 1. Usually in mobile phones and in computer applications we have English type keyboard for usage, but in Sindhi alphabet, all 52 letters are written in different ways and styles and they have different sounds. Due to this reason it is very difficult to write and read in the Sindhi script [8,10].
Natural selection of datasets has two ways for the construction of datasets for the Arabic-script as (a) Unicode-Character set (b) Extensible Markup Language (XML) file. Sindhi has been just like Arabic based languages and uses Unicode for the storage.

Romanized Sindhi Text
Romanized Sindhi text is mostly used by the Sindhi people of different regions of Sindh as well as different parts of the world for communicating with each other by using different ways. The use of Romanized Sindhi text in text messages on social networks is one of them. Resources for use of Romanized Sindhi Text includes Cell phones (Text messages and Social networks just like WhatsApp Chat, Facebook Communications, Online Websites, Translators and so on).
In Sindhi scripts many symbols are used but in Romanized Sindhi Text there is no use of symbols as described in rule No. 1. Rule 2-15 may be used to write Romanized Sindhi Text instead of Sindhi script. Vowels are the same as of the English language when write Romanized Sindhi Text as described in rule No. 16. Use of single letter (as shown in rule No. 17) of Sindhi script in words has different meanings but when these letters are used separately in sentences creates different meanings and connect the paragraphs.
Many words of Sindhi script use double letters but when in Romanized Sindhi, Text may be written in different ways.

Rule-1
No use of symbols in RST, because its time taking process when two people are communicating (in written) at different resources.

Rule-2
‫ﺏ‬ and ‫ٻ‬ used in Romanized letters are same when write. ‫ﺏ‬ (ba) and ‫ٻ‬ (ba) most of the time written same letters in Romanized Sindhi text. Example: In Romanized form Q used as K and K uses as Q.

Rule-14
In Romanized form W used as V and V uses as W.

Rule-15
There is no any word start with X but in a few cases it will use. 3. This year we will also be visiting Karachi  ‫ﻭﻳﻨﺪﺍﺳﻮﻥ‬ ‫ﮔﻬﻤﮡ‬ ‫@ﺮﺍﭼﻲ‬ ‫ﭘﮡ‬ ‫ﺍﺳﺎﻥ‬ ‫ﺳﺎﻝ‬ ‫ﻫﻦ‬ hin sal asan penn Karachi ghuman wendason.

CONCLUSION
In this research work, Rules for Romanized Sindhi text were introduced for different sources like text messages on mobile phones, WhatsApp chat, Facebook comments, Twitter (Tweets) and so on. In the rigorous study of literature review, the Sindhi trained corpus was not available for the text classification of Romanized Sindhi text. This study provides the solution for the problem/issues faced in communication (text) and the data given in this paper is useful for computer/mobile users and users will use these rules for different purposes. This research work may be very helpful for easy communication between people of different regions of the world on social media and also very helpful for the sentiment analysis and summarization of Romanized Sindhi text in the future.
Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah, Pakistan. The authors also gratefully acknowledge the financial support provided by the Quaid-e-Awam University of Engineering, Science and Technology, Nawabshah.