Glyph Identification and Character Recognition for Sindhi OCR

A computer can read and write multiple languages and today’s computers are capable of understanding various human languages. A computer can be given instructions through various input methods but OCR (Optical Character Recognition) and handwritten character recognition are the input methods in which a scanned page containing text is converted into written or editable text. The change in language text available on scanned page demands different algorithm to recognize text because every language and script pose varying number of challenges to recognize text. The Latin language recognition pose less difficulties compared to Arabic script and languages that use Arabic script for writing and OCR systems for these Latin languages are near to perfection. Very little work has been done on regional languages of Pakistan. In this paper the Sindhi glyphs are identified and the number of characters and connected components are identified for this regional language of Pakistan. A graphical user interface has been created to perform identification task for glyphs and characters of Sindhi language. The glyphs of characters are successfully identified from scanned page and this information can be used to recognize characters. The language glyph identification can be used to apply suitable algorithm to identify language as well as to achieve a higher recognition rate.


INTRODUCTION
V arious input devices get input of text, voice and image data. ORC is an input method which takes less time [1] and using less or no time in getting more text as input. The text available on scanned paper or document is converted into editable text which is much faster alternative to type such amount of text with keyboard. The OCR applies set of algorithms to identify, segment, extract and recognize text (printed by machines). The text written on an image might be in different type of language or script and these scripts demand suitable algorithms to segment and recognize glyphs and characters [2]. The non-cursive script such as Latin and Cyrillic are posing less challenges compared to Arabic and Indian scripts and easy to recognize whereas the later pose more challenges and still need a lot of attention of the researchers [3]. scripts with addition of new problems and challenges.
Arabic possesses less number of character variations using same base shape whereas Sindhi extends the same base shape with more dots, creating more number of characters. The dot placement and orientations are altered so that more characters can be formed in Sindhi. A complete list of issues and challenges are identified in [5][6].

RELATED WORK
Latin OCRs are at their peak level of accuracy whereas research on Arabic adopting languages are still need attention. The OCR research on national and regional languages of Pakistan is at the very initiating stage. Few of the researchers are engaged in regional languages such as [7][8][9][10][11]   The created text images are in the form of .bmp format and

Glyph Identification
Glyphs can be a presentations of a character shapes to be presented or displayed when any character is rendered.
Glyph can be a presentation of one or more characters.

Character Identification and Recognition
Character identification is a preliminary step for multiscript recognition when multiple language characters are optically recognized [2]. For the glyph identification and recognition of Sindhi script, the proposed system is illustrated in Fig. 1.

Design and Development
For the implementation of the proposed system the MATLAB 2015 has been used and an interactive graphical User Interface has been created is shown in Fig. 2. The steps of proposed system are aligned in sequence so that the process can be understood easily. The stages of proposed system are described in following subsections.

Load Image
By pressing the load image button the various image formats can be loaded into the image box where it is fit according to the size. The image box contains the resized photo of the text whereas the system holds the original image also. The supportable image formats in MATLAB can be used for loading in the system. The loading of an image is shown in Fig. 3(a).

Preprocessing
The first stage will result in loading of an image and this loaded image is the input of preprocessing stage where in binary images [7] as shown in Fig. 3(b).

Segmentation
The segmentation stage is considered the most important and crucial stage of any OCR. Here we used horizontal and vertical segmentation for separation of text lines from an image and characters from text lines.
Free space is an indicator for segmenting text lines from a text image and characters from text lines. After FIG. 3(a). LOADING OF AN IMAGE FIG. 3(b). PREPROCESSING STAGES HrI =Σfi j(I)

Mehran University Research
The Equation (1)

Classification
In this stage the segmented characters are used as input and then by using feature extraction algorithm [19], features are extracted to use along with neural network feedforward algorithm for the recognition of characters as shown in Fig. 5.

Output
The features extracted have been used to recognize the characters. With the language mapping the characters have been identified that which character is available in a specific language. We mapped Sindhi language script according to their number of characters and number of glyphs. The number of glyphs is the indicator for a particular language. After the glyph identification an OCR for isolated character has been applied so that the characters can be recognized. The recognized characters are displayed on User interface and with the help of export alphabet button these recognized characters can be exported to any text editor.

Export Alphabet
The last stage of the proposed system is to export alphabet in the same sequence to text editor supporting Unicode.
In our case we are exporting a text file containing recognized characters. This text file can easily be opened in any text editor supporting Unicode.