Zernike Moments Based Handwritten Pashto Character Recognition Using Linear Discriminant Analysis

This paper presents an efficient Optical Character Recognition (OCR) system for offline isolated Pashto characters recognition. Developing an OCR system for handwritten character recognition is a challenging task because of the handwritten characters vary both in shape and in style and most of the time the handwritten characters also vary among the individuals. The identification of the inscribed Pashto letters becomes even palling due to the unavailability of a standard handwritten Pashto characters database. For experimental and simulation purposes a handwritten Pashto characters database is developed by collecting handwritten samples from the students of the university on A4 sized page. These collected samples are then scanned, stemmed and preprocessed to form a medium sized database that encompasses 14784 handwritten Pashto character images (336 distinguishing handwritten samples for each 44 characters in Pashto script). Furthermore, the Zernike moments are considered as a feature extractor tool for the proposed OCR system to extract features of each individual character. Linear Discriminant Analysis (LDA) is followed as a recognition tool for the proposed recognition system based on the calculated features map using Zernike moments. Applicability of the proposed system is tested by validating it with 10-fold cross-validation method and an overall accuracy of 63.71% is obtained for the handwritten Pashto isolated characters using the proposed OCR system.


INTRODUCTION
n the last decades, a lot of research has been reported on machine learning and pattern identification problems. Optical Characters Recognition (OCR) is a significant problem of research for the researchers in the pattern recognition. OCR converts images of text into computer readable format. State of the art techniques are suggested for different languages like English, Chinese, Arabic, Hindi, Dari, Persian and other around the world and high accuracy results are calculated for these languages. Cursive script languages like Arabic, Pashto and Urdu are the open research fields due to complexity in writing and word formation. Also, writing styles of these languages are varying for 1 Department of Computer Science, University of Swabi, KP, Pakistan. Email: a sardarjehangir88@gmail.com, b mukhlisdagai@gmail.com, c engr.sulaiman88@gmail.com (Corresponding Author), d shahnazir@uoswabi.edu.pk, e anwar@uoswabi.edu.pk different peoples, and even it varies slightly for the same person on different occasions. These are the main problems that encounter hurdles in attaining state of the art performances in cursive-script based languages.
As per the study of literature, no research work has been reported for handwritten Pashto characters recognition. Boufenar et al. [1] presented an Artificial Immune Recognition (AIR) system using 7 types of features including both statistical and structural features for offline Arabic letters recognition. Abandah and Anssari [2] presented the concept of recognizing handwritten Arabic letters using Normalized Central Moments (NCMs) and Zernike Moments (ZMs) features based on Support Vector I Machines (SVM). Bhuiyan and Alsaade [3] presented a hybrid neural network approach for Arabic character recognition. The system is composed of a Bidirectional Associative Memory and a Multi-Layer Perceptron (BAMMLP). Classification/Accuracy results are calculated for the system in less than 1ms. Oujaoura et al. [4] suggested a method for offline Arabic letters identification using three feature techniques including zernike moments in conjunction with neural networks. Zernike moments surpass rest of the two in recognition rate. Sulaiman et al. [5] presented the use of KNN and artificial neural network for handwritten Pashto characters recognition bases on zoning feature extraction tool.
Naz et al. [6] investigates the study of Urdu Nastali'q text recognition using multiple geometrical features and Multi-Dimensional Long Short-Term Memory neural networks (MDLSTM). Jameel et al. [7] presented the concept of basis spline (B-Spline) curves as feature extractor and Neural Network for the recognition of isolated Urdu characters. Jameel and Kumar [8] also proposed basis spline curves in conjunction with Artificial Neural Network (ANN) for offline handwritten Urdu characters. Ahmed et al. [9] presented an algorithm for Urdu letters identification using Bidirectional Long Short-Term Memory (BLSTM) system. They also introduced a new database Urdu-Nasta'liq handwritten dataset. This paper presents an OCR system for isolated handwritten Pashto characters recognition. Pashto language has character set of 44 letters. It shares the same cursive style as that of Arabic, Persian and Urdu. Pashto language text is written from right-to-left side.
The paper is organized as follow; Section 2 gives the details of related work to Pashto text recognition. Section 3 gives detail about the Pashto script. While section 4 describes the proposed methodology of the OCR system based on Zernike moments as a feature extractor for the isolated handwritten Pashto characters recognition. Section 5 explains the results of the proposed research followed by conclusion in section 6.

RELATED WORK
For the last few decades, the handwritten character recognition is a prominent research problem in the field of image processing and machine learning. Significant improvements in OCR system for languages like English, Chinese and Japanese have been made. The languages like Arabic, Persian and Urdu still needs an effective handwritten character recognition system. The main problem associated with these languages is its cursive writing style. Pashto language shares same cursive nature. Several studies on automatic OCR system have been reported for languages like Arabic, Persian and Urdu, but no work has been reported on handwritten Pashto characters. A little work has been reported for printed letters recognition in Pashto language like Ahmad et al. [10] investigates the study of developing optical character recognition system for Pashto printed characters using k nearest neighbors.
Tavoli et al. [11] proposed a new feature extractor for the recognition of Arabic and Persian words, namely Statistical Geometric Components of Straight Lines (SGCSL) technique. Abandah and Anssari [2] presented the concept of recognizing handwritten Arabic letters using NCMs and ZMs features based on SVM as a classifier. Bhuiyan and Alsaade [3] presented a hybrid neural network approach for Arabic character recognition which contained a bidirectional associative memory and a multilayer perceptron (BAMMLP). Boufenar et al, [1] presented an Artificial Immune Recognition (AIR) using 7 types of features including both statistical and structural features for offline handwritten Arabic character recognition.
Sahlol et al. [12] presented an Arabic OCR system using a number of optimizers. CENPARMI dataset was used for testing of the system using three classifiers Linear Discriminant Analysis (LDA), SVM and Random Forest Trees (RFT). Oujaoura et al. [4] suggested a method for offline Arabic character recognition using three feature extractors in conjunction with neural networks. Zernike moments surpasses rest of the two in recognition rate. Aranian et al. [13] proposed a hybrid approach using artificial neural network, genetic algorithm and quantum genetic algorithm for the identification of Persian handwritten characters. They also performed feature dimensionality reduction on the datasets. Shafique et al. [14] suggested the concept of neural network for the recognition of handwritten Sindhi characters. Naz et al. [15] proposed entity recognition system in Urdu language using hybrid unigram and bigram approaches based on IJCNLP NE dataset and CRL NE dataset.
Naz et al. [16] presents a hybrid approach for Urdu nastali'q text recognition using hierarchical combination of Convolutional Neural Networks (CNN) and MDLSTM. They tested the system on Urdu Printed Text line Images (UPTI) dataset producing state of the art recognition results. Naz et al. [17] suggested geometrical features and multidimensional long short-term memory for Urdu Nasta'liq text recognition using sliding window technique. Naz et al. [18] suggested the use of zoning features and 2DLSTM networks for identification of Urdu text recognition. The system performance was evaluated on UPTI dataset. Ahmed et al. [9] presented an algorithm for Urdu letters identification using the BLSTM. They introduced a new database called the Urdu Nastali'q Handwritten Dataset (UNHD). This paper presents an OCR system for offline isolated Pashto character using Zernike moments as feature extractor technique and LDA as a classification tool.

PASHTO SCRIPT
Pashto is the official language of Afghanistan and a major language of Pashtun tribe in northern areas (Khyber Pakhtunkhwa) of Pakistan. In census 2007 -2009, it was estimated that about 40 -60 millions of people around the globe are native speakers of this language [19]. It consist both in hard dialect and soft dialect. The soft dialect is termed as Southern while the hard dialect is known as Northern. Both are differ from each other on phonological basis. It is cursive in nature, and had borrowed all the characters of the Arabic script, Persian script and Urdu script with some modification and additional six characters specific to Pashto script to made 44 character dataset and is shown in Fig. 1.

PROPOSED METHODOLOGY
Any handwritten OCR system consists of mainly three major steps that are; the input (handwritten character images), a feature extractor tool (to calculate astute features from the handwritten characters) and a classification tool for the recognition purpose. For the proposed Handwritten Pashto Characters Recognition (HPCR) system, we have developed a handwritten Pashto characters database which is developed for input, Zernike moments is considered for feature extraction purposes, and linear discriminant analysis selection for recognition purpose. Fig. 2 shows the proposed methodology of HPCR system.

Database Development
There is no handwritten Pashto characters database available for simulation purpose. A database of 14784 handwritten Pashto characters samples is developed by collecting samples from students and teachers in University, varying in age, gender and educational backgrounds. Table 1 shows age-wise distribution of samples collected for handwritten Pashto characters. Pashto script consists of 44 characters, it is more difficult to set all the characters in one page, so the handwritten samples are calculated in two pages. First 23 characters are calculated on one page that is shown in Fig. 3, while the remaining 21 on the other page as shown in Fig. 4. The page is divided into six columns to get variant samples from different students.
These scanned images are then further processed in order to extract the individual character samples by applying preprocessing steps (that are discussed in the next section) to develop a database for the proposed OCR system.

Preprocessing
In order to extract uniform features, it is necessary to apply some filtering techniques to character database to make the images uniform. The database images contained noise and also characters appeared at different locations (left, right, top, bottom). The images are normalized to a fixed size character images.
In this research work, the noise (black dots) are removed using thresholding. In our experimental case, we come with an optimum threshold value of 30. After noise removal, some morphological operations of erusion and dilation were followed to fix the skeleton of the handwritten characters in the sliced images. By applying morphological operation, all the characters are centralized in sliced images and converted to a fixed size of 80 × 80. The result obtained is shown in Fig. 5.

Feature Extraction
Feature extraction is a process by which the essential characteristics of the actual character image is represented in much lower dimensional space.
Rpq is the orthogonal radial polynomial given in equation 2, Here we are calculating zernike moments for digital image with order p and repetition q (q-p must be even), so zernike moments are given by equation (3): Thirty-six Zernike features are calculated for an individual image up to order 10 and repetition 10.

Classification
Classification is the most important step in an OCR system development. An efficient linear discriminant analysis (LDA) classifier is proposed to classify the handwritten Pashto characters based on the Zernike features map calculated. LDA is a generalization of the developed an OCR for glyphs and Sindhi characters recognition. In this approach the glyphs are successfully identified from scanned images and the characters are recognized. Awan et al. [14] presented the concept of neural network for the recognition of handwritten Sindhi characters based on zoning feature extraction algorithm. However all these techniques works good and are highly applicable for the problems addressed, but unfortunately all these techniques fails due to large characters dataset in Pashto script and varying characters in this language. This paper suggests the use of LDA technique for classification purposes in the proposed OCR system. LDA is generally a dimensionality reduction tool and outperforms in multi-class problems. This technique works by picking a new dimension that gives the maximum separation between the means of the projected classes, and minimum variance within each projected class. Fig. 5 presents a generalized model of the LDA classification tool.
Pashto script consists of 44 characters (in other words it contains 44 classes) and it ultimately specifies a multi-class problem. Fig. 6 represents a conventional multi-class LDA classifier system.

RESULTS AND DISCUSSIONS
Recognition results are generated based on LDA classification tool. These recognition results are simulated based on the Zernike feature map calculated in the previous step. For training and testing purposes, the calculated feature map is divided into 2: 1. Based on this ratio, an accuracy of 63.71% is calculated on 10-fold cross-validation method. LDA is tested for variant training and test sets. The training sets starts from 50%, 55%, 60%, 65%, 70%, 75% and 80%, while the remaining are considered as test set. For the proposed variant training and test set accuracy is calculated that is shown in Fig. 7.

Fig. 7: Training Set vs Accuracy Graph
It is evident from the Fig. 7, that as the training set increases, the recognition accuracy of the proposed PHCR system also increases.
Time consumed for each variant training and test sets is calculated for the proposed PHCR system, and a graph is generated based on the variant sets vs accuracy and time consumption. Classification accuracy and time consumption graph is plotted based on varying training and test sets that is shown in Fig.  8.
It is evident from Fig. 8, that when the training set increases the accuracy of the proposed PHCR system increases along with time consumption for the proposed PHCR system. After performing simulations for varying training and test sets based on the Zernike features map, 63.71% is the highest accuracy results achieved for the proposed PHCR system.

CONCLUSION
In this paper, an OCR system for recognition of handwritten Pashto characters is used. A Medium size database of 14784 characters is developed by collecting samples from different people varying in age, gender and educational backgrounds in University. Zernike moments invariants are considered as a feature extractor tool in the proposed OCR system. While an efficient LDA classifier is used to classify the individual character images. An accuracy result of 63.71% was calculated using 10fold cross validation.
In future, we tend to improve the accuracy of the system by using different combinations of features and classifiers methods. Also, we want to increase database samples to achieve high accuracy, and to improve this work for word/script recognition.