A Database for Urdu Text Detection and Recognition in Natural Scene Images

This paper describes a novel database for Urdu Text detection and recognition in natural scene images. Many standard benchmarks for Latin text have been published, where remarkable classification and recognition techniques for text extraction in natural scenes are proposed. Recently, a dataset for multilanguage text in natural scene images has been published by the International Conference on Document Analysis and Recognition (ICDAR). This dataset contains natural scene images in six different languages including Arabic, Korean and Chinese texts. Currently, there is no any dataset available for Urdu text in natural scene images. Therefore, the main objective of this paper is to create a novel dataset of Urdu text in natural scene images and provide to the research community to develop and evaluate state-of-the-art algorithms for text localization and recognition. The dataset consists of cropped words and segmented character images in natural scenes. All the characters are manually segmented from the captured images. All the images are captured in varying lighting conditions, low resolution, occlusions and perspective conditions. The dataset consists of 8000 cropped Urdu word-images and 16000 segmented Urdu character-images in different forms (isolated, initial, medial and final). The dataset is further increased by synthetically generating Urdu characters and putting on the real background images. The dataset is compared with the recently published Arabic natural scene datasets and Latin text datasets including ARASTI, ICDAR03 and Chars74k. The proposed dataset contains more natural scene images as well as segmented characters and cropped words, which show that the dataset can be used as a benchmark for recognizing Urdu text in natural scene images.


INTRODUCTION
T ext recognition in natural scene images has become a useful and challenging task in many real world applications. The text within natural scene images contains much valuable information, which is helpful to interpret the world and understand the other textual cues. It is one of the common ways of the commination. Text extraction in natural images is generally divided into two phases: detection and recognition. In detection, the image is checked if it contains text or not and in recognition, the detected text is converted into machine-readable form. Text recognition has traditionally been performed from scanned documents, where the text is usually in black-and-white, plain background and line based paper environment. In scanned documents, the text usually appears in consistent font type, size, color, style and fixed lines. Therefore, the Optical Character recognition (OCR) systems perform very well and accurate on these scanned documents. However, these OCR systems fail when applied to read text in natural scene images due to various challenges including background complexities, un-even lighting conditions, low resolution, blur, occlusion, variations in font size, type, color orientations and many more present in natural scene images. The natural scene images also contain many other objects whose structure resemble with the text, which make the recognition process further complex.
However, there are more than 100 languages commonly written and spoken around the world. Many natural scene images contain text in more than one language as well.
This shows that if text recognition in natural scene images is carried for other languages, then it could be helpful for foreign tourists to translate and understand what is written on road signboards, shop names, advertisement banners and product labels.
Recently, some research work for isolated Arabic and Urdu character recognition in natural scene images has been reported and a dataset for Arabic scene text recognition has been developed [11]. This is the first benchmark for Arabic character recognition in natural images. A baseline research work has been done by [12] [13] for isolated Urdu character recognition in natural scene images. However, no any dataset is available for Urdu text recognition in natural scene images. The availability of the standard datasets is important to evaluate existing state-of-the-art algorithms and to train and test machine learning classifiers for scene text recognition. Therefore, the main objective of this research is to capture natural images, It will further be increased for whole image text detection and end-to-end text extraction algorithms.
The rest of the paper is organized as follows: the next section describes the related existing datasets of other scripts. Section III highlights the proposed dataset, the segmented characters and words. Section IV explains the characteristics of the synthetic dataset and section V describes the concluding remarks and some possible future enhancements in the dataset.

PROPOSED DATASET
The proposed dataset is compared with the currently available character datasets in natural images and the statistics of the number of images, cropped words and segmented characters is shown in  Each character class has unbalanced number of samples because some characters are not frequently used in text and some are more commonly used. Therefore, each character class has 30 to 1580 numbers of samples. To overcome the problem of unbalanced classes, a synthetic dataset of Urdu characters is created. The details of the synthetic dataset are described in the next section.
Cropped Urdu Word Image Dataset: a dataset of 8000 cropped word-images in natural scenes is also developed.   images. In future, the dataset will further be increased with more cropped words and characters. The ground truth bounding boxes at word level will also be created.