Artificial Urdu Text Detection and Localization from Individual Video Frames

In current era of technology, information acquisition from images and videos become most important task due to the rapid development of data mining and machine learning.The information can be either textual, visual, or combination of these. Text appearing in images or videos is a significant source of information and plays a vital role to perceive it. Developing a unified method to detect the text is hard, as textual properties (i.e. font, size, color, illumination, orientation, etc.) may vary with the complex background. So far, multimedia and computer vision community unable yet to standardize any ideal approach to extract the text smoothly. In this paper, a novel method is proposed to detect and localize artificial Urdu text in individual video frames. Firstly, Sobel and Canny edge detection operators are applied to input frame and are merged with MSER (Maximally Stable Extremal Region) detected regions. Next, geometric constraints are applied to eliminate obvious non-text regions with large and small variations. Further refining of non-text regions is achieved by stroke width transform. SVM (Support Vector Machine) classifier is trained to classify text and non-text objects. Finally, bounding boxes are used to localize the text.Experimental results show that the proposed method is robust and efficient than state-of-the-art methods.


INTRODUCTION
The text exists under varying conditions such as fonts, size, color, orientation, and illumination, as shown in Artificial Urdu Text Detection and Localization from Individual Video Frames mechanism and methods for several specific applications like visually impaired people's assistance [4], video indexing and retrieval [5], document analysis [6], content-based image search, automatic translation and so on.Since last decade, many researchers and scientists have been paying more attention to text acquisition from the video images; however, it is still a challenging task due to varying properties of text including unwanted reflection, shadow, and complex backgrounds.

FIG. 1. SAMPLE IMAGES OF ARTIFICIAL URDU TEXT UNDER VARYING CONDITIONS
In last few years, many researchers explored several methods for text detection and localization from videos in order to develop robust video retrieval systems [7][8][9][10].
Most of these methods focused on English language and some methods for Chinese and other languages. However, a minimal work can be seen for Urdu text. Urdu is the national language of Pakistan and official language of several cities of India. As reported in Wikipedia, Urdu has 65 million native speakers and 40 million second language speakers worldwide. The language was formed due to the impact of Arabic and Persian languages and is right to left script. Like Arabic and Farsi, Urdu characters can also have distinct shapes as per their position e.g. initial, middle, final or standalone. Urdu has 38 alphabets, and no upper or lower-case characters exist.
Recently, a notable progress can be seen on Urdu text.
The researchers have investigated robust methods for Urdu OCR (Optical Character Recognition) [11][12][13], Urdu handwritten text recognition [14], Urdu document analysis [15]. However, extraction of Urdu text from video images is still rare explored research area as compared to other languages (e.g. English, Chinese). As there are more than 105 Urdu TV channels of sports, movies, music, news, religious and education worldwide, there is great need to extract Urdu text more robustly. Considering this, we have proposed a new approach which efficiently detects and localizes artificial Urdu text from video frames.
In this paper, we propose a framework which robustly detects and localize Urdu text. The framework is robust and efficient than the state-of-the-art methods available for Urdu language. The contributions of this paper are:  A novel framework based on MSER and SWT (Stroke Width Transform) is proposed to detect artificial Urdu text.


The proposed method is robust for complex background images with minimum computational complexities.
The rest of the paper is organized as follow: In Section 2, literature work available for Urdu and other languages is described. In section 3, the proposed methodology is presented. Experimental results and performance evaluation are given in Section 4. Section 5 concludes the results and future direction of proposed work.

RELATED WORK
The

PROPOSED METHOD
In this section, we describe a novel framework to detect and localize artificial Urdu text in video frames. The general workflow of our proposed methodology is shown in Fig.   2.

Text Detection
To detect textual regions in an individual frame, first,we used Sobel and Canny edge detectors to find potential edges, as shown in Fig. 3(b-c). Then MSER [21] feature detector is employed as it extracts maximum features due to high contrast and regular color intensities. The detected regions are shown in Fig. 3(d). The pixel area is setto 120 < Area < 400 and the threshold is set to 3. The Sobel and Canny filters are merged with MSER to cope with blurred frames.

Localization and Validation
The obtained text regions are localized to text and nontext objects and then validated using geometric constraints to filter out non-text objects.Binarization of input frame is enhanced via Otsu's method. To filter out obvious nontext objects, we used simple geometric constraints such as width, height, aspect ratio. The objects having maximum and minimum variations are eliminated first.Then we set the aspect ratio of objects between 0.2 as Urdu text can have connecting characters.We used different constraints from ICDAR [22] which are observed best features by [23], and are given as follows:  Fig. 4(a).

Segmentation and Extraction
Extraction step eliminates background pixels with the foreground. We use stroke width variation [24] for further extraction of true text regions. It is the length of a straight line from a text pixel to another pixel towards its gradient direction.Stroke width measures the width of curves and lines which can make a character. Text regions can have less stroke width variation, while non-text regions can have more variations.Skeleton image of remaining text regions is obtained by computing distance transform from each pixel to its nearest boundary pixel.We set the threshold rate to 0.3 and apply the procedure to each region filtered from previous step. Stroke width distance further segments non-text objects, as shown in Fig. 4(b).

Character Classification
The

EXPERIMENTAL RESULTS
In this section, we will briefly present the experimental results and performance evaluation. All experiments are implemented and executed on a computer with 6 GB Ram and 3.10 GHz CPU Intel Core-i3-2100.

Dataset
We evaluated the proposed approach on publicly accessible Artificial Urdu Text Dataset [25] and compared the results with state-of-the-art methods available for Urdu text.The dataset consists of 1000 individual video images which are captured from different Urdu TV channels (e.g. News, Sports, Business, Entertainment, and Religion).
All images have a uniform dimension of 720x576 pixels and have "png"file format.

Experimental Setup
To evaluate the performance, we used area based precision p and recall r measures, which are universally accepted, and are defined as: where E is the estimated words, T is the ground truth targets. The frequency measure f is used to combine precision and recall.

ACKNOWLEDGEMENT
Authors are extremely thankful to anonymous reviewers for their valuable comments and suggestions that helped us to improve the quality of the script.