A Review on Devanagari Character Recognition

7 downloads 0 Views 491KB Size Report
Devanagari used for various languages including Sanskrit, Hindi, Marathi, ... information but off-line character recognition systems don't have that type of information. .... searching the data from scanned book, extracting the data from paper of ... such as post office, which include various manual tasks for interpreting text.

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

A Review on Devanagari Character Recognition Pooja Sharma Assistant Professor, D. A. V. College, Abohar, Punjab, India Abstract. Optical Character Recognition (OCR) is a process used for pattern recognition. Many researchers have been studied english character recognition. But, Indian languages are very complicated due to structure and computations. In India , Devanagari is the most popular script and is commonly used by Indian people. The research work on devanagri script is very less. Devanagari used for various languages including Sanskrit, Hindi, Marathi, Kashmiri and so on. A review of previous research work associated to devanagari character recognition and some applications of OCR system is presented in this article. Index Terms - Devanagari, Optical character recognition, Segmentation, Pattern recognition, Indian scripts. I. INTRODUCTION Character recognition is a research problem that has been ongoing from many years. In optical character recognition, a procedure of automatically recognizing the optically scanned character images and digitized character images is to be developed an electronic text document [1]. Devanagari is an Indian script which is very popular script among millions people. There are many Indian languages which are the basis of devanagri. Those languages are Hindi, Sanskrit, Kashmiri, Marathi and many more. English character recognition is mostly studied by researchers and a lot of commercial systems are used for it. But for Indian languages, the research work is very limited because of the complex formation of the language. This paper gives brief review of devanagari OCR and its applications. There are two types of classification of character recognition: printed and handwritten character recognition. The printed documents can further be classified into two types: good quality printed documents and degraded printed documents. Handwritten character recognition can also be classified into two types: offline and online character recognition, shown in the following figure 1.

Figure1. Different character recognition systems

On-line recognition system is also known as dynamic or real time recognition which is used to obtain the position of pen or captures temporal or dynamic information of number and also captures order of each stroke of the character, directly by the interface while typing or writing itself. After the completion of the writing or printing task, the off-line character recognition is carried out. For an input to the recognition system, the scanned copy of handwritten or printed character is used. Main difference between off-line and on-line character recognition is that on-line character recognition has real time, contextual information but off-line character recognition systems don’t have that type of information. Character recognition systems are further divided into machine printed and handwritten recognition systems which are based on the type of the text. Handwritten character recognition system is used to improve man and machine communication. Off-line handwritten recognition system is extremely hard and complex. In case of cursive writing, the recognition process becomes much harder. Handwritten characters used to show a large variation in the basic shape of the characters due to some factors like accuracy of the acquisition device, width of the pen, pen ink type, stroke size and location of the character in the

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

473

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

word. In addition, the psychological and physical condition of the writer also affects the writing styles and accuracy of recognition system. Various types of digital images and scanned documents are converted into searchable and editable data by using OCR. OCR can be used for reading entrance examination forms, processing of victims applications and criminal records in police station etc. OCR can also used to convert handwritten documents into searchable and easily accessible digital forms. Other applications of OCR are mail sorting, zip code reading, providing assistance to blind people as a reading aid, reading of customer filled forms like tax forms, validation of passports, verification of account numbers, accounting airline passenger tickets, automatic accounting procedures used in processing utility bills, automating office archiving, retrieving text and improving human computer interfaces (pen based computers). It can also be used for language processing, converting document image to ASCII format and designing multimedia systems etc. II. GENERAL CHARACTERISTICS OF DEVANAGARI SCRIPT Devanagari word is defined from Sanskrit words Deva (god) and Nagari (city) jointly stand for “city of gods”. The script devanagari is mainly based on phonologically and written from left to right. Devanagari script is part of ancient Brahmi script emerged sometimes around 11th century AD. Devanagari was initially developed to write Sanskrit but later it was adopted to write many other languages. Devanagari is the mother most of all Indian scripts. Devnagari is used to write languages like Hindi, Marathi, Bhili, Marwari, Magahi, Nepali, Bhojpuri, Maithili, Newari, pahari, Santhali, Mundari, Kashmiri, Tharu, Konkani and Sindhi [3]. The characters of devanagari script basically consist of 36 consonants (Vyanjan) and 13 Vowels (Swar). Devanagari script has particular composition rules for joining consonants, vowels and modifiers. Modifier symbols’ set is called as matras. The combination of two constants or a constant and a vowel are used to make a compound character. Compound characters (conjuncts) can have combinations of three or four characters. Devanagari contains almost 280 compound characters [5]. Devanagari script is different from roman script in many ways. Devanagari script doesn’t have the concept of upper or lower case characters.

Figure 1: Vowels (13) and Consonants (36) III. DIFFERENT STAGES IN THE RECOGNITION PROCESS Character recognition is the important task in pattern recognition. Character recognition process depends upon various factors like various noise, font sizes, broken lines or characters etc. and these factors affects the results of recognition system. There are four stages in optical character recognition system as follows: A. Preprocessing Stage Preprocessing is a necessary stage of applying procedures for smoothing, filtering, enhancing etc, for making a digital image used by subsequent algorithm to enhance their readability for optical character recognition software. The Paper

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

474

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

document is usually scanned by the optical scanner and is converted into the form of a picture. A picture is made of the combinations of picture elements which are known as Pixels. The pixels generally contain two values ON and OFF. The ON value indicates that the pixel is visible and the OFF value indicates that the pixel is not visible [4]. At this stage we have the data as image and this image can be analyzed further. By this, the important information can be retrieved. The preprocessing has various stages as following,

Figure 2: stages of preprocessing 1. Binarization Binarization or thresholding is the conversion of a gray-scale image into a binary image. To conversion of gray level image to binary form there are two approaches; i.e. global threshold and local threshold. In Global threshold, single threshold value is selected based on estimation of the background level from the intensity histogram of the image. In local or adaptive threshold, different values are selected for each pixel according to the local area information. The main purpose of binarization is to recognize the extent of objects and to focus on the shape analysis. 2. Noise elimination A major obstruction in pattern recognition of an image is noise. Noise degrades the quality of image. Noise may occur at different stages such as image capturing, transmission and compression. There can be the disconnected line segment, large gaps between the lines etc. due to the noise. So, it is very necessary to remove all the errors so that’s the information can be retrieved efficiently. For removing image noise, Different filters and morphological operations are available. One of the popular filters is gaussian filter. It is an effective noise elimination technique. Another name of noise elimination is smoothing. It reduces fine textured noise to improve the quality of the image. The techniques such as morphological operations are used to connect the unconnected pixels to remove isolated pixels and also help in smoothening pixels boundary. 3. Size normalization Normalization is used to obtain characters of uniform size. A tremendous reduction in data size is provided by this stge. The character patterns can have different sizes. Generally, an array of fixed size is the input to the recognition system. Hence, size normalization is required to make the image suitable to this size. Normalization reduces the size of the image without altering the structure of the image. 4. Thinning A morphological function is used to remove the selected foreground pixels from binary images, called as thinning. Thinning is the final stage in preprocessing. Image thinning creates a skeleton of the image without loss its topological properties. In the thinning algorithm, both boundary pixel analysis and connectivity analysis are used. B. Segmentation Segmentation is an important process that helps to decide the character recognition system’s success rate. In segmentation, an image/document is divided in two categories: disjoint and homogeneous regions [5]. This task is performed by extracting the boundaries. There are various approaches for finding the character bounds. Devanagari document is further decomposed the sequence of characters of the image in to lines and words using horizontal and vertical projection correspondingly. Devanagari words can further be splitted by eliminating the shiro-rekha of individual character. A word of devanagri script can be separated into three parts. The middle part denotes the core characters. The portion which is above the shiro-rekha, is in upper part and lower part may have optional modifiers. Therefore, character segmentation of the devanagari is very difficult because of various modifiers. C. Feature Extraction The main goal of feature extraction is to extract the necessary features of the symbols. It is the most important step of OCR system. For achieving the high recognition performance, it becomes the complex step of the recognition process. Selection of the feature extraction technique is an important factor. It is said to be the most difficult task of pattern recognition. There are some feature extraction methods which are Deformable Templates, Zoning, Projection Histogram, Template matching,

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

475

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

Contour Profile. Table I, shows some feature extraction methods of various representation forms like gray scale, binary, vector. [6]

Table I. Feature extraction methods for gray scale, binary, vector

D. Character Recognition An excellent text recognizer has various commercial and practical applications. These applications are library materials’ documentation, processing of cheques in banks, searching the data from scanned book, extracting the data from paper of documents, automation of the organization such as post office, which include various manual tasks for interpreting text. Many different approaches are used in the problem of text recognition; some of them are Feature extraction, Template matching, Support Vector Machine (SVM) algorithms, Geometric approach, Fuzzy logic and Neural Networks. There exist many methods and comparative study on devanagari character recognition Which may be found in [7]. IV. REVIEW OF PREVIOUS APPROACHES India is a country where many different languages and scripts are used. In devanagari script, mostly Hindi language is used. National language of India is Hindi and after Chinese and English, the third most spoken language of the world. Mostly Hindi is used in documentation in Rajasthan, New Delhi, Madhya Pradesh, Uttar Pradesh, Himachal Pradesh, Chattisgarh, Uttarakand, Bihar and Haryana. Therefore, devanagari script is mostly used in various documents like application forms, bank cheques, envelops, answer sheets, railway reservation forms etc and also many websites are hosted in devanagari increasingly. There are many commercial systems available for reading and searching english scripts, but still devanagari script for such are in development stage. Bansal et al. (2010) [8]: This paper elaborated the segmentation of different irregular text words of Gurumukhi script. The segmentation of words containing skewed, irregular headline, broken, touching and overlapped characters are discussed in this paper. Some new techniques such as counter tracing methods are elaborated with the help of horizontal and vertical projections. Kumar et al. (2010) [9]: The segmentation of the various scanned text image is discussed in this paper. The full image is known as a large window in this technique. The large window is split into small windows as giving lines and once the lines are recognized then recognize a word that is existed in a line and at the end character is recognized. The variable sized window concept is also elaborated in this paper. Garg N. et al. (2011) [10]: In OCR system, recognition rate is decreased due to touching of the half character along with full characters, the analysis of existence of half character is very complicated task. In this paper, new algorithm of structural properties of document is proposed to segment half characters of handwritten Hindi text. The results are concluded for both handwritten Hindi text and for printed Hindi text. The proposed algorithm acquires accuracy in segmentation as 83.02% with half characters in handwritten text and 87.5% in printed text. Kumar and Singh (2011) [11]: Many tests were conducted on different documents, the results obtained with a great accuracy. Some characters of lines in the lower zone were observed almost correctly. The coordinates of the detected lines and words are used to get the character. Character segmentation process was categorized in two parts: (i) to acquire the segmented region R (ii) to verify that R has a meaningful symbol or not. If R is meaningful then it is accepted otherwise rejected. Rhead et al. (2012) [12]: This paper elaborated aspects of the relevant legislation and standards, after applying them on planet range plates. Many producing techniques and varied specifications of the element components also are noted. Many fixing methodologies with fixing locations area unit discussed still because of the impact on captured image. Badawy, W. et al. (2012) [13]: This paper discussed the automated vehicle plate recognition (ALPR). The auto vehicle plate data from a picture or a sequence of pictures are extracted. This extracted data is used in several applications, like systems (toll payment, parking fee payment), electronic payment and superhighway and blood vessel observance systems. The ALPR uses black and white color or infrared camera to acquire pictures.

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

476

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

Ntirogiannis et al. (2013) [14]: This paper proposed that the linearization of document image is important for image analysis and recognition pipeline of the document and it impacts the stages of this method. The analysis of a linearization technique aids in finding out its recursive behavior, its effectiveness, by providing qualitative and quantitative indication of the performance. This paper discussed a pixel-based linearization analysis methodology for historical handwritten or machineprinted document pictures. Within the planned analysis theme, the recall and preciseness analysis measures area unit properly changed employing a coefficient theme that diminishes any potential analysis bias. Kumar et al. (2014) [15]: This paper elaborated the segmentation of characters such as handwritten Gurumukhi characters, which is used to defining the segmentation with digitization process with pre-processed techniques. This paper discussed the Water Reservoir method, which is used for segmentation and identification of characters. Goyal et al. (2015) [16]: This paper elaborated the reasons and applications of OCR for analyzing the single character. OCR is also called handwritten character recognition or intelligent character recognition. Image recognition is very difficult part of OCR because of the problem of distortion presents in images. This is also used for typing Hindi characters on iPhones and iPods. Kaur et al. (2017) [17]: This document is a template of OCR, which is a system that gives full alphabetical recognition of written characters or merely scanned the document. Documents area unit scanned employing a scanner and area unit given to the OCR systems that acknowledges the characters within the scanned documents and converts them into code information. This review paper is focused to summarize the ways for a higher level of understanding of the reader. V. CONCLUSION One of the important applications of pattern recognition is OCR. The popularity of OCR is increasing day by day. But still, OCR on Indian scripts is remaining in preliminary stage and large extent of research is required to handle the issues and complexity of devanagari character recognition (DCR). This paper presented an overview of existing approaches of DCR. From many years, investigations on OCR of devanagari scripts by researchers are very less. The segmentation is responsible for most of the errors in the recognition system. If we identify the whole word without segmentation, therefore the rate of recognition can also be increased. The accuracy of character recognition system depends on various factors such as training set, availability of sample data, number of parameters used in the recognition process and test data. Use of dictionary also helps in the improvement of recognition accuracy. REFERENCES [1] BAG S. and HARIT G. 2013. A survey on optical character recognition for Bangla and Devanagari scripts”, Sa¯dhana¯ Vol. 38, Part 1, pp. 133–168._c Indian Academy of Sciences. [2] Kumar M., Jindal M. K. and Sharma R.K. 2011. Review on OCR for Handwritten Indian Scripts Character Recognition. Advances in Digital Image Processing and Information Technology, pp. 268-276,DPPR. [3] Malanker A. and Patel M. 2014.Handwritten Devanagari Script Recognition: A Survey. IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-ISSN: 2278-1676, p-ISSN: 2320-3331 Volume 9, Issue 2 Ver. II, PP 80-87. [4] Pratap N. and Arya S. 2012. A Review of Devnagari Character Recognition from Past to Future. International Journal of Computer Science and Telecommunications, Volume 3, Issue 6. [5] Indira B. and Sudha T. 2010. A Pragmatic Approach for Reading Number Plates of Indian Vehicles. International Journal of Neural Networks and Applications, 3(1), pp. 15-18. [6] Patwardhan S. and Deshmukh R. 2015. A Review on Offline Handwritten Recognition of Devnagari Script. International Journal of Computer Applications (0975 – 8887) Volume 117. [7] Trier D., Jain A.K. and Taxt T. 1996. Feature Extraction Method for Character Recognition – A Survey. Pattern Recognition, pp. 641-662, Vol. 29, No. 4. [8] Bansal G. and Sharma D. 2010. Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script. International Journal of Computer Applications, Vol. 1, No. 24, pp. 104-111. [9] Kumar M., Jindal M.K. and Sharma R.K. 2014. Segmentation of Isolated and Touching Characters in Offline Handwritten Gurmukhi Script Recognition. International Journal Information Technology and Computer Science, pp. 58- 63. [10] Garg N.K., Kaur L. and Jindal M.K. 2011. The segmentation of half characters in Handwritten Hindi Text. SpringerVerlag Berlin Heidelberg, pp. 48-53. [11] Kumar R. and Singh A. 2011. Algorithm to Detect and Segment Gurmukhi Handwritten Text into Lines, Words and Characters. IACSIT International Journal of Engineering and Technology, Vol.3, No.4. [12] Rhead M. 2012. Accuracy of automatic number plate recognition (ANPR) and real world UK number plate problems. IEEE International Carnahan Conference on Security Technology (ICCST).

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

477

© 2018 IJRAR August 2018, Volume 5, Issue 3

www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)

[13] Badawy W. 2012. Automatic License Plate Recognition (ALPR): A State of the Art Review. IEEE International Conference on Document Analysis and Recognition. [14] Ntirogiannis K., Gatos B. and Pratikakis I. 2013. A Performance Evaluation Methodology for Historical Document Image Binarization. IEEE International Conference on Document Analysis and Recognition. [15] Kumar D., Koshti and Govilkar S. 2014. Segmentation of Touching Characters in Handwritten Devanagri Script. International Journal of Computer Science and its Applications, Vol. 2, Issue 2, pp. 83-87. [16] Goyal N. and Jain S. 2015. A REVIEW: Optimized Hindi Script Recognition using OCR Feature Extraction Technique. International Journal of Innovative Research in Computer and Communication Engineering (An ISO 3297: 2007 Certified Organization) Vol. 3, Issue 8. [17] Kaur J. and Kaur R. 2017. Review of the Character Recognition System Process and Optical Character Recognition Approach International Journal of Computer Science and Mobile Computing. Vol.6 Issue.5.

IJRAR19035

International Journal of Research and Analytical Reviews (IJRAR) www.ijrar.org

478