terhasil yang bebas dari illuminasi. Keputusan ujikaji ...... In this research, classification of methods divided into sex methods to segment the. Quran pages into ...
Faculty of Information and Communication Technology
ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION
Laith Nazeeh Jamil Bany Melhem
Master of Computer Science (Internetworking Technology)
2015
ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION
LAITH NAZEEH JAMIL BANY MELHEM
A thesis submitted in fulfilment of the requirements for the degree of Master of Computer Science (Internetworking Technology)
Faculty of Information and Communication Technology
UNIVERSITI TEKNIKAL MALAYSIA MELAKA
2015
DECLARATION
I declare that this project entitled “Illumination Removal and Text Segmentation for AlQuran Using Binary Representation” is the result of my own research except as cited in the references. The project has not been accepted for any degree and is not concurrently submitted in candidature of any other degree.
Signature
:
........................................................................
Name
:
LAITH NAZEEH JAMIL BANY MELHEM
Date
:
........................................................................
APPROVAL
I hereby declare that I have read this project and in my opinion this project is sufficient in terms of scope and quality for the award of Master of Computer Science (Internetworking Technology).
Signature
:
.............................................
Supervisor Name
:
Dr. MOHD SANUSI AZMI
Date
:
.............................................
DEDICATION
I would like to present my work to those who did not stop their daily support since I was born, my dear mother and my kindness father. They never hesitate to provide me all the facilities to push me foreword as much as they can. This work is a simple and humble reply to their much goodness I have taken over during that time. Also to my brothers (Mohammad, Hamzah), sisters (Rawan, Rana, Shoroq), my Grandfather, my Grandmother, my Aunt, my Uncle, friends and those entire how I love (Allah's bless all of them).
ABSTRACT
Segmentation process for segmenting Al-Quran needs to be studied carefully. This is because Al-Quran is the book of Allah swt. Any incorrect segmentation will affect the holiness of Al-Quran. A major difficulty is the appearance of illumination around text areas as well as of noisy black stripes. In this study, we propose a novel algorithm for detecting the illumination on Al-Quran page. Our aim is to segment Al-Quran pages to pages without illumination, and to segment Al-Quran pages to text line images without any changes on the content. First we apply a pre-processing which includes binarization. Then, we detect the illumination of Al-Quran pages. In this stage, we introduce the vertical and horizontal white percentages which have been proved efficient for detecting the illumination. Finally, the new images are segmented to text line. The experimental results on several Al-Quran pages from different Al-Quran style demonstrate the effectiveness of the proposed technique.
i
ABSTRAK
Proses penemberengan Al-Quran memerlukan kajian yang berhati-hati. Ini kerana AlQuran adalah kitab Allah swt. Sebarang kesalahan penemberengan akan memberikan kesan kepada kesucian Al-Quran. Kesukaran yang dihadapi adalah illuminasi yang mengelilingi kawasan teks Al-Quran dan juga garisan hitam. Pada kajian ini, kami mencadangkan satu algoritma baharu untuk mengenalpasti illuminasi pada setiap muka Al-Quran. Tujuan adalah untuk menembereng Al-Quran muka ke mukadan baris ke baris tanpa mengubah apa-apa pada kandungan Al-Quran. Mulanya, prapemprosesan digunakan dengan menggunakan proses binari. Kemudian, illuminasi dikenalpasti. Pada tahap ini, kami memperkenalkan peratusan menegak dan mendatarberdasarkan kepada piksel putih dapat melaksanakan penemberengan dengan baik. Akhir sekali, imej baru terhasil yang bebas dari illuminasi. Keputusan ujikaji menggunakan stail Al-Quran menunjukkan teknik cadangan adalah efektif.
ii
ACKNOWLEDGEMENT
First and foremost, praise to Allah, for giving me this opportunity, the strength and the patience to complete my project finally, after all the challenges and difficulties. I would like to take this opportunity to express my sincere acknowledgement to my supervisor Dr. Mohd Sanusi Bin Azmi from the Faculty of Information & Communication Technology Universiti Teknikal Malaysia Melaka (UTeM) for her essential supervision, support and encouragement towards the completion of this thesis. Thanks for king Fahd Glorious Quran Printing Complex to publish the styles of Al-Quran to use it during the research. To my beloved my family and the jewel my heart my mother. Thank you for the sacrifices, patience, support and compassion which has become one enters my life. Not to forget also to all my colleagues and friends struggling for Master's that inspire a vision, guidance and sharing experiences. Special thanks to all my peers, my father, beloved mother and siblings for their moral support in completing this degree. Lastly, thank you to everyone who had been to the crucial parts of realization of this project.
iii
TABLE OF CONTENTS PAGE DECLARATION APPROVAL DEDICATION ABSTRACT i ABSTRAK ii ACKNOWLEDGEMENT iii TABLE OF CONTENTS ................................................................................................... iv LIST OF TABLES .............................................................................................................. vi LIST OF FIGURES ........................................................................................................... vii CHAPTER 1. INTRODUCTION ......................................................................................................... 1 1.1 Introduction ............................................................................................................. 1 1.2 Research Background .............................................................................................. 3 1.3 Problem Statement .................................................................................................. 4 1.4 Research Questions ................................................................................................. 5 1.5 Research Objectives ................................................................................................ 5 1.6 Project Significant ................................................................................................... 5 1.7 The Scope of Research ............................................................................................ 6 1.8 Expected Outcomes ................................................................................................. 6 1.9 Conclusion ............................................................................................................... 6 2. LITERATURE REVIEW ............................................................................................. 7 2.1 Introduction ............................................................................................................. 7 2.2 Image Pre-processing .............................................................................................. 8 2.2.1 Progress in Binarization Studies .................................................................. 8 2.2.2 Border/Illumination Removal .................................................................... 14 2.3 Arabic OCR ........................................................................................................... 16 2.4 Image Segmentation .............................................................................................. 18 2.5 Techniques for Document of Text Segmentation.................................................. 19 2.5.1 Projection Profiles...................................................................................... 20 2.5.2 Hough Transform Approach ...................................................................... 20 2.5.3 Smearing methods ...................................................................................... 21 2.5.4 Dynamic Programming .............................................................................. 22 2.5.5 Other Techniques ....................................................................................... 23 2.6 Text Segmentation analysis ................................................................................... 26 2.7 Conclusion ............................................................................................................. 34 3. RESEARCH METHODOLOGY ............................................................................... 35 3.1 Introduction ........................................................................................................... 35 3.2 Research Methodology .......................................................................................... 35 3.2.1 Research Framework ................................................................................. 35 3.3 Task Framework .................................................................................................... 38 3.4 Experimental Test Framework .............................................................................. 40 3.4.1 Experiment I .............................................................................................. 40 3.5 Research Tool ........................................................................................................ 41 3.6 Conclusion ............................................................................................................. 41 iv
4. IMPLEMENTATION ................................................................................................. 42 4.1 Introduction ........................................................................................................... 42 4.2 Data Collection ...................................................................................................... 42 4.3 Implementation process ......................................................................................... 43 4.3.1 Image Binarization ..................................................................................... 43 4.3.2 Page segmentation ..................................................................................... 45 4.3.3 Text line segmentation ............................................................................... 53 4.4 Conclusion ............................................................................................................. 56 5. RESULT AND TESTING ........................................................................................... 57 5.1 Introduction ........................................................................................................... 57 5.2 Testing ................................................................................................................... 57 5.3 Questionnaire......................................................................................................... 58 5.4 Result ..................................................................................................................... 62 5.5 Conclusion ............................................................................................................. 65 6. CONCLUSION AND FUTURE WORK ................................................................... 66 6.1 Introduction ........................................................................................................... 66 6.2 Summary ............................................................................................................... 66 6.3 Limitation of the project ........................................................................................ 68 6.4 Future Works / Further Research .......................................................................... 68 REFERENCES .................................................................................................................. 69 APPENDICES.................................................................................................................... 79 Appendix A Questionnaire ............................................................................................... 79 Appendix B RESULT ........................................................................................................ 85
v
LIST OF TABLES
TABLE
TITLE
PAGE
Table 1.1: Summary of Ayat, Page and Line from printed Al-Quran
2
Table 2.1: Image processing steps for Jawi pattern recognition (Azmi, 2013)
9
Table 2.2: Categorization of segmentation algorithms (Nikos et al., 2010)
18
Table 2.3: Text line segmentation methods analysis
26
Table 3.1: Tools and programming languages that utilized in this research
41
vi
LIST OF FIGURES
FIGURE
TITLE
Figure 1.1: Al-Quran writing styles
PAGE 4
Figure 2.1: Example of an image with noisy black border and noisy text region
14
Figure 2.2: General Arabic OCR systems capabilities
16
Figure 3.1: Research Framework
36
Figure 3.2: Al-Quran writing styles
37
Figure 3.3: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework
39
Figure 3.4: Page (10) from Holy Al-Quran
40
Figure 4.1: Al-Quran writing styles
43
Figure 4.2: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework
44
Figure 4.3: Overall steps for removing page frames
45
Figure 4.4: Flowchart for blank space and Illumination detection and removal
47
Figure 4.5: The process for removing the exterior blank space and marginal Illumination from Al-Quran page
48
Figure 4.6: The process for removing the Illumination from Al-Quran page
50
Figure 4.7: The process for removing the interior blank space from Al-Quran page
52
Figure 4.8: Flowchart for text line segmentation
54
Figure 4.9: The process for text line segmentation
55
vii
Figure 5.1: Pie Chart for the Gender
58
Figure 5.2: Pie Chart for the Race
59
Figure 5.3: Pie Chart for the Country
59
Figure 5.4: Pie Chart for the Student Category
60
Figure 5.5: Pie Chart for the Faculty
60
Figure 5.6: Column Chart for Segmentation
61
Figure 5.7: Pie Chart for Page Segmentation
61
Figure 5.8: Pie Chart for Text Line Segmentation
62
Figure 5.9: User Interface for Selection File(s)
63
Figure 5.10: User Interface for Binarization
63
Figure 5.11: User Interface for Save Pages and Line
63
Figure 5.12: The Output Files
64
Figure 5.13: Result of Page Segmenttion
64
Figure 5.14: Result of Text Line Segmentation
65
viii
CHAPTER 1
INTRODUCTION
1.1
Introduction Image processing is a popular research area in computer science. Today, image
processing not only focuses on fundamental issues addressed by researches also the suitability of the research into several domains such as biometric (Phillips et al., 1998), geographical information system (Câmara et al., 1996), character recognition (Omar, 2000), document analysis (Sauvola and Pietikäinen, 2000) and others. Image Processing is a technique to enhance raw images received from cameras/sensors placed on satellites, space probes and aircrafts or pictures taken in normal day-to-day life for various applications (Rao, 2004). There are three main categories of image processing: Image Enhancement, Image Rectification, and Restoration, and Image Classification. Various techniques have been developed in Image Processing during the last four to five decades. Most of the techniques are developed for enhancing images obtained from unmanned spacecrafts, space probes and military reconnaissance flights. Image Processing systems are becoming popular due to easy availability of powerful personnel computers, large size memory devices, graphics software etc (Rao, 2004). Besides, Khairuddin Omar (2010) and Mohammad Faidzul et al. (2010) the image processing body knowledge consists of several phases start from data collection, pre1
processing, feature extraction, feature selection, classification, and post processing (Azmi, 2013; Nasrudin et al., 2010; Omar, 2000). Each phase in the image processing has subprocesses. (Omar, 2000) has category the pre-processing phase for the Arabic/Jawi character recognition into Binarization, Edge detection, thinning, and Segmentation before feature extraction took place. In this research, focus is in segmentation for holy Quran. Thus, Segmentation for holy Quran based on processes for segmenting Arabic/Jawi handwritten texts. Holy Quran is book of Allah swt. Al-Quran consists of 30 chapters, 114 Surah and 6036 Ayat. However, number of pages and lines are difference based on publishers. Table 1.1 below shows example of printed Al-Quran from different publishers. Table 1.1: Summary of Ayat, Page and Line from printed Al-Quran Al-Quran (Version)
Ayat
Page
Line per page
Total line
Madinah
6236
604
15
9060
Al-Quran Al-Hakeem
6236
608
15
9120
Al-Quran Al-Kareem
6236
617
15
9225
Al-Quran Al-Majeed
6236
855
13
11115
Mushaf Al-Madinah Quran
6236
619
15
9285
6236
625
15
9375
Majeed {Nastaleeq} Mushaf Al-Madinah Quran Majeed
Based on Table 1.1, issues on segmentation Al-Quran is not uniform. Although the same number of chapter, surah and ayat, but number of page and line are difference. The
2
difference number of page and line become interesting topic to be studied especially in the segmentation process. Segmentation process for segmenting Al-Quran needs to be studied carefully. This is because Al-Quran is the book of Allah swt. Any incorrect segmentation will affect the holiness of Al-Quran. Currently, exist some segmentation techniques such as Naive Bayes Classifier (Bidgoli and Boraghi, 2010), however the segmentation techniques focus on segmenting object such as text and face segmentation (Khattab et al., 2014). Also, there are some segmentation techniques for Arabic and Jawi character (Omar, 2000). Although, Arabic and Jawi characters are quite near to Al-Quran, however the techniques in Arabic and Jawi do not have diacritical marks (Omar, 2000). In this research, one technique for segmenting Al-Quran will be proposed. The technique will consider diacritical marks (Tashkil) in order to protect the holiness of AlQuran. The propose technique will be evaluated based on comparison with original AlQuran text. Any missing diacritical marks (Tashkil), words, and sentences will be considered as incorrect evaluation.
1.2
Research Background Al-Quran is the last book of Allah swt. In Table 1.1, there are difference page and
line for each printed Al-Quran. Besides, Figure 1.1, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.
3
Figure 1.1: Al-Quran writing styles
The segmentation process in this research is use to prepare image for feature extraction process. In the Al-Quran, texts and Diacritical marks (Tashkil) will be extracted. Thus, the illumination and some empty space will be removed. In this research “illumination” refers to the art of embellishing and decorating the Holy Quran. It is used in Islamic arts. This is used in the first and last page of Al-Quran, a head of each ‘Sura’ and a border of each page in Al-Quran and other places (Tajabadi et al., 2009). Forming decorative arrays and Islamic designs in this holy Quranic art has come from consolidated ideas and worldviews of artists of this field. So that, many calligraphists and number of (not less) illuminators read Quran in a memorized manner. But those less people who were not able to read Quran in a memorized manner, were so familiar with its verses that it had become an integrated part of their nature. (Lings, 1998). Manifestation of spirituality in illumination of Quran is to the extent that it has made this art worthy of companion of the holy Quran and in fact, manifestation of the divine realm is duty of the art evoked by the word of God. And it can be said that Quran itself has opportunities that stimulates religion (Lings, 1998).
1.3
Problem Statement Arabic language (Quran Language) has markings called “diacritical marks” or
“diacritics” represent short vowels or other sounds if one of these diacritical marks ignored 4
it will change the meaning of the word, There are many Arabic character recognition techniques which can recognize the characters of text or the whole page Khairuddin Omar(2000), Mohamad Faidzul (2010) but all these techniques provide recognition for characters without considering the diacritical marks, which may affect the meaning of the Quran’s word and the holiness of Al-Quran.
1.4
Research Questions
How the segmentation process for Al-Quran happen in image processing domain?
How to segment illumination occur in Al-Quran with different form of illumination and without missing any diacritic?
1.5
Research Objectives This study has certain objectives as follows:
1.6
•
To propose framework for Segmenting Al-Quran into page and line
•
To propose a technique to segmenting texts of Al-Quran
Project Significant The Holy Quran is very important to the Muslims with respect to its authenticity. In
this project the removal Illumination technique from the Holy Quran page will be proposed, which enable us remove the Illumination of page according to the percentage of the binary numbers in the Quran page image. The result can be used by the researchers to compare between the copies of the Holy Quran for identifying the originality of copies.
5
1.7
The Scope of Research The scope of research is: i.
This research study will primarily focus on removing the border of the Quran pages by cropping the page of Quran first, then crop line by line of the page without the empty space
ii.
This technique will be applied on all the Quran pages except the first and second pages which are “Surat Al-Fatihah and the first page of Surat AlBaqarah” in the Quran that its writing styles comes with a circle border for these two pages.
1.8
Expected Outcomes The main expected outcomes for this research to design application that provide a
technique that produced better image for Holy Quran page image without the border, then cropping the image page line-by-line after line segmentation.
1.9
Conclusion This thesis addressed the problem of removal image border for sensitive digital
Holy Quran pages image. We proposed a very robust and secure approach against the remove border without any changes on the content of the Holy Quran pages to maintain the authenticity of the Quran text-image content. Our objective is to segment the page of the Quran into lines and also to remove Illumination.
6
CHAPTER 2
LITERATURE REVIEW
2.1
Introduction Nowadays we are living in a world that is almost entirely digital. So, many digital
documents are available. Research on document originality are exist (Qadir and Ahmad, 2006). For this research, study will be done on Al-Quran. There are too many printed versions of Al-Quran as well as digital copy. This research will do the segmentation on AlQuran for the removing illumination and segmenting Al-Quran by lines. This research will prepare the Al-Quran for feature extraction phase and the final aim is to validate the originality of Al-Quran. There are too many research on segmentation, however, segmenting the Al-Quran need to do carefully in order to preserve the holiness of Al-Quran. Research that nearly to Al-Quran are Arabic/Jawi segmentation. But, both are not suitable to be applied for AlQuran due to diacritics diacritical marks (Tashkil), words, and sentences. The first step before segmentation phase is preprocessing documents to produce a clean image of the document. Al-Quran page image contained Illumination. Detecting and removing these unwanted areas is critical to achieve better text segmentation results. Before the illumination detection and removal takes place, we first proceed to image binarization using the efficient technique proposed in (Gatos et al., 2006). There are some segmentation techniques for Arabic and Jawi character (Omar, 2000). (Omar, 2000) has category the pre-processing phase for the Arabic/Jawi character 7
recognition into Binarization, Edge detection, thinning, and Segmentation before feature extraction took place. We are trying in our research to understand the image processing and text segmentation. Images used in our study will be are an image of Al-Quran pages. At this point, Will be explained the previous studies on text line segmentation in detail. The main objective of this research is to seek out the better method to segment Al-Quran pages without missing any character or diacritical marks.
2.2
Image Pre-processing
2.2.1 Progress in Binarization Studies A binary image (Stathis et al., 2008) is a digital image that has just two feasible values meant for every pixel. Normally, two colors are used for a binary image i.e. black and white however any two colors can be used. The color used for the objects in the image is the foreground color while the rest of the image is the background color. Binary images (Su et al., 2011) frequently occur in image processing as masks or as the outcome of some operations as segmentation and thresholding. Few input/output devices, for example, laser printers, bi-level computer displays, are able to just handle bi-level images. Binary images are formed from color images by segmentation. Various approaches as well as techniques were developed to improve documents images quality. Binarization is one of the most important pre-processing steps which consist to separate foreground and background of documents images. It converts a grayscale document image into a binary document image. Image binarization is typically executed in the preprocessing phase of image processing of several documents. It is the process of separation of pixel values into dual collections, black as foreground and white as background.
8
Mohd Sanusi (2013) shows the processes within the pre-processing carried out by researchers within the Jawi script. This process is predicated on the process performed by Khairuddin Omar (2000). Overall stages that were utilized by the pioneers of Jawi script Khairuddin Omar (2000) has committed method of images reborn to a scale illustration of color, skew and slant correction, noise removal, and thinning of the frame (skeleton). However, the process utilized by Khairuddin Omar (2000) not all of them are utilized by researchers to conduct pre-processing as shown in Table 2.1. Table 2.1: Image processing steps for Jawi pattern recognition (Azmi, 2013) Researcher for previous
Format
Skew and slant
Noise
studies
Conversion
correction
Removal
Khairuddin Omar (2000)
√
√
√
Mazani Manaf (2002)
√
√
√ √
Mohammad Roslim (2002) Mohammad Faidzul (2010)
Thinning
√
√
By Refer to Table 2.1, Khairuddin Omar (2000) during the pre-processing phase has been accomplished on the scale of the transformation of binary format. In addition to that, Mazani Manaf (2002) and Mohammad Faidzul (2010) come out to gray scale format during the transformation. After that, by these researchers has carried out the methodology of noise removal. Khairuddin Omar (2000) in the noise removal process they use Median Filter and then he has been using the technique proposed by Sharaf El-Deen et al. (2003) to "change the image format to a binary scale". After this, using gradient orientation histogram implements the skew as well as slant correction image. The ultimate procedure carried out through Khairuddin Omar (2000) the thinning skeleton did in the preprocessing. The algorithms used by Khairuddin Omar (2000) for the thinning skeleton
9
using sequential thinning algorithm Safe-Point Thinning Algorithm (SPTA) proposed by Naccache, & Shinghal (1984). SPTA algorithm is also used by Roslim Mohammad (2002) in the thinning of the Jawi script. Mazani Manaf (2002) in pre-processing employs "gamma correction and intensity". Next the uses of a linear function with the planned technique desires by Parker (1994) to convert the image to grayscale format. The noise removal method, then being executed by using erosion operation followed by reclamation as proposed by Zhang, & Suen (1984). After that, to de-noise employs the median filter. Finally, Mazani Manaf (2002) using a simple sequential thinning algorithm performs a thinning process framework based on Zhang and Suen (1984). Mohammad Faidzul (2010) in his study has taken knowledge from nine writers. The primary method done by him is doing segmentation that performed manually. however he did't doing the noise removal in his research. The results of this manually segmentation, he obtained 993 total characters and 540 sub-word from nine last author. Next, format conversion method to be enforced using the binary scale thresholds of 127. Mohammad Faidzul (2010) did't perform the repair method skew and slant and additionally the thinning skeleton. Abdenour Sehad et al.(2013) (Sehad et al., 2013) has present a capable scheme for binarization of ancient and degraded document images, grounded on texture qualities. The suggested technique is an adaptive threshold-based. It has been calculated by using a descriptor centered on a co-occurrence matrix and the scheme is verified objectively, on DIBCO dataset degraded documents furthermore subjectively, utilizing a set of ancient degraded documents offered by a national library. The outcomes are acceptable and assuring, present an improvement to classical approaches.
10
Hossein Ziaei Nafchi et al.(2013) (Nafchi et al., 2013) has concluded that the preprocessing and post processing phases meaningfully advance the performance of binarization approaches, particularly in the situation of harshly degraded ancient documents. An unverified post processing technique is presented founded on the phasepreserved denoised image and also phase congruency features extracted from the input image. The central part of the technique comprises of two robust mask images that can be used to cross the false positive pixels on the production of the binarization technique. Firstly, a mask with an extreme recall value is attained from the denoised image with the help of morphological procedures. In parallel, a second cover is acquired dependent upon stage congruency features. At that point, a median filter is utilized to evacuate noise on these two masks, which then are utilized to rectify the yield of any binarization strategy. Jon Parker et al.(2013) (Parker et al., 2013) has studied that regularly documents of notable noteworthiness are ran across in a state of deterioration. Such archives are regularly examined to all the while history and announce a disclosure. Changing over the data found inside such reports to open information happens all the more rapidly and inexpensively if a programmed technique to upgrade these corrupted archives is utilized as opposed to improving each one document image by hand. A novel mechanized image upgrade approach that indulges no preparation information was introduced. The methodology was valid to images of typewritten text in addition to hand written text or both. Konstantinos Ntirogiannis et al.(2013) (Ntirogiannis et al., 2013) has analysed that document image binarization is of incredible value in recognition pipeline and document image examination as it disturbs further phases of the recognition procedure. The assessment of a binarization technique helps in examining its algorithmic conduct, and also confirming its adequacy, by giving qualitative and quantitative sign of its execution. A
11
pixel-based binarization assessment approach for recorded handwritten/machine-printed document image has been proposed. .In the proposed assessment procedure, the review and accuracy assessment measures are fittingly adjusted utilizing a weighting plan that decreases any potential assessment unfairness. Extra execution measurements of the proposed assessment plan comprise of the rate rates of broken and missed content, false alerts, foundation commotion, character amplification, and combining. Vincent Rabeux et al.(2013) (Rabeux et al., 2013) has an approach to expect the outcome of binarization algorithms on a known document image according to its situation of degradation. Document shaving degradation which result in binarization errors. To characterize the degradation of a document image by using different features based on the strength, amount and position of the degradation. These characteristics allow us to build calculation models of binarization algorithms that are very accurate according to R2 values and p-values. The prediction models are used to select the best binarization algorithm for a given document image. Djamel GACEB et al. (2013) (Gaceb et al., 2013) has studied a smart binarization technique of the images. In this technique,considered different degradations document images. The nature of every pixel is approximate using a hierarchical local thresholding in order to classify it as foreground, background or ambiguous pixel. The ambiguous pixels that represent the corrupted zones cannot be binarized with the same local thresholding. The global quality of the image is estimated from the density of theses degraded pixels. If image is degraded then apply a second separation on the ambiguous pixels to split them into background or foreground. Second process uses our improved relaxation method Marian Wagdy et al. (2013) (Wagdy et al., 2013) has implemented a quick and proficient document image clean up and binarization technique depend on retinex hypothesis and global thresholding. This technique joins of local and global thresholding
12
with concept of retinex theory which can efficiently improve the degraded and poor quality document image. Then, quick global threshold is utilized to change over the document image into binary form. The new method conquers the limitations of the related global threshold techniques. Vassilis Papavassiliou et al.(2012) (Papavassiliou et al., 2012) has discussed an capable technique dependent upon mathematical morphology for extracting text regions from degraded document images. The fundamental stages of methodology area) top-hatby-reconstruction to construct a filtered image with sensible background) region growing beginning from a set of seed points and attaching to each seed similar intensity neighbour pixels and c) conditional extension of the first detected text regions based on the values of the second derivative of the filtered image. Bolan Su et al. (2012) (Su et al., 2012) has studied a document image binarization structure that makes utilization of the Markov Random Field model. Structure isolates the document image pixels into three classes i.e. document background text, document foreground text, and uncertain pixels established binarization method. Uncertain pixels are belongto foreground and background categories by incorporating MRF model and boundary information. C. Patvardhan et al. (2012) (Patvardhan et al., 2012) has studied that images may contain difficult background i.e. shading or a denoising. Binarization method of document images creates them suitable for OCR using discrete curvelet transform. Curvelet transform is used for eliminate difficult image background, white Gaussian noise and gives improved binarized document image. The Curvelet transform also helps to enhanced in text shape still in the occurrence of noise. This method is capable to eliminate high frequency Gaussian noise and low frequency complex backgrounds and shows better performance.
13
2.2.2 Border/Illumination Removal Approaches The proposed approaches for the segmentation of document and character recognition are usually considered the scanned images without noise are ideal images. However, there are several factors that may generate images of the full document. When a page is scanned from the book, text of a neighbouring page can also be captured in the image of the current page. These areas of unwanted are called "noisy text regions." In addition, whenever a scanned page does not completely cover the scanner setup image size, there will usually be black borders in the image. These unwanted regions are called “noisy black borders”. Figure 2.1 shows noisy black borders as well as noisy text regions. All these problems influence the performance of segmentation and recognition processes. Since the page segmentation algorithms take into account noisy text regions, the text recognition accuracy decreases since the text recognition system usually outputs several extra characters in these regions. The goal of border detection is to find the principal text region and to ignore the noisy text and black borders.
Figure 2.1: Example of an image with noisy black border and noisy text region
14
The most common approach to eliminate marginal noise is to perform document cleaning by filtering out connected components based on their size and aspect ratio. However, when characters from the adjacent page are also present, they usually cannot be filtered out using only these features. There are only few techniques in the bibliography for page borders detection, and they are mainly focused on printed document images. Le et al. (Le et al., 1996) propose a technique for remove the border that is predicated on classification of blank, non-textual and textual columns and rows, the border objects location, and an analysis of crossing counts of textual squares and projection profiles. There are many uses are used in this approach. Moreover, it is assumed that the page borders very close to the image edges and separate the border from the contents of the image by a blank space. However, this assumption is often violated. Fan et al. (Fan et al., 2002) propose a technique for detect the black noisy regions that overlap with the text, but did not assume there is noisy text regions. They propose framework to reduce the image resolution to detect and remove the black borders, which hides text, by threshold filter thus leaving the border of the image. They applied the deletion process on the original image. In (Ávila and Lins, 2004) Avila et al. propose algorithms for non-invading and invading border that work as "flood–fill" algorithm. The algorithm for the "non-invading border" supposes that the information of the document that merged with the border of noisy black. In the connected area, to curb the flooding, it use two parameters that related to the document, the segment maximum size that belonging to the document text as well as the maximum distance between lines. On the other hand, the algorithm for the "invading" supposes that the black areas will not be invaded by the borders of noisy black. If the border of noisy black that merged with text region of the document, all area and the part of the text region is removed and flooded. Dey et al. (Dey et al., 2012) propose a technique for removing margin noise from printed document images. Firstly, they perform layout
15
analysis to detect words, lines, and paragraphs in the document image and the detected elements are classified into text and non-text components based on their characteristics (size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the margin noise. Finally, Agrawal and Doermann (Agrawal and Doermann, 2013) present a clutter detection and removal algorithm for complex document images. They propose a distance transform-based approach which aims to remove irregular and nonperiodic clutter noise from binary document images independently of clutter’s position, size, shape and connectivity with text.
2.3
Arabic OCR Optical Character Recognition (OCR) systems is transforming large amount of
documents, either printed alphabet or handwritten into machine encoded text without any transformation, noise, resolution variations and other factors.
Character Recognition
Off-Line
Machine Printed
On-Line
Isolated Characters
Handwritten
Figure 2.2: General Arabic OCR systems capabilities
16
Cursive Words
Figure 2.2 shows the main capabilities of Arabic OCR systems. Obviously, they differ in their capabilities of character recognition. The sophistication of the off-line OCR system depends on the type and number of fonts to be recognized. An Omni-font OCR machine can recognize most non stylized fonts without having to maintain huge databases of specific font information. Usually Omni-font technology is characterized by the use of feature extraction. However, no OCR machine performs equally well or even usably well, on all the fonts used by modern computers. The first step in any OCR system is to capture text data and transform it into a digital form. The recognition systems differ in how they acquire their input. There are two different ways: on–line and off–line systems as described in Figure 2.2. On–line (or real time) systems (Alimi, 1997; Al-Emami and Usher, 1990) recognize the text while the user is writing it, e.g. a digital tablet. The tablet captures the (x, y) coordinates of the pen location while it is moving. This generates a one dimensional vector of these points. This vector depends on the tablet resolution (points/inch) and the sampling rate (point/second). The on–line systems have a high recognition performance where each character is represented by a vector of points that are sorted by the time factor; time– dependant. The user of this system can see directly the output of the recognition system and verify the results. The system is limited to recognize handwritten text only. Off–line systems recognize the text after it has been written or printed on pages (Al-Muhtaseb et al., 2008; Cheung et al., 2001). Most interesting text is already printed in documents or books and the need to convert it into an electronic media gives a great value to the off–line recognition systems. Unlike on–line systems, off–line systems do not have information dependent on the time factor. Each page of text is represented by a two dimensional array of pixel values. The system may acquire the input text using scanners (Al-Muhtaseb et al., 2008; Cheung et al., 2001).
17
2.4
Image Segmentation In order to extract features from the text image, it should be segmented into lines,
words, characters or primitives. Arabic OCR systems are classified into two major types depending on the method of segmentation been used: segmentation–based systems and segmentation–free systems. The segmentation procedure is the major challenging phase for any Arabic OCR system because of the cursive nature of the Arabic script. This challenge occurs at segmentation based systems while segmentation –free systems avoid this problem. Table 2.2: Categorization of segmentation algorithms (Nikos et al., 2010) Existing Research
Proposed
HandSegmentation
Printed
Diacritical Page
Text line
Arabic Text
written algorithm
documents
Marks segmentation segmentation Segmentation
documents
Segmentation
X-Y cuts
*
*
*
*
RLSA
*
*
*
*
Docstrum
*
*
*
*
*
*
*
*
*
Whitespace analysis Constrained text line Hough transform
*
Voronoi
*
* *
Scale space analysis
*
*
*
*
18
Various document image segmentation techniques have been proposed in the literature. These techniques can be categorized based on the document image segmentation algorithm that they adopt. The most known of these segmentation algorithms are the following: X–Y cuts or projection profiles based (NAGY and SETH, 1984), Run Length Smoothing Algorithm (RLSA) (Wahl et al., 1982), component grouping (Feldbach and Tonnies, 2001), document spectrum (O’Gorman, 1993), whitespace analysis (Baird, 1994), constrained text lines (Breuel, 2002), Hough transform (Hough, 1962; Duda and Hart, 1972), Voronoi tessellation (Kise et al., 1998) and Scale space analysis (Manmatha and Rothfeder, 2005) . All of the above segmentation algorithms are mainly designed for contemporary documents. Table 2.2 categorizes all of the aforementioned segmentation algorithms and depicts the way they have been used in document processing.
2.5
Techniques for Document of Text Segmentation One of the early tasks in a handwriting recognition system is the segmentation of a
handwritten document image into text lines, which is defined as the process of defining the region of every text line on a document image. The overall performance of the system to recognize the handwritten character powerfully depends on the process results of the text line segmentation. If the quality of the results produced by the stage of text line segmentation is poor, this will affect the accuracy of the text recognition procedure. Thus, the algorithms employed for these two stages are critical for the overall recognition procedure. We can group existing text line methods into four basic categories: methods making use of the projection profiles, methods that are based on the Hough transform, smearing methods and, finally, methods based on the principle of dynamic programming. Also,
19
many methods exist that can't be clearly classified in a specific category, since they employ particular techniques.
2.5.1 Projection Profiles There are several ways that make use of the projection profiles consists in (Bruzzone and Coffetti, 1999), (Arivazhagan, 2007). In (Bruzzone and Coffetti, 1999), The original image divided into vertical slices. At every vertical slice, calculate the histogram of each horizontal runs. The text contained in one slice is assumed in this technique is that it is parallel to each other. Arivazhagan et al. (Arivazhagan, 2007) partitions the original image into vertical strips called chunks. The projection profile is calculated for each chunk. Among the first chunks are extracted first candidate lines. These lines pass for any handwritten connected element by linking them to the text line below or above. By any of the following makes the decision (i) "modelling the text lines as bivariate Gaussian densities and evaluating the probability of the component for each Gaussian" or (ii) "the probability obtained from a distance metric".
2.5.2 Hough Transform Approach To make use of the Hough transform there are some of methods include (Fletcher and Kasturi, 1988), (Louloudis et al., 2008), (Likforman-Sulem et al., 1995) and (Pu and Shi, 1998). The Hough transform is "a powerful tool used in many areas of document analysis that is able to locate skewed lines of text". Through beginning with a few points from the original image, the technique to extract lines of these points is best suited. The points regarded as within the voting process of the Hough transform are often either the gravity centers (Fletcher and Kasturi, 1988), (Louloudis et al., 2008), (Likforman-Sulem et al., 1995) or minima points (Pu and Shi, 1998) of the connected components.
20
In further detail, Likforman (Likforman-Sulem et al., 1995) developed a technique based on a hypothesis – validation scheme. Potential alignments are validated in the image domain and hypothesized in the Hough domain. The units for the Hough transform are the centroids of the connected components. A set of aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell (ρ, θ) of the Hough domain. Alignments including a lot of units correspond to high peaked cells of the Hough domain. A recent method using the Hough transform was proposed by Louloudis et al. (Louloudis et al., 2008). The main contributions of the approach correspond to a) the partitioning of the connected space into three distinct spatial sub-domains (small, normal and large) from which only normal connected components are used in the Hough transformation step, b) a block-based Hough transform step for the detection of potential text lines and c) a post-processing step for the detection of text lines the Hough did not reveal as well as the separation of vertically connected parts of adjacent text lines. The Hough transform can also be applied to fluctuating lines of handwritten drafts such as in (Pu and Shi, 1998). The Hough transform is first applied to minima points (units) in a vertical strip on the left of the image. The alignments in the Hough domain are searched starting from a main direction, by grouping cells in an exhaustive search in six directions. Then a moving window, associated with a clustering scheme in the image domain, assigns the remaining units to alignments. The clustering scheme (Natural Learning Algorithm) allows the creation of new lines starting in the middle of the page.
2.5.3 Smearing methods Smearing methods mainly include the fuzzy RLSA (Shi and Govindaraju, 2004) and the adaptive RLSA (Makridis et al., 2007). The fuzzy RLSA measure is calculated for every pixel on the initial image and describes “how far one can see when standing at a
21
pixel along horizontal direction”. By applying this measure, a new grayscale image is created, which is binarized and the lines of text are extracted from the new image. The adaptive RLSA (Makridis et al., 2007) is definitely an expansion from the traditional RLSA, in the meaning that extra smoothing restrictions are set with respect to the geometrical qualities of neighboring connected elements. Implement the actual replacement between the background pixels with foreground pixels whenever these restrictions to be satisfied.
2.5.4 Dynamic Programming The segmentation methods for the text line supported on the dynamic programming principle were recently presented (Nicolaou and Gatos, 2009), (Saabni et al., 2014). They try to segment text lines by finding an optimal path on the background of the document image travelling from the left to the right edge. Nicolaou et al. approach (Nicolaou and Gatos, 2009) is based on the topological presumption which for every text line, There is a existing path from one side of the image to the other which cross a single text line. The image is first blurred and at a second step tracers are used to follow the black-most and white-most paths from right to left as well as from left to right. The final goal is to shred the image into text line areas. Saabni et al. (Saabni et al., 2014) propose a method which computes an energy map of a text image and determines the seams that pass across and between text lines. Two different algorithms are described (one for binary and one for grayscale images). Concerning the first algorithm (binary case), each seam passes on the middle and along a text line, and marks the components that make the letters and words of it. At a final step, the unmarked components are assigned to the closest text line. For the second algorithm (grayscale case) the seams are calculated on the distance transform of the grayscale image.
22
2.5.5 Other Techniques The related works for other methodologies include Nicolas et al. (Nicolas et al., 2004). In this work, from the perspective of Artificial Intelligence to be considered to the problem of extraction the text line. The objective is to gather the connected components of the document into homogeneous sets that match to the text lines of the document. The solution of this problem, a search over the graph that is defined by the connected components as vertices and the distances among them as edges is applied. In the recent paper (Shi et al., 2005), Adaptive Local Connectivity Map is the aim of this paper. it makes use of the Adaptive Local Connectivity Map: a grayscale image is the input to the technique, and calculate new image by summing of the intensities of every pixel’s neighbors in the horizontal direction. When the new image is also a grayscale image, applied the thresholding methodology and also grouped the connected components by employing a grouping methodology into location maps. In (Kennard and Barrett, 2006), the technique for binarized image that use the "count of foreground/background transitions" for text line segmentation to detect text lines area. And a "min-cut/max-flow graph cut algorithm" is used to segment the area of the text that shown as "more than one line of the text". They applied the merge of the text lines with the text line that have little text information. Yi Li (Li et al., 2008) presented a method in which models for detect the text line by enhancing text line structures as an problem of segmentation image using a Gaussian window and determining the level set method to develop the boundaries of the text line. The method described in (Lemaitre and Camillerapp, 2006) is based on a notion of perceptive vision: at a certain distance, text lines can be seen as line segments. This method is based on the theory of Kalman filtering to detect text lines on low resolution images.
23
Weliwitage et al. (Weliwitage et al., 2005), presented a technique that include cut text minimization for text line segmentation from handwritten documents in English. To do that, applied the optimization technique that varies the angle of cutting and begin location to reduce the text pixels cut during tracking between two text lines. In (Basy et al., 2008), presented for multi-skewed handwritten document of Bengali or English text by using text line extraction technique. it suppose that hypothetical water flows, from both right and left sides of the frame of the image, confront obstruction from characters of text lines. The stripes of areas left unwetted on the image frame are finally labeled for extraction of text lines. In (Zahour et al., 2007), presented a segmentation for the text line for handwritten or printed historical document that contain Arabic letters. The first step using the Kmeans scheme is classify the document to two classes. All classes are compatible with the complexity of the document "(easy or not easy to segment)". For the document that have overlapping and also touching with the characters is "divided into vertical strips". From the horizontal projection result, the extracted text blocks are classified to 3 categories: large, average, small text blocks. The lines are obtained from the segmentation process for the large text blocks using spatial relationship by match adjacent blocks within two successive strips. By making abstraction of the large blocks segmentation module are segmenting the document that didn't have touching or overlapping characters. From 100 experiments on historical documents, the researcher claims 96% accuracy from that sample. Yin (Yin and Liu, 2008), proposes an approach which is based on "minimum spanning tree (MST)" clustering with new distance measures. The first step, in the document image are grouped the connected components into a tree by minimum spanning tree clustering with a new distance measure. Then using a new objective function will be dynamic cutting the tree edges to form text lines for finding the number of clusters. This
24
approach can be apply to many documents and totally parameter-free with curved lines and multi-skewed. Stamatopoulos et al. (Stamatopoulos et al., 2009) present a combination method of different segmentation techniques. The goal is to exploit the segmentation results of complementary techniques and specific features of the initial image so as to generate improved segmentation results. Roy et al. (Roy et al., 2008) propose a method that is based on "morphological operation" and "run-length smearing". The first step, applied RLSA to obtain a single word as a component. The next step, eroded the front side of this smoothed image to get several seed components from the individual words of the document. From the background portions applied also the Erosion to find some boundary information of text lines. Last step; segment the lines using the boundary information and the positional information of the seed components. Finally, Du et al. (Du et al., 2008) propose a method which is based on the Mumford–Shah model. The algorithm is claimed to be script independent. In addition, morphing is used to remove overlaps between neighboring text lines and connect broken ones.
25
2.6
Text Segmentation analysis Table 2.3: Text line segmentation methods analysis Authors
Year
Title
Category
Description The algorithm is based on the analysis of horizontal run projections and connected components grouping and splitting on a partition of the input
An algorithm for Bruzzone et 1999
Projection
image into vertical strips, in order to deal with undulate or skewed text. Goal
profiles method
of the algorithm is to preserve the ascending and descending characters from
extracting cursive
al. text lines
been corrupted by arbitrary cuts. The algorithm has been designed for 26
cursive text and it can be applied also to hand-printed one The projection profile of every vertical strip (chunk) is calculated. The first A statistical
candidate lines are extracted among the first chunks. These lines traverse
approach to line Arivazhagan
around any obstructing handwritten connected component by associating it Projection
2007
segmentation in
et al.
to the text line above or below. This decision is made by either (i) modeling profiles method
handwritten
the text lines as bivariate Gaussian densities and evaluating the probability of
documents
the component for each Gaussian or (ii) the probability obtained from a distance metric
Authors
Year
Title
Category
Description
algorithm for
Hough
Potential alignments are hypothesized in the Hough domain and validated in
extracting text lines
transform
the image domain. The gravity centers of the connected components are the
in handwritten
method
A Hough based
Likforman 1995 et al. units for the Hough transform
documents
27
Pu and Shi
1998
A natural learning
The Hough transform is first applied to minima points (units) in a vertical
algorithm based on
strip on the left of the image. The alignments in the Hough domain are
hough transform
Hough
searched starting from a main direction, by grouping cells in an exhaustive
for text lines
transform
search in six directions. Then a moving window, associated with a clustering
extraction in
method
scheme in the image domain, assigns the remaining units to alignments. The
handwritten
clustering scheme (natural learning algorithm) allows the creation of new
documents
lines starting from the middle of the pages
Authors
Year
Title
Category
Description The methodology incorporates a block based Hough transform approach which takes into account the gravity centers of parts of connected
Text line detection
Hough
components. After the first candidate text line extraction, a postprocessing
in handwritten
transform
step is used to correct possible splitting as well as to detect text lines that the
documents
method
previous step did not reveal. A key idea in the whole procedure is the
Louloudis et 2008 al.
partitioning of the connected component domain into three distinct subdomains each of which is treated in a different manner Line separation for 28
The fuzzy RLSA measure is calculated for every pixel on the initial image
Shi and
complex document
Smearing
and describes “how far one can see when standing at a pixel along horizontal
images using fuzzy
method
direction”. By applying this measure, a new grayscale image is created,
2004 Govindaraju
which is binarized and the lines of text are extracted from the new image
runlength
The adaptive RLSA is definitely an expansion from the traditional RLSA, in Adaptive degraded Smearing Gatos et al.
2006
document image
the meaning that extra smoothing restrictions are set with respect to the geometrical qualities of neighboring connected elements. Implement the
method binarization
actual replacement between the background pixels with foreground pixels whenever these restrictions to be satisfied.
Authors
Year
Title
Category
Description
Text extraction A methodology that makes use of the adaptive local connectivity map. a from gray scale grayscale image is the input to the technique, and calculate new image by historical document Shi et al.
2005
Other
summing of the intensities of every pixel’s neighbors in the horizontal
images using direction. When the new image is also a grayscale image, applied the adaptive local thresholding methodology and also grouped the connected components by connectivity map employing a grouping methodology into location maps. 29
The method to segment text lines uses the count of foreground/ background Separating lines of transitions in a binarized image to determine areas of the document that are text in free-form Kennard et
likely to be text lines. Also, a min-cut/max-flow graph cut algorithm is used 2006
handwritten
al.
Other to split up text areas that appear to encompass more than one line of text. A
historical merging of text lines containing relatively little text information to nearby documents text lines is then applied.
Authors
Year
Title
Category
Description
Text line extraction in handwritten Lemaitre
A methodology which is based on a notion of perceptive vision: at a certain document with
and
2006
Other
distance, text lines can be seen as line segments. This method is based on the
Kalman Filter Camillerapp
theory of Kalman filtering to detect text lines on low resolution images. applied on low resolution image
30
Text line
A method that from the perspective of Artificial Intelligence to be considered
segmentation in
to the problem of extraction the text line. The objective is to gather the
Nicolas et 2004
handwritten
Other
connected components of the document into homogeneous sets that match to
al. document using a
the text lines of the document. The solution of this problem, a search over the
production system
graph that is defined by the connected components as vertices and the distances among them as edges is applied.
Authors
Year
Title
Category
Description
Script-independent A technique that models for detect the text line by enhancing text line
text line
structures as an problem of segmentation image using a Gaussian window
segmentation in Li et al.
2008
Other and determining the level set method to develop the boundaries of the text
freestyle handwritten
line.
documents
31
Weliwitage
Handwritten
A technique that include cut text minimization for text line segmentation
document offline
from handwritten documents in English. To do that, applied the optimization
2005 et al.
Other text line
technique that varies the angle of cutting and begin location to reduce the
segmentation
text pixels cut during tracking between two text lines. A technique for multi-skewed handwritten document of Bengali or English
Text line extraction text by using text line extraction technique. it suppose that hypothetical water from multi-skewed Basu et al.
2008
Other
flows, from both right and left sides of the frame of the image, confront
handwritten obstruction from characters of text lines. The stripes of areas left unwetted documents on the image frame are finally labeled for extraction of text lines.
Authors
Year
Title
Category
Description An approach which is based on "minimum spanning tree (MST)" clustering
Yin and Liu
2008
Handwritten text
with new distance measures. The first step, in the document image are
line extraction
grouped the connected components into a tree by minimum spanning tree
based on minimum
Other
clustering with a new distance measure. Then using a new objective function
spanning tree
will be dynamic cutting the tree edges to form text lines for finding the
clustering
number of clusters. This approach can be apply to many documents and totally parameter-free with curved lines and multi-skewed.
32
A method for combining A combination method of different segmentation techniques. The goal is to Stamatopoul
complementary 2009
os et al.
Other
exploit the segmentation results of complementary techniques and specific
techniques for features of the initial image so as to generate improved segmentation results. document image segmentation
Authors
Year
Title
Category
Description A method that is based on "morphological operation" and "run-length
Morphology based smearing". The first step, applied RLSA to obtain a single word as a handwritten line component. The next step, eroded the front side of this smoothed image to segmentation using Roy et al.
2008
Other
get several seed components from the individual words of the document.
foreground and From the background portions applied also the Erosion to find some background boundary information of text lines. Last step; segment the lines using the information boundary information and the positional information of the seed components. 33
Text line segmentation in A method which is based on the Mumford–Shah model. The algorithm is handwritten Du et al.
2008
Other
claimed to be script independent. In addition, morphing is used to remove
documents using overlaps between neighboring text lines and connect broken ones. Mumford-Shah model
2.7
Conclusion This literature discuses the pre-processing phase for the Arabic/Jawi character
recognition into binarization using many techniques and a very robust and secure approach for removing illumination and segmenting the holy Quran without any changes on the content. Maintaining the authenticity of the Quran text-image content using investigation and the implementation phases since detecting and removing these unwanted areas is critical to achieve better text segmentation results.
34
CHAPTER 3
RESEARCH METHODOLOGY
3.1
Introduction This chapter discusses the research methodology that will be carried out in order to
achieve the research objectives mentioned earlier in Chapter 1. Research methodology incorporates the sequential logical process that formed the conceptual nature of the tasks performed throughout the research. The framework for this research is a implementation of the research framework.
3.2
Research Methodology The study discusses the methodology of the conceptual framework, the task
framework and the experimental framework. Every framework contains the details of the implementation of each sub-section.
3.2.1 Research Framework The conceptual framework of the study has been divided into two phases. The first section is investigation phase, and the second section is implementation phase. Phases are shown in Figure 3.1.
35
Research Framework Investigation Phase
Implementation Phase
1. Problem Summarization
2. Research on illumination detection and text line segmentation processing
Task Framework 1. Data Collection
2. Binarization method
3. Image detection 3. Research on previous techniques used in illumination detection and text line segmentation
4. Segmentation Image
5. Image of each page and line Figure 3.1: Research Framework
I.
Investigation Phase At this phase, the study is done on the domain of study. The study Background,
interests, problems, and current issues of the domains studied to determine the scope of the study and obtain the desired aim of the study. When the domain is specified, the investigation phase is done through a literature review of the factors involved in the domain and identifies previous researches associated with the scope and domain of the study.
II.
Implementation Phase Once the objectives and the problem statement established in Chapter one,
following phase is completed throughout the implementation phase of this study. Within the implementation phase, will be used the task framework as a guideline for this research.
36
i. Data Collection Data Collection is the method the researcher used to get the data and information. In this research, we used printed text. Printed text is an image of Arabic words written as text types. The contents of the printed text data are different copies from Holy Quran. Besides, Figure 3.2, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.
Figure 3.2: Al-Quran writing styles ii. Binarization Image Binarization process converts gray scale or colored image into a binary image. Binary image (Stathis et al., 2008) is a digital image that has just two feasible values meant for every pixel. Normally, two colors are used for a binary image i.e. black and white however any two colors can be used. The color used for the objects in the image is the foreground color while the rest of the image is the background color.
37
iii. Image detection The images will be detected to some of the following detections: a) Illumination detection b) Text line detection
iv. Segmentation Image The images will be segmented according to the following segments: a) Segments to page b) Segments to text line The segmentation images will be used to save the image of each page and line by using the proposed method.
3.3
Task Framework Below, we present a framework of techniques; see Figure 3.3, which enables page
and text line segmentation of a set of Al-Quran pages. In this research, classification of methods divided into sex methods to segment the Quran pages into lines and pages without Illumination, which are (a) A pre-processing which encompass binarization, noise removal is applied, (b) The Illumination on the pages are detected, (c) Segments to page without Illumination, (d) The text lines on the segmented page are detected, (e) Segments to text lines, And (f) the segmented pages and lines are saved as image.
38
Image/data
Binary image
Illumination detection
Segments to page
Text line detection
Segments to text line
Save Image
Save Image
of each line
of each page
Figure 3.3: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework
39
At the end of task framework, the best practice techniques for page and text line segmentation techniques will be obtained. The images used in this research are from AlQuran. Figure 3.4 illustrate sample of Al-Quran image.
Figure 3.4: Page (10) from Holy Al-Quran
3.4
Experimental Test Framework In this research, one experimental test conducted to seek out the most effective
practice techniques for Segmentation Image for Holy Quran pages. The experiment used is predicated on the best percentage to detect the illumination and text lines, to segment AlQuran pages into text lines and pages without Illumination. The experiment test is explained with the objectives. The algorithmic programs used, Input and the results obtained from the algorithm are also disclosed.
3.4.1 Experiment I i. Objective: Get the best percentage to detect the illumination and text lines, to segment Al-Quran pages into text lines and pages without Illumination. 40
ii. Input: data as an image from Al-Quran pages. iii. Algorithm: Proposed Method. iv. Output: Image of each text lines and each page without Illumination
3.5
Research Tool We used some of the research tools to support the research in this study. Table 3.1
shows the summary of the necessary equipment in this research. Table 3.1: Tools and programming languages that utilized in this research Steps
Datasets
Programming Language
Tools
Java
ImageJ
Data Collection Binarization Detection
Image of Holy Quran Pages
Segmentation
3.6
Conclusion This chapter discusses the methodology utilized in order to solve the problem of the
research. Two phases identified and utilized in this study. The phase included of investigation and also the implementation phase. Next, task framework is conferred to indicate the general implementation of this study. Experimental test designs are represented and finally, a research tool used throughout the study reported.
41
CHAPTER 4
IMPLEMENTATION
4.1
Introduction In the previous chapter, the methodology has been explained considerably. The
methodology consists of two phases Investigation, and Implementation of the proposed system. This chapter discusses in details the design and development of the system; the task framework used for this research. Among others, it explains on the requirements determination and structuring activity based on the research methodology discussed in Chapter 3. In this section, suitable solutions will be provided for the issues that have been discussed previously. The solution will be based on the objectives mentioned in chapter one (introduction). Therefore, in this chapter we will also discuss the proposed method.
4.2
Data Collection The image used is from Al-Quran pages. There are different copies from Holy
Quran. Besides, Figure 4.1, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.
42
Figure 4.1: Al-Quran writing styles
4.3
Implementation process Below, we present a framework of techniques; see Figure 4.2, which enables page
and text line segmentation of a set of Al-Quran pages. In this research, classification of methods divided into sex methods to segment the Quran pages into lines and pages without Illumination, which are (a) A pre-processing which encompass binarization, noise removal is applied, (b) The Illumination on the pages are detected, (c) Segments to page without Illumination, (d) The text lines on the segmented page are detected, (e) Segments to text lines, And (f) the segmented pages and lines are saved as image. The proposed framework focuses on the most important steps: page segmentation and text line segmentation. Below are explained in detail to all steps used in the process of Al-Quran pages segmentation.
4.3.1 Image Binarization In the pre-processing step of the documents analysis is performed the Binarization, and is designed to segment the text from the document background. To do the Binarization task of the document, there is many algorithms have been proposed to do that. Through this study, will be used Otsu method to do the Binarization (Otsu, 1979).
43
Image/data
Binary image
Illumination detection
Segments to page
Text line detection
Segments to text line
Save Image
Save Image
of each line
of each page
Figure 4.2: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework
44
4.3.2 Page segmentation Our methodology detects and removes blank space and Illumination from the Holy Quran pages. The blank space and Illumination removal method relies upon the density of the binary values (the binary representation). We propose a new methodology to detect the frames of page by 3 frames: i. The exterior blank space and marginal illumination, ii. Illumination, iii. And the interior blank space. That's based on the horizontal and vertical white pixel percentage. Our aim is to segment Al-Quran page image into page without blank space or Illumination.
Figure 4.3: Overall steps for removing page frames
i.
Frame 1 (The exterior blank space and marginal illumination) : At a first frame step, the blank space and marginal Illumination of the page are
removed. The flowchart for blank space and marginal Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to an image processing and converted into a binary representation to calculate the total frequency of zero value (white color). Consider the input page gray scale image with dimension of X × Y. Our aim is to find the frames of the page defined by the new coordinates as demonstrated in Figure 4.5.
45
We assumed the constant value:
for this frame , At a next step, which include searching for horizontal detection of the first frame edges by calculate the White Percentage from Up (
) for each line start
from line 0 to Y-1 by the following formula:
When the value for
, the process of detecting the upper limit will
stop and identify l as the upper limit. Then we calculate the White Percentage from Down (
) for each line start from Y-1 to 0 by the following formula:
When the value for
, the process of detecting the down limit will
stop and identify l+1 as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the White Percentage from Left (
) for
each line start from line 0 to X-1 by the following formula:
When the value for
, the process of detecting the left limit will
stop and identify l as the left limit. Then we calculate the White Percentage from Right (
) for each line start from line X-1 to 0 by the following formula: 46
When the value for
, the process of detecting the right limit will
stop and identify l+1 as the right limit. Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × down-upper. Next page frame is identified and frame is generated using a cropping operation, as shown in Figure 4.5.
Figure 4.4: Flowchart for blank space and Illumination detection and removal
47
48
Figure 4.5: The process for removing the exterior blank space and marginal Illumination from Al-Quran page
ii.
Frame 2 (Illumination): At a second frame step, the Illumination of the page is removed. The flowchart for
Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to process the last frame result to calculate the total frequency of zero value for the new image. Consider the input new image with new dimension of X × Y. Our aim is to find the next frame of the page defined by the new coordinates as demonstrated in Figure 4.6. We assumed the constant value: for this frame. At a next step, which include searching for horizontal detection of the second frame edges by calculate the (
) for each line start from line 0 to Y-1 using formula (1). When the
value for
for five consecutive rows, the process of detecting the upper
limit will stop and identify l as the upper limit. Then we calculate the ( line start from Y-1 to 0 using formula (2). When the value for
) for each for five
consecutive rows, the process of detecting the down limit will stop and identify l as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the (
) for each line start from line 0 to X-1 using formula
(3).When the value for
for five consecutive rows, the process of
detecting the left limit will stop and identify l as the left limit. Then we calculate the ( for
) for each line start from line X-1 to 0 using formula (4). When the value for five consecutive rows, the process of detecting the right limit will
stop and identify l as the right limit. Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × down-upper. Next page frame is identified and frame is generated using a cropping operation, as shown in Figure 4.6.
49
50
Figure 4.6: The process for removing the Illumination from Al-Quran page
iii.
Frame 3 (The interior blank space): At a third frame step, the interior blank space of the page is removed. The
flowchart for Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to process the last frame result to calculate the total frequency of zero value for the new image. Consider the input new image with new dimension of X × Y. Our aim is to find the next frame of the page defined by the new coordinates as demonstrated in Figure 4.7. We assumed the constant value: for this frame. At a next step, which include searching for horizontal detection of the second frame edges by calculate the (
) for each line start from line 0 to Y-1 using formula (1).
When the value for
, the process of detecting the upper limit will
stop and identify l as the upper limit. Then we calculate the (
) for each line start
from Y-1 to 0 using formula (2). When the value for
, the process of detecting the down limit will
stop and identify l+1 as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the (
) for each line start from line 0
to X-1 using formula (3). When the value for
, the process of detecting the left limit will
stop and identify l as the left limit. Then we calculate the (
) for each line start from
line X-1 to 0 using formula (4). When the value for detecting the right limit will stop and identify l+1 as the right limit.
51
, the process of
52
Figure 4.7: The process for removing the interior blank space from Al-Quran page
Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × downupper. Al-Quran page is identified and the page without Illumination is generated using a cropping operation, as shown in Figure 4.7.
4.3.3 Text line segmentation Once Al-Quran pages have been segmented, we proceed to segment all text lines from the image. Our aim is to save each text line as image without any blank space. The flowchart for text line segmentation is shown in Figure 4.9. Consider the input new image with new dimension of X × Y. Our aim is to find all text lines frame from the page defined by the new coordinates as demonstrated in Figure 4.9. At a next step, which include searching for horizontal detection of each text line edges by calculate the (
) for each line start from line 0 to Y-1 by the following
formula:
If the value for
, continue to next line. Otherwise, if there is at
least 1 blank space before it will must shift the upper pointer to this line. Otherwise, we crop new image at point (0, upper) with size X × (down-upper). Re-repeating the previous steps and save each line of the ‘ayat’ Al-Quran until the end of Al-Quran page. Using this technique we produce images that contain all lines in each page, as shown in Figure 4.9.
53
Figure 4.8: Flowchart for text line segmentation
54
55
Figure 4.9: The process for text line segmentation
4.4
Conclusion The page and text line segmentation method is mentioned in elaborated during this
project. The methods for page segmentation on Al-Quran pages were categorized into two parts of methods including page segmentation and text line segmentation. All two methods are in similar process of segmenting. It can be concluded from the works done for this project that we have a method to produce promising results for segment Al-Quran pages and lines without missing any diacritical marks. Therefore, the segmentation process may preserve holiness of Al-Quran.
56
CHAPTER 5
RESULT AND TESTING
5.1
Introduction In this chapter, we'll define the testing and evaluation procedures that are being
performed throughout the development process and when examining the final version of the application. This also includes a discussion of the result, which outlines the standards for achievement of the project. The chapter is concluded by an overview of the testing results with regard to segmentation without missing word or diacritical marks.
5.2
Testing Testing occurred throughout the various stages of the application development so
as to make sure adequate performance during page segmentation. The testing includes checks for both the word and the diacritical marks. Concerning the correctness of the software, we supply out informal tests for every completed page. This will deal with AlQuran pages segmentation especially, so as to make sure that all pages have been segmented and tested for validity, and don't contain any missing word or diacritical marks. We did a lot of experiment test to verify the validity of the proposed methodology for each time we changed the segmentation percentage to suit the all Al-Quran writing styles. We used different style of Al-Quran and different pages. In order to calculate the
57
Precision in segmentation. We calculated the density of Illumination for each page frame in all Al-Quran writing style and then we extracted the average values of them.
5.3
Questionnaire We did a questionnaire to evaluate the performance of the application. The
questionnaire included two sections: The first section contain of demographic questions and personal questions, such as gender, race, country and study category. The second part has included two questions to evaluate the results of the application, to know the validity of the results if there is no missing any word or diacritical marks during the segmentation process. We have distributed a questionnaire to 16 students in the Universiti Teknikal Malaysia Melaka (UTeM) there are 12 male and 4 female as shown in Figure 5.1.
Gender
25%
male femal 75%
Figure 5.1: Pie Chart for the Gender
In the following figures shown the study sample information of the demographic questions such as race (Figure 5.2), country (Figure 5.3), student category (Figure 5.4), and faculty (Figure 5.5).
58
Race 0% 0% 12% Malay Arab
19%
Kadazandusun Indian 69%
Chinese
Figure 5.2: Pie Chart for the Race
Country 0% 0% 6% 13%
China Jordan malaysia iraq yeman 81%
Figure 5.3: Pie Chart for the Country
59
Student Category 0% 6% 19%
Phd Master Degree Diploma
75%
Figure 5.4: Pie Chart for the Student Category
Faculty FTMK
FKEKK 6%
FKM
FKE
FPTT
6%
6% 44%
6% 32%
Figure 5.5: Pie Chart for the Faculty
60
FKP
Figure 5.6 illustrates the evaluation results after applying all segmentation steps.
Segmentation 14 12 10
No Missing Words
8 6
No Missing Diacritical marks
4 2
0 Page segmentation
text line segmentation
Figure 5.6: Column Chart for Segmentation
Figure 5.7 illustrates the evaluation results after applying only the first step of the segmentation (page segmentation). In which the illumination are removed.
Page Segmentation – No Missing word /No missing Diacritical Marks 0%
Yes No 100%
Figure 5.7: Pie Chart for Page Segmentation
61
Figure 5.8 illustrates the evaluation results after applying only the Second step of the segmentation (text line segmentation). In which the text line are detected.
Text Line Segmentation – No Missing word /No missing Diacritical Marks 0%
Yes No 100%
Figure 5.8: Pie Chart for Text Line Segmentation
5.4
Result After several tests and after our survey results, we concluded that the following
results are the best results we have obtained, where it given the best results that we want, without missing of any text or the diacritical marks from Al-Quran pages. Figure 5.9 shows the user interface that we have built in the Java language, as it contains the following options: upload image/s, binarization (Figure 5.10), save the image as a page or text line or both (Figure 5.11).
62
Figure 5.9: User Interface for Selection File(s)
Figure 5.10: User Interface for Binarization
Figure 5.11: User Interface for Save Pages and Line
63
Figure 5.12 illustrates the output files for each copy of Al-Quran.
Figure 5.12: The Output Files
Figure 5.13 illustrates the results of sample page after applying only page segmentation in which the illumination is removed.
Figure 5.13: Result of Page Segmenttion Figure 5.14 illustrates the results of sample page after applying only text line segmentation in which the text lines are detected. 64
Figure 5.14: Result of Text Line Segmentation 5.5
Conclusion From the test and result shown above, shows the success of segmentation process
for several copies of Al-Quran and save the images as text line and pages without illumination. That means the possibility of use the proposed methodology of the Al-Quran pages without missing any word or diacritical marks. Therefore, the segmentation process preserved the holiness of Al-Quran.
65
CHAPTER 6
CONCLUSION AND FUTURE WORK
6.1
Introduction This chapter is the last chapter in this project. This chapter summarizes the most
important achievements of this project which is segmenting Al-Quran pages. It summarizes the phases of the project that lead to the result, and also the limitation in this project. This chapter concludes the research conducted in this project and recommend directions for further research or future work. It begins with a summary of the project. And Limitations on the progress of this research provides in the following section. More further research recommendations presented in the final section.
6.2
Summary This project is formed from six chapters. Chapter one served as an introduction to
the problem of the research, outlining the objectives and the scope of this project. The research represented during this project is concerned with digital image process focus in segmenting the image for Al-Quran pages. Its purpose to review the previous research work in segmentation methods utilized in Arabic/Jawi handwritten texts. Chapter 2 presents the understanding for image processing and segmentation for Arabic/Jawi handwritten texts images and other text images. The images used in this research are images of Al-Quran pages. The previous researches on segmentation text
66
image are elaborated, but didn't focus on Al-Quran pages and diacritical marks. The main aim of this study is to find the better segmentation technique. Chapter 3 discusses the methodology utilized in order to solve the problem of the research. In this study it was used and identified two phases. The phase included of investigation and also the implementation phase. In the investigative phase, including the summary of the research problem, explain the process of image segmentation in Al-Quran pages and identify the techniques used in image segmentation of the previous techniques. Next, task framework is conferred to indicate the general implementation of this study. The objective of the task framework of the project is to divide the method used to six methods. The best practice methods for page and text line segmentation techniques can be obtained. Finally, mention the use of Java throughout the study as a research tool. Chapter 4 presents the design and development of the system; the task framework used for this research. Among others, it explains on the requirements determination and structuring activity based on the research methodology discussed in Chapter 3. Therefore, the proposed method also been discussed in this chapter. The page and text line segmentation method is discussed in detailed. The techniques for page segmentation on AlQuran pages were classified into two parts of methods including page segmentation and text line segmentation. All two methods are in similar process of segmenting. Chapter 5 presents the findings and results for the techniques utilized in page segmentation of Al-Quran pages. Depending on the results, the techniques have proven to get the best segmentation technique. It can be concluded from the works done for this project that we have a method to produce promising results for segment Al-Quran.
67
6.3
Limitation of the project
It has been identified a few limitations:
This technique cannot be applied on the first and second pages which are ‘Surat AlFatihah’ and the first page of ‘Surat Al-Baqarah’ in Al-Quran that its writing styles comes with a circle border for these two pages, due to the short time period.
6.4
The method used identified constant value in the segmentation process.
Future Works / Further Research
Further research might be directed towards the following:
The expansion of all Al-Quran pages regardless the writing styles (circle border).
The expansion of methods applied for segmentation process for the image of AlQuran pages. In future dynamic value can be applied for segmentation methods.
68
REFERENCES Agrawal, M. and Doermann, D., 2013. Clutter noise removal in binary document images. International Journal on Document Analysis and Recognition, 16(4), pp.351–369.
Al-Emami, S. and Usher, M., 1990. On-line recognition of handwritten Arabic characters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), pp.704–710.
Alimi, a. M., 1997. An evolutionary neuro-fuzzy approach to recognize on-line Arabic handwriting. Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1.
Al-Muhtaseb, H. a., Mahmoud, S. a. and Qahwaji, R.S., 2008. Recognition of off-line printed Arabic text using Hidden Markov Models. Signal Processing, 88(12), pp.2902– 2912.
Arivazhagan, M., 2007. A statistical approach to line segmentation in handwritten documents. Document Recognition and Retrieval XIV, Proceedings of SPIE, San Jose, CA, USA, 6500, pp.65000T–1–11.
Ávila, B.T. and Lins, R.D., 2004. A new algorithm for removing noisy borders from monochromatic documents. Proceedings of the 2004 ACM symposium on Applied computing - SAC ’04, p.1219.
Azmi, M.S., 2013. Fitur Baharu Dari Kombinasi Geometri Segitiga dan Pengezonan utk Paleografi Jawi Digital.
69
Baird, H.S., 1994. Background structure in document images. ocument Image Analysi, pp.17–34.
Basy, S. et al., 2008. Text line extraction from multi-skewed handwritten documents. Proceedings of the 27th Chinese Control Conference, CCC, 40, pp.412–415.
Bidgoli, a. M. and Boraghi, M., 2010. A language independent text segmentation technique based on naive bayes classifier. 2010 International Conference on Signal and Image Processing, pp.11–16.
Breuel, T.M., 2002. Two Algorithms for Geometric Layout Analysis. Proceedings of the Workshop on Document Analysis Systems, Princeton, NJ, USA. 2002 pp. 188–199.
Bruzzone, E. and Coffetti, M.C., 1999. An algorithm for extracting cursive text lines. Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR ’99 (Cat. No.PR00318), pp.2–5.
Câmara, G., Souza, R.C.M., Freitas, U.M. and Garrido, J., 1996. Spring: Integrating remote sensing and gis by object-oriented data modelling. Computers and Graphics (Pergamon), 20(3), pp.395–403.
Cheung, A., Bennamoun, M. and Bergmann, N.W., 2001. Arabic optical character recognition system using recognition-based segmentation. Pattern Recognition, 34(2), pp.215–233.
70
Dey, S., Mukhopadhyay, J., Sural, S. and Bhowmick, P., 2012. Margin Noise Removal From Printed Document Images. Workshop on Document Analysis and Recognition, (iv), pp.86–93.
Du, X., Pan, W. and Bui, T.D., 2008. Text line segmentation in handwritten documents using Mumford-Shah model. Pattern Recognition, 42(12), pp.3136–3145.
Duda, R.O. and Hart, P.E., 1972. Use of the Hough transformation to detect lines and curves in pictures. , 15(April 1971), pp.11–15.
Fan, K.C., Wang, Y.K. and Lay, T.R., 2002. Marginal noise removal of document images. Pattern Recognition, 35(11), pp.2593–2611.
Feldbach, M. and Tonnies, K.D., 2001. Line detection and segmentation in historical church registers. Proceedings of Sixth International Conference on Document Analysis and Recognition.
Fletcher, L.A. and Kasturi, R., 1988. Robust algorithm for text string separation from mixed text/graphics images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6), pp.910–918.
Gaceb, D., Lebourgeois, F. and Duong, J., 2013. Adaptative smart-binarization method: For images of business documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.118–122.
71
Gatos, B., Pratikakis, I. and Perantonis, S.J., 2006. Adaptive degraded document image binarization. Pattern Recognition, 39(3), pp.317–327.
Hough, P.V.C., 1962. Method and means for recognizing complex patterns, Kennard, D.J. and Barrett, W. a., 2006. Separating lines of text in free-form handwritten historical documents. Proceedings - Second International Conference on Document Image Analysis for Libraries, DIAL 2006, 2006, pp.12–23.
Khattab, D., Theobalt, C., Hussein, A.S. and Tolba, M.F., 2014. Modified GrabCut for human face segmentation. Ain Shams Engineering Journal, 5(4), pp.1083–1091.
Kise, K., Sato, A. and Iwata, M., 1998. Segmentation of Page Images Using the Area Voronoi Diagram. Computer Vision and Image Understanding, 70(3), pp.370–382.
Le, D.X., Thoma, G.R. and Wechsler, H., 1996. Automated borders detection and adaptive segmentation for binary document images. Proceedings - International Conference on Pattern Recognition, 3, pp.737–741.
Lemaitre, A. and Camillerapp, J., 2006. Text line extraction in handwritten document with Kalman Filter applied on low resolution image. Proceedings - Second International Conference on Document Image Analysis for Libraries, DIAL 2006, 2006, pp.38–45.
Li, Y., Zheng, Y., Doermann, D. and Jaeger, S., 2008. Script-independent text line segmentation in freestyle handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), pp.1313–1329.
72
Likforman-Sulem, L., Hanimyan, a. and Faure, C., 1995. A Hough based algorithm for extracting text lines in handwritten documents. Proceedings of 3rd International Conference on Document Analysis and Recognition, 2, pp.774–777.
Lings, M., 1998. The Quranic Art of Calligraphy and Illumination,
Louloudis, G., Gatos, B., Pratikakis, I. and Halatsis, C., 2008. Text line detection in handwritten documents. Pattern Recognition, 41(12), pp.3758–3772.
Makridis, M., Nikolaou, N. and Gatos, B., 2007. An efficient word segmentation technique for historical and degraded machine-printed documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 1(Icdar), pp.178–182.
Manmatha, R. and Rothfeder, J.L., 2005. A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), pp.1212–1225.
Nafchi, H.Z., Moghaddam, R.F. and Cheriet, M., 2013. Application of phase-based features and denoising in postprocessing and binarization of historical document images. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.220–224.
NAGY, G. and SETH, S., 1984. Hierarchical representation of optically scanned documents. Proceedings of International Conference on Pattern Recognition, pp.347–349.
73
Nasrudin, M.F., Omar, K., Choong-Yeun, L. and Zakaria, M.S., 2010. Pengecaman aksara jawi menggunakan jelmaan surih. Sains Malaysiana, 39(2), pp.291–297.
Nicolaou, a. and Gatos, B., 2009. Handwritten text line segmentation by shredding text into its lines. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.626–630.
Nicolas, S., Paquet, T. and Heurte, L., 2004. Text line segmentation in handwritten document using a production system. Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR, pp.245–250.
Ntirogiannis, K., Gatos, B. and Pratikakis, I., 2013. Performance evaluation methodology for historical document image binarization. IEEE Transactions on Image Processing, 22(2), pp.595–609.
O’Gorman, L., 1993. Document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11), pp.1162–1173.
Omar, K., 2000. Pengecaman Tulisan Tangan Teks Jawi Menggunakan Penkelas Multiaras. Universiti Putra Malaysia.
Papavassiliou, V., Simistira, F., Katsouros, V. and Carayannis, G., 2012. A morphology based approach for binarization of handwritten documents. Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR, pp.577–581.
74
Parker, J., Frieder, O. and Frieder, G., 2013. Automatic enhancement and binarization of degraded document images. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.210–214.
Patvardhan, C., K. Verma, a. and V. Lakshmi, C., 2012. Denoising of Document Images using Discrete Curvelet Transform for OCR Applications. International Journal of Computer Applications, 55(10), pp.20–27.
Phillips, P., McCabe, R. and Chellappa, R., 1998. Biometric image processing and recognition. European Signal Processing Conference.
Pu, Y. and Shi, Z., 1998. A natural learning algorithm based on hough transform for text lines extraction in handwritten documents. Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, pp.637–646.
Qadir, M.A. and Ahmad, I., 2006. Digital text watermarking: Secure content delivery and data hiding in digital documents. IEEE Aerospace and Electronic Systems Magazine, 21(11), pp.18–21.
Rabeux, V., Journet, N., Vialard, A. and Domenger, J.P., 2013. Quality evaluation of ancient digitized documents for binarization prediction. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.113–117.
Rao, K.M.M., 2004. Overview of Image Processing. Proceedings of a workshop on image processing and pattern recognition, pp.1–7.
75
Roy, P., Pal, U. and Lladós, J., 2008. Morphology based handwritten line segmentation using foreground and background information. Conference on Frontiers in Handwriting , pp.5–10.
Saabni, R., Asi, A. and El-Sana, J., 2014. Text line extraction for historical document images. Pattern Recognition Letters, 35(1), pp.23–33.
Sauvola, J. and Pietikäinen, M., 2000. Adaptive document image binarization. Pattern Recognition, 33(2), pp.225–236.
Sehad, A., Chibani, Y., Cheriet, M. and Yaddaden, Y., 2013. Ancient degraded document image binarization based on texture features. , (Ispa), pp.182–186.
Shi, Z. and Govindaraju, V.G.V., 2004. Line separation for complex document images using fuzzy runlength. First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings.
Shi, Z.S.Z., Setlur, S. and Govindaraju, V., 2005. Text extraction from gray scale historical document images using adaptive local connectivity map. Eighth International Conference on Document Analysis and Recognition (ICDAR’05).
Stamatopoulos, N., Gatos, B. and Perantonis, S.J., 2009. A method for combining complementary techniques for document image segmentation. Pattern Recognition, 42(12), pp.3158–3168.
76
Stathis, P., Kavallieratou, E. and Papamarkos, N., 2008. An evaluation survey of binarization algorithms on historical documents. 2008 19th International Conference on Pattern Recognition, pp.2–5.
Su, B., Lu, S. and Tan, C., 2012. A learning framework for degraded document image binarization using Markov random field. Pattern Recognition (ICPR), 2012 21st , (Icpr), pp.13–16.
Su, B., Lu, S. and Tan, C.L., 2011. Combination of document image binarization techniques. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2011 pp. 22–26.
Tajabadi, R., Mashayekhi, K. and Shabani, S., 2009. Illumination position in the growth of Islamic Art. Paper presented at the first national conference on Shiite arts.
Wagdy, M., Faye, I. and Rohaya, D., 2013. Fast and Efficient Document Image Clean Up and Binarization Based on Retinex Theory. , pp.8–10.
Wahl, F.M., Wong, K.Y. and Casey, R.G., 1982. Block segmentation and text extraction in mixed text/image documents. Computer Graphics and Image Processing, 19(1), p.94.
Weliwitage, C., Harvey, A.L. and Jennings, A.B., 2005. Handwritten document offline text line segmentation. Proceedings of the Digital Imaging Computing: Techniques and Applications, DICTA 2005. 2005 pp. 184–187.
77
Yin, F. and Liu, C.L., 2008. Handwritten text line extraction based on minimum spanning tree clustering. Wavelet Analysis and Pattern Recognition, 2007. ICWAPR’07. International Conference on, 3, pp.1123–1128.
78
APPENDICES
Appendix A Questionnaire
FAKULTI TEKNOLOGI MAKLUMAT DAN KOMUNIKASI
ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION
A) PERSONAL BACKGROUND
1. Gender * Male
Female
2. Race * Malay
Indian
Chinese
Jordan
Iraq
Arab
Others (…….......)
3. Country * Malaysia
China
Others (................)
Diploma
Others (…...……)
4. Student Category * PhD
Master
Degree 79
5. Faculty: * FTMK
FKM
FKEKK
FKE
FPTT
FKP
Others (………)
B) STUDENTS’ EVALUATION ON OUR SYSTEM RESULT 1. Refer to Table A.3: left side the original image, and right side the segmentation page. Please, evaluate the result that no missing words or diacritical marks “vowel, vocalization sign”, tick (√) into appropriate box for your correct answer. Table A.1: Question 1 Page #
No Missing Words
No Missing Diacritical
10 Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
Yes
No
11 603 604
2. In case of “No” answer, please, give short explanation: ………………………………………………………………………………………………… …………………………………………………………………………………………………
3. Refer to Table A.4: left side the original image, and right side segmentation of text line. Please, evaluate the result that no missing words or diacritical marks “vowel, vocalization sign”, tick (√) into appropriate box for your correct answer.
80
Table A.2: Question 2 Page #
Line #
No Missing Words
No Missing Diacritical
1
Yes
No
Yes
No
2
Yes
No
Yes
No
3
Yes
No
Yes
No
4
Yes
No
Yes
No
5
Yes
No
Yes
No
6
Yes
No
Yes
No
7
Yes
No
Yes
No
8
Yes
No
Yes
No
9
Yes
No
Yes
No
10
Yes
No
Yes
No
11
Yes
No
Yes
No
12
Yes
No
Yes
No
13
Yes
No
Yes
No
14
Yes
No
Yes
No
10
4. In case of “No” answer, please, give short explanation: ………………………………………………………………………………………………… …………………………………………………………………………………………………
81
Table A.3: For Question 1 Page# Original Image
Image of Segmentation
10
11
82
603
604
83
Table A.4: For Question 2 Page# Original Image
Line# Image of Segmentation 1 2
3 4 5 6 84
7 8 10
9 10 11 12 13 14
Appendix B RESULT
85
86
87