Faculty of Information and Communication Technology

0 downloads 0 Views 6MB Size Report
terhasil yang bebas dari illuminasi. Keputusan ujikaji ...... In this research, classification of methods divided into sex methods to segment the. Quran pages into ...
Faculty of Information and Communication Technology

ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION

Laith Nazeeh Jamil Bany Melhem

Master of Computer Science (Internetworking Technology)

2015

ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION

LAITH NAZEEH JAMIL BANY MELHEM

A thesis submitted in fulfilment of the requirements for the degree of Master of Computer Science (Internetworking Technology)

Faculty of Information and Communication Technology

UNIVERSITI TEKNIKAL MALAYSIA MELAKA

2015

DECLARATION

I declare that this project entitled “Illumination Removal and Text Segmentation for AlQuran Using Binary Representation” is the result of my own research except as cited in the references. The project has not been accepted for any degree and is not concurrently submitted in candidature of any other degree.

Signature

:

........................................................................

Name

:

LAITH NAZEEH JAMIL BANY MELHEM

Date

:

........................................................................

APPROVAL

I hereby declare that I have read this project and in my opinion this project is sufficient in terms of scope and quality for the award of Master of Computer Science (Internetworking Technology).

Signature

:

.............................................

Supervisor Name

:

Dr. MOHD SANUSI AZMI

Date

:

.............................................

DEDICATION

I would like to present my work to those who did not stop their daily support since I was born, my dear mother and my kindness father. They never hesitate to provide me all the facilities to push me foreword as much as they can. This work is a simple and humble reply to their much goodness I have taken over during that time. Also to my brothers (Mohammad, Hamzah), sisters (Rawan, Rana, Shoroq), my Grandfather, my Grandmother, my Aunt, my Uncle, friends and those entire how I love (Allah's bless all of them).

ABSTRACT

Segmentation process for segmenting Al-Quran needs to be studied carefully. This is because Al-Quran is the book of Allah swt. Any incorrect segmentation will affect the holiness of Al-Quran. A major difficulty is the appearance of illumination around text areas as well as of noisy black stripes. In this study, we propose a novel algorithm for detecting the illumination on Al-Quran page. Our aim is to segment Al-Quran pages to pages without illumination, and to segment Al-Quran pages to text line images without any changes on the content. First we apply a pre-processing which includes binarization. Then, we detect the illumination of Al-Quran pages. In this stage, we introduce the vertical and horizontal white percentages which have been proved efficient for detecting the illumination. Finally, the new images are segmented to text line. The experimental results on several Al-Quran pages from different Al-Quran style demonstrate the effectiveness of the proposed technique.

i

ABSTRAK

Proses penemberengan Al-Quran memerlukan kajian yang berhati-hati. Ini kerana AlQuran adalah kitab Allah swt. Sebarang kesalahan penemberengan akan memberikan kesan kepada kesucian Al-Quran. Kesukaran yang dihadapi adalah illuminasi yang mengelilingi kawasan teks Al-Quran dan juga garisan hitam. Pada kajian ini, kami mencadangkan satu algoritma baharu untuk mengenalpasti illuminasi pada setiap muka Al-Quran. Tujuan adalah untuk menembereng Al-Quran muka ke mukadan baris ke baris tanpa mengubah apa-apa pada kandungan Al-Quran. Mulanya, prapemprosesan digunakan dengan menggunakan proses binari. Kemudian, illuminasi dikenalpasti. Pada tahap ini, kami memperkenalkan peratusan menegak dan mendatarberdasarkan kepada piksel putih dapat melaksanakan penemberengan dengan baik. Akhir sekali, imej baru terhasil yang bebas dari illuminasi. Keputusan ujikaji menggunakan stail Al-Quran menunjukkan teknik cadangan adalah efektif.

ii

ACKNOWLEDGEMENT

First and foremost, praise to Allah, for giving me this opportunity, the strength and the patience to complete my project finally, after all the challenges and difficulties. I would like to take this opportunity to express my sincere acknowledgement to my supervisor Dr. Mohd Sanusi Bin Azmi from the Faculty of Information & Communication Technology Universiti Teknikal Malaysia Melaka (UTeM) for her essential supervision, support and encouragement towards the completion of this thesis. Thanks for king Fahd Glorious Quran Printing Complex to publish the styles of Al-Quran to use it during the research. To my beloved my family and the jewel my heart my mother. Thank you for the sacrifices, patience, support and compassion which has become one enters my life. Not to forget also to all my colleagues and friends struggling for Master's that inspire a vision, guidance and sharing experiences. Special thanks to all my peers, my father, beloved mother and siblings for their moral support in completing this degree. Lastly, thank you to everyone who had been to the crucial parts of realization of this project.

iii

TABLE OF CONTENTS PAGE DECLARATION APPROVAL DEDICATION ABSTRACT i ABSTRAK ii ACKNOWLEDGEMENT iii TABLE OF CONTENTS ................................................................................................... iv LIST OF TABLES .............................................................................................................. vi LIST OF FIGURES ........................................................................................................... vii CHAPTER 1. INTRODUCTION ......................................................................................................... 1 1.1 Introduction ............................................................................................................. 1 1.2 Research Background .............................................................................................. 3 1.3 Problem Statement .................................................................................................. 4 1.4 Research Questions ................................................................................................. 5 1.5 Research Objectives ................................................................................................ 5 1.6 Project Significant ................................................................................................... 5 1.7 The Scope of Research ............................................................................................ 6 1.8 Expected Outcomes ................................................................................................. 6 1.9 Conclusion ............................................................................................................... 6 2. LITERATURE REVIEW ............................................................................................. 7 2.1 Introduction ............................................................................................................. 7 2.2 Image Pre-processing .............................................................................................. 8 2.2.1 Progress in Binarization Studies .................................................................. 8 2.2.2 Border/Illumination Removal .................................................................... 14 2.3 Arabic OCR ........................................................................................................... 16 2.4 Image Segmentation .............................................................................................. 18 2.5 Techniques for Document of Text Segmentation.................................................. 19 2.5.1 Projection Profiles...................................................................................... 20 2.5.2 Hough Transform Approach ...................................................................... 20 2.5.3 Smearing methods ...................................................................................... 21 2.5.4 Dynamic Programming .............................................................................. 22 2.5.5 Other Techniques ....................................................................................... 23 2.6 Text Segmentation analysis ................................................................................... 26 2.7 Conclusion ............................................................................................................. 34 3. RESEARCH METHODOLOGY ............................................................................... 35 3.1 Introduction ........................................................................................................... 35 3.2 Research Methodology .......................................................................................... 35 3.2.1 Research Framework ................................................................................. 35 3.3 Task Framework .................................................................................................... 38 3.4 Experimental Test Framework .............................................................................. 40 3.4.1 Experiment I .............................................................................................. 40 3.5 Research Tool ........................................................................................................ 41 3.6 Conclusion ............................................................................................................. 41 iv

4. IMPLEMENTATION ................................................................................................. 42 4.1 Introduction ........................................................................................................... 42 4.2 Data Collection ...................................................................................................... 42 4.3 Implementation process ......................................................................................... 43 4.3.1 Image Binarization ..................................................................................... 43 4.3.2 Page segmentation ..................................................................................... 45 4.3.3 Text line segmentation ............................................................................... 53 4.4 Conclusion ............................................................................................................. 56 5. RESULT AND TESTING ........................................................................................... 57 5.1 Introduction ........................................................................................................... 57 5.2 Testing ................................................................................................................... 57 5.3 Questionnaire......................................................................................................... 58 5.4 Result ..................................................................................................................... 62 5.5 Conclusion ............................................................................................................. 65 6. CONCLUSION AND FUTURE WORK ................................................................... 66 6.1 Introduction ........................................................................................................... 66 6.2 Summary ............................................................................................................... 66 6.3 Limitation of the project ........................................................................................ 68 6.4 Future Works / Further Research .......................................................................... 68 REFERENCES .................................................................................................................. 69 APPENDICES.................................................................................................................... 79 Appendix A Questionnaire ............................................................................................... 79 Appendix B RESULT ........................................................................................................ 85

v

LIST OF TABLES

TABLE

TITLE

PAGE

Table 1.1: Summary of Ayat, Page and Line from printed Al-Quran

2

Table 2.1: Image processing steps for Jawi pattern recognition (Azmi, 2013)

9

Table 2.2: Categorization of segmentation algorithms (Nikos et al., 2010)

18

Table 2.3: Text line segmentation methods analysis

26

Table 3.1: Tools and programming languages that utilized in this research

41

vi

LIST OF FIGURES

FIGURE

TITLE

Figure 1.1: Al-Quran writing styles

PAGE 4

Figure 2.1: Example of an image with noisy black border and noisy text region

14

Figure 2.2: General Arabic OCR systems capabilities

16

Figure 3.1: Research Framework

36

Figure 3.2: Al-Quran writing styles

37

Figure 3.3: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework

39

Figure 3.4: Page (10) from Holy Al-Quran

40

Figure 4.1: Al-Quran writing styles

43

Figure 4.2: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework

44

Figure 4.3: Overall steps for removing page frames

45

Figure 4.4: Flowchart for blank space and Illumination detection and removal

47

Figure 4.5: The process for removing the exterior blank space and marginal Illumination from Al-Quran page

48

Figure 4.6: The process for removing the Illumination from Al-Quran page

50

Figure 4.7: The process for removing the interior blank space from Al-Quran page

52

Figure 4.8: Flowchart for text line segmentation

54

Figure 4.9: The process for text line segmentation

55

vii

Figure 5.1: Pie Chart for the Gender

58

Figure 5.2: Pie Chart for the Race

59

Figure 5.3: Pie Chart for the Country

59

Figure 5.4: Pie Chart for the Student Category

60

Figure 5.5: Pie Chart for the Faculty

60

Figure 5.6: Column Chart for Segmentation

61

Figure 5.7: Pie Chart for Page Segmentation

61

Figure 5.8: Pie Chart for Text Line Segmentation

62

Figure 5.9: User Interface for Selection File(s)

63

Figure 5.10: User Interface for Binarization

63

Figure 5.11: User Interface for Save Pages and Line

63

Figure 5.12: The Output Files

64

Figure 5.13: Result of Page Segmenttion

64

Figure 5.14: Result of Text Line Segmentation

65

viii

CHAPTER 1

INTRODUCTION

1.1

Introduction Image processing is a popular research area in computer science. Today, image

processing not only focuses on fundamental issues addressed by researches also the suitability of the research into several domains such as biometric (Phillips et al., 1998), geographical information system (Câmara et al., 1996), character recognition (Omar, 2000), document analysis (Sauvola and Pietikäinen, 2000) and others. Image Processing is a technique to enhance raw images received from cameras/sensors placed on satellites, space probes and aircrafts or pictures taken in normal day-to-day life for various applications (Rao, 2004). There are three main categories of image processing: Image Enhancement, Image Rectification, and Restoration, and Image Classification. Various techniques have been developed in Image Processing during the last four to five decades. Most of the techniques are developed for enhancing images obtained from unmanned spacecrafts, space probes and military reconnaissance flights. Image Processing systems are becoming popular due to easy availability of powerful personnel computers, large size memory devices, graphics software etc (Rao, 2004). Besides, Khairuddin Omar (2010) and Mohammad Faidzul et al. (2010) the image processing body knowledge consists of several phases start from data collection, pre1

processing, feature extraction, feature selection, classification, and post processing (Azmi, 2013; Nasrudin et al., 2010; Omar, 2000). Each phase in the image processing has subprocesses. (Omar, 2000) has category the pre-processing phase for the Arabic/Jawi character recognition into Binarization, Edge detection, thinning, and Segmentation before feature extraction took place. In this research, focus is in segmentation for holy Quran. Thus, Segmentation for holy Quran based on processes for segmenting Arabic/Jawi handwritten texts. Holy Quran is book of Allah swt. Al-Quran consists of 30 chapters, 114 Surah and 6036 Ayat. However, number of pages and lines are difference based on publishers. Table 1.1 below shows example of printed Al-Quran from different publishers. Table 1.1: Summary of Ayat, Page and Line from printed Al-Quran Al-Quran (Version)

Ayat

Page

Line per page

Total line

Madinah

6236

604

15

9060

Al-Quran Al-Hakeem

6236

608

15

9120

Al-Quran Al-Kareem

6236

617

15

9225

Al-Quran Al-Majeed

6236

855

13

11115

Mushaf Al-Madinah Quran

6236

619

15

9285

6236

625

15

9375

Majeed {Nastaleeq} Mushaf Al-Madinah Quran Majeed

Based on Table 1.1, issues on segmentation Al-Quran is not uniform. Although the same number of chapter, surah and ayat, but number of page and line are difference. The

2

difference number of page and line become interesting topic to be studied especially in the segmentation process. Segmentation process for segmenting Al-Quran needs to be studied carefully. This is because Al-Quran is the book of Allah swt. Any incorrect segmentation will affect the holiness of Al-Quran. Currently, exist some segmentation techniques such as Naive Bayes Classifier (Bidgoli and Boraghi, 2010), however the segmentation techniques focus on segmenting object such as text and face segmentation (Khattab et al., 2014). Also, there are some segmentation techniques for Arabic and Jawi character (Omar, 2000). Although, Arabic and Jawi characters are quite near to Al-Quran, however the techniques in Arabic and Jawi do not have diacritical marks (Omar, 2000). In this research, one technique for segmenting Al-Quran will be proposed. The technique will consider diacritical marks (Tashkil) in order to protect the holiness of AlQuran. The propose technique will be evaluated based on comparison with original AlQuran text. Any missing diacritical marks (Tashkil), words, and sentences will be considered as incorrect evaluation.

1.2

Research Background Al-Quran is the last book of Allah swt. In Table 1.1, there are difference page and

line for each printed Al-Quran. Besides, Figure 1.1, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.

3

Figure 1.1: Al-Quran writing styles

The segmentation process in this research is use to prepare image for feature extraction process. In the Al-Quran, texts and Diacritical marks (Tashkil) will be extracted. Thus, the illumination and some empty space will be removed. In this research “illumination” refers to the art of embellishing and decorating the Holy Quran. It is used in Islamic arts. This is used in the first and last page of Al-Quran, a head of each ‘Sura’ and a border of each page in Al-Quran and other places (Tajabadi et al., 2009). Forming decorative arrays and Islamic designs in this holy Quranic art has come from consolidated ideas and worldviews of artists of this field. So that, many calligraphists and number of (not less) illuminators read Quran in a memorized manner. But those less people who were not able to read Quran in a memorized manner, were so familiar with its verses that it had become an integrated part of their nature. (Lings, 1998). Manifestation of spirituality in illumination of Quran is to the extent that it has made this art worthy of companion of the holy Quran and in fact, manifestation of the divine realm is duty of the art evoked by the word of God. And it can be said that Quran itself has opportunities that stimulates religion (Lings, 1998).

1.3

Problem Statement Arabic language (Quran Language) has markings called “diacritical marks” or

“diacritics” represent short vowels or other sounds if one of these diacritical marks ignored 4

it will change the meaning of the word, There are many Arabic character recognition techniques which can recognize the characters of text or the whole page Khairuddin Omar(2000), Mohamad Faidzul (2010) but all these techniques provide recognition for characters without considering the diacritical marks, which may affect the meaning of the Quran’s word and the holiness of Al-Quran.

1.4

Research Questions 

How the segmentation process for Al-Quran happen in image processing domain?



How to segment illumination occur in Al-Quran with different form of illumination and without missing any diacritic?

1.5

Research Objectives This study has certain objectives as follows:

1.6



To propose framework for Segmenting Al-Quran into page and line



To propose a technique to segmenting texts of Al-Quran

Project Significant The Holy Quran is very important to the Muslims with respect to its authenticity. In

this project the removal Illumination technique from the Holy Quran page will be proposed, which enable us remove the Illumination of page according to the percentage of the binary numbers in the Quran page image. The result can be used by the researchers to compare between the copies of the Holy Quran for identifying the originality of copies.

5

1.7

The Scope of Research The scope of research is: i.

This research study will primarily focus on removing the border of the Quran pages by cropping the page of Quran first, then crop line by line of the page without the empty space

ii.

This technique will be applied on all the Quran pages except the first and second pages which are “Surat Al-Fatihah and the first page of Surat AlBaqarah” in the Quran that its writing styles comes with a circle border for these two pages.

1.8

Expected Outcomes The main expected outcomes for this research to design application that provide a

technique that produced better image for Holy Quran page image without the border, then cropping the image page line-by-line after line segmentation.

1.9

Conclusion This thesis addressed the problem of removal image border for sensitive digital

Holy Quran pages image. We proposed a very robust and secure approach against the remove border without any changes on the content of the Holy Quran pages to maintain the authenticity of the Quran text-image content. Our objective is to segment the page of the Quran into lines and also to remove Illumination.

6

CHAPTER 2

LITERATURE REVIEW

2.1

Introduction Nowadays we are living in a world that is almost entirely digital. So, many digital

documents are available. Research on document originality are exist (Qadir and Ahmad, 2006). For this research, study will be done on Al-Quran. There are too many printed versions of Al-Quran as well as digital copy. This research will do the segmentation on AlQuran for the removing illumination and segmenting Al-Quran by lines. This research will prepare the Al-Quran for feature extraction phase and the final aim is to validate the originality of Al-Quran. There are too many research on segmentation, however, segmenting the Al-Quran need to do carefully in order to preserve the holiness of Al-Quran. Research that nearly to Al-Quran are Arabic/Jawi segmentation. But, both are not suitable to be applied for AlQuran due to diacritics diacritical marks (Tashkil), words, and sentences. The first step before segmentation phase is preprocessing documents to produce a clean image of the document. Al-Quran page image contained Illumination. Detecting and removing these unwanted areas is critical to achieve better text segmentation results. Before the illumination detection and removal takes place, we first proceed to image binarization using the efficient technique proposed in (Gatos et al., 2006). There are some segmentation techniques for Arabic and Jawi character (Omar, 2000). (Omar, 2000) has category the pre-processing phase for the Arabic/Jawi character 7

recognition into Binarization, Edge detection, thinning, and Segmentation before feature extraction took place. We are trying in our research to understand the image processing and text segmentation. Images used in our study will be are an image of Al-Quran pages. At this point, Will be explained the previous studies on text line segmentation in detail. The main objective of this research is to seek out the better method to segment Al-Quran pages without missing any character or diacritical marks.

2.2

Image Pre-processing

2.2.1 Progress in Binarization Studies A binary image (Stathis et al., 2008) is a digital image that has just two feasible values meant for every pixel. Normally, two colors are used for a binary image i.e. black and white however any two colors can be used. The color used for the objects in the image is the foreground color while the rest of the image is the background color. Binary images (Su et al., 2011) frequently occur in image processing as masks or as the outcome of some operations as segmentation and thresholding. Few input/output devices, for example, laser printers, bi-level computer displays, are able to just handle bi-level images. Binary images are formed from color images by segmentation. Various approaches as well as techniques were developed to improve documents images quality. Binarization is one of the most important pre-processing steps which consist to separate foreground and background of documents images. It converts a grayscale document image into a binary document image. Image binarization is typically executed in the preprocessing phase of image processing of several documents. It is the process of separation of pixel values into dual collections, black as foreground and white as background.

8

Mohd Sanusi (2013) shows the processes within the pre-processing carried out by researchers within the Jawi script. This process is predicated on the process performed by Khairuddin Omar (2000). Overall stages that were utilized by the pioneers of Jawi script Khairuddin Omar (2000) has committed method of images reborn to a scale illustration of color, skew and slant correction, noise removal, and thinning of the frame (skeleton). However, the process utilized by Khairuddin Omar (2000) not all of them are utilized by researchers to conduct pre-processing as shown in Table 2.1. Table 2.1: Image processing steps for Jawi pattern recognition (Azmi, 2013) Researcher for previous

Format

Skew and slant

Noise

studies

Conversion

correction

Removal

Khairuddin Omar (2000)







Mazani Manaf (2002)





√ √

Mohammad Roslim (2002) Mohammad Faidzul (2010)

Thinning





By Refer to Table 2.1, Khairuddin Omar (2000) during the pre-processing phase has been accomplished on the scale of the transformation of binary format. In addition to that, Mazani Manaf (2002) and Mohammad Faidzul (2010) come out to gray scale format during the transformation. After that, by these researchers has carried out the methodology of noise removal. Khairuddin Omar (2000) in the noise removal process they use Median Filter and then he has been using the technique proposed by Sharaf El-Deen et al. (2003) to "change the image format to a binary scale". After this, using gradient orientation histogram implements the skew as well as slant correction image. The ultimate procedure carried out through Khairuddin Omar (2000) the thinning skeleton did in the preprocessing. The algorithms used by Khairuddin Omar (2000) for the thinning skeleton

9

using sequential thinning algorithm Safe-Point Thinning Algorithm (SPTA) proposed by Naccache, & Shinghal (1984). SPTA algorithm is also used by Roslim Mohammad (2002) in the thinning of the Jawi script. Mazani Manaf (2002) in pre-processing employs "gamma correction and intensity". Next the uses of a linear function with the planned technique desires by Parker (1994) to convert the image to grayscale format. The noise removal method, then being executed by using erosion operation followed by reclamation as proposed by Zhang, & Suen (1984). After that, to de-noise employs the median filter. Finally, Mazani Manaf (2002) using a simple sequential thinning algorithm performs a thinning process framework based on Zhang and Suen (1984). Mohammad Faidzul (2010) in his study has taken knowledge from nine writers. The primary method done by him is doing segmentation that performed manually. however he did't doing the noise removal in his research. The results of this manually segmentation, he obtained 993 total characters and 540 sub-word from nine last author. Next, format conversion method to be enforced using the binary scale thresholds of 127. Mohammad Faidzul (2010) did't perform the repair method skew and slant and additionally the thinning skeleton. Abdenour Sehad et al.(2013) (Sehad et al., 2013) has present a capable scheme for binarization of ancient and degraded document images, grounded on texture qualities. The suggested technique is an adaptive threshold-based. It has been calculated by using a descriptor centered on a co-occurrence matrix and the scheme is verified objectively, on DIBCO dataset degraded documents furthermore subjectively, utilizing a set of ancient degraded documents offered by a national library. The outcomes are acceptable and assuring, present an improvement to classical approaches.

10

Hossein Ziaei Nafchi et al.(2013) (Nafchi et al., 2013) has concluded that the preprocessing and post processing phases meaningfully advance the performance of binarization approaches, particularly in the situation of harshly degraded ancient documents. An unverified post processing technique is presented founded on the phasepreserved denoised image and also phase congruency features extracted from the input image. The central part of the technique comprises of two robust mask images that can be used to cross the false positive pixels on the production of the binarization technique. Firstly, a mask with an extreme recall value is attained from the denoised image with the help of morphological procedures. In parallel, a second cover is acquired dependent upon stage congruency features. At that point, a median filter is utilized to evacuate noise on these two masks, which then are utilized to rectify the yield of any binarization strategy. Jon Parker et al.(2013) (Parker et al., 2013) has studied that regularly documents of notable noteworthiness are ran across in a state of deterioration. Such archives are regularly examined to all the while history and announce a disclosure. Changing over the data found inside such reports to open information happens all the more rapidly and inexpensively if a programmed technique to upgrade these corrupted archives is utilized as opposed to improving each one document image by hand. A novel mechanized image upgrade approach that indulges no preparation information was introduced. The methodology was valid to images of typewritten text in addition to hand written text or both. Konstantinos Ntirogiannis et al.(2013) (Ntirogiannis et al., 2013) has analysed that document image binarization is of incredible value in recognition pipeline and document image examination as it disturbs further phases of the recognition procedure. The assessment of a binarization technique helps in examining its algorithmic conduct, and also confirming its adequacy, by giving qualitative and quantitative sign of its execution. A

11

pixel-based binarization assessment approach for recorded handwritten/machine-printed document image has been proposed. .In the proposed assessment procedure, the review and accuracy assessment measures are fittingly adjusted utilizing a weighting plan that decreases any potential assessment unfairness. Extra execution measurements of the proposed assessment plan comprise of the rate rates of broken and missed content, false alerts, foundation commotion, character amplification, and combining. Vincent Rabeux et al.(2013) (Rabeux et al., 2013) has an approach to expect the outcome of binarization algorithms on a known document image according to its situation of degradation. Document shaving degradation which result in binarization errors. To characterize the degradation of a document image by using different features based on the strength, amount and position of the degradation. These characteristics allow us to build calculation models of binarization algorithms that are very accurate according to R2 values and p-values. The prediction models are used to select the best binarization algorithm for a given document image. Djamel GACEB et al. (2013) (Gaceb et al., 2013) has studied a smart binarization technique of the images. In this technique,considered different degradations document images. The nature of every pixel is approximate using a hierarchical local thresholding in order to classify it as foreground, background or ambiguous pixel. The ambiguous pixels that represent the corrupted zones cannot be binarized with the same local thresholding. The global quality of the image is estimated from the density of theses degraded pixels. If image is degraded then apply a second separation on the ambiguous pixels to split them into background or foreground. Second process uses our improved relaxation method Marian Wagdy et al. (2013) (Wagdy et al., 2013) has implemented a quick and proficient document image clean up and binarization technique depend on retinex hypothesis and global thresholding. This technique joins of local and global thresholding

12

with concept of retinex theory which can efficiently improve the degraded and poor quality document image. Then, quick global threshold is utilized to change over the document image into binary form. The new method conquers the limitations of the related global threshold techniques. Vassilis Papavassiliou et al.(2012) (Papavassiliou et al., 2012) has discussed an capable technique dependent upon mathematical morphology for extracting text regions from degraded document images. The fundamental stages of methodology area) top-hatby-reconstruction to construct a filtered image with sensible background) region growing beginning from a set of seed points and attaching to each seed similar intensity neighbour pixels and c) conditional extension of the first detected text regions based on the values of the second derivative of the filtered image. Bolan Su et al. (2012) (Su et al., 2012) has studied a document image binarization structure that makes utilization of the Markov Random Field model. Structure isolates the document image pixels into three classes i.e. document background text, document foreground text, and uncertain pixels established binarization method. Uncertain pixels are belongto foreground and background categories by incorporating MRF model and boundary information. C. Patvardhan et al. (2012) (Patvardhan et al., 2012) has studied that images may contain difficult background i.e. shading or a denoising. Binarization method of document images creates them suitable for OCR using discrete curvelet transform. Curvelet transform is used for eliminate difficult image background, white Gaussian noise and gives improved binarized document image. The Curvelet transform also helps to enhanced in text shape still in the occurrence of noise. This method is capable to eliminate high frequency Gaussian noise and low frequency complex backgrounds and shows better performance.

13

2.2.2 Border/Illumination Removal Approaches The proposed approaches for the segmentation of document and character recognition are usually considered the scanned images without noise are ideal images. However, there are several factors that may generate images of the full document. When a page is scanned from the book, text of a neighbouring page can also be captured in the image of the current page. These areas of unwanted are called "noisy text regions." In addition, whenever a scanned page does not completely cover the scanner setup image size, there will usually be black borders in the image. These unwanted regions are called “noisy black borders”. Figure 2.1 shows noisy black borders as well as noisy text regions. All these problems influence the performance of segmentation and recognition processes. Since the page segmentation algorithms take into account noisy text regions, the text recognition accuracy decreases since the text recognition system usually outputs several extra characters in these regions. The goal of border detection is to find the principal text region and to ignore the noisy text and black borders.

Figure 2.1: Example of an image with noisy black border and noisy text region

14

The most common approach to eliminate marginal noise is to perform document cleaning by filtering out connected components based on their size and aspect ratio. However, when characters from the adjacent page are also present, they usually cannot be filtered out using only these features. There are only few techniques in the bibliography for page borders detection, and they are mainly focused on printed document images. Le et al. (Le et al., 1996) propose a technique for remove the border that is predicated on classification of blank, non-textual and textual columns and rows, the border objects location, and an analysis of crossing counts of textual squares and projection profiles. There are many uses are used in this approach. Moreover, it is assumed that the page borders very close to the image edges and separate the border from the contents of the image by a blank space. However, this assumption is often violated. Fan et al. (Fan et al., 2002) propose a technique for detect the black noisy regions that overlap with the text, but did not assume there is noisy text regions. They propose framework to reduce the image resolution to detect and remove the black borders, which hides text, by threshold filter thus leaving the border of the image. They applied the deletion process on the original image. In (Ávila and Lins, 2004) Avila et al. propose algorithms for non-invading and invading border that work as "flood–fill" algorithm. The algorithm for the "non-invading border" supposes that the information of the document that merged with the border of noisy black. In the connected area, to curb the flooding, it use two parameters that related to the document, the segment maximum size that belonging to the document text as well as the maximum distance between lines. On the other hand, the algorithm for the "invading" supposes that the black areas will not be invaded by the borders of noisy black. If the border of noisy black that merged with text region of the document, all area and the part of the text region is removed and flooded. Dey et al. (Dey et al., 2012) propose a technique for removing margin noise from printed document images. Firstly, they perform layout

15

analysis to detect words, lines, and paragraphs in the document image and the detected elements are classified into text and non-text components based on their characteristics (size, position, etc.). The geometric properties of the text blocks are sought to detect and remove the margin noise. Finally, Agrawal and Doermann (Agrawal and Doermann, 2013) present a clutter detection and removal algorithm for complex document images. They propose a distance transform-based approach which aims to remove irregular and nonperiodic clutter noise from binary document images independently of clutter’s position, size, shape and connectivity with text.

2.3

Arabic OCR Optical Character Recognition (OCR) systems is transforming large amount of

documents, either printed alphabet or handwritten into machine encoded text without any transformation, noise, resolution variations and other factors.

Character Recognition

Off-Line

Machine Printed

On-Line

Isolated Characters

Handwritten

Figure 2.2: General Arabic OCR systems capabilities

16

Cursive Words

Figure 2.2 shows the main capabilities of Arabic OCR systems. Obviously, they differ in their capabilities of character recognition. The sophistication of the off-line OCR system depends on the type and number of fonts to be recognized. An Omni-font OCR machine can recognize most non stylized fonts without having to maintain huge databases of specific font information. Usually Omni-font technology is characterized by the use of feature extraction. However, no OCR machine performs equally well or even usably well, on all the fonts used by modern computers. The first step in any OCR system is to capture text data and transform it into a digital form. The recognition systems differ in how they acquire their input. There are two different ways: on–line and off–line systems as described in Figure 2.2. On–line (or real time) systems (Alimi, 1997; Al-Emami and Usher, 1990) recognize the text while the user is writing it, e.g. a digital tablet. The tablet captures the (x, y) coordinates of the pen location while it is moving. This generates a one dimensional vector of these points. This vector depends on the tablet resolution (points/inch) and the sampling rate (point/second). The on–line systems have a high recognition performance where each character is represented by a vector of points that are sorted by the time factor; time– dependant. The user of this system can see directly the output of the recognition system and verify the results. The system is limited to recognize handwritten text only. Off–line systems recognize the text after it has been written or printed on pages (Al-Muhtaseb et al., 2008; Cheung et al., 2001). Most interesting text is already printed in documents or books and the need to convert it into an electronic media gives a great value to the off–line recognition systems. Unlike on–line systems, off–line systems do not have information dependent on the time factor. Each page of text is represented by a two dimensional array of pixel values. The system may acquire the input text using scanners (Al-Muhtaseb et al., 2008; Cheung et al., 2001).

17

2.4

Image Segmentation In order to extract features from the text image, it should be segmented into lines,

words, characters or primitives. Arabic OCR systems are classified into two major types depending on the method of segmentation been used: segmentation–based systems and segmentation–free systems. The segmentation procedure is the major challenging phase for any Arabic OCR system because of the cursive nature of the Arabic script. This challenge occurs at segmentation based systems while segmentation –free systems avoid this problem. Table 2.2: Categorization of segmentation algorithms (Nikos et al., 2010) Existing Research

Proposed

HandSegmentation

Printed

Diacritical Page

Text line

Arabic Text

written algorithm

documents

Marks segmentation segmentation Segmentation

documents

Segmentation

X-Y cuts

*

*

*

*

RLSA

*

*

*

*

Docstrum

*

*

*

*

*

*

*

*

*

Whitespace analysis Constrained text line Hough transform

*

Voronoi

*

* *

Scale space analysis

*

*

*

*

18

Various document image segmentation techniques have been proposed in the literature. These techniques can be categorized based on the document image segmentation algorithm that they adopt. The most known of these segmentation algorithms are the following: X–Y cuts or projection profiles based (NAGY and SETH, 1984), Run Length Smoothing Algorithm (RLSA) (Wahl et al., 1982), component grouping (Feldbach and Tonnies, 2001), document spectrum (O’Gorman, 1993), whitespace analysis (Baird, 1994), constrained text lines (Breuel, 2002), Hough transform (Hough, 1962; Duda and Hart, 1972), Voronoi tessellation (Kise et al., 1998) and Scale space analysis (Manmatha and Rothfeder, 2005) . All of the above segmentation algorithms are mainly designed for contemporary documents. Table 2.2 categorizes all of the aforementioned segmentation algorithms and depicts the way they have been used in document processing.

2.5

Techniques for Document of Text Segmentation One of the early tasks in a handwriting recognition system is the segmentation of a

handwritten document image into text lines, which is defined as the process of defining the region of every text line on a document image. The overall performance of the system to recognize the handwritten character powerfully depends on the process results of the text line segmentation. If the quality of the results produced by the stage of text line segmentation is poor, this will affect the accuracy of the text recognition procedure. Thus, the algorithms employed for these two stages are critical for the overall recognition procedure. We can group existing text line methods into four basic categories: methods making use of the projection profiles, methods that are based on the Hough transform, smearing methods and, finally, methods based on the principle of dynamic programming. Also,

19

many methods exist that can't be clearly classified in a specific category, since they employ particular techniques.

2.5.1 Projection Profiles There are several ways that make use of the projection profiles consists in (Bruzzone and Coffetti, 1999), (Arivazhagan, 2007). In (Bruzzone and Coffetti, 1999), The original image divided into vertical slices. At every vertical slice, calculate the histogram of each horizontal runs. The text contained in one slice is assumed in this technique is that it is parallel to each other. Arivazhagan et al. (Arivazhagan, 2007) partitions the original image into vertical strips called chunks. The projection profile is calculated for each chunk. Among the first chunks are extracted first candidate lines. These lines pass for any handwritten connected element by linking them to the text line below or above. By any of the following makes the decision (i) "modelling the text lines as bivariate Gaussian densities and evaluating the probability of the component for each Gaussian" or (ii) "the probability obtained from a distance metric".

2.5.2 Hough Transform Approach To make use of the Hough transform there are some of methods include (Fletcher and Kasturi, 1988), (Louloudis et al., 2008), (Likforman-Sulem et al., 1995) and (Pu and Shi, 1998). The Hough transform is "a powerful tool used in many areas of document analysis that is able to locate skewed lines of text". Through beginning with a few points from the original image, the technique to extract lines of these points is best suited. The points regarded as within the voting process of the Hough transform are often either the gravity centers (Fletcher and Kasturi, 1988), (Louloudis et al., 2008), (Likforman-Sulem et al., 1995) or minima points (Pu and Shi, 1998) of the connected components.

20

In further detail, Likforman (Likforman-Sulem et al., 1995) developed a technique based on a hypothesis – validation scheme. Potential alignments are validated in the image domain and hypothesized in the Hough domain. The units for the Hough transform are the centroids of the connected components. A set of aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell (ρ, θ) of the Hough domain. Alignments including a lot of units correspond to high peaked cells of the Hough domain. A recent method using the Hough transform was proposed by Louloudis et al. (Louloudis et al., 2008). The main contributions of the approach correspond to a) the partitioning of the connected space into three distinct spatial sub-domains (small, normal and large) from which only normal connected components are used in the Hough transformation step, b) a block-based Hough transform step for the detection of potential text lines and c) a post-processing step for the detection of text lines the Hough did not reveal as well as the separation of vertically connected parts of adjacent text lines. The Hough transform can also be applied to fluctuating lines of handwritten drafts such as in (Pu and Shi, 1998). The Hough transform is first applied to minima points (units) in a vertical strip on the left of the image. The alignments in the Hough domain are searched starting from a main direction, by grouping cells in an exhaustive search in six directions. Then a moving window, associated with a clustering scheme in the image domain, assigns the remaining units to alignments. The clustering scheme (Natural Learning Algorithm) allows the creation of new lines starting in the middle of the page.

2.5.3 Smearing methods Smearing methods mainly include the fuzzy RLSA (Shi and Govindaraju, 2004) and the adaptive RLSA (Makridis et al., 2007). The fuzzy RLSA measure is calculated for every pixel on the initial image and describes “how far one can see when standing at a

21

pixel along horizontal direction”. By applying this measure, a new grayscale image is created, which is binarized and the lines of text are extracted from the new image. The adaptive RLSA (Makridis et al., 2007) is definitely an expansion from the traditional RLSA, in the meaning that extra smoothing restrictions are set with respect to the geometrical qualities of neighboring connected elements. Implement the actual replacement between the background pixels with foreground pixels whenever these restrictions to be satisfied.

2.5.4 Dynamic Programming The segmentation methods for the text line supported on the dynamic programming principle were recently presented (Nicolaou and Gatos, 2009), (Saabni et al., 2014). They try to segment text lines by finding an optimal path on the background of the document image travelling from the left to the right edge. Nicolaou et al. approach (Nicolaou and Gatos, 2009) is based on the topological presumption which for every text line, There is a existing path from one side of the image to the other which cross a single text line. The image is first blurred and at a second step tracers are used to follow the black-most and white-most paths from right to left as well as from left to right. The final goal is to shred the image into text line areas. Saabni et al. (Saabni et al., 2014) propose a method which computes an energy map of a text image and determines the seams that pass across and between text lines. Two different algorithms are described (one for binary and one for grayscale images). Concerning the first algorithm (binary case), each seam passes on the middle and along a text line, and marks the components that make the letters and words of it. At a final step, the unmarked components are assigned to the closest text line. For the second algorithm (grayscale case) the seams are calculated on the distance transform of the grayscale image.

22

2.5.5 Other Techniques The related works for other methodologies include Nicolas et al. (Nicolas et al., 2004). In this work, from the perspective of Artificial Intelligence to be considered to the problem of extraction the text line. The objective is to gather the connected components of the document into homogeneous sets that match to the text lines of the document. The solution of this problem, a search over the graph that is defined by the connected components as vertices and the distances among them as edges is applied. In the recent paper (Shi et al., 2005), Adaptive Local Connectivity Map is the aim of this paper. it makes use of the Adaptive Local Connectivity Map: a grayscale image is the input to the technique, and calculate new image by summing of the intensities of every pixel’s neighbors in the horizontal direction. When the new image is also a grayscale image, applied the thresholding methodology and also grouped the connected components by employing a grouping methodology into location maps. In (Kennard and Barrett, 2006), the technique for binarized image that use the "count of foreground/background transitions" for text line segmentation to detect text lines area. And a "min-cut/max-flow graph cut algorithm" is used to segment the area of the text that shown as "more than one line of the text". They applied the merge of the text lines with the text line that have little text information. Yi Li (Li et al., 2008) presented a method in which models for detect the text line by enhancing text line structures as an problem of segmentation image using a Gaussian window and determining the level set method to develop the boundaries of the text line. The method described in (Lemaitre and Camillerapp, 2006) is based on a notion of perceptive vision: at a certain distance, text lines can be seen as line segments. This method is based on the theory of Kalman filtering to detect text lines on low resolution images.

23

Weliwitage et al. (Weliwitage et al., 2005), presented a technique that include cut text minimization for text line segmentation from handwritten documents in English. To do that, applied the optimization technique that varies the angle of cutting and begin location to reduce the text pixels cut during tracking between two text lines. In (Basy et al., 2008), presented for multi-skewed handwritten document of Bengali or English text by using text line extraction technique. it suppose that hypothetical water flows, from both right and left sides of the frame of the image, confront obstruction from characters of text lines. The stripes of areas left unwetted on the image frame are finally labeled for extraction of text lines. In (Zahour et al., 2007), presented a segmentation for the text line for handwritten or printed historical document that contain Arabic letters. The first step using the Kmeans scheme is classify the document to two classes. All classes are compatible with the complexity of the document "(easy or not easy to segment)". For the document that have overlapping and also touching with the characters is "divided into vertical strips". From the horizontal projection result, the extracted text blocks are classified to 3 categories: large, average, small text blocks. The lines are obtained from the segmentation process for the large text blocks using spatial relationship by match adjacent blocks within two successive strips. By making abstraction of the large blocks segmentation module are segmenting the document that didn't have touching or overlapping characters. From 100 experiments on historical documents, the researcher claims 96% accuracy from that sample. Yin (Yin and Liu, 2008), proposes an approach which is based on "minimum spanning tree (MST)" clustering with new distance measures. The first step, in the document image are grouped the connected components into a tree by minimum spanning tree clustering with a new distance measure. Then using a new objective function will be dynamic cutting the tree edges to form text lines for finding the number of clusters. This

24

approach can be apply to many documents and totally parameter-free with curved lines and multi-skewed. Stamatopoulos et al. (Stamatopoulos et al., 2009) present a combination method of different segmentation techniques. The goal is to exploit the segmentation results of complementary techniques and specific features of the initial image so as to generate improved segmentation results. Roy et al. (Roy et al., 2008) propose a method that is based on "morphological operation" and "run-length smearing". The first step, applied RLSA to obtain a single word as a component. The next step, eroded the front side of this smoothed image to get several seed components from the individual words of the document. From the background portions applied also the Erosion to find some boundary information of text lines. Last step; segment the lines using the boundary information and the positional information of the seed components. Finally, Du et al. (Du et al., 2008) propose a method which is based on the Mumford–Shah model. The algorithm is claimed to be script independent. In addition, morphing is used to remove overlaps between neighboring text lines and connect broken ones.

25

2.6

Text Segmentation analysis Table 2.3: Text line segmentation methods analysis Authors

Year

Title

Category

Description The algorithm is based on the analysis of horizontal run projections and connected components grouping and splitting on a partition of the input

An algorithm for Bruzzone et 1999

Projection

image into vertical strips, in order to deal with undulate or skewed text. Goal

profiles method

of the algorithm is to preserve the ascending and descending characters from

extracting cursive

al. text lines

been corrupted by arbitrary cuts. The algorithm has been designed for 26

cursive text and it can be applied also to hand-printed one The projection profile of every vertical strip (chunk) is calculated. The first A statistical

candidate lines are extracted among the first chunks. These lines traverse

approach to line Arivazhagan

around any obstructing handwritten connected component by associating it Projection

2007

segmentation in

et al.

to the text line above or below. This decision is made by either (i) modeling profiles method

handwritten

the text lines as bivariate Gaussian densities and evaluating the probability of

documents

the component for each Gaussian or (ii) the probability obtained from a distance metric

Authors

Year

Title

Category

Description

algorithm for

Hough

Potential alignments are hypothesized in the Hough domain and validated in

extracting text lines

transform

the image domain. The gravity centers of the connected components are the

in handwritten

method

A Hough based

Likforman 1995 et al. units for the Hough transform

documents

27

Pu and Shi

1998

A natural learning

The Hough transform is first applied to minima points (units) in a vertical

algorithm based on

strip on the left of the image. The alignments in the Hough domain are

hough transform

Hough

searched starting from a main direction, by grouping cells in an exhaustive

for text lines

transform

search in six directions. Then a moving window, associated with a clustering

extraction in

method

scheme in the image domain, assigns the remaining units to alignments. The

handwritten

clustering scheme (natural learning algorithm) allows the creation of new

documents

lines starting from the middle of the pages

Authors

Year

Title

Category

Description The methodology incorporates a block based Hough transform approach which takes into account the gravity centers of parts of connected

Text line detection

Hough

components. After the first candidate text line extraction, a postprocessing

in handwritten

transform

step is used to correct possible splitting as well as to detect text lines that the

documents

method

previous step did not reveal. A key idea in the whole procedure is the

Louloudis et 2008 al.

partitioning of the connected component domain into three distinct subdomains each of which is treated in a different manner Line separation for 28

The fuzzy RLSA measure is calculated for every pixel on the initial image

Shi and

complex document

Smearing

and describes “how far one can see when standing at a pixel along horizontal

images using fuzzy

method

direction”. By applying this measure, a new grayscale image is created,

2004 Govindaraju

which is binarized and the lines of text are extracted from the new image

runlength

The adaptive RLSA is definitely an expansion from the traditional RLSA, in Adaptive degraded Smearing Gatos et al.

2006

document image

the meaning that extra smoothing restrictions are set with respect to the geometrical qualities of neighboring connected elements. Implement the

method binarization

actual replacement between the background pixels with foreground pixels whenever these restrictions to be satisfied.

Authors

Year

Title

Category

Description

Text extraction A methodology that makes use of the adaptive local connectivity map. a from gray scale grayscale image is the input to the technique, and calculate new image by historical document Shi et al.

2005

Other

summing of the intensities of every pixel’s neighbors in the horizontal

images using direction. When the new image is also a grayscale image, applied the adaptive local thresholding methodology and also grouped the connected components by connectivity map employing a grouping methodology into location maps. 29

The method to segment text lines uses the count of foreground/ background Separating lines of transitions in a binarized image to determine areas of the document that are text in free-form Kennard et

likely to be text lines. Also, a min-cut/max-flow graph cut algorithm is used 2006

handwritten

al.

Other to split up text areas that appear to encompass more than one line of text. A

historical merging of text lines containing relatively little text information to nearby documents text lines is then applied.

Authors

Year

Title

Category

Description

Text line extraction in handwritten Lemaitre

A methodology which is based on a notion of perceptive vision: at a certain document with

and

2006

Other

distance, text lines can be seen as line segments. This method is based on the

Kalman Filter Camillerapp

theory of Kalman filtering to detect text lines on low resolution images. applied on low resolution image

30

Text line

A method that from the perspective of Artificial Intelligence to be considered

segmentation in

to the problem of extraction the text line. The objective is to gather the

Nicolas et 2004

handwritten

Other

connected components of the document into homogeneous sets that match to

al. document using a

the text lines of the document. The solution of this problem, a search over the

production system

graph that is defined by the connected components as vertices and the distances among them as edges is applied.

Authors

Year

Title

Category

Description

Script-independent A technique that models for detect the text line by enhancing text line

text line

structures as an problem of segmentation image using a Gaussian window

segmentation in Li et al.

2008

Other and determining the level set method to develop the boundaries of the text

freestyle handwritten

line.

documents

31

Weliwitage

Handwritten

A technique that include cut text minimization for text line segmentation

document offline

from handwritten documents in English. To do that, applied the optimization

2005 et al.

Other text line

technique that varies the angle of cutting and begin location to reduce the

segmentation

text pixels cut during tracking between two text lines. A technique for multi-skewed handwritten document of Bengali or English

Text line extraction text by using text line extraction technique. it suppose that hypothetical water from multi-skewed Basu et al.

2008

Other

flows, from both right and left sides of the frame of the image, confront

handwritten obstruction from characters of text lines. The stripes of areas left unwetted documents on the image frame are finally labeled for extraction of text lines.

Authors

Year

Title

Category

Description An approach which is based on "minimum spanning tree (MST)" clustering

Yin and Liu

2008

Handwritten text

with new distance measures. The first step, in the document image are

line extraction

grouped the connected components into a tree by minimum spanning tree

based on minimum

Other

clustering with a new distance measure. Then using a new objective function

spanning tree

will be dynamic cutting the tree edges to form text lines for finding the

clustering

number of clusters. This approach can be apply to many documents and totally parameter-free with curved lines and multi-skewed.

32

A method for combining A combination method of different segmentation techniques. The goal is to Stamatopoul

complementary 2009

os et al.

Other

exploit the segmentation results of complementary techniques and specific

techniques for features of the initial image so as to generate improved segmentation results. document image segmentation

Authors

Year

Title

Category

Description A method that is based on "morphological operation" and "run-length

Morphology based smearing". The first step, applied RLSA to obtain a single word as a handwritten line component. The next step, eroded the front side of this smoothed image to segmentation using Roy et al.

2008

Other

get several seed components from the individual words of the document.

foreground and From the background portions applied also the Erosion to find some background boundary information of text lines. Last step; segment the lines using the information boundary information and the positional information of the seed components. 33

Text line segmentation in A method which is based on the Mumford–Shah model. The algorithm is handwritten Du et al.

2008

Other

claimed to be script independent. In addition, morphing is used to remove

documents using overlaps between neighboring text lines and connect broken ones. Mumford-Shah model

2.7

Conclusion This literature discuses the pre-processing phase for the Arabic/Jawi character

recognition into binarization using many techniques and a very robust and secure approach for removing illumination and segmenting the holy Quran without any changes on the content. Maintaining the authenticity of the Quran text-image content using investigation and the implementation phases since detecting and removing these unwanted areas is critical to achieve better text segmentation results.

34

CHAPTER 3

RESEARCH METHODOLOGY

3.1

Introduction This chapter discusses the research methodology that will be carried out in order to

achieve the research objectives mentioned earlier in Chapter 1. Research methodology incorporates the sequential logical process that formed the conceptual nature of the tasks performed throughout the research. The framework for this research is a implementation of the research framework.

3.2

Research Methodology The study discusses the methodology of the conceptual framework, the task

framework and the experimental framework. Every framework contains the details of the implementation of each sub-section.

3.2.1 Research Framework The conceptual framework of the study has been divided into two phases. The first section is investigation phase, and the second section is implementation phase. Phases are shown in Figure 3.1.

35

Research Framework Investigation Phase

Implementation Phase

1. Problem Summarization

2. Research on illumination detection and text line segmentation processing

Task Framework 1. Data Collection

2. Binarization method

3. Image detection 3. Research on previous techniques used in illumination detection and text line segmentation

4. Segmentation Image

5. Image of each page and line Figure 3.1: Research Framework

I.

Investigation Phase At this phase, the study is done on the domain of study. The study Background,

interests, problems, and current issues of the domains studied to determine the scope of the study and obtain the desired aim of the study. When the domain is specified, the investigation phase is done through a literature review of the factors involved in the domain and identifies previous researches associated with the scope and domain of the study.

II.

Implementation Phase Once the objectives and the problem statement established in Chapter one,

following phase is completed throughout the implementation phase of this study. Within the implementation phase, will be used the task framework as a guideline for this research.

36

i. Data Collection Data Collection is the method the researcher used to get the data and information. In this research, we used printed text. Printed text is an image of Arabic words written as text types. The contents of the printed text data are different copies from Holy Quran. Besides, Figure 3.2, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.

Figure 3.2: Al-Quran writing styles ii. Binarization Image Binarization process converts gray scale or colored image into a binary image. Binary image (Stathis et al., 2008) is a digital image that has just two feasible values meant for every pixel. Normally, two colors are used for a binary image i.e. black and white however any two colors can be used. The color used for the objects in the image is the foreground color while the rest of the image is the background color.

37

iii. Image detection The images will be detected to some of the following detections: a) Illumination detection b) Text line detection

iv. Segmentation Image The images will be segmented according to the following segments: a) Segments to page b) Segments to text line The segmentation images will be used to save the image of each page and line by using the proposed method.

3.3

Task Framework Below, we present a framework of techniques; see Figure 3.3, which enables page

and text line segmentation of a set of Al-Quran pages. In this research, classification of methods divided into sex methods to segment the Quran pages into lines and pages without Illumination, which are (a) A pre-processing which encompass binarization, noise removal is applied, (b) The Illumination on the pages are detected, (c) Segments to page without Illumination, (d) The text lines on the segmented page are detected, (e) Segments to text lines, And (f) the segmented pages and lines are saved as image.

38

Image/data

Binary image

Illumination detection

Segments to page

Text line detection

Segments to text line

Save Image

Save Image

of each line

of each page

Figure 3.3: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework

39

At the end of task framework, the best practice techniques for page and text line segmentation techniques will be obtained. The images used in this research are from AlQuran. Figure 3.4 illustrate sample of Al-Quran image.

Figure 3.4: Page (10) from Holy Al-Quran

3.4

Experimental Test Framework In this research, one experimental test conducted to seek out the most effective

practice techniques for Segmentation Image for Holy Quran pages. The experiment used is predicated on the best percentage to detect the illumination and text lines, to segment AlQuran pages into text lines and pages without Illumination. The experiment test is explained with the objectives. The algorithmic programs used, Input and the results obtained from the algorithm are also disclosed.

3.4.1 Experiment I i. Objective: Get the best percentage to detect the illumination and text lines, to segment Al-Quran pages into text lines and pages without Illumination. 40

ii. Input: data as an image from Al-Quran pages. iii. Algorithm: Proposed Method. iv. Output: Image of each text lines and each page without Illumination

3.5

Research Tool We used some of the research tools to support the research in this study. Table 3.1

shows the summary of the necessary equipment in this research. Table 3.1: Tools and programming languages that utilized in this research Steps

Datasets

Programming Language

Tools

Java

ImageJ

Data Collection Binarization Detection

Image of Holy Quran Pages

Segmentation

3.6

Conclusion This chapter discusses the methodology utilized in order to solve the problem of the

research. Two phases identified and utilized in this study. The phase included of investigation and also the implementation phase. Next, task framework is conferred to indicate the general implementation of this study. Experimental test designs are represented and finally, a research tool used throughout the study reported.

41

CHAPTER 4

IMPLEMENTATION

4.1

Introduction In the previous chapter, the methodology has been explained considerably. The

methodology consists of two phases Investigation, and Implementation of the proposed system. This chapter discusses in details the design and development of the system; the task framework used for this research. Among others, it explains on the requirements determination and structuring activity based on the research methodology discussed in Chapter 3. In this section, suitable solutions will be provided for the issues that have been discussed previously. The solution will be based on the objectives mentioned in chapter one (introduction). Therefore, in this chapter we will also discuss the proposed method.

4.2

Data Collection The image used is from Al-Quran pages. There are different copies from Holy

Quran. Besides, Figure 4.1, Al-Quran is written with different style of writing and different Illumination. Illumination here is referring to decoration in every page of Al-Quran.

42

Figure 4.1: Al-Quran writing styles

4.3

Implementation process Below, we present a framework of techniques; see Figure 4.2, which enables page

and text line segmentation of a set of Al-Quran pages. In this research, classification of methods divided into sex methods to segment the Quran pages into lines and pages without Illumination, which are (a) A pre-processing which encompass binarization, noise removal is applied, (b) The Illumination on the pages are detected, (c) Segments to page without Illumination, (d) The text lines on the segmented page are detected, (e) Segments to text lines, And (f) the segmented pages and lines are saved as image. The proposed framework focuses on the most important steps: page segmentation and text line segmentation. Below are explained in detail to all steps used in the process of Al-Quran pages segmentation.

4.3.1 Image Binarization In the pre-processing step of the documents analysis is performed the Binarization, and is designed to segment the text from the document background. To do the Binarization task of the document, there is many algorithms have been proposed to do that. Through this study, will be used Otsu method to do the Binarization (Otsu, 1979).

43

Image/data

Binary image

Illumination detection

Segments to page

Text line detection

Segments to text line

Save Image

Save Image

of each line

of each page

Figure 4.2: The Illumination Removal in Al-Quran and segmentation to Pages and Lines Framework

44

4.3.2 Page segmentation Our methodology detects and removes blank space and Illumination from the Holy Quran pages. The blank space and Illumination removal method relies upon the density of the binary values (the binary representation). We propose a new methodology to detect the frames of page by 3 frames: i. The exterior blank space and marginal illumination, ii. Illumination, iii. And the interior blank space. That's based on the horizontal and vertical white pixel percentage. Our aim is to segment Al-Quran page image into page without blank space or Illumination.

Figure 4.3: Overall steps for removing page frames

i.

Frame 1 (The exterior blank space and marginal illumination) : At a first frame step, the blank space and marginal Illumination of the page are

removed. The flowchart for blank space and marginal Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to an image processing and converted into a binary representation to calculate the total frequency of zero value (white color). Consider the input page gray scale image with dimension of X × Y. Our aim is to find the frames of the page defined by the new coordinates as demonstrated in Figure 4.5.

45

We assumed the constant value:

for this frame , At a next step, which include searching for horizontal detection of the first frame edges by calculate the White Percentage from Up (

) for each line start

from line 0 to Y-1 by the following formula:

When the value for

, the process of detecting the upper limit will

stop and identify l as the upper limit. Then we calculate the White Percentage from Down (

) for each line start from Y-1 to 0 by the following formula:

When the value for

, the process of detecting the down limit will

stop and identify l+1 as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the White Percentage from Left (

) for

each line start from line 0 to X-1 by the following formula:

When the value for

, the process of detecting the left limit will

stop and identify l as the left limit. Then we calculate the White Percentage from Right (

) for each line start from line X-1 to 0 by the following formula: 46

When the value for

, the process of detecting the right limit will

stop and identify l+1 as the right limit. Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × down-upper. Next page frame is identified and frame is generated using a cropping operation, as shown in Figure 4.5.

Figure 4.4: Flowchart for blank space and Illumination detection and removal

47

48

Figure 4.5: The process for removing the exterior blank space and marginal Illumination from Al-Quran page

ii.

Frame 2 (Illumination): At a second frame step, the Illumination of the page is removed. The flowchart for

Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to process the last frame result to calculate the total frequency of zero value for the new image. Consider the input new image with new dimension of X × Y. Our aim is to find the next frame of the page defined by the new coordinates as demonstrated in Figure 4.6. We assumed the constant value: for this frame. At a next step, which include searching for horizontal detection of the second frame edges by calculate the (

) for each line start from line 0 to Y-1 using formula (1). When the

value for

for five consecutive rows, the process of detecting the upper

limit will stop and identify l as the upper limit. Then we calculate the ( line start from Y-1 to 0 using formula (2). When the value for

) for each for five

consecutive rows, the process of detecting the down limit will stop and identify l as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the (

) for each line start from line 0 to X-1 using formula

(3).When the value for

for five consecutive rows, the process of

detecting the left limit will stop and identify l as the left limit. Then we calculate the ( for

) for each line start from line X-1 to 0 using formula (4). When the value for five consecutive rows, the process of detecting the right limit will

stop and identify l as the right limit. Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × down-upper. Next page frame is identified and frame is generated using a cropping operation, as shown in Figure 4.6.

49

50

Figure 4.6: The process for removing the Illumination from Al-Quran page

iii.

Frame 3 (The interior blank space): At a third frame step, the interior blank space of the page is removed. The

flowchart for Illumination detection and removal is shown in Figure 4.4. In order to achieve this, we first proceed to process the last frame result to calculate the total frequency of zero value for the new image. Consider the input new image with new dimension of X × Y. Our aim is to find the next frame of the page defined by the new coordinates as demonstrated in Figure 4.7. We assumed the constant value: for this frame. At a next step, which include searching for horizontal detection of the second frame edges by calculate the (

) for each line start from line 0 to Y-1 using formula (1).

When the value for

, the process of detecting the upper limit will

stop and identify l as the upper limit. Then we calculate the (

) for each line start

from Y-1 to 0 using formula (2). When the value for

, the process of detecting the down limit will

stop and identify l+1 as the down limit. At a next step, which include searching for vertical detection of the first frame edges by calculate the (

) for each line start from line 0

to X-1 using formula (3). When the value for

, the process of detecting the left limit will

stop and identify l as the left limit. Then we calculate the (

) for each line start from

line X-1 to 0 using formula (4). When the value for detecting the right limit will stop and identify l+1 as the right limit.

51

, the process of

52

Figure 4.7: The process for removing the interior blank space from Al-Quran page

Frames are detected after detect the horizontal and vertical frame edges. Once the frame is detection, we crop new image at point (left, upper) with size right-left × downupper. Al-Quran page is identified and the page without Illumination is generated using a cropping operation, as shown in Figure 4.7.

4.3.3 Text line segmentation Once Al-Quran pages have been segmented, we proceed to segment all text lines from the image. Our aim is to save each text line as image without any blank space. The flowchart for text line segmentation is shown in Figure 4.9. Consider the input new image with new dimension of X × Y. Our aim is to find all text lines frame from the page defined by the new coordinates as demonstrated in Figure 4.9. At a next step, which include searching for horizontal detection of each text line edges by calculate the (

) for each line start from line 0 to Y-1 by the following

formula:

If the value for

, continue to next line. Otherwise, if there is at

least 1 blank space before it will must shift the upper pointer to this line. Otherwise, we crop new image at point (0, upper) with size X × (down-upper). Re-repeating the previous steps and save each line of the ‘ayat’ Al-Quran until the end of Al-Quran page. Using this technique we produce images that contain all lines in each page, as shown in Figure 4.9.

53

Figure 4.8: Flowchart for text line segmentation

54

55

Figure 4.9: The process for text line segmentation

4.4

Conclusion The page and text line segmentation method is mentioned in elaborated during this

project. The methods for page segmentation on Al-Quran pages were categorized into two parts of methods including page segmentation and text line segmentation. All two methods are in similar process of segmenting. It can be concluded from the works done for this project that we have a method to produce promising results for segment Al-Quran pages and lines without missing any diacritical marks. Therefore, the segmentation process may preserve holiness of Al-Quran.

56

CHAPTER 5

RESULT AND TESTING

5.1

Introduction In this chapter, we'll define the testing and evaluation procedures that are being

performed throughout the development process and when examining the final version of the application. This also includes a discussion of the result, which outlines the standards for achievement of the project. The chapter is concluded by an overview of the testing results with regard to segmentation without missing word or diacritical marks.

5.2

Testing Testing occurred throughout the various stages of the application development so

as to make sure adequate performance during page segmentation. The testing includes checks for both the word and the diacritical marks. Concerning the correctness of the software, we supply out informal tests for every completed page. This will deal with AlQuran pages segmentation especially, so as to make sure that all pages have been segmented and tested for validity, and don't contain any missing word or diacritical marks. We did a lot of experiment test to verify the validity of the proposed methodology for each time we changed the segmentation percentage to suit the all Al-Quran writing styles. We used different style of Al-Quran and different pages. In order to calculate the

57

Precision in segmentation. We calculated the density of Illumination for each page frame in all Al-Quran writing style and then we extracted the average values of them.

5.3

Questionnaire We did a questionnaire to evaluate the performance of the application. The

questionnaire included two sections: The first section contain of demographic questions and personal questions, such as gender, race, country and study category. The second part has included two questions to evaluate the results of the application, to know the validity of the results if there is no missing any word or diacritical marks during the segmentation process. We have distributed a questionnaire to 16 students in the Universiti Teknikal Malaysia Melaka (UTeM) there are 12 male and 4 female as shown in Figure 5.1.

Gender

25%

male femal 75%

Figure 5.1: Pie Chart for the Gender

In the following figures shown the study sample information of the demographic questions such as race (Figure 5.2), country (Figure 5.3), student category (Figure 5.4), and faculty (Figure 5.5).

58

Race 0% 0% 12% Malay Arab

19%

Kadazandusun Indian 69%

Chinese

Figure 5.2: Pie Chart for the Race

Country 0% 0% 6% 13%

China Jordan malaysia iraq yeman 81%

Figure 5.3: Pie Chart for the Country

59

Student Category 0% 6% 19%

Phd Master Degree Diploma

75%

Figure 5.4: Pie Chart for the Student Category

Faculty FTMK

FKEKK 6%

FKM

FKE

FPTT

6%

6% 44%

6% 32%

Figure 5.5: Pie Chart for the Faculty

60

FKP

Figure 5.6 illustrates the evaluation results after applying all segmentation steps.

Segmentation 14 12 10

No Missing Words

8 6

No Missing Diacritical marks

4 2

0 Page segmentation

text line segmentation

Figure 5.6: Column Chart for Segmentation

Figure 5.7 illustrates the evaluation results after applying only the first step of the segmentation (page segmentation). In which the illumination are removed.

Page Segmentation – No Missing word /No missing Diacritical Marks 0%

Yes No 100%

Figure 5.7: Pie Chart for Page Segmentation

61

Figure 5.8 illustrates the evaluation results after applying only the Second step of the segmentation (text line segmentation). In which the text line are detected.

Text Line Segmentation – No Missing word /No missing Diacritical Marks 0%

Yes No 100%

Figure 5.8: Pie Chart for Text Line Segmentation

5.4

Result After several tests and after our survey results, we concluded that the following

results are the best results we have obtained, where it given the best results that we want, without missing of any text or the diacritical marks from Al-Quran pages. Figure 5.9 shows the user interface that we have built in the Java language, as it contains the following options: upload image/s, binarization (Figure 5.10), save the image as a page or text line or both (Figure 5.11).

62

Figure 5.9: User Interface for Selection File(s)

Figure 5.10: User Interface for Binarization

Figure 5.11: User Interface for Save Pages and Line

63

Figure 5.12 illustrates the output files for each copy of Al-Quran.

Figure 5.12: The Output Files

Figure 5.13 illustrates the results of sample page after applying only page segmentation in which the illumination is removed.

Figure 5.13: Result of Page Segmenttion Figure 5.14 illustrates the results of sample page after applying only text line segmentation in which the text lines are detected. 64

Figure 5.14: Result of Text Line Segmentation 5.5

Conclusion From the test and result shown above, shows the success of segmentation process

for several copies of Al-Quran and save the images as text line and pages without illumination. That means the possibility of use the proposed methodology of the Al-Quran pages without missing any word or diacritical marks. Therefore, the segmentation process preserved the holiness of Al-Quran.

65

CHAPTER 6

CONCLUSION AND FUTURE WORK

6.1

Introduction This chapter is the last chapter in this project. This chapter summarizes the most

important achievements of this project which is segmenting Al-Quran pages. It summarizes the phases of the project that lead to the result, and also the limitation in this project. This chapter concludes the research conducted in this project and recommend directions for further research or future work. It begins with a summary of the project. And Limitations on the progress of this research provides in the following section. More further research recommendations presented in the final section.

6.2

Summary This project is formed from six chapters. Chapter one served as an introduction to

the problem of the research, outlining the objectives and the scope of this project. The research represented during this project is concerned with digital image process focus in segmenting the image for Al-Quran pages. Its purpose to review the previous research work in segmentation methods utilized in Arabic/Jawi handwritten texts. Chapter 2 presents the understanding for image processing and segmentation for Arabic/Jawi handwritten texts images and other text images. The images used in this research are images of Al-Quran pages. The previous researches on segmentation text

66

image are elaborated, but didn't focus on Al-Quran pages and diacritical marks. The main aim of this study is to find the better segmentation technique. Chapter 3 discusses the methodology utilized in order to solve the problem of the research. In this study it was used and identified two phases. The phase included of investigation and also the implementation phase. In the investigative phase, including the summary of the research problem, explain the process of image segmentation in Al-Quran pages and identify the techniques used in image segmentation of the previous techniques. Next, task framework is conferred to indicate the general implementation of this study. The objective of the task framework of the project is to divide the method used to six methods. The best practice methods for page and text line segmentation techniques can be obtained. Finally, mention the use of Java throughout the study as a research tool. Chapter 4 presents the design and development of the system; the task framework used for this research. Among others, it explains on the requirements determination and structuring activity based on the research methodology discussed in Chapter 3. Therefore, the proposed method also been discussed in this chapter. The page and text line segmentation method is discussed in detailed. The techniques for page segmentation on AlQuran pages were classified into two parts of methods including page segmentation and text line segmentation. All two methods are in similar process of segmenting. Chapter 5 presents the findings and results for the techniques utilized in page segmentation of Al-Quran pages. Depending on the results, the techniques have proven to get the best segmentation technique. It can be concluded from the works done for this project that we have a method to produce promising results for segment Al-Quran.

67

6.3

Limitation of the project

It has been identified a few limitations: 

This technique cannot be applied on the first and second pages which are ‘Surat AlFatihah’ and the first page of ‘Surat Al-Baqarah’ in Al-Quran that its writing styles comes with a circle border for these two pages, due to the short time period.



6.4

The method used identified constant value in the segmentation process.

Future Works / Further Research

Further research might be directed towards the following: 

The expansion of all Al-Quran pages regardless the writing styles (circle border).



The expansion of methods applied for segmentation process for the image of AlQuran pages. In future dynamic value can be applied for segmentation methods.

68

REFERENCES Agrawal, M. and Doermann, D., 2013. Clutter noise removal in binary document images. International Journal on Document Analysis and Recognition, 16(4), pp.351–369.

Al-Emami, S. and Usher, M., 1990. On-line recognition of handwritten Arabic characters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), pp.704–710.

Alimi, a. M., 1997. An evolutionary neuro-fuzzy approach to recognize on-line Arabic handwriting. Proceedings of the Fourth International Conference on Document Analysis and Recognition, 1.

Al-Muhtaseb, H. a., Mahmoud, S. a. and Qahwaji, R.S., 2008. Recognition of off-line printed Arabic text using Hidden Markov Models. Signal Processing, 88(12), pp.2902– 2912.

Arivazhagan, M., 2007. A statistical approach to line segmentation in handwritten documents. Document Recognition and Retrieval XIV, Proceedings of SPIE, San Jose, CA, USA, 6500, pp.65000T–1–11.

Ávila, B.T. and Lins, R.D., 2004. A new algorithm for removing noisy borders from monochromatic documents. Proceedings of the 2004 ACM symposium on Applied computing - SAC ’04, p.1219.

Azmi, M.S., 2013. Fitur Baharu Dari Kombinasi Geometri Segitiga dan Pengezonan utk Paleografi Jawi Digital.

69

Baird, H.S., 1994. Background structure in document images. ocument Image Analysi, pp.17–34.

Basy, S. et al., 2008. Text line extraction from multi-skewed handwritten documents. Proceedings of the 27th Chinese Control Conference, CCC, 40, pp.412–415.

Bidgoli, a. M. and Boraghi, M., 2010. A language independent text segmentation technique based on naive bayes classifier. 2010 International Conference on Signal and Image Processing, pp.11–16.

Breuel, T.M., 2002. Two Algorithms for Geometric Layout Analysis. Proceedings of the Workshop on Document Analysis Systems, Princeton, NJ, USA. 2002 pp. 188–199.

Bruzzone, E. and Coffetti, M.C., 1999. An algorithm for extracting cursive text lines. Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR ’99 (Cat. No.PR00318), pp.2–5.

Câmara, G., Souza, R.C.M., Freitas, U.M. and Garrido, J., 1996. Spring: Integrating remote sensing and gis by object-oriented data modelling. Computers and Graphics (Pergamon), 20(3), pp.395–403.

Cheung, A., Bennamoun, M. and Bergmann, N.W., 2001. Arabic optical character recognition system using recognition-based segmentation. Pattern Recognition, 34(2), pp.215–233.

70

Dey, S., Mukhopadhyay, J., Sural, S. and Bhowmick, P., 2012. Margin Noise Removal From Printed Document Images. Workshop on Document Analysis and Recognition, (iv), pp.86–93.

Du, X., Pan, W. and Bui, T.D., 2008. Text line segmentation in handwritten documents using Mumford-Shah model. Pattern Recognition, 42(12), pp.3136–3145.

Duda, R.O. and Hart, P.E., 1972. Use of the Hough transformation to detect lines and curves in pictures. , 15(April 1971), pp.11–15.

Fan, K.C., Wang, Y.K. and Lay, T.R., 2002. Marginal noise removal of document images. Pattern Recognition, 35(11), pp.2593–2611.

Feldbach, M. and Tonnies, K.D., 2001. Line detection and segmentation in historical church registers. Proceedings of Sixth International Conference on Document Analysis and Recognition.

Fletcher, L.A. and Kasturi, R., 1988. Robust algorithm for text string separation from mixed text/graphics images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6), pp.910–918.

Gaceb, D., Lebourgeois, F. and Duong, J., 2013. Adaptative smart-binarization method: For images of business documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.118–122.

71

Gatos, B., Pratikakis, I. and Perantonis, S.J., 2006. Adaptive degraded document image binarization. Pattern Recognition, 39(3), pp.317–327.

Hough, P.V.C., 1962. Method and means for recognizing complex patterns, Kennard, D.J. and Barrett, W. a., 2006. Separating lines of text in free-form handwritten historical documents. Proceedings - Second International Conference on Document Image Analysis for Libraries, DIAL 2006, 2006, pp.12–23.

Khattab, D., Theobalt, C., Hussein, A.S. and Tolba, M.F., 2014. Modified GrabCut for human face segmentation. Ain Shams Engineering Journal, 5(4), pp.1083–1091.

Kise, K., Sato, A. and Iwata, M., 1998. Segmentation of Page Images Using the Area Voronoi Diagram. Computer Vision and Image Understanding, 70(3), pp.370–382.

Le, D.X., Thoma, G.R. and Wechsler, H., 1996. Automated borders detection and adaptive segmentation for binary document images. Proceedings - International Conference on Pattern Recognition, 3, pp.737–741.

Lemaitre, A. and Camillerapp, J., 2006. Text line extraction in handwritten document with Kalman Filter applied on low resolution image. Proceedings - Second International Conference on Document Image Analysis for Libraries, DIAL 2006, 2006, pp.38–45.

Li, Y., Zheng, Y., Doermann, D. and Jaeger, S., 2008. Script-independent text line segmentation in freestyle handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), pp.1313–1329.

72

Likforman-Sulem, L., Hanimyan, a. and Faure, C., 1995. A Hough based algorithm for extracting text lines in handwritten documents. Proceedings of 3rd International Conference on Document Analysis and Recognition, 2, pp.774–777.

Lings, M., 1998. The Quranic Art of Calligraphy and Illumination,

Louloudis, G., Gatos, B., Pratikakis, I. and Halatsis, C., 2008. Text line detection in handwritten documents. Pattern Recognition, 41(12), pp.3758–3772.

Makridis, M., Nikolaou, N. and Gatos, B., 2007. An efficient word segmentation technique for historical and degraded machine-printed documents. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 1(Icdar), pp.178–182.

Manmatha, R. and Rothfeder, J.L., 2005. A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), pp.1212–1225.

Nafchi, H.Z., Moghaddam, R.F. and Cheriet, M., 2013. Application of phase-based features and denoising in postprocessing and binarization of historical document images. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.220–224.

NAGY, G. and SETH, S., 1984. Hierarchical representation of optically scanned documents. Proceedings of International Conference on Pattern Recognition, pp.347–349.

73

Nasrudin, M.F., Omar, K., Choong-Yeun, L. and Zakaria, M.S., 2010. Pengecaman aksara jawi menggunakan jelmaan surih. Sains Malaysiana, 39(2), pp.291–297.

Nicolaou, a. and Gatos, B., 2009. Handwritten text line segmentation by shredding text into its lines. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.626–630.

Nicolas, S., Paquet, T. and Heurte, L., 2004. Text line segmentation in handwritten document using a production system. Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR, pp.245–250.

Ntirogiannis, K., Gatos, B. and Pratikakis, I., 2013. Performance evaluation methodology for historical document image binarization. IEEE Transactions on Image Processing, 22(2), pp.595–609.

O’Gorman, L., 1993. Document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11), pp.1162–1173.

Omar, K., 2000. Pengecaman Tulisan Tangan Teks Jawi Menggunakan Penkelas Multiaras. Universiti Putra Malaysia.

Papavassiliou, V., Simistira, F., Katsouros, V. and Carayannis, G., 2012. A morphology based approach for binarization of handwritten documents. Proceedings - International Workshop on Frontiers in Handwriting Recognition, IWFHR, pp.577–581.

74

Parker, J., Frieder, O. and Frieder, G., 2013. Automatic enhancement and binarization of degraded document images. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.210–214.

Patvardhan, C., K. Verma, a. and V. Lakshmi, C., 2012. Denoising of Document Images using Discrete Curvelet Transform for OCR Applications. International Journal of Computer Applications, 55(10), pp.20–27.

Phillips, P., McCabe, R. and Chellappa, R., 1998. Biometric image processing and recognition. European Signal Processing Conference.

Pu, Y. and Shi, Z., 1998. A natural learning algorithm based on hough transform for text lines extraction in handwritten documents. Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, pp.637–646.

Qadir, M.A. and Ahmad, I., 2006. Digital text watermarking: Secure content delivery and data hiding in digital documents. IEEE Aerospace and Electronic Systems Magazine, 21(11), pp.18–21.

Rabeux, V., Journet, N., Vialard, A. and Domenger, J.P., 2013. Quality evaluation of ancient digitized documents for binarization prediction. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp.113–117.

Rao, K.M.M., 2004. Overview of Image Processing. Proceedings of a workshop on image processing and pattern recognition, pp.1–7.

75

Roy, P., Pal, U. and Lladós, J., 2008. Morphology based handwritten line segmentation using foreground and background information. Conference on Frontiers in Handwriting , pp.5–10.

Saabni, R., Asi, A. and El-Sana, J., 2014. Text line extraction for historical document images. Pattern Recognition Letters, 35(1), pp.23–33.

Sauvola, J. and Pietikäinen, M., 2000. Adaptive document image binarization. Pattern Recognition, 33(2), pp.225–236.

Sehad, A., Chibani, Y., Cheriet, M. and Yaddaden, Y., 2013. Ancient degraded document image binarization based on texture features. , (Ispa), pp.182–186.

Shi, Z. and Govindaraju, V.G.V., 2004. Line separation for complex document images using fuzzy runlength. First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings.

Shi, Z.S.Z., Setlur, S. and Govindaraju, V., 2005. Text extraction from gray scale historical document images using adaptive local connectivity map. Eighth International Conference on Document Analysis and Recognition (ICDAR’05).

Stamatopoulos, N., Gatos, B. and Perantonis, S.J., 2009. A method for combining complementary techniques for document image segmentation. Pattern Recognition, 42(12), pp.3158–3168.

76

Stathis, P., Kavallieratou, E. and Papamarkos, N., 2008. An evaluation survey of binarization algorithms on historical documents. 2008 19th International Conference on Pattern Recognition, pp.2–5.

Su, B., Lu, S. and Tan, C., 2012. A learning framework for degraded document image binarization using Markov random field. Pattern Recognition (ICPR), 2012 21st , (Icpr), pp.13–16.

Su, B., Lu, S. and Tan, C.L., 2011. Combination of document image binarization techniques. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 2011 pp. 22–26.

Tajabadi, R., Mashayekhi, K. and Shabani, S., 2009. Illumination position in the growth of Islamic Art. Paper presented at the first national conference on Shiite arts.

Wagdy, M., Faye, I. and Rohaya, D., 2013. Fast and Efficient Document Image Clean Up and Binarization Based on Retinex Theory. , pp.8–10.

Wahl, F.M., Wong, K.Y. and Casey, R.G., 1982. Block segmentation and text extraction in mixed text/image documents. Computer Graphics and Image Processing, 19(1), p.94.

Weliwitage, C., Harvey, A.L. and Jennings, A.B., 2005. Handwritten document offline text line segmentation. Proceedings of the Digital Imaging Computing: Techniques and Applications, DICTA 2005. 2005 pp. 184–187.

77

Yin, F. and Liu, C.L., 2008. Handwritten text line extraction based on minimum spanning tree clustering. Wavelet Analysis and Pattern Recognition, 2007. ICWAPR’07. International Conference on, 3, pp.1123–1128.

78

APPENDICES

Appendix A Questionnaire

FAKULTI TEKNOLOGI MAKLUMAT DAN KOMUNIKASI

ILLUMINATION REMOVAL AND TEXT SEGMENTATION FOR AL-QURAN USING BINARY REPRESENTATION

A) PERSONAL BACKGROUND

1. Gender * Male

Female

2. Race * Malay

Indian

Chinese

Jordan

Iraq

Arab

Others (…….......)

3. Country * Malaysia

China

Others (................)

Diploma

Others (…...……)

4. Student Category * PhD

Master

Degree 79

5. Faculty: * FTMK

FKM

FKEKK

FKE

FPTT

FKP

Others (………)

B) STUDENTS’ EVALUATION ON OUR SYSTEM RESULT 1. Refer to Table A.3: left side the original image, and right side the segmentation page. Please, evaluate the result that no missing words or diacritical marks “vowel, vocalization sign”, tick (√) into appropriate box for your correct answer. Table A.1: Question 1 Page #

No Missing Words

No Missing Diacritical

10 Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

11 603 604

2. In case of “No” answer, please, give short explanation: ………………………………………………………………………………………………… …………………………………………………………………………………………………

3. Refer to Table A.4: left side the original image, and right side segmentation of text line. Please, evaluate the result that no missing words or diacritical marks “vowel, vocalization sign”, tick (√) into appropriate box for your correct answer.

80

Table A.2: Question 2 Page #

Line #

No Missing Words

No Missing Diacritical

1

Yes

No

Yes

No

2

Yes

No

Yes

No

3

Yes

No

Yes

No

4

Yes

No

Yes

No

5

Yes

No

Yes

No

6

Yes

No

Yes

No

7

Yes

No

Yes

No

8

Yes

No

Yes

No

9

Yes

No

Yes

No

10

Yes

No

Yes

No

11

Yes

No

Yes

No

12

Yes

No

Yes

No

13

Yes

No

Yes

No

14

Yes

No

Yes

No

10

4. In case of “No” answer, please, give short explanation: ………………………………………………………………………………………………… …………………………………………………………………………………………………

81

Table A.3: For Question 1 Page# Original Image

Image of Segmentation

10

11

82

603

604

83

Table A.4: For Question 2 Page# Original Image

Line# Image of Segmentation 1 2

3 4 5 6 84

7 8 10

9 10 11 12 13 14

Appendix B RESULT

85

86

87