Retrieval of Historical Documents by Word Spotting

0 downloads 0 Views 94KB Size Report
Page 1 ... Word spotting is a term borrowed by speech recognition field and introduced in document image retrieval ... Using a word image matching algorithm, he clusters ... characters a space of two blank pixel-width columns is added (Fig.3). ... Right after the feature extraction procedure a smoothing stage is following in ...
Retrieval of Historical Documents by Word Spotting Nikoleta Doulgeri Ergina Kavallieratou University of the Aegean University 83200 Samos Greece [email protected] ABSTRACT The implementation of word spotting is not an easy procedure and it gets even worse in the case of historical documents since it requires character recognition and indexing of the document images. A general technique for word spotting is presented, independent of OCR, using automatic representation of the text queries of the user by word images and comparing them with the word images extracted from the document images. The proposed system does not require training. The only required preprocessing task is the alphabet determination. Global shape features are used to describe the words. They are very general in order to capture the form of the word and appropriately normalized in order to face the usual problems of variance in resolution, width of words and fonts. A novel technique that makes use of the interpolation method is presented. In our experiments, we analyze the system dependence on its parameters and we prove that its performance is similar to the trainable systems. Keywords: Document analysis, document image processing, document indexing, word spotting

1. INTRODUCTION The volume of the digitized collections of document images have been increased significantly the last years, since the digitized means and procedures have improved and do not affect the documents while they can maintain it better and present it in a more attractive way. The latter was mostly in the interest of the historical documents were their maintenance and the research in them was a big headache for most of the libraries that they couldn’t avoid the paper deterioration due to natural reasons as time, humidity, while at the same time, they had to pose restricted rules about their use by researchers since the often touching could further damage them. Thus, the digitization of those documents and their disposal in digital form solved many of these problems and at the same time offered an alternative way to dispose them to scholars and researchers. On the other hand, it gave new impulse in an old problem of document processing, the document image retrieval. The document image retrieval was always in the interest of places like libraries or institutions or even companies that possess large volumes of document images in digital form. In the first place this task was faced by the use of keywords that someone had to index manually, an expensive and untrustworthy procedure. Thus a new research field was created, that of document image retrieval in order to solve the task automatically. A technique used a lot to this direction is the word spotting. Word spotting is a term borrowed by speech recognition field and introduced in document image retrieval by Manmatha [1]. It includes the localization of parts of text specified by query words, comparing a user provided template of text image. It is a generic approach that can be applied to any document written in any language using any alphabet, pictograph or ideogram. The implementation of word spotting is not an easy procedure since the matching procedure, even in printed text, has to deal with specific problems: type of fonts, size of fonts, size of word, resolution, printing quality etc, but it gets even worse in the case of historical document were more difficult situations has to be faced, like the paper deterioration that introduces noise, the bad quality of printing that creates uneven fonts and spaces, uneven illumination etc. As Baird [2] mentions, most published methods for retrieval of document images first attempt recognition and transcription followed by indexing and search operating on the resulting encoded text. He states that at OCR character error rates below 5%, information retrieval methods suffer little loss of either recall or precision, while at error rates above 20%, both recall and precision degrade significantly.

A detailed survey on the word spotting and information retrieval from document images up to 1997 can be found in Doermann [3]. Historically, the use of index of descriptors for each document provided manually by experts was the first approach to the problem [4]. Next, with the improvement in character recognition field, OCR packages were applied to documents in order to convert them to text. Thus, Edwards et al. [5] described an approach that transcribes and retrieves Medieval Latin manuscripts with generalized Hidden Markov Models. Their hidden states correspond to characters and the space between them. One training instance is used per character and character n-grams are used, yielding a transcription accuracy of 75%. Motivated by the fact that the OCR accuracy requirements for information retrieval are considerably low, methods with the ability of tolerating recognition errors of OCR have, also, been developed [6]. More recently, with the improvement in document image processing (DIP) field, techniques that make use of images instead of OCR were also introduced. Leydier [7] uses DIP techniques to create a pattern dictionary of each document and then he performs word spotting by selecting as feature the gradient angle and a matching algorithm. Kolcz et al. [8] described an approach for retrieving handwritten documents using word image templates. Their word image comparison algorithm is tested on matching the provided templates to segmented manuscript lines from the Archive of the Indies collection. Konidaris et al. [9] propose a technique for keyword guided word spotting in historical printed documents. He uses synthetic image words as query and performs word segmentation using dynamic parameters and hybrid feature extraction. Finally, he uses user feedback to optimize the retrieval. Matching of whole words in printed documents is also performed by Balasubramanian [10]. In this approach, a Dynamic Time Warping (DTW) based partial matching scheme is used to overcome the morphological differences between the words. Similar technique is used in the case of historical documents [11] where noisy handwritten document images are preprocessed into one-dimensional feature sets and compared using the DTW algorithm. Rath [12] presents a method for retrieving from large collections of handwritten historical documents using statistical models. Using a word image matching algorithm, he clusters occurrences of the same word in a collection of handwritten documents. When clusters that contain index terms are labeled, a partial index can be built for the document corpus, which can then be used for ASCII querying. Lu [13] presents an approach with the capability of searching a word portion in document images. A feature string is synthesized according to the character sequence in the user-specified word, and each word image extracted from documents is represented by the corresponding feature string. Then, an inexact string matching technology is utilized to measure the similarity between the two feature strings. Although, the above mentioned works can give very good results to the specific applications they have been built for, some of them require extensive manual work and can only process predefined keywords, while many of them require extensive training. As a solution to these problems we propose a general technique for word spotting independent of OCR: the representation of the user queries by word images and their comparison with the words in the document images. The whole idea is based on the global shape features used to describe the words. They are very general enough in order to capture the form of the word and appropriate normalized in order to face the usual problems of different resolutions, size of words and style of fonts. In the section 2, the system is presented and the feature extraction methodology is further analyzed, while experimental results are given in section 3 and we finally conclude in the last section 4.

2. SYSTEM PRESENTATION Our system is presented in Fig. 1. The binarization procedure is performed by the algorithm presented in [14], while the tasks of skew angle correction, line segmentation, slant correction and word segmentation are the ones presented in [15], used for a system of general use for document image processing. The slant correction task is not always necessary, but one of the books of our experiments (§3) was including a lot of text in italics. The rest of the tasks: query word synthesis, word cleaning, feature extraction, smoothing, normalization, as well as the comparison procedure will be presented in the next subsections. 2.1 Query Word Synthesis The word-in-search, the query, is given by the user in ASCII. In order to compare it with word-images we need to transform it into image. The only preprocessing our system requires is the determination of the alphabet in bitmap. However, we consider it a trivial procedure since it takes less than a minute.

Doc. Images

Text query

Binarization

Word synthesis

Skew corr.

Line segm.

Slant corr.

Word segm.

Word clean.

Feature Extraction

Smoothing

Normalization

Comparison

Fig. 1. The proposed system.

The alphabet is determined automatically, by introducing images like those of Fig. 2. The characters are written in MS word and then transformed in pdf. Introducing such an image in our system, it is transformed into a bitmap and then segmented to characters, automatically. Each character image will be filenamed after a name in the form of x.bmp and saved in the alphabet directory. The x stands for a character from a to z for the lowercase characters and for a double character from aa to zz for the corresponding uppercase characters. As the character order in the image is standard (the alphabetic order is considered), the automatic correspondence is easy. Moreover, it is easy to change the fonts and repeat the procedure for different fonts or style (bold, normal).

Fig. 2. Alphabet determination image.

When a text query is given by the user, the text is transformed into a bmp image by using for each character x, the corresponding x.bmp representation if it is lowercase character or the xx.bmp if it is uppercase. Between two consecutive characters a space of two blank pixel-width columns is added (Fig.3). However the blank width is not that important since a normalization step is included in our system. The system has been made for English; however, it can work for different alphabets or symbols as well, if the appropriate correspondence is determined in the same way. Our experiments (§ 3) proved that bold characters have a better performance than normal and the resolution of the pdf affects the performance.

Fig.3. Examples of bmp images synthesized from text query.

2.2 Word Cleaning After the word segmentation, since the methodology we used [15] makes use of the first black pixels in order to perform line and word segmentation, a task that performs special word cleaning is included in our system. In the case of historical documents, where a lot of noise can remain in the image after the binarization procedure, we can end up with noisy word images inappropriate for comparison (Fig. 4a). In this task, the vertical and horizontal histograms of the word are calculated. Then we start from the peak value of each histogram (max) and we move towards the edges of the word, towards up and down for the horizontal histogram and left and right for the vertical one. When a zero value is found, before we keep the position as an edge, we make sure that no other big enough value (>max/4) is included in the rest of the histogram, otherwise we keep on scanning. Examples of cleaned words of this procedure are shown in Fig. 4b.

(a)

(b) Fig. 4. Words before (a) and after (b) cleaning.

2.3 Feature Extraction During the feature extraction procedure, each word image, query word or segmented and cleaned word, is represented by a vector of the form: [h w ascs descs hist UProf DProf MProf asc1, asc2,…, 0 desc1, desc2 ,…, 0] where: h

the height of the word image measured in pixels.

w

the width of the word image measured in pixels.

ascs

the amount of the estimated ascenders of the word image. We first estimate the upper and lower baselines of the word image (Fig. 5), by considering the points of the horizontal histogram that its value falls under the w/3 [15].

Then we estimate the number of the ascenders as the peaks between valleys in the vertical histogram of the upper zone (Fig. 6), with height greater than the one third of the upper zone.

Fig. 5. Zones and baselines.

descs

the amount of the estimated descenders of the word image. They are estimated in the same way as the ascs by using the lower zone (Fig. 5) instead of the upper zone of the word image.

ascender areas

Fig. 6. Vertical histogram of the upper zone. Estimation of the ascender areas.

hist

the vertical histogram of the whole word image.

Uprof

the upper profile of the word: the first black pixel of each column of the image starting from above (Fig. 7).

Fig. 7.

Upper and lower profiles of a word.

Dprof

the lower profile of the word: the first black pixel of each column of the image starting from below (Fig. 7).

Mprof

the middle profile of the word. We cut horizontally the word in the middle between the baselines. We consider that at this point there will be less black pixels and more open areas of the characters. Thus, by taking the lower profile of the above segmented zone, more information will be captured for the internal area of the word.

asci

from each area of ascenders, as determined above, we keep the position of the peak in the vertical histogram, as representative of the ascender. Each position is normalized by dividing it by the width of the word.

desci

in a similar way, from each area of descenders, we keep the position of the peak in the vertical histogram, as representative of the descender. As before, each position is normalized by dividing it by the width of the word. In order to succeed a homogeneity in the size of the vectors, we suppose a maximum of 10 ascenders and 10 descenders and we reserve corresponding positions in the vector that are filled with zeros when less.

After a lot of experimentation, the above features proved to capture the best description of the shape of the word (§3).

2.4 Smoothing Right after the feature extraction procedure a smoothing stage is following in order to deal with small variations due to different fonts and styles of printing as well as possible noise remained in the word image. We applied smoothing of different degrees: using average of 3 points (the value of every point is replaced by the average of the values of the point and its 2 neighbours), 5 points (4 neighbours), 7 points (6 neighbours) and 9 points (8 neighbours). The results will be presented in section 3. 2.5 Normalization The size of the word images can vary significantly, even when they include the same word due to different styles of printing, fonts and also the image resolution. In order to compare the extracted vectors we need to have a standard vector size and the same order of magnitude, thus we include a normalization stage. In fact, the normalization stage consists of two steps: the division by value to succeed common order of magnitude and the interpolation to succeed standard vector size. The division by value is needed for the features of the histogram and the three profiles that are divided by the height of the word image to normalize their values in the space of [0, 1]. Moreover, the positions of the ascenders and descenders, as we have already mentioned, are divided by the width of the word image to fit in the same interval. Yes Points=FinalSizeCurrentSize

No Increase Length

Points=FinalSize

step=Size/Points position=1+step

Is position integer value (e.g. i)?

Yes NewPoint =f(i)

No w2= decimal(position) w1=1-w2 NewPoint =w1*f(i)+ w2*f(i+1)

Fig. 8.

The Interpolation procedure.

Interpolation is a well known method in the field of signal processing and image analysis [16]. It is used to compute intermediate values among a set of values. In our case we use the interpolation procedure when we wish either to increase or decrease the vector size. Specifically, we compute the number of points that must be added to the initial vector f, if we wish to increase the length, or the numbers of points to leave out of vector f, if we wish to decrease the length. The next step is to find out the positions in the initial vector to add or delete these points. The value of every new point is a combination of its neighbors. For example if a new point must be placed in position 2.6, the value of that point would be calculated as the sum of 40% of the value of point 2 and 60% of the value of point 3. The interpolation procedure is shown in Fig. 8. In the third section, results for different vector sizes are given. 2.6 Comparison After the feature extraction smoothing and normalization for both, the query word and the words of the document images, the vector of the query word is compared to those of the words of the document collection. However, not every single word is compared to the query word. We pose two criteria in order to decide if a word should be compared or not: 1.

The ratio width/height of the word should be included in the interval [qratio-1 qratio+1] where qratio the corresponding ratio of the query word.

2.

The values ascs and descs of the word should be included in the interval [ascs-1 ascs+1] and [descs-1 descs+1], respectively, of the query word.

Then, as a distance criterion, we use the Manhattan distance. If Q(i) the vector for the query word and W(i) the vector for the word in examination. Their distance will be:

d = ∑ Q (i ) − W (i ) i =5

The first four values of the vector are not taking into account since they have been examined by the criteria mentioned above.

3. EXPERIMENTAL RESULTS Here, the experimental data set is described and the results are given for different fonts, used in the query word synthesis, different degree of smoothing and different normalized size at the interpolation stage. Moreover the contribution of the features used in the feature vector is presented and finally the precision-recall curve is given for the specific data. 3.1 Experimental Data In our experiments, two different books were used in order to find the appropriate values for all the parameters and criteria mentioned above: Table. 1. The query words of the experiments, their occurrences and results.

1.

queries

occ.

res.

queries

occ.

res.

Apollo Charles Edinburgh Egyptian Elba English Falcone Holland King Laocoon Linneo Michael Naples Napoleon Prince Rome Rotundo Rufinella Sambuco Sardinia Scotch Scots Spain Titus

2 4 2 1 2 4 1 1 5 1 1 1 1 3 3 5 1 2 1 1 2 2 7 1

2 4 2 1 2 4 1 1 4 1 0 1 1 3 3 5 1 2 1 1 2 2 7 1

Tusculan Wallace agriculture anecdote apartment art brother collection company daughter denominazione exhibition family government group houses knowledge language liberty musuem nobleman officers telescope uncle

2 2 2 1 2 2 3 1 1 3 1 1 4 1 3 1 1 2 1 4 2 2 2 2

2 2 2 1 1 2 3 1 1 3 0 1 2 1 3 1 1 2 1 0 2 1 2 2

“Prospetto delle piante che si trovano nell’isola di Cefalonia”, Dr Niccolo Dallaporta, Corfu, 1821. This book is written in Italian but it also includes words in Greek and English. No separation was performed. All the words were handled as candidate. Some of the mentioned places exist also in the second book. Moreover, a lot of its words are written in italics fonts so the slant correction task was also included.

2.

“Travels in Italy, Greece and the Ionian Islands”, Br.H.W.Williams, Edinburgh, 1820.

A set of 10 pages were used, 3 from the first book and 7 from the second one, which include a total of 2013 words. For our queries, we tried to choose words from both books and also some common ones. Typical words that someone could be interested to look for in a book were chosen. Finally, we tried to cover different values of word width. In Table 1, the queries (48) are shown as well as their occurrences (total 100) in that set of pages and the times that a word was found in the best of our results (89%), that is, for smoothing of 5 points, normalized word width of 175 pixels and Times New Roman (bold) as query fonts. Our result were extracted by counting for each query the corrected words in return among the N first places, where N the corresponding occurrences for each word. 3.2 Query Font In query word synthesis (§2.1), we experimented with five different fonts, namely Arial, Courier, Helvetica, Palatino Linotype and Times New Roman, using normal and bold version in each case. The results are shown in Table 2. Table 2.

Results for different fonts Fonts

Normal

Bold

Arial Courier Helvetica Palatino Linotype Times New Roman

43% 27% 46% 77% 87%

44% 8% 43% 83% 89%

3.2.1 Pdf resolution

Next we experimented with different resolutions of the bitmaps used to insert the font set in the system for the task of word synthesis. Specifically resolutions of 150, 300 and 600 dots per inch were considered. The results for Times new Roman (bold) are shown in Table 3. Table 3.

Results according to resolution Resolution 150 dpi 300 dpi 600 dpi

Result 87% 89% 87%

3.3 Smoothing In our experiments we used smoothing of 0 (no smoothing), 3,5,7 and 9 points. The results are shown in Table 4. Table 4.

Results for different degrees of smoothing Points used 0 (no smoothing) 3 5 7 9

Result 87% 88% 89% 89% 89%

3.4 Normalized word width In the normalization stage, we experimented with normalized words width from 10 to 250 pixels. Bearing in mind that resolution of 300 dpi was used and Times New Roman (bold), you can consider a 30 pixel mean-character-width to have an idea of the corresponding word width used in each case. The results are presented in Table 5. Table 5.

Results according to the normalized word width Word width 10 pixels 30 pixels 50 pixels 100 pixels 150 pixels 175 pixels 200 pixels 250 pixels

Result 67% 86% 87% 87% 87% 89% 89% 88%

3.5 Feature contribution In order to be able to appreciate the contribution of each of our vector feature in the result, we tested our system without several of them. In the graph of Fig. 9, the system performance without several features is shown. We can conclude that the external profiles are the most valuable feature while the amount of ascenders and descenders proved to be the less important one. 100 84

90 80

86

85

73

70 no upper/lower prof

60

no histogram

50

no middle prof

40

no ascender descender

30 20 10 0

Fig. 9. The contribution of some of our features.

precision

3.6 Precision-Recall curve 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 0,885

0,89

0,895

0,9

0,905

0,91

0,915

0,92

0,925

recall

Fig.10. The precision-recall curve.

Here the precision-recall curve is presented for the specific data. Precision is defined as the percentage of the number of correctly retrieved words over the number of all retrieved words, while recall is the percentage of the number of correctly retrieved words over the number of the specific words in the pages. The retrieved words were considered to be N in the first place, where N the corresponding occurrences for each word as shown in Table 1, and then 10 for each query, 20, 30, etc, up to 100 words per query. The curve is shown in Fig. 10.

4. CONCLUSION A system appropriate for word spotting in historical document databases that does not include training and OCR has been presented. The only preprocessing procedure that our system requires is the alphabet determination by a bitmap image. Then the text query of the user will be transformed into word image automatically and will be compared to all of the word images extracted from the database. For the comparison, the word images are represented by a feature vector appropriate to capture as much as possible information about the word shape. Next the vector is normalized to deal with the variance in fonts and image resolution. A novel normalization technique that makes use of the interpolation method has been presented. In our experiments, the system dependence on several of its parameters was examined. Thus, we concluded that our system presents its best performance for Times New Roman fonts and a resolution of 300 dpi, as far as it concerns the query transformation procedure, while in the level of feature vector description the best performance is succeeded using 5-point smoothing and a normalized width around 175-200 pixels.

Our system performance can be characterized quite well and similar to the state of the art systems, bearing in mind that it doesn’t require training, OCR and indexing. Here, we demonstrated it by using English texts of early 19th century. However, we strongly believe that it can be used in historical documents of different languages and symbols. Our intention is to experiment more in this field.

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

R. Manmatha, C. Han, and E.M. Riseman, “Word Spotting: A New Approach to Indexing Handwriting”, Proc. Conference on Computer Vision and Pattern Recognition, pp. 631–637 (1996). H.S. Baird, V. Govindaraju, and D.P. Lopresti, “Document Analysis Systems for Digital Libraries: Challenges and Opportunities”, DAS 2004, LNCS 3163, pp. 1–16 (2004). D. Doermann. “The Indexing and Retrieval of Document Images: A Survey”, Computer Vision and Image Understanding, 70(3), pp. 287–298 (June 1998). G. Salton, “Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer”, Addison–Wesley, Reading, MA (1989). J. Edwards, Y.W. Teh, D. Forsyth, R. Bock, M. Maire, and G. Vesom, “Making latin manuscripts searchable using gHMM’s", Proc. 18th Annual Conf. on Neural Information Processing Systems, pp. 385-392 (Cambridge, USA, 2004). Y.Ishitani, “Model-based Information Extraction Method Tolerant of OCR Errors for Document Images”, Proc. Sixth International Conference on Document Analysis and Recognition, pp. 908–915 (Seattle, USA 2001). Y. Leydier, F. LeBourgeois, and H. Emptoz, “Textual Indexation of Ancient Documents”, DocEng’05, pp. 111- 117 (November 2–4 2005). Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., and Popescu, G. V. “A Line-Oriented Approach to Word Spotting in Handwritten Documents.” Pattern Analysis & Applications vol, 3, n. 2, pp. 153–168 (2000). T. Konidaris, B. Gatos, K. Ntzios, I. Pratikakis, S. Theodoridis and S. J. Perantonis, “Keyword-Guided Word Spotting in Historical Printed Documents using Synthetic Data and User Feedback”, Internationa Journal on Document Analysis and Recognition, vol. 9, no 2, pp. 167-177 (2007). A. Balasubramanian, M. Meshesha, and C.V. Jawahar, “Retrieval Form Document Image Collections”, Proc. DAS2006, pp. 1–12 (2006). T.M. Rath, and R. Manmatha, “Word Image Matching using Dynamic Time Warping”, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 521–527 (2003). T.M. Rath, R. Manmatha, and V. Lavrenko, “A Search Engine for Historical Manuscript Images”, Proc. ACMSIGIR conference, pp.369–376 (2004). Y. Lu, and C.L. Tan, "Information Retrieval in Document Image Databases" IEEE Trans. Knowledge and Data Engineering, vol.16, no.11, pp. 1398-1410 (Nov. 2004). E.Kavallieratou, “A Binarization Algorithm Specialized on Document Images and Photos”, Proc. 8th International Conference of Document Analysis and Recognition, vol. I, pp. 463-467 (Seoul 2005). E.Kavallieratou, N.Fakotakis, and G.Kokkinakis, “Un Off-line Unconstrained Handwritting Recognition System”, International Journal of Document Analysis and Recognition, no 4, pp. 226-242 (2002). E. Meijering, “A Chronology of Interpolation: From Ancient Astronomy to Modern Signal and Image Processing.” Proc. IEEE. vol. 90, no. 3, pp. 319-42 (March 2002).