Cleaning and Enhancing Historical Document Images - CiteSeerX

0 downloads 0 Views 2MB Size Report
Cleaning and Enhancing Historical Document Images. Ergina Kavallieratou1 and Hera Antonopoulou2. 1 Department of Information and Communication ...
Cleaning and Enhancing Historical Document Images Ergina Kavallieratou1 and Hera Antonopoulou2 1

Department of Information and Communication Systems Engineering, University of The Aegean, 83200 Karlovassi, Samou, Greece [email protected] 2 Computer Technology Institute, 26500 Patras, Greece [email protected]

Abstract. In this paper we present a recursive algorithm for the cleaning and the enhancing of historical documents. Most of the algorithms, used to clean and enhance documents or transform them to binary images, implement combinations of complicated image processing techniques which increase the computational cost and complexity. Our algorithm simplifies the procedure by taking into account special characteristics of the document images. Moreover, the fact that the algorithm consists of iterated steps, makes it more flexible concerning the needs of the user. At the experimental results, comparison with other methods is provided.

1 Introduction The binarization of images is a long investigated field with remarkable accomplishments. Some of them have also been applied to documents or historical documents. One of the older methods in image binarization is Otsu’s [6], based on the variance of pixel intensity. Bernsen [1] calculates local thresholds using the neighbours. Niblack [5] uses local mean and standard deviation. Sauvola [7] presents a method specialised on document images that applies two algorithms in order to calculate a different threshold for each pixel. As far as the recent problem of historical documents is concerned, Leedham [3] compares some of the traditional methods on degraded document images while Gatos [2] builds up a new method by using a combination of existing techniques. These are also the cases of Shi[8] and Yan[9] applied to some historical documents from the US library of Congress. Leydier [4] works with coloured document images and implements a serialization of the k-means algorithm. Some of the above mentioned methods have also used specific filters or algorithms for the cleaning of the document as an additional module. The historical documents suffer from bad storage conditions and poor contrast between foreground and background due to humidity, paper deterioration and ink seeking. Moreover, the fragility of those documents does not allow access to many researchers while a legible digitised version is more accessible. In the next section, a description of the algorithm is given, while in section 3 the algorithm is analysed in detail. Some experimental results and a short comparison J. Blanc-Talon et al. (Eds.): ACIVS 2005, LNCS 3708, pp. 681 – 688, 2005. © Springer-Verlag Berlin Heidelberg 2005

682

E. Kavallieratou and H. Antonopoulou

with traditional binarization methods are described in section 4. Finally, our conclusions are provided in section 5.

2 Algorithm Description As input, we assume greyscale historical document images where the tones of the foreground (characters, graphics, etc) outrange over the background (including spots, stains, wrinkles etc). As example, consider the historical documents of fig. 1. Our images are described by the equation:

I ( x, y ) = r , r ∈ [0,1]

(1)

where x and y are the horizontal and vertical coordinates of the image, and r can take any value between 0 and 1 while r=1 stands for white colour and r=0 stands for black colour. Our intention is to transform the intermediate grey tones to either black (r=0) for foreground or white (r=1) for background.

Fig. 1. Historical Documents before and after the application of our algorithm

The algorithm is based on the fact that a document image includes very few pixels of useful information (foreground) compared to the size of the image (foreground+background). According to our experiments, rarely the black pixels exceed the 10% of the total pixels in the document. Taking advantage of this fact, we assume that the average value of the pixel values of a document image is determined mainly by the background even if the document is quite clear. This claim is supported from fig. 2, where are depicted the histograms of the above examples, respectively. In the same figure two thresholds, of our method and Otsu’s method, as well as the average value in each case are given. It is obvious that the average value is always on the background side, considering either threshold. Using this fact our method consists of two procedures that are applied alternately. In the first part the average colour value of the image is calculated and then subtracted from the image, while in the second part of the algorithm we perform histogram equalisation, thus the values of remaining pixels would expand and take up all of the greyscale tones. Briefly, the algorithm consists of the following steps:

Cleaning and Enhancing Historical Document Images

683

Fig. 2. The histograms of the corresponding documents of figure 1. The thresholds extracted with the proposed method (--), Otsu’s method (-·) and the average value of the pixels (··).

1. 2. 3. 4.

Calculation of the average pixel value (Ti) of the image. Subtraction of the Ti from all the pixels of the image. Histogram equalisation. Repetition of steps 1-3 till the T0i-T0i-1