A Real-World Noisy Unstructured Handwritten Notebook Corpus for ...

4 downloads 102377 Views 5MB Size Report
9. Spontaneous. Raw. Various DIA Research course notebooks. aAs an ongoing activity, we are aiming for about 100 notebooks from 50 students, in a total of ...
A Real-World Noisy Unstructured Handwritten Notebook Corpus for Document Image Analysis Research Jin Chen

Daniel Lopresti

CSE Department Lehigh University Bethlehem, PA 18015, USA

CSE Department Lehigh University Bethlehem, PA 18015, USA

[email protected]

[email protected]

ABSTRACT Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an ongoing activity, so far we have collected 18 handwritten notebooks from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.

Categories and Subject Descriptors I.7.5 [Document and Text Processing]: Document Capture – document analysis

General Terms dataset, measurement, performance, repository, ground-truth, interpretation

1.

INTRODUCTION

Document Image Analysis (DIA) is the subfield of digital image processing that aims at converting document images to a symbolic form for modification, storage, retrieval, reuse, ∗From January 2010 to July 2011, Bart Lamiroy was a Visiting Scientist in the Department of Computer Science and Engineering at Lehigh University, on an INRIA d´el´egation with the Unit´e de Recherche Nancy – Grand Est.



Bart Lamiroy

Nancy Université – LORIA Campus Scientifique, BP 239 54506 Vandoeuvre Cedex, France

[email protected]

and transmission [13]. To achieve this goal, a fundamental task is to be able to transcribe text from the image format to the symbolic format, which is known as optical character recognition (OCR) or handwriting recognition (HR). For this purpose, researchers collect appropriate data or simply make use of existing public datasets for performance comparison. Standard datasets have been provided and acknowledged since they serve for performance comparison of different methods. Most of them, however, contain various artifacts during the process of dataset building and curation. Figure 1 shows a categorization based on two measures during data collection: • spontaneous or elicited: this measure characterizes whether handwritten samples are affected by data collectors. • raw or curated: this measure captures whether the post-processing of datasets might compromise real-world scenarios. In this paper, we define curation as the postprocessing where researchers intentionally select particular types of raw data for their research tasks. The only post-processing is to discard handwritten pages that contain personal identifying information for personal privacy protection. Thus, dataset artifacts include elicited and/or curated handwriting. Most of existing datasets are either elicited because researchers affect human subjects’ normal handwriting or curated because researchers re-organize or sift the data they collect. A breakdown of all these datasets is listed in Table 1. For example, during the data acquisition of IAM [11], the authors make the following restrictions: • The paper documents contain pre-printed separating lines so that it is easy to extract the individual parts automatically. • The authors require the use of rulers during writing and there is an 1.5 cm spacing between lines. • Human subjects might be intervened when the supervisor observes limited space for writing on the page.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. J-MOCR-AND ’11 Beijing, China Copyright 2011 ACM 978-1-4503-0685-0/11/09 ...$10.00.

Although these restrictions largely simplify the pre-processing and analysis for their handwriting recognition system, this scenario differs from the real-world situation. To promote and evaluate robust algorithms for real-world scenarios, we strive for providing a handwriting dataset that is spontaneous and free of curation. Specifically, all the data

Dataset IAM [11] SUNYBuffalo [18] Firemaker [17] NIST (SD3) [20] RIMES [6] IBN SINA [5] Mormon Diary [1] Germana [14] CENPARMI [19] CEDAR [7] Lehigh Notebooka a

Table 1: A breakdown of several public off-line handwriting datasets. Scale Writers Source Process Purpose 9K lines 400 Elicited Raw handwriting recognition with language modeling 1 page 1,500 Elicited Raw handwritten document examination testimony N/A 250 Elicited Curated writer identification/verification 2.6K forms 2,600 Elicited Curated character recognition 12k pages 1,300 Elicited Curated handwriting recognition 51 folios 1 Spontaneous Curated historical handwriting recognition 63K pages 376 Spontaneous Raw historical document analysis 764 pages 1 Spontaneous Raw historical document analysis 17K words N/A Elicited Raw U.S. Zip Code Recognition 70K words N/A Spontaneous Curated U.S. Zip Code Recognition 499 pages 9 Spontaneous Raw Various DIA Research

Comments 1. use rulers 2. human intervention 1. the same document 2. various samples write fixed text unclear on motivation mails, forms historical documents historical diaries historical manuscript dead letter envelopes dead letter envelopes course notebooks

As an ongoing activity, we are aiming for about 100 notebooks from 50 students, in a total of about 3,000 pages.

we collect are not prepared in research labs or for a particular task in DIA, rather, they exist for students’ personal use. In addition, we strive for protecting test subjects’ privacy by excluding pages that contain personal identifying information: person names, email addresses, phone numbers, etc.. Another important characteristic of Lehigh notebook dataset is that it is hosted by the document analysis and exploitation (DAE) Server at Lehigh, which supports multiple competing interpretations for a given document, rather than the dictated ground-truth. To facilitate broad document analysis, we provide a set of document metadata for each page, such as the author (in the form of author IDs), notebook index, etc. In addition, the DAE platform enables researchers to provide their own interpretations given a document. We will discuss more about metadata, structural information, and the DAE platform in Section 5. For the remainder of this paper, Section 3 explains the details on data acquisition and processing. We then discuss a series of real-world applications that can benefit from using the Lehigh notebook dataset in Section 4. In Section 5, we briefly describe the concept of the DAE platform and the format it hosts the Lehigh notebook dataset in. Finally we conclude with discussions on future work in Section 6.

2.

RELATED WORK

In the DIA area, several handwriting datasets have been serving as standard benchmarks. Figure 1 shows an approximate categorization of existing handwriting datasets based on their application areas and data source. In this section, we only discuss Mormon Diary and Germana since these two are spontaneous and largely free of curation. The Mormon Diary [1] dataset contains Mormon missionary diaries that are from 1832 to 1960. For DIA researchers, it is a real-world handwriting dataset that can be used for multiple tasks, such as historical writer identification, historical handwriting recognition, etc.. The Germana [14] dataset is 764-page historical Spanish manuscript written by a single author in 1891. Germana provides access to raw images as well as annotations of its real informative contents. However, there are important differences between historical and modern handwriting. For example, writing implements are different (fountain pens vs. ballpoints, etc.), handwriting styles are different, and lexicons are different

Raw IAM

SUNY Buffalo

CENPARMI

Elicited

Lehigh Notebook

Germana

Mormon Diary

Spontaneous

RIMES CEDAR Firemaker IBN SINA

NIST Digit

Curated Figure 1: Characterizing existing off-line handwriting datasets.

as well. On the other hand, the document degradations are different, which may require different processing techniques. For example, the major source of degradation for historical documents is wear and damage to the paper, water stains, rips, etc. From this perspective, although these two datasets and our Lehigh notebook dataset are spontaneous collections that support a range of DIA research, they grow up in different directions and thus can be useful for different DIA tasks. Thus, two observations have motivated us for providing a new handwriting dataset that aims for promoting and evaluating algorithms in DIA research. The first is that we want the dataset to reflect real-world challenges that traditional datasets can not support, and also we want to keep as much as possible the raw format/composition of datasets. Second, we would like to support broader research rather than narrowly focusing on handwritten digit recognition, etc. As a result, we have created the Lehigh Notebook Dataset.

3.

LEHIGH NOTEBOOK DATASET With a sufficient amount of training data and carefully

(a) One page of a computer science course.

(b) One page of a statistics course.

(c) One page of an electrical engineering course.

Figure 2: Some example notebook pages from the Lehigh notebook dataset. designed features, traditional handwriting recognition systems can perform reasonably well. However, there is a distinction between research scenarios and real-world ones. For research purposes, it is reasonable to start out on specific types of data in order to tune the features and classifiers. However, current DIA techniques are still not able to handle all forms of real-world complexity. The reason is that handwritten documents exhibit endless variation and complexity: complicated layouts such as hand-drawn tables, forms, sketches are possible, other artifacts such as rulings and legends. In addition, these complicated components can present almost arbitrarily within a page. The process of creating and post-processing document datasets is critical. First, we want to make sure that test subjects can provide their normal handwriting without interventions from data collectors. In other words, all the data are spontaneous. Second, after data collection, we want to ensure as much as possible the raw data. For example, we retain all raw data except for those containing personal identifying information. Strictly speaking, this post-processing is one type of curation. This curation, however, is necessary since we must protect test subjects’ privacy. As a result, we provided the Lehigh notebook dataset. The notebooks were used by college students for their courses. This is important because we ensure spontaneous handwriting. Thus, all kinds of handwriting are possible in this collection: handwritten text, hand-drawn graphs/sketches, handdrawn tables, handwritten annotations from people other than the authors. From the perspective of data source, these notebooks come from a variety of disciplines. Thus, it is common to observe plain handwritten text, math equations, engineering drawings, chemistry symbols, etc. In addition, we did not adjust distributions of any particular types of notebooks or their contents. To process notebooks, we separated pages while ensuring the page order. While for loose notebooks, we also preserved the order as we received it. Few notebook pages contain personal identifying information such as names, email addresses, phone numbers, so we excluded them from the

scanning. So far, we have excluded about 20 pages to protect test subjects’ personal privacy. Each notebook page was scanned at 600 dpi into PDF files, using an HP copier with the bitonal setting under the plain text mode. Note that in the near future, we will also provide a full-color version of the dataset. After the scanning, we processed all PDF files using a Linux tool called pdfimages to extract images from PDF files as Portable Bitmap (PBMs). Next, we converted PBM images into the TIFF format with the non-compression option. A common dimension of a notebook page image is 5104w × 6600h. It is important to consider the scale of any datasets for DIA tasks. As an ongoing activity, the Lehigh notebook dataset has been increasingly growing. So far, we have collected 18 notebooks from nine college students, a total of 499 document pages. As more and more notebooks become available, we expect this dataset to have a reasonable size comparing to existing ones. Therefore, for now we do not consider the current scale a serious issue limiting its usage. Rather, we would like to highlight the varieties it contains and how they can reflect other real-world problems and also how to benefit those tasks. Figure 2 shows a few examples. In reviewing the data we collected, we noted that pre-printed ruling lines usually presented in notebooks. These rulings are designed to guide people’s handwriting, while on the other hand, handwriting usually overlaps pre-printed ruling lines, which might cause serious problems for handwriting recognition. Second, our dataset contains notebooks from a variety of courses ranging from computer science, electrical engineering, to statistics. Thus it is common to observe various kinds of components that are placed almost arbitrarily, e.g., hand-drawn sketches/figures, handwritten tables, etc.. In the following section, we will demonstrate a series of real-world research problems that might be conducted on this dataset.

4.

POSSIBLE DIA RESEARCH

For DIA, one primary task is to translate an input document image into a textual transcription so that other tech-

niques such as those in natural language processing (NLP) can be applied. However, the bulk of information that a document corpus conveys goes beyond such a transcription. We define them as structural information, which consists of document metadata and other non-textual information. For example, in handwritten documents, writer idiosyncrasies are an example of structural information that can be exploited for many applications, including handwriting recognition and writer identification. Other examples include the date of creation, the languages of the script, the topics involved, the page layout specifications, the 2-D arrangement of handwritten text (e.g., tables), the order of the current page in a collection (e.g., a notebook), etc. An example is shown in Figure 3. In real-world applications, the goal is often to process a large collection of noisy and unstructured documents automatically. What we expect machines to produce are a set of organized documents with a variety of structural information. For example, we would like to know which pages belong to the same collection (stapled, or bound together), or the same topic at the semantic level. This might involve crosspage based DIA tasks, as will be discussed in Section 4.2. Within each page, the structural information should include the correct reading order of the page, document component regions, and other page artifact information such as ruling line specification, bleed-through, etc.. Single-page based DIA tasks are discussed in Section 4.1. In some scenarios where a high-value collection is relatively small and holds significant interest, manual processing may be a feasible alternative, e.g., the George Washington dataset in [10].

4.1 4.1.1

DIA on a Single Page Bleed-through Processing

For real-world OCR systems, they may fail due to a large variety of artifacts and noise. Bleed-through is one common issue that arises often in the historical document image processing [12, 21]. Bleed-through is a situation in handwritten documents where the ink from the verso side is shown at the recto side. It turns out that bleed-through is also possible for double-sided handwritten notebook pages. On the other hand, it is relatively easier for researchers to generate reference interpretations for modern documents (we will introduce the concept of interpretation in details in Section 5). Thus, it is possible to use Lehigh notebook dataset to design and evaluate algorithms on bleed-through. Figure 5(a) and Figure 5(b) shows two sample pages that contain.

4.1.2

Logical Component Grouping

In real life, human beings usually first decide logical components given a document page. Such document pages might contain various kinds of components: handwritten text, handdrawn sketches/figures, handwritten tables, other people’s annotations, etc.. In addition, such components might be placed in an arbitrary way. By conducting recognition and segmentation simultaneously, human beings have no problems of deciding a logical grouping. However, it is much harder for machines to do so. To facilitate progress on this regard, the Lehigh notebook dataset is a suitable resource to take advantages of. For example, when it comes to handwritten components other than plain text, people usually ignore the space constraints suggested by rulings.

Also, sometimes people might use human-readable indicators to guide the logical component grouping, e.g., arrows. Figure 5(d) shows an example of complicated grouping indicated by hand-drawn arrows. Thus it is important to be able to recognize such indicators in order to decide logical component grouping.

4.1.3

Handwritten Annotation Analysis

In real-world handwritten documents, annotations usually represent interpretations for the document from other people, so they are very important to detect and analyze. However, since annotations are usually concise and context dependent, it is not easy for machines to understand. Figure 5(e) shows an example of annotated document page. Handwritten annotation analysis is a relatively new area in which not too much work has done. Several interesting research questions which are not restricted to DIA include: • Who writes the annotations? • If it is a question, can we find answers within the page? • If it is a comment or a correction, can we derive the correct form from the context, at the syntactic and the semantic level? We hypothesize that being able to understand symbolic handwritten annotations such as question marks, check marks, will benefit the understanding in other real-world documents.

4.1.4

Writer Identification

Traditional writer identification research is usually conducted on carefully prepared datasets such IAM [11]. In some real-world cases, however, it is almost impossible to decide the original author information. For example, researchers may want to conduct writer identification on historical documents from various sources. Since our notebook dataset shares many characteristics with historical documents (bleed-through, artifacts, etc.), it is reasonable to design and evaluate new writer identification techniques on our notebook dataset.

4.1.5

Table understanding

The two-dimensional arrangement of text cells conveys more critical information than the symbolic form itself. For example, for each table cell, its relationship with the row head and column head build a logical relationship for these three table components which is valuable for knowledge extraction. In real-world handwritten documents, tables are a common means of organizing information, thus it is important to design and evaluate table understanding techniques on representative datasets. Therefore, our Lehigh notebook dataset might be appropriate for researchers in table understanding, as more and more table pages become available. Figure 5(h) and Figure 5(i) show two examples.

4.2

DIA on Multiple Pages

So far, large amount of work in document analysis focuses on isolated pages, and ignores the fact that one page is part of a larger document collection. Obviously, the pages that comprise a single document should be linked. For example, built-in relationships such as tables of contents, indices, captions and footnotes are valuable information to organize documents. Human readers use this information extensively, and algorithms should, too. Thus, we believe Lehigh notebook dataset can be an attempt to fill this void.

page294 /TIFF/ l e h i g h 1 0 0 3 n b 2 0 0 9 p a g e 2 9 4 . t i f f 600 600

< v a l u e l i s t> < v a l u e l i s t i t e m> a u t h o r s u b j e c t 1 0 0 3 < v a l u e l i s t i t e m> notebookID nb2009 < v a l u e l i s t i t e m> c r e a t i o n d a t e 2 0 1 0 / 0 9 / 1 5 < v a l u e l i s t i t e m> S u b j e c t mathematics , s t a t i s t i c s , l i n e a r models ...

Figure 3: An XML markup of the metadata for a Lehigh notebook document shown in Figure 5(f ). This document page is also accessible via: http://dae.cse.lehigh.edu/DAE/?q=browse/dataitem/554349.

4.2.1

Page Order Analysis

It is natural for people to know the page order first before they understand an unstructured document collection. To facilitate automatic understanding of a collection of documents, machines should decide the page order as well. In our discussion, we define page order to be the logical sequence of notebook pages that ought to be interpreted sequentially. This is important for a set of loose pages, which reflects a real-world scenario where an unstructured document collection is available. In this scenario, the physical order of pages within a notebook is not necessarily the same as the logical page order. Thus, although the notebook dataset contains notebooks which have obvious physical page order, it can be used to test algorithms for cases where only loose pages are available. Commonly used indicators include page numbers, creation dates, etc. Figure 5(f) and Figure 5(g) show two consecutive pages within the same notebook. It is also important to bear in mind that neither page numbers nor creation dates are sufficient in determining the page order. For example, these two pages can be from different notebooks but happened to contain consistent styles of page order indicator, e.g., the same person writes page orders for two course notebooks with the same notebook style.

4.2.2

Logical Clustering

Similar to the situation of page order, people also need to know if there are sets of pages that are coherent in terms of their physical structures (notebooks), or logical structures (topics, discussed in Section 4.2.3). For human beings, such analysis might facilitate their understanding of document contents, or the temporal order of events those documents contain. For machines, such analysis can also benefit style based document analysis. One example is the whole book recognition [22]. Our Lehigh notebook dataset provides a

natural collection of documents that belong to different notebooks. By providing document metadata such as notebook ID and ruling specifications, it is possible to design and evaluate algorithms on logical clustering. To do that, it is appealing to make use of the ruling line specifications. For example, two pages containing different ruling line structure originate from different notebooks. Note that for tasks like handwriting recognition, people usually treat rulings as noise or artifacts [2]. However, in a different scenario such an artifact might be useful in determining the structure of notebooks. As a prerequisite step for notebook restoration, we have now been evaluating our ruling detection algorithm [4] on this dataset.

4.2.3

Document Topic Tracking

In real life, an unstructured document collection might record a variety of topics, and understanding these topics might benefit understanding of the entire document collection. This scenario is reflected by tracking course topics within notebooks, i.e., over a semester, a course might move through several topics. For automatic methods, this is hardly a sole problem for the DIA community, rather, it is desirable to invite techniques from the NLP and IR community that can help decide the topic model within each page. Although we have listed a few research tasks using Lehigh notebook dataset, it can be more useful for some applications. For example, writer identification is usually conducted on plain and clean documents, as in the IAM dataset. However, it is interesting to evaluate existing techniques on real-world noisy documents such as notebook pages. New techniques can also be designed and tested on our dataset if necessary. In addition, many DIA applications in multiple pages require coherent document groups (notebooks in our

case) to test. Thus, our notebook dataset can facilitate such research tasks as well. we shall say that it still leaves large space for further discussions of potential applications.

5.

THE DAE PLATFORM

The document analysis and exploitation (DAE) platform is proposed to promote storing, visualizing, and interpreting collections of noisy documents. More specifically, it allows peer evaluation of algorithms and datasets through crowd cooperation [8, 9] (See Figure 4). One important idea it conveys is to allow multiple interpretations and measurements to coexist, and thus measuring, comparing, and interpreting require the presence of a well defined context. This way of perceiving document interpretation sheds a new light on how to consider noisy documents since it not only covers the noise that affects the physical transcription process of the author’s syntax on the document support, but it also covers the lack of contextual knowledge that affects the interpretation by the reader [8]. Under this paradigm, the well-know term ground-truth is simply an instance of one person’s interpretations. Therefore, it is natural that one document page or a page zone will receive multiple interpretations over time. Then, how do we know which one is a “correct” interpretation? Here the notion of on-line reputation as practiced in Web 2.0 recommender systems might hold the key [15, 23, 16]. Since researchers and algorithms have already acquired informal reputations, extending this to the document interpretation data can provide a mechanism for deciding which interpretations to choose. Hence we do not attempt to provide with a complete set of ground-truth interpretations for various DIA tasks, rather, we provide general structural information for each page. For example, an XML-like metadata label might look like Figure 3. In this markup, the author and notebook id information are recorded during data collection, thus they are more likely to be trusted. In addition, we also mark physical characteristics present in the notebook pages, such as rulings. Pre-printed ruling lines have certain properties: they are parallel, employ consistent spacing, length, and thickness. Making use of the ruling line model, we can markup rulings conveniently [3]. Some structural information is only interpretations of the data annotators. For example in Figure 3, the creation data is an estimation of the creation date that is derived from its previous pages (the author writes a date on that page). It is clear that such metadata is not precise, however, it still can be useful for deciding the page ordering. Also, the subject entry is likely to create more interesting questions, e.g., what is the subject for this particular notebook page? In this case, one might tag this as a math course, while another person might treat it as a statistics course. We can certainly imagine that some people might mark it at a finer level, such as “linear models.” Thus, we should allow for multiple interpretations for a single document. Within the DAE platform, researchers have the access to create the reference data together on a given input. Currently, we have been creating interpretations of the ruling line specifications that are likely to benefit multiple research tasks, such as notebook structure restoration. Other interpretations will hopefully be provided by research communities over time.

6.

DISCUSSION

One observation that inspired our work was that currently most of the public handwriting datasets were carefully prepared for research purposes, in which artificial effects might simplify problems that are common in real life. In this work, we have introduced a real-world noisy unstructured handwritten notebook dataset that intended to help promote and evaluate robust document analysis algorithms. As an ongoing activity, the Lehigh notebook dataset has been increasingly growing. So far, we have made available 18 notebooks from nine college students, resulting a total of 499 pages. Although just started, it has characteristics that are still interesting for the DIA community and also the NLP and IR community. Currently, the bitonal scanned version of our Lehigh notebook dataset is available via the DAE platform hosted at Lehigh (http://dae.cse.lehigh.edu/DAE/ ). It is important to clarify that any dataset is biased, so is our Lehigh notebook dataset. Future work include a fullcolor scanned version and several applications in which more structural information is exploited. In addition, we also call for discussions concerning the usage of such a dataset.

7.

ACKNOWLEDGMENTS

The authors thank anonymous reviewers for their insightful feedback, and those anonymous college students who contribute their course notebooks for our research purposes. This work was supported in part by a DARPA IPTO grant administered by Raytheon BBN Technologies.

8.

REFERENCES

[1] The Mormon missionary diaries. http://lib.byu.edu/digital/mmd/create.php. [2] H. Cao, R. Prasad, and P. Natarajan. A stroke regeneration method for cleaning rule-lines in handwritten document images. In Proc. of the MOCR workshop at the 10th international Conference on Document Analysis and Recognition, 2007. [3] J. Chen and D. Lopresti. A model based ruling line detection algorithm for noisy handwritten documents. In Proceedings of the 2011 11th International Conference on Document Analysis and Recognition, 2011. To appear. [4] J. Chen and D. Lopresti. Table detection in noisy off-line handwritten documents. In Proceedings of the 2011 11th International Conference on Document Analysis and Recognition, 2011. To appear. [5] R. Farrahi, M. Cheriert, M. Adankon, K. Filonenko, and R. Wisnovsky. IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. In Proceedings of the 9th International Workshop on Document Analysis Systems, pages 11–17, 2010. [6] E. Grosicki, M. Carre, J. Brodin, and E. Geoffrois. Results of the RIMES evaluation campaign for handwritten mail processing. In Proc. the 11th International Conference on Frontiers in Handwriting Recognition, pages 941–945, 2008. [7] J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):550–554, 1994. [8] B. Lamiroy and D. Lopresti. A platform for storing, visualizing, and interpreting collections of noisy

Figure 4: A screenshot of the DAE platform where Lehigh notebook dataset is shown for user editing.

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

documents. In Proceedings of the fourth ACM Workshop on Analytics for Noisy Unstructured Text Data, pages 11–18, 2010. B. Lamiroy, D. Lopresti, H. Korth, and J. Heflin. How carefully designed open resource sharing can help and expand document analysis research. In Proc. of Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging), 2011. R. Manmatha and T. Reth. Indexing of handwritten historical documents - recent progress. In Proceedings of the Symposium on Document Image Understanding, pages 77–85, 2003. U. Marti and H. Bunke. The IAM-database. International Journal of Document Analysis and Recognition, 5:39–46, 2002. R. Moghaddam, D. Rivest-H´enault, I. Bar-Yosef, and M. Cheriet. A unified framework based on the level set approach for segmentation of unconstrained double-sided document images suffering from bleed-through. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 441–445, 2009. G. Nagy. Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):36–62, 2000. D. P´erez, L. Taraz´ on, N. Serrano, F. Castro, O. R. Terrades, and A. Juan. The GERMANA database. In Proc. of the International Conference on Document Analysis and Recognition, pages 301–305, 2009. W. Raub and J. Weesie. Reputation and efficiency in social interactions: An example of network effects. American Journal of Sociology, 96(3):626–654, 1990. J. Sabater and C. Sierra. Review on computational

[17]

[18]

[19]

[20]

[21]

[22]

[23]

trust and reputation models. Artificial Intelligence Review, 24(1):33–60, 2005. L. Schomaker and L. Vuurpijl. Forensic writer identification: a benchmark data set and a comparison of two systems. Technical report, Nijmegen, 2000. S. Srihari, S. Cha, H. Arora, and S. Lee. Individuality of handwriting. Journal of Forensic Science, 47:1–17, 2002. C. Suen, C. Nadal, R. Legault, T. Mai, and L. Lam. Computer recognition of unconstrained handwritten numerals. Special Issue of Proceedings of IEEE, 7(80):1162–1180, 1992. R. Wilkinson, J. Geist, S. Janet, P. Grother, C. Burges, R. Creecy, B. Hammond, J. Hull, N. Larsen, T. Vogl, and C. Wilson. The first census optical character recognition systems conference. NISTIR 4912, The U.S. Bureau of Census and the National Institute of Standards and Technology, Gauthersburg, MD, USA, 1992. C. Wolf. Document ink bleed-through removal with two Hidden Markov Random Fields and a single observation field. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3):431–447, 2010. P. Xiu and H. Baird. Whole-book recognition using mutual-entropy-driven model adaptation. In Proc. Document Recognition and Retrieval XVIII (IS&T/SPIE International Symposium on Electronic Imaging), 2008. B. Yu and M. Singh. A social mechanism of reputation management in electronic communities. In Cooperative Information Agents IV-The Future of Information Agents in Cyberspace, pages 355–393, 2000.

(a) The recto side of a double-sided document.

(b) The verso side of a double-sided document.

(c) One page of a computer science course.

(d) Reading order indicated by handdrawn arrows

(e) An example page containing handwritten annotations at the bottom.

(f) Page 30 in Notebook #2009.

(g) Page 31 in Notebook #2009.

(h) One page containing a handwritten table.

(i) One page containing a handwritten table.

Figure 5: Some example notebook pages from the Lehigh notebook dataset.