Hiding Information in Document Images

2 downloads 112408 Views 67KB Size Report
Network distribution offers the promise of reaching vast numbers of recipients. ..... page were printed on an Apple LaserWriter IIntx 300 dpi laser printer. ..... of a Desktop Scanner'', Pattern Recognition Letters, vol. 15, July, 1994, pp. 705-711.
Hiding Information in Document Images J. Brassil, S. Low, N. F. Maxemchuk, L. O’Gorman AT&T Bell Laboratories Murray Hill, NJ Consider 2 distinct document pages which we label A and B. The pages are sufficiently similar that they are indistinguishable upon visual inspection, even when compared "side-by-side." However in one of the pages the locations of certain text words have been shifted horizontally by an imperceptible amount. Suppose that either page is selected at random and is reproduced, possibly recursively, using either an ordinary plain paper copier or facsimile device. Can one determine if such a copy was derived from A or B? In this paper we show how to use spatial information to reliably hide data in document images. Experimental results reveal that this information survives the distortions introduced by noisy image reproduction devices. We also describe the implementation of a prototype document marking and distribution system. The system can be used to discourage unauthorized reproduction and dissemination of either copyrighted documents distributed electronically, or paper copies of confidential executive memoranda. 1. Motivation

Copyright protection is becoming more elusive as computer networks such as the global Internet are increasingly used to deliver electronic documents. Network distribution offers the promise of reaching vast numbers of recipients. It also allows information to be tailored and preprocessed to meet the needs of each recipient. However, these same distribution networks represent an enormous business threat to information providers — the unauthorized redistribution of copyrighted materials. Rather than attempt to prevent unauthorized document copying and dissemination, we propose technology to discourage it. Our emphasis is strictly on "commercial grade" document security; we focus on security techniques which are simple to implement, rather than those that are extremely resistant to attack. Our security goals are modest; we hope to make unauthorized copying and dissemination of electronic

publications at least as difficult as if the publications were distributed on durable media (e.g. paper). If publishers can easily use computer networks for document distribution, with no additional fear of revenue loss due to "bootlegging" then they already face, then we have achieved our goal. Traditional approaches to discouraging document reproduction relied heavily on the distribution of physical media. Classical security paper techniques [1] include printing technologies (e.g. intaglios), optical elements (e.g. holograms), and magnetic materials (e.g. ink). Media-based security techniques obviously can not be used with electronically distributed documents. Indeed, copies of documents distributed on noiseless communication networks are exact and indistinguishable from an original. In this paper we introduce a scheme to surreptitiously embed information in one important class of documents, namely formatted, black and white document images. Each document recipient (i.e. subscriber) receives a document containing a unique set of marks [2]. Each mark corresponds to an imperceptible horizontal displacement of a textual object. Since the information is not observable upon casual inspection of a document, a recipient may or may not be aware of its presence. We will show that information hidden in this fashion can be reliably recovered, even from severely degraded copies of marked documents. The marks placed in an image can be used as a "fingerprint"; a recovered, unauthorized document copy can be traced to the original, authorized recipient. In addition, personal information which a document recipient may be unwilling to make public can also be included in an image. Indeed, being aware that such information can be hidden makes a document more "valuable" to a recipient, and less attractive to redistribute. In addition to discouraging copying of electronic documents, our techniques can also be used to deter "leaks" of paper copies of closely held executive correspondence.

_____________________________________________________________  In order for electronic publishing to become accepted, publishers must be   In order for electronic publishing to become accepted, publishers must be     In order for electronic publishing to become accepted, publishers must be  _____________________________________________________________ Figure 1 - Illustration of word-shift encoding. The first line is unshifted; the second line contains 5 words each shifted by 1/150 inch. Note that the spacing in both lines appears natural. The third line is an overlay of the first two lines. The remainder of the paper is organized as follows. In the next section we introduce our technique and discuss an implementation of a document marking system. We then discuss image defects which characterize "noisy" document reproduction devices, such as plain paper copiers and facsimile devices. In Section 4 we present experimental results that show that spatial information can be recovered from even rather poor quality document reproductions. System implementation issues are addressed in Section 5, and in the final section we summarize our work. 2. Document Marking and Distribution System Implementation

Our document marking scheme uses word-shift encoding; each mark corresponds to a minute, horizontal displacement of a certain text word (Figure 1). The marking system requires an electronic version of the document to be distributed. This "original" document is assumed to have variable spacing between adjacent text words. Our prototype system operates on documents in the PostScript Page Description Language (PDL). Postscript was chosen primarily because it permits arbitrarily precise text setting, and is the most common PDL in use today. We will discuss certain consequences of choosing to work with the PostScript format in Section 5.

file into a new set of tokens, each corresponding to a separately set text word. After preprocessing, each text word is individually positioned on the virtual page. The word shifter modifies the horizontal coordinate of each text word to be shifted by adding or subtracting a constant value. The constant value corresponds to the desired displacement in device space (e.g. pixels) when the document is rendered (e.g. printed). Each document recipient is assigned a unique codeword. The correspondence between the codeword and recipient is maintained in a codebook. Each codeword symbol corresponds to a displacement of a specific text word. In the current implementation of the encoder, the alphabet is binary, with each bit indicating the left or right displacement of a word. In Section 4 we briefly discuss the desired error-correcting properties of a code; the property of robustness to collusive attacks is treated in [3].

2.1 Encoder

The choice of preferred text words to shift has been made carefully, as follows. The encoder first determines if a line has a sufficient number of words to encode; short lines are not encoded. Though a line with at least three words is encodable, a line spanning an entire column length is most desirable. On each encodable text line, the 2nd, 4th, 6th, etc, word (from the left margin) is displaced. To maintain column justification, neither the first nor last word on any line is shifted. Experimentation has shown that this selection process facilitates decoding, encodes a potentially large number of bits/page, and is aesthetically acceptable. We do not claim that this choice of text words is optimal in any sense. Indeed, we continue to explore alternatives, including those which shift entire "blocks" of adjacent text words.

The encoder comprises a preprocessor and a word shifter (Figure 2). The preprocessor facilitates the displacement of arbitrary individual text words. PostScript text setting operators act on tokens or character strings rather than individual text words. The location of each token is specified by a horizontal and vertical user space coordinate on a virtual page. The preprocessor simply parses collections of one or more tokens in the original

After word-shifting each page is rendered. Depending on the application and corresponding document delivery mechanism, the page is either printed by a local printer (e.g. for hand-delivered executive correspondence), or a bitmap is generated (e.g. for network based distribution). Our system uses World Wide Web technology for document delivery on the Internet. Subscribers requesting copies of documents

A document marking system comprises an encoder and a decoder, which we describe in the next 2 subsections. The encoder uses word-shift encoding to embed a unique codeword in each intended recipient’s document. The decoder analyzes a recovered (and possibly degraded) document image and extracts the embedded codeword.

Version 1

Codebook PostScript Document

PostScript Preprocessor

Word-shift Encoder

Version 2 Render Bitmap Version N

Figure 2 - The logical architecture of an encoder. available on-line are asked to register and provide identifying information (for the purposes of both encoding and billing). The encoder is implemented as a Common Gateway Interface script, and requested documents are encoded on-demand.

line image, and 2) the locations of words in the recovered text line image is within the neighborhood of their expected location, as determined by the corresponding word locations in the original text line image.

2.2 Decoder

The centroid location is then calculated for each corresponding word in both the original and recovered text line images.1 Suppose that the ith word was shifted on an encoded line in the recovered image. Let x i − 1 , x i , x i + 1 be the centroid locations for words i − 1, i, and i + 1 in the original image. Label the 3 corresponding word centroids in the recovered image x ′i − 1 , x ′i , x ′i + 1 . Define the differences between adjacent word centroids on the same image as

The decoding process is depicted in Figure 3. There we assume that a degraded paper copy of an encoded document is recovered. For the purpose of this discussion, we will assume that a recovered document comprises a single page. The recovered page is scanned at an 8-bit depth and is subject to the following image processing operations — edge cropping, binarization, salt-and-pepper noise removal, deskewing (rotation of text lines to the horizontal), and thinning (of individual characters). Each encoded text line is extracted from the recovered page image and decoded separately. Information about the original (unencoded) page image is needed for decoding, though the original page image itself is not strictly required. Decoding requires the set of centroid (i.e. center of mass) locations for each word on each encoded line. Additional specification of word locations in the unencoded image (e.g. bounding boxes) is currently used to increase decoding performance when the recovered page image is severely degraded. Since a potentially large number of original documents may be available on the document distribution system, a design goal is to keep documentspecific information to a minimum. A vertical projection profile (i.e. the number of ON bits per column) is calculated for each corresponding line in both the original and recovered page images. The pair of profiles is first used for word segmentation. This is a multipass operation, with each pass effectively designed to segment words from increasingly noisy profiles. The segmentation operation is complete when 1) the number words identified in the recovered line image matches the number of words in the original text

∆ x −x ∆ dl = i i −1 , dr = xi +1 − xi , (1) ′ ∆ x′ − x′ , d′ ∆ x′ d ′l = i i −1 r = i + 1 − xi . Then the decision rule our decoder implements is as follows:

if d ′l − d l < d ′r − d r : decide word i shifted left

(2) if d ′l − d l > d ′r − d r : decide word i shifted right That is, we consider the distance between the centroids of a shifted word and the words to its immediate left and right in the recovered image, relative to the corresponding distances in the original unshifted text line. If the relative distance between the centroids of the center word and the word to its left (right) has decreased, then it is hypothesized that the center word was shifted left (right).

__________________ 1. More precisely, a centroid is calculated for each horizontal section of the profile judged to correspond to each word on the text line. Each centroid is a single, real-valued number.

Codebook Illicit hard copy Version ?

Scan Document

Image Analysis

Decoder

Version ? = 3

Original Document

Figure 3 - The architecture of the decoder. Before we consider the performance of our decoder, it is instructive to briefly examine how documents degrade when reproduced imperfectly. We anticipate that document images will be illicitly redistributed both by noiseless communication channels (e.g. computer networks) and noisy communication channels (e.g. facsimile). While decoding an uncorrupted document is trivial, decoding a distorted document is not. Since the most common noisy image reproduction devices in use today are plain paper copiers and facsimile devices, we will focus on image defects associated with these devices. 3. Noisy Image Channels

Most image defects present in documents produced by plain paper copiers or telecopiers appear minor and are readily overlooked by readers. But even minor defects take on a particular importance when hiding data in "white space." We can fortunately identify those defects which present the largest obstacles to reliable decoding, and specify how to encode documents to minimize their affect. Expansion or shrinkage of copy size is present to some degree in nearly every image reproduction device. In some cases size changes are purposely introduced for perceived reproduction quality improvement or as an anti-counterfeiting measure. The expansion along the length and width of a page is typically different. For spatially encoded documents to survive noisy reproduction, the reproduction devices must possess a high degree of geometric fidelity (i.e. linearity). While geometric linearity depends heavily on the specific implementation of a reproduction device [4], observed nonlinearities generally tend to increase with increasing distance across a page. To counter the effect encode information in the position of textual objects. independently along both word shifting), and along

of page size changes, we relative rather than absolute We also encode information the width of the page (i.e. the length of the page (i.e.

vertical shifting of text lines [5]), but not along both dimensions jointly. Certain large-scale image defects affect sizable regions of a page (i.e. > 1 cm 2 ). One such phenomena observed in recursively copied pages is "baseline waviness" (i.e., text rising above and/or falling below the baseline). This effect can be seen in Figure 5. To counter such large-scale spatial defects, we encode information in textual objects which are relatively close (e.g. adjacent text words). In addition, a large number of words are left unshifted to serve as reference points. Though we have not implemented the approach, collections of reference points can in theory be used to "correct" geometric nonlinearities. Many other image defects are readily observed but do not appear to have a dramatic effect on our detection results [6]. This includes salt-and-pepper noise, some of which is easily removed from copies. Linear text line skew is approximately corrected by image rotation. Both edge raggedness (i.e. blurring) and fading have surprisingly little consequence in detection performance [7, 8]. Other researchers have concluded that blurring tends to be isotropic about each text character [9], and we consequently speculate that blurring does not dramatically alter the position of word centroids. Image distortion is usually more severe in one dimension — either along the length or the width of a page — than the other. This is typically the "paper direction," or the orientation of paper moving though an image reproduction device. Variable paper thickness, drums and wheels out-of-round, nonconstant paper speed, etc., all contribute to more distortion in the paper direction. Note that a paper direction along the width (length) of a page will have more of an adverse affect for word (line) shift encoding. Since a recovered document may have been reproduced on different devices with different paper directions, a marking system should simultaneous encode along both directions to increase decoding performance. However in the remainder of this paper we discuss word shifting exclusively,

Figure 4 - Fourth generation plain paper copy of the encoded page (scanned at 300dpi, binarized and inverted).

Figure 5 - The encoded page after standard resolution facsimile transmission. which encodes information along the page width. The image defects introduced by a copier or facsimile are time-varying on several time scales. Mechanical parts wear, loose toner and paper fragments accumulate, etc. Poor machine maintenance is perhaps the most notable cause of severe image defects. Time variance causes certain difficulties, which we discuss in Section 5. 4. Experimental Results

To test how well word-shift encoded documents could be decoded after passing though noisy image reproduction channels, we performed the following experiments. A single page of justified, single-column, 10 point Times-Roman text was created. The test page comprised 30 lines each with a sufficient number of words to be encodable, with a total of 177 shiftable words. Both an original (unshifted) and an encoded page were printed on an Apple LaserWriter IIntx 300 dpi laser printer. Each shifted word on the encoded page was displaced by 2 pixels (1/150 inch). A 2 pixel displacement was selected because it was judged to be sufficiently small such that a trained observer has some difficulty detecting its presence. The reader can inspect Figure 1 and judge for themselves. The printed, encoded page was recursively copied four times on a Xerox 5052 plain paper copier (producing a 1st, 2nd, 3rd and 4th generation copy). This model has a paper direction along the page width. Figure 4 shows a portion of the 4th generation copy. In

a separate pair of experiments, the printed, encoded page was transmitted between two Xerox 7033 Telecopiers (i.e. facsimile devices). One facsimile transmission, as shown in Figure 5, was performed at "standard" resolution (100x200dpi), and one at "superfine" resolution (300x300dpi). The paper direction of this facsimile device is along the page length. Table 1 presents decoding performance for several of the images "recovered" from these experiments. The first column (unencoded detection rate) displays the fractional number of shifted words whose directional shift was correctly identified. As expected, decoding performance worsened as the images degraded with increasing copy generation. Decoding performance also worsened with decreased resolution in the facsimile experiments. _______________________________________________ _______________________________________________    Copy unencoded coded     0 174/177 30/30 _______________________________________________    3 _______________________________________________  145/177 (154/177)  27/30 (29/30)  _______________________________________________  134/177 (144/177)  26/30 (28/30)  4     Hi res. fax  166/177 _______________________________________________  30/30  _______________________________________________ Std. res. fax  140/177 (143/177)  26/30 (27/30)  Table 1 - Summary of decoding performance. Results after improved word segmentation are in parentheses. Since errors were anticipated when detecting individual word shifts, we used a simple error-correcting (repetition) code to reliably encode a single bit on each

of the 30 encodable lines. Each line contained between 5 and 8 shifted words. If the majority of word shifts on a line could be correctly identified, then the bit encoded in the text line would be correctly decoded. The second column of Table 1 displays the coded detection results. ______________________________________________ line #  copy 0  copy 3  copy 4  std. fax  hi-res  ______________________________________________  0  6/6  5/6  5/6  4/6  5/6   2  6/6    5/6  5/6   3  6/6     5/6   4  6/6              5  3/5  0/5  0/5  3/5  4/5   6  6/6  0/6  0/6  5/6    7  5/5    0/5    8  5/5  4/5  3/5  4/5    9  6/6    5/6    10  8/8    6/8    11  6/6  5/6  5/6  5/6  5/6   13  5/6   4/6  3/6    14  6/6   5/6     15  6/6  5/6  5/6            16  7/7  6/7  5/7  6/7    17  6/6  4/6  4/6  5/6    18  7/7       19  6/6  3/6  2/6  5/6  5/6   20  5/5   4/5  2/5  5/6   21  5/5  4/5  4/5     22  5/5  4/5  4/5     23  7/7   6/7  5/7  5/7   26  6/6       27  6/6  4/6  2/6  4/6           28  6/6  5/6  5/6  3/6    29  5/5   4/5     30  5/5    3/5    32  6/6  4/6  4/6  4/6  5/6   33  7/7  5/7  5/7  6/7   ______________________________________________  4/5  34  5/5  3/5  3/5  Table 2 - The fractional number of bits decoded correctly on each text line. For clarity, the absence of an entry on a line indicates no change from the entry in the column to the left. A closer examination of decoding performance is presented in Table 2. This shows the number of bits correctly decoded on each text line. The results justify a number of remarks: 1. Decoding performance does not decrease monotonically as images degrade (e.g. with increasing copy generation). This is a positive sign, suggesting that additional "processing" might improve detection performance. 2. Decoding the original, printed test page results in 3 bit errors. This reflects the fact that neither printing nor scanning are noiseless operations. Note that in this experiment both the original and

text pages were printed on the same device. But in some applications each page would likely be printed on different devices. This will likely result in additional decoding errors. 3. The 0/x entries are due to an initial failed attempt at word segmentation. If the number of words on the recovered text line image did not match the corresponding number on the original text line image, the decoder abandoned processing. Since word segmentation failures accounted for a significant number of errors, we repeated the segmentation phase, as described below. 4. Certain text lines, such as line 5, seem to be errorprone. In many applications it is possible to identify such lines empirically before any documents are distributed. In those cases, the system would avoid encoding such lines to maximize decoding reliability. 5.

Visually examining an image to discover the cause of a given bit error is not always revealing. While in some cases the word corresponding to an erroneous bit has obviously been subject to particularly severe distortion, in other cases the error cause is not apparent.

6. Improved error-correction techniques, likely including a stronger code, interleaving, and diversity, appear to be required. To improve decoding performance, the dominant source of errors, failed word segmentation, merited a closer look. These failures occurred on lines 5 and 6 in the copier experiments, and line 7 in the low resolution facsimile experiment. In all three lines the failure cause was immediately apparent. A long, narrow blotch of extraneous copier toner in the right margin extended from just outside the right column edge of line 5 down past the right column edge of line 6. This blotch was mistakenly identified as a word on each of the lines. For line 7 in the facsimile experiment, the combination of the very low scanning resolution and the presence of a punctuation mark (i.e. an apostrophe) also resulted in the false detection of an extra word. Both types of errors were easily remedied by minor adjustments in the decoder (e.g., adjusting a constant threshold). The initially unsegmentable lines were then segmented a second time with success, and the decoding results are summarized in the parenthesized entries in Table 1. However, the occurrence of these errors is a reminder that decoders must be designed to operate across documents subject to a wide range of noise.

5. System and Implementation Issues

Like their media-based counterparts, electronic document security schemes which use image "watermarks" can be defeated by a technically sophisticated attacker. The challenge for the attacker is to develop a generic tool to remove or obscure marks in a large class of documents. This must ideally be achieved with neither loss of presentation quality nor expert manual intervention. We believe that such a tool can likely be implemented, though our document image analysis experience suggests that the complexity of this task is easily underestimated. For more discussion of potential attacks on both marked documents and photos, see [5] and [10]. A number of implementation issues arose while developing our system. Implementing an encoder is, in principle, simple. The primary complication encountered is that the PostScript generated by different word processing applications is different. Hence, our preprocessor and word shifter were initially designed to be capable of accepting the PostScript generated by 3 common word processing applications. We are currently developing a new encoder which relies on the Adobe Acrobat Distiller to preprocess PostScript generated from arbitrary word processing software. Printable bitmaps (i.e. PostScript image operator) are ideally distributed to make documents both printable and moderately difficult to alter. But consequences of distributing images include lower presentation quality, larger file sizes, and the need for image rendering before distribution. The latter 2 disadvantages are aggravated when delivering documents over relatively low speed communications lines from a single-threaded httpd server. One approach to addressing the problem of rendering images on demand, which we did not implement, is to encode pre-existing page images (i.e. bitmaps) directly. This would be a necessary choice for distributing documents existing only in scanned image form. A second alternative is to "look ahead" by encoding and rendering pages before they are requested. A disadvantage of this technique is that it might not facilitate embedding user-specific information. In contrast to encoders, implementing a decoder is technically challenging. Good performance requires state-of-the-art document analysis tools to perform noise reduction, text "zoning", and word segmentation. Manual intervention by experts can enhance decoding performance in some cases, such as when extraneous writing appears on a recovered page. Decoding is fortunately envisioned as being a relatively infrequent operation without a significant time constraint.

Some issues also arose in data collection. The time varying nature of reproduction device defects obviously diminishes the repeatability of experiments. Perhaps more disturbing, we have not identified a simple, composite measure of image defects which effectively indicates whether a document image can be decoded successfully. Hence we have been forced to rely on poorly defined substitutes, such as the image copy generation number. Despite these limitations, we have demonstrated that the document marking technique works well and we have successfully implemented a prototype system. Looking forward, we expect technological improvements and new developments (e.g. color copiers, liquid toner) to lessen defects introduced by noisy reproduction devices. We also expect to see facsimile devices support much higher resolution. These anticipated technological improvements suggest that the performance of our marking techniques will likely increase with time. 6. Summary

Document delivery by computer network offers information providers the opportunity to reach a large audience more quickly and cheaply than does mediabased distribution. To facilitate the transition to network distribution, we have proposed a technique for embedding information in document images to discourage unauthorized copying and dissemination. The technique is shown to be simple and robust, and suitable for "commercial grade" document security. A prototype document marking system has been implemented. Experimentation has demonstrated that our "watermarks" can be recovered, even from degraded documents. By discouraging unauthorized redistribution, document marking offers publishers a greater level of security than they currently enjoy. Acknowledgement - Thanks to Kay Hane and Aleta Lapone for developing the encoder, and Dave Kristol for maintaining the server. References 1. R. L. van Renesse, ‘‘Optical Document Security’’, Artech House, Boston, 1993. 2. N. R. Wagner, ‘‘Fingerprinting,’’ Proceedings of the 1983 Symposium on Security and Privacy, IEEE Computer Society, April, 1983, pp. 18-22. 3. D. Boneh, J. Shaw, ‘‘Collusion-Secure Fingerprinting for Digital Data,’’ Princeton University Technical Report, 1994.

4. L. P. Cordella, G. Nagy, ‘‘Quantitative Functional Characterization of an Image Digitization System’’, 6th International Conference on Pattern Recognition, October, 1982, pp. 535-537. 5. J. Brassil, S. Low, N. Maxemchuk, L. O’Gorman, ‘‘Electronic Marking and Identification Techniques to Discourage Document Copying,’’ Proceedings of IEEE INFOCOM’94, vol. 3, Toronto, June, 1994, pp. 1278-1287. 6. H. S. Baird, ‘‘Document Image Defect Models,’’ Structured Document Image Analysis, (H. S. Baird, H. Bunke, K. Yamamoto, eds.) Springer-Verlag, Berlin, 1992, pp. 546-556. 7. L. B. Schein, Electrophotography and Development Physics, 2nd Ed., Springer-Verlag, 1992. 8. L. B. Schein, G. Beardsley, ‘‘Offset Quality Electrophotography’’, Journal of Imaging Science and Technology, vol. 37, no. 5, October, 1993, pp. 451461. 9. C. A. Glasbey, G. W. Horgan, D. Hitchcock, ‘‘A Note on the Grey-scale Response and Sampling Properties of a Desktop Scanner’’, Pattern Recognition Letters, vol. 15, July, 1994, pp. 705-711. 10. G. Caronni, ‘‘Assuring Ownership Rights of Digital Images’’, ftp://ktik0.ethz.ch/pub/tagging, submitted for publication, 1994. 11. K. Tanaka, Y. Nakamura, K. Matsui, ‘‘New Integrated Coding Schemes for Computer Aided Facsimile’’, First International Conference on Systems Integraion, IEEE Computer Society, Morristown, NJ, April 1990, pp. 275-281. 12. G. Nagy, ‘‘Optical Scanning Digitizers’’, IEEE Computer Magazine, May, 1983, pp. 13-24.