EFFICIENT SIDE INFORMATION ENCODING FOR ...

2 downloads 0 Views 428KB Size Report
EFFICIENT SIDE INFORMATION ENCODING FOR TEXT HARDCOPY ..... discussed in depth in [4]. The experiments were conducted with printers HP LaserJet.
EFFICIENT SIDE INFORMATION ENCODING FOR TEXT HARDCOPY DOCUMENTS Paulo Borges, Ebroul Izquierdo

Joceli Mayer

Dept. of Electronic Engineering Queen Mary University of London London, UK

Dept. of Electrical Engineering Federal University of Santa Catarina Florianopolis, Brazil

Abstract This paper proposes a new coding method that increases significantly the signal-to-watermark ratio in document watermarking algorithms. A possible approach to text document watermarking is to consider text characters as a data structure consisting of several modifiable features such as size, shape, position, luminance, among others. In existing algorithms, these features can be modified sequentially according to bit values to be embedded. In contrast, the solution proposed here uses a positional information coding approach to embed information. Using this approach, the information is related to the position of modified characters, and not to the bit embedded on each character. This coding is based on combinatorial analysis and it can embed more bits in comparison to the usual methods, given a distortion constraint. An analysis showing the superior performance of positional coding for this type of application is presented. Experiments validate the analysis and the applicability of the method.

1 Introduction Due to the undesired financial and security impact that fraudulent documents often cause, document authentication is seen as indispensable research area which has experienced significant growth in recent years. With respect to paper form, important paper copies of documents are often exchanged between companies and people. As a consequence, the development of a reliable method for the certification of hardcopy documents remains a critical challenge. Several text watermarking techniques have been proposed in the literature. Brassil et al. [11] describe and compare several mechanisms to embed and decode information in documents, which are remarkably robust to the print and scan (PS) channel. One method is called line-shift coding, where a line is moved up or down according to the bit to be embedded. Line centroids can be used as references in blind detection (without the original document). Variations of the method include character and word shift coding [12, 9, 10], but they are essentially different implementations of the fundamental idea. In a different approach, it is proposed to perform modifications on the characters pixels [5, 13], such as flipping a pixel from black to white, and vice versa. This class of methods has a large capacity potential, but since the method relies on small dots, the printing and scanning operations must be performed at very high resolutions to reduce errors due to thresholding in the binarization of the

978-1-4244-1696-7/07/$25.00 ©2007 IEEE.

scanned image. Another important technique is called text luminance modulation (TLM) [2, 3]. It slightly modulates the luminance of text characters to embed side information. This modification is performed according to the target side information and for cases of low PS distortion, it can be set to cause a very low perceptual impact while remaining detectable after printing and scanning. A characteristic that is common to all the techniques described above, is that the modification of the characters (or other features) are performed in sequence, according to the bit to be embedded. In contrast, this paper proposes a new coding scheme where the embedded message is associated to the position of the modified feature in the text. It is coined position based coding (PBC) and it is appliable to digital and printed documents. Considering a document composed of K characters, there are many possibilities (given by the binomial coefficient) as how to combine a given number Q of modified characters. For each of these possibilities, a different bit string can be assigned, making use of a coding scheme. The detector searches the document for Q positions that present the modified feature. The decoding process relates these positions to a bit string. This technique is evaluated regarding perceptual impact and it is shown that for the same number of inserted bits we use less energy if compared with the sequential modulation (SM), traditionally employed. PBC is specially advantageous when the channel noise is strong such that the modifiable feature must be considerably modified to ensure detectability. In this case, SM often causes a disturbing pattern spread over the text. In contrast, using PBC, the modification is also visible, but it is confined to only Q characters, having a lower impact on the readability of the text. This paper is organized as follows. Section 2 describes SM coding. Section 3 introduces PBC and presents an comparative analysis with SM, regarding the perceptual impact and the payload. Section 3.4 analyzes the error rate using PBC. To illustrate the validity of the analysis and the applicability of PBC, selected experimental results are presented in Section 4. The paper closes with relevant conclusions in Section 5.

2 Sequential Modulation PBC is appliable to all of the document hardcopy watermarking methods described in the previous section. However, for the system description and the examples in this paper, luminance is chosen as the modifiable feature. Therefore, TLM is applied to illustrate the underlying process.

Using TLM, each character in the original digital document is labeled as ci , i = 1, 2, . . . , K, where K is the total number of characters. The characters are labeled from left to right, and from top to bottom. In TLM, information is embedded by individually altering the luminance of ci through an embedding function where each character has its luminance modulated from black to any value in the real-valued discrete alphabet Ω = {ω1 , ω2 , . . . , ωS } of cardinality S, so that each symbol conveys log2 S bits of information. Considering a spatial coordinate system for each character, indexed by coordinates (m, n), ci (m, n) is modulated by a gain wi , wi ∈ Ω. Assume that ci (m, n) ∈ {0, 1} and wi ∈ [0, 1], the modified luminance pixels are in the range [0, 1], from white (level 0) to black (level 1). The general embedding function is given by: si (m, n) = wi ci (m, n)

(1)

where si is the output element. The process is illustrated in Fig. 1 for S = 2, where the intensity changes have been augmented to make them visible and to illustrate the process.

by modulating 2 characters (Q = 2) such that the watermarked text becomes POSITIONAL ENCODING A different bit string b = [01110] could cause POSITIONAL ENCODING for example. An efficient coding rule mapping “input information” ↔ “output positions” is described in [7]. Notice that this coding rule does not use a preset codebook, which would be computationally expensive. Instead, a mathematical relationship between input and output is used.

3.2 The Coding Scheme This section describes a coding scheme for mapping a single number, which represents the information to be embedded, into Q other numbers, which represent the Q character reference positions in a document image. The encoding rule has some special characteristics: • The input number is the decimal representation of the binary string to be embedded. The input is represented by Z.

Figure 1: Example of text watermarking through TLM. In the detection process the characters have their average luminances evaluated, although other statistics can also be used [4].

3 Position Based Coding This section introduces an alternative coding scheme for TLM, which causes a reduced distortion to the text while maintaining the transmitting rate. The scheme is also appliable to other methods based on the modification of individual characters, as discussed in Section 1. Using the proposed coding method, information is related to the position of the modulated characters in the document. This coding scheme is referred to as position based coding (PBC), as opposed to sequential modulation (SM) described in Section 2. A related method focused on digital images has been proposed in [6, 7], where the authors embed information in an image by adding to it pseudo-random blocks in different positions, according to the information.

3.1 Positional Encoding Using the positional encoding, the information to be embedded is related to the position of a given number of modulated characters in the document. Consider a document with K characters. A message b is embed into the document by modulating Q different characters in the text, where Q < K. Therefore, the set of indexes i that contain the modulated characters represent the embedded information. For instance, using an appropriate coding rule, a bit string b = [11010] can be embedded into the text: POSITIONAL ENCODING

• The output numbers (representing the positions where the modified characters will be placed) must be integers. Each output number is represented by ai , with i = 1, 2, . . . , Q. • The value of the output numbers are constrained to be less or equal to K. The coding process is performed by using a mapping function Z = c1 · a1 + c2 · a2 + ... + cQ · aQ

(2)

where ci are coefficients and ai represent character positions in an image. It is suggested to use coefficients given by ci = K i−1 , i = 1, . . . , K. Given an input Z, to determine the outputs ai , the following rule is used:

aQ = ai =



$

Z cQ



Z−

(3) PQ

j=i+1 cj aj

ci

%

(4)

An example to illustrate the use of the mapping function is presented next. Example: Suppose that Q = 2 and the mapping function is expressed by Z = 1 · a1 + 14 · a2 . Given an input Z = 138, for example, the output is given by:   138 aQ = a2 = =9 (5) 14 $ %  PQ  138 − j=2 cj aj 138 − 14 · 9 a1 = = = 12 (6) c1 1 Figure 2 illustrates the above mapping, where 138 corresponds to a1 = 12 and a2 = 9.

such that p0 = p1 = 0.5

(10)

where p0 and p1 are the probabilities of occurrence of bits 0 and 1, respectively. Using SM, p0 and p1 translate directly into the probabilities of occurrence of non-modulated and modulated characters. In a document composed of K characters, the expected number of modulated characters (or ‘bit 1’ characters) using SM is E{Q} = Kp1 = 0.5K ∴ K = 2E{Q} (11) Because in PBC Q is deterministic, E{Q} = Q. Using this result, the substitution of (11) into (8) yields RS = 2Q

Figure 2: Example of combination table for Q = 2.

To retrieve the embedded information, the reverse process (decoding) is performed simply by having as input the positions ai , obtained after the watermark detection process. In this example, Z = 1 · 12 + 14 · 9 = 138.

Fig. 4 shows the capacities of both methods, illustrating the overall better performance of PBC, for K = 200. This figures shows that if 20 out of the 200 characters in the document are modulated (D = 20/200 = 0.1), for example, PBC encodes 90 bits whereas SM encodes 40 bits. A surface corresponding to the ratio RP /RS is given in Fig. 5. A two dimensional representation of the ratio is given in Fig. 6, for K = 200.

40

In the analysis of this section, assume that S = 2 (described in Section 2). Let RP be the embedding capacity in a document using PBC, representing the number of bits embeddable. RP depends on K and on Q. It is given by: ! K RP = log2 (7) Q

30

Bits

3.3 Distortion of PBC Versus SM

20 10 0 40 30 20 10

K

In order to compare the embedding capacities of PBC and SM, (7) and (8) must be expressed as a function of the same parameters for the same distortion level. Let Q be the number of modified characters. Let the amount of distortion in a modulated character be represented by δ. Let the average distortion caused by embedding a message in a document be represented by D, given by Qδ K

0

0.2

0.4

1

0.8

Distortion

Figure 3: Capacity of PBC as function of K and of the distortion D.

250 Payload PBC Payload NPC 200

150

(9)

100

For simplicity, consider δ = 1 for the rest of this paper. Although the message b to be embedded is defined by the user, consider b as a realization of a random process. In PBC, Q is deterministic, independent from the bit string b. In contrast, using SM, the distortion depends on the embedded message. For instance, the distortion caused b = [111101] is stronger than that caused by b = [000010]. Assume that bit 0 and bit 1 in b occur with equal probability,

50

D=

0

0.6

Bits

 K! where K = Q!(K−Q)! is the binomial coefficient in combinatoQ rial analysis [8]. The payload given in (7) is illustrated as a function of K and Q in Fig. 3. In contrast, the embedding capacity using SM is simply a function of K, given by RS = K (8)

(12)

0 0

0.2

0.4

0.6

0.8

1

Distortion

Figure 4: Capacity as a function of the distortion D, for K = 200.

3

Ratio RP / RN

2.5 2 1.5 1 0.5 40 30 20 10

K

0

0

0.2

0.4

0.6

0.8

1

Distortion

Figure 5: Ratio RP /RS , as a function of the distortion D. 4

Ratio PBC/NPC

3.5 3 2.5 2 1.5 1 0.5 0

0.2

0.4 0.6 Distortion

0.8

1

Figure 6: Ratio RP /RS , as a function of the distortion D, for K = 100.

3.4 Detection In the digital only domain, in applications where the document does not suffer significant distortions, the detection error rate is practically zero. However, in hardcopy applications, the PS channel can be seen as a noisy communications channel, causing several distortions to the image, such as low-pass filtering, non-linear gains, and additive noise [2, 4, 1]. The simplest detection metric to determine embedded luminance is the average luminance of the character, given in (13). It is known from detection theory that this detection statistic is the Neyman-Pearson detector (which minimizes the error probability) when detecting a change in the mean level considering Gaussian noise, which is the framework of the application. By mapping the (m, n) coordinates in (1) to an onedimensional notation, the detection metric di for the i-th character is given by: Ni 1 X yi (n), (13) di = Ni n=1

where Ni is the number of pixels in character i and yi (n) is the printed and scanned version of si (n). In the S = 2 case, for example, if the average luminance is above a threshold λ (possibly determined through statistical training, for a given PS channel), yi is assumed as modulated (bit 1).

Else, it is assumed as non-modulated (bit 0). In a hardcopy application, considering the sum of the several distortions of the PS channel, it is observed experimentally that the detector output can be modeled as normal random variable, as supported by the central limit theorem. This assumption is illustrated in Fig. 7, where the distributions of the average luminances of printed and scanned modulated and non-modulated characters are presented. This is also supported by experiments presented in [3]. With these assumptions, let dN be a random variable representing the result of the detection statistic for non-modulated characters and assume it is distributed according to dN ∼ N (µdN , σd2N ). Similarly, let dM represent the detection statistic of modulated characters, given by dM ∼ N (µdM , σd2M ). Because of the noisy channel, a disadvantage of PBC is that if the detection process wrongly assumes a character yi as modified, the embedded message is entirely lost. In contrast, SM has the advantage that bits can be recovered through error correcting codes, which costs some amount of usable embedding rate. To reduce this problem in PBC, an alternative detection scheme that does not rely on the threshold λ is employed. Instead of evaluating if the average luminance of yi is greater or smaller than λ, the Q characters with the highest average luminance are determined. This causes a significant reduction in the error rate, as shown in the following. In PBC, only Q among K characters are modulated. The detector selects as the modulated characters only the Q locations which provide the Q highest detection values. Therefore, to determine the error probability, the probability of erroneously assuming the presence of a modulated character in a given location must be determined. The probability pe that a non-modulated character presents a higher detection value than a modulated character must be determined. This probability is: pe = Pr(dN > dM ) =

Z

∞ µd

M



1 e 2πσdN

(14) (x−µd )2 N − 2σ2 dN

dx

  µdM − µdN 1 √ = erfc 2 σ dN 2

(15) (16) (17)

Equation (17) describes the probability of erroneously detecting a single character as modulated. The total error probability must take into account K detections. Notice that erroneously detecting a single character results in missing the entire message. Thus, considering all the characters in the document, the probability of finding the wrong message using PBC is: pP BC = 1 − (1 − pe )K   K µdN 1 √ pP BC = 1 − 1 − erfc 2 σ dN 2

(18) (19)

An estimation for the parameters µdN , µdM , σd2N , and σd2N is given in Section 4. When the channel noise is strong, using a perceptually transparent modulation causes a very high error rate. To suppress that, the modulation gain can be increased to a visible level. However, using SM, empirical tests indicate that this causes a disturbing pat-

tern on the text. On the other hand, using PBC, only Q characters are modified. Although the modulation is visible, it it localized and does not have affect the readability of the text.

4 Experiments

Table 1: Experimental message error rates. K Using threshold λ Using highest d 20 1.94 × 10−2 0 50 4.70 × 10−2 0 100 8.80 × 10−2 0 200 0.177 0

The purpose of this section is to illustrate through Monte Carlo simulations the applicability of PBC and the validity of the analyses. Practical aspects of the TLM method, such as segmentation and character indexing are discussed in depth in [4]. The experiments were conducted with printers HP LaserJet 1320 and scanner Canon CanoScan LiDE 20. The printing and scanning resolutions were set to 600 dots/inch and 300 pixels/inch, respectively.

Non−modulated characters Modulated characters

λ

4.1 Experiment 1 0.16

4.2 Experiment 2 In this experiment, 200 sample documents composed of K characters are printed, for K = 20, 50, 100 and 200. The characters in the documents are modulated such that the samples have a message embedded according to the PBC algorithm. For these set of samples, only 2 among K characters are modulated (Q = 2). The detection process determines the 2 modulated characters, decoding their positions into the embedded bit string. Using the threshold λ = 0.233 determined in Experiment 1 to classify the modulated and the non-modulated characters, a significant error rate is observed, presented in Table 1. However, using the detection procedure where only the 2 characters with the highest average luminances are determined, no detection errors are observed. Notice that no errors are observed because the theoretical error rate is very low, and the number of trials is not large

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

Figure 7: Distribution of the average luminance of the characters.

enough to properly illustrate this rate (because printing and scanning is time consuming). This experiment, however, illustrates the applicability of the method.

4.3 Experiment 3 To validate the analysis of Section 3.4, an experiment similar to Experiment 2 is performed in the digital domain, using computer generated normally distributed noise to simulate the channel distortion. Therefore, no printing and scanning is performed, allowing a larger number of trials. The observed error rates in function of K are presented in Fig. 8, represented by the triangles. This figure also plots the theoretical error rates determined from (19), illustrating the accordance between theory and practice.

4.4 Experiment 4 In this experiment, random binary messages b are generated, with bits 0 and bits 1 equiprobable. The size of the messages varies from 1 to 100 bits. The messages are embedded into a document composed of K = 200 characters, using the PBC and the SM cod-

−4

10

Error Probability

This experiment implemented the text hardcopy watermarking system [2, 3], which embeds data by performing modifications in the luminances of characters, respecting a perceptual transparency requirement. Initially, a text sample composed of 10,320 characters is generated, as in ’abcdef...’. The font type tested was ‘Arial’, size 12. The luminances of the characters were randomly modified to {0, 0.16} with equal probability. In the experiments small text elements such as commas and dots are not watermarked. These elements are composed by a smaller number of pixels, making them more susceptible to segmentation and detection errors. The characters are printed and scanned and have their average luminances evaluated using (13). The characters are separated from the background using simple thresholding, although more advanced methods for documente segmentations could be employed [14]. Segmentation errors are not observed in this set of tests, however it is clear that they may cause synchronization decoding errors. Increasing the scanning resolution is an efficient option to reduce the occurrence of wrong segmentation, at the expense of computational complexity. For this set of tests, the obtained values for the parameters are µdN = 0.196, µdM = 0.276, σdN = 0.0121, and σdM = 0.0142, and the detection histogram is presented in Fig. 7. In this case, the detection threshold that minimizes the detection error is λ = 0.233.

−5

10

−6

10

Theoretical Experimental −7

10

0

50

100 K

150

200

Figure 8: Theorical error rate in PBC.

ital and printed documents: theoretical and practical considerations” in Proc. of SPIE, Elect. Imaging, USA, 2006.

0.3

0.25

SM PBC

[3] P. V. Borges and J. Mayer, “Document watermarking via character luminance modulation,” IEEE Int’l Conf. on Acoustics, Speech and Signal Processing, ICASSP 06, Toulouse, France, May 2006.

Distortion D

0.2

0.15

[4] P. V. Borges and J. Mayer, “Text luminance modulation for hardcopy watermarking” Signal Processing, Elsevier, January 2007.

0.1

0.05

0 0

20

40 60 Bits Embedded

80

100

Figure 9: Comparison between the distortions in PBC and in SM as a function of the number of bits embedded, for a document composed of K = 200 characters.

ing methods. Fig. 9 shows a plot representing the distortions when the messages are encoded using PBC and SM. This plot illustrates that, for a document of size K = 200, PBC can embed the same amount of bits as SM while causing a lower distortion. Notice that in PBC the distortion (number of modified characters) is deterministic for a given number of bits embedded, whereas in SM the distortion is non-deterministic, dependent on the bit values in b.

5 Conclusions This paper proposes and analyzes a new information coding method appliable to text watermarking. In traditional approaches, the bit string representing the side message is embedded by performing modifications in the text in sequential order, according to each bit. In contrast, in the proposed method, side information is transmitted according to the position of the modifications in the text. An analysis determining the embedding capacity of the method is presented. Given an average distortion constraint, it is shown that the positional coding can transmit more bits, in comparison with the usual methods. Notice that when the channel noise is strong the modulation gain must be increased to a visible level to ensure detectability. In this case, using SM causes a disturbing pattern on the text. In contrast, using PBC, although visible, the modulation is local resulting in a lower degradation of the readability of the text. Although most of this paper uses the luminance of the characters as the modifiable feature, PBC is also appliable in the text watermarking algorithms presented in Section 1. An analysis determining the error rate using PBC is presented. Computer simulations illustrate the superior performance of the proposed method and validate the analyses.

References [1] N. D. Quintela and F. Prez-Gonzlez. “Visible encryption: Using paper as a secure channel.” In Proc. of SPIE, USA, 2003. [2] R. Vllan, S. Voloshynovskiy, O. Koval, J. Vila, E. Topak, F. Deguillaume, Y. Rytsar and T. Pun, “Text data-hiding for dig-

[5] Min Wu and Bede Liu, “Data hiding in binary image for authentication and annotation,” IEEE Trans. on Multimedia, August 2004. mask,” J. Opt. Soc. Amer., vol. 9, 1992. of image data,” IEEE Trans. Pattern Analysis and Mach. Intell., vol. 8, no. 8, pp. 837842, Aug. 1996. [6] R. Baitello, M. Barni, F. Bartolini, V. Cappellini, ”From watermark detection to watermark decoding: a PPM approach”, Signal Processing 81 , p. 1261-1271, 2001. [7] Paulo Vinicius K. Borges and Joceli Mayer, “Analysis of position based watermarking,” Pattern Analysis and Applications, Springer-Verlag, 2006. [8] J. Riordan, Introduction to Combinatorial Analysis, Dover Publications, 2002. [9] A. M. Alattar and O. M. Alattar, “Watermarking electronic text documents containing justified paragraphs and irregular line spacing,” Proc. of SPIE, Volume 5306, June, 2004. [10] H. Yang and A. C. Kot, “Text document authentication by integrating inter character and word spaces watermarking,” Proc.IEEE Int’l Conf. on Multimedia and Expo, 2004. [11] J.T. Brassil, S. Low, N.F. Maxemchuk, “Copyright protection for the electronic distribution of text documents,” Proc. of IEEE, Volume 87, No. 7, pp. 1181-1196, July 1999. [12] D. Huang and H. Yan, “Interword distance changes represented by sine waves for watermarking text images,” IEEE Trans. on Circuits and Systems for Video Technology, Volume 11, No. 12, pp. 1237-1245, Dec. 2001. [13] T. Amano, “A feature calibration method for watermarking of document images,” in IEEE Proc. of the Fifth Int’l Conf. on Document Analysis and Recognition, ICDAR ’99. 20-22 Sept. 1999. [14] Y. Yang and H. Yan, “An adaptive logical method for binarization of degraded document images,” Pattern Recognition, Vol. 33, pp. 787-807, 2000.