Linguistic Steganography Detection Using Statistical ... - Springer Link

3 downloads 0 Views 508KB Size Report
vice words gathered in a dictionary to classify the given text segments ... word substitution and Probabilistic Context-free Grammars (PCFG) ([2], [3]). There are a ...
Linguistic Steganography Detection Using Statistical Characteristics of Correlations between Words Zhili Chen*, Liusheng Huang, Zhenshan Yu, Wei Yang, Lingjun Li, Xueling Zheng, and Xinxin Zhao National High Performance Computing Center at Hefei, Department of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China [email protected]

Abstract. Linguistic steganography is a branch of Information Hiding (IH) using written natural language to conceal secret messages. It plays an important role in Information Security (IS) area. Previous work on linguistic steganography was mainly focused on steganography and there were few researches on attacks against it. In this paper, a novel statistical algorithm for linguistic steganography detection is presented. We use the statistical characteristics of correlations between the general service words gathered in a dictionary to classify the given text segments into stego-text segments and normal text segments. In the experiment of blindly detecting the three different linguistic steganography approaches: Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of discovering stego-text segments and normal text segments is found to be 97.19%. Our results show that the linguistic steganalysis based on correlations between words is promising.

1

Introduction

As text-based Internet information and information dissemination media, such as e-mail, blog and text messaging, are rising rapidly in people’s lives today, the importance and size of text data are increasing at an accelerating pace. This augment of the significance of digital text in turn creates increased concerns about using text media as a covert channel of communication. One of such important covert means of communication is known as linguistic steganography. Linguistic steganography makes use of written natural language to conceal secret messages. The whole idea is to hide the very presence of the real messages. Linguistic steganography algorithms embed messages into a cover text in a covert manner such that the presence of the embedded messages in the resulting stegotext cannot be easily discovered by anyone except the intended recipient. Previous work on linguistic steganography was mainly focused on how to hide messages. One method of modifying text for embedding a message is to substitute selected words by their synonyms so that the meaning of the modified K. Solanki, K. Sullivan, and U. Madhow (Eds.): IH 2008, LNCS 5284, pp. 224–235, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Linguistic Steganography Detection

225

sentences is preserved as much as possible. A steganography approach that is based on synonym substitution is the system proposed by Winstein [1]. There are some other approaches. Among them NICETEXT and TEXTO are most famous. NICETEXT system generates natural-like cover text by using the mixture of word substitution and Probabilistic Context-free Grammars (PCFG) ([2], [3]). There are a dictionary table and a style template in the system. The style template can be generated by using PCFG or a sample text. The dictionary is used to randomly generate sequences of words, while the style template selects natural sequences of parts-of-speech when controlling generation of word, capitalization, punctuation, and white space. NICETEXT system is intended to protect the privacy of cryptograms to avoid detection by censors. TEXTO is a linguistic steganography program designed for transforming uuencoded or PGP ascii-armoured ASCII data into English sentences [4]. It is used to facilitate the exchange of binary data, especially encrypted data. TEXTO works just like a simple substitution cipher, with each of the 64 ASCII symbols used by PGP ASCII armour or uuencode from secret data replaced by an English word. Not all of the words in the resulting text are significant, only those nouns, verbs, adjectives, and adverbs are used to fill in the preset sentence structures. Punctuation and “connecting” words (or any other words not in the dictionary) are ignored. Markov-Chain-Based is another linguistic steganography approach proposed by [5]. The approach regards text generation as signal transmission from a Markov signal source. It builds a state transfer chart of the Markov signal source from a sample text. A part of state transfer chart with branches tagged by equal probabilities that are represented with one or more bit(s) is illustrated by Fig. 1. Then the approach uses the chart to generate cover text according to secret messages. The approaches described above generate innocuous-like stego-text to conceal attackers. However, there are some drawbacks in them. For example, the first approach sometimes replaces word synonyms that do not agree with correct English usage or the genre and the author style of the given text. And the later three approaches are detectable by a human warden, the stego-text generated by which doesn’t have a natural, coherent and complete sense. They can be used in communication channels where only computers act as attackers. A few detection methods have been proposed making use of the drawbacks discussed. The paper [6] brought forward an attack against systems based on synonym substitution, especially the system presented by Winstain. The 3-gram language model was used in the attack. The experimental accuracy of this method on classification of steganographically modified sentences was 84.9% and that of unmodified sentences was 38.6%. Another detection method enlightened by the design ideas of conception chart was proposed by the paper [7] using the measurement of correlation between sentences. The accuracy of the simulation detection using this method was 76%. The two methods fall short of accuracy that the practical application of detection requires. In addition, the first method

226

Z. Chen et al.

Fig. 1. A part of tagged state transfer chart

requires a great lot of computation to calculate a large number of parameters of the 3-gram language model and the second one requires a database of rules consuming a lot of work. Our research examines drawbacks of the last three steganography approaches, aiming to accurately detect the application of the three approaches in given text segments and bring forward a blind detection method for linguistic steganography generating cover texts. We have found a novel, efficient and accurate detection algorithm that uses the statistical characteristics of the correlations between the general service words that are gathered in a dictionary to distinguish between stego-text segments and normal text segments.

2 2.1

Important Notions N-Window Mutual Information (N-WMI)

In the area of statistical Natural Language Processing (NLP), an informationtheoretically measurement for discovering interesting collocation is Mutual Information (MI) [8]. MI is originally defined between two particular events x and y. In case of NLP, the MI of two particular words x and y, as follows: M I(x, y) = log2

P (x, y) P (x|y) P (y|x) = log2 = log2 P (x)P (y) P (x) P (y)

(1)

Here, P (x, y), P (x) and P (y) are the occurrence probabilities of “xy”, “x” and “y” in given text. In our case, we regard these probabilities as the occurrence probabilities of the word pairs “xy”, “x?” and “?y” in given text, respectively, and “?” represents any word. In natural language, collocation is usually defined as an expression consisting of two or more sequential words. In our case, we will investigate pairs of words

Linguistic Steganography Detection

227

Fig. 2. An illustration of 3-WWP

within a certain distance. With the distance constraint, we introduce some definitions as follows. N-Window Word Pair (N-WWP): Any pair of words in the same sentence with a distance less than N (N is an integer greater than 1). Here, the distance of a pair of words equals to the number of words between the words in the pair plus 1. Note that N-WWP is order-related. In Fig. 2, the numbered boxes represent the words in a sentence and the variable d represents distance of the word pair. The 3-WWPs in the sentence are illustrated in the figure by arrowed, folded lines. Hereafter, we will denote N-WWP “xy” as x, y. N-Window Collocation (N-WC): An N-WWP with frequent occurrence. In some sense, our detection results are partially determined by the distribution of N-WCs in given text segment and later we can see it. N-Window Mutual Information (N-WMI): We use the MI of an N-WWP to measure its occurrence. This MI is called N-Window Mutual Information (NWMI) of the words in the word pair. Therefore, an N-WWP is an N-WC if its N-WMI is greater than a particular value. With the definition of N-WMI, we can use equation (1) to evaluate the N-WMI of words x and y in a particular text segment. Given a certain text segment, the counts of occurrences of any N-WWP, N-WWP x, y, x, ? and ?, y are C, Cxy , Cx and Cy , respectively, the N-WMI value is denoted by M IN , then the evaluation as follows: M IN (x, y) = log2

P (x, y) Cxy /C CCxy = log2 = log2 P (x)P (y) (Cx /C)(Cy /C) Cx Cy

(2)

Because of the signification of N-WMI in our detection algorithm, we will make a further explanation with an example. Given a sentence “We were fourteen in all, and all young men.” let us evaluate the 4-WMI of in, all. All 4-WWPs in the sentence are as follows: we, were, we, f ourteen, we, in, were, f ourteen, were, in, f ourteen, in, were, all, f ourteen, all, in, all, f ourteen, and, in, and, all, and, in, all, all, all, and, all, all, young, and, young, all, young, and, men, all, men, young, men. We get C = 21, Cin,all = 2, Cin = 3, Call = 6. Then we have: M I4 (in, all) = log2

CCin,all 21 × 2 = 1.2224 = log2 Cin Call 3×6

228

2.2

Z. Chen et al.

N-Window Variance of Mutual Information (N-WVMI)

Suppose D is the general service word dictionary and M is the count of words in D, then we can get M × M different pairs of words from the dictionary D. In any given text segment, we can calculate the N-WMI value of each different word pair in dictionary D, and get an N-WMI matrix. However, it is probably that some items of the matrix have no values because of the absence of their corresponding pairs of words in the given text segment, that is to say, these items are not present. We denote the N-WMI matrix of the training corpus as TM×M , and that of a sample text segment as SM×M . Therefore, when all items of both TM×M and SM×M are present, we define the N-Window Variance of Mutual Information (N-WVMI) as: V =

M  M  1 (Sij − Tij )2 M × M i=0 j=0

(3)

When either an item of SM×M or its corresponding item of TM×M dose not exist, we say that the pair of items is not present, otherwise it is present. For example, if either Sij or Tij dose not exist, we say the pair of items in position (i, j) is not present. Suppose I pairs of items are present, we evaluate N-WVMI as: M M 1  V = (Sij − Tij )2 δ(i, j) (4) I i=0 j=0 Here, δ(i, j) = 1 if both Sij and Tij are present, δ(i, j) = 0 otherwise. When I = M × M , equation (4) turns into equation (3). 2.3

Partial Average Distance (PAD)

The N-WVMI is defined to distinguish between the statistical characteristics of normal text segments and stego-text segments in principle. But a more precise statistical variable is necessary for the accurate detection. Therefore, the Partial Average Distance (PAD) of the two N-WMI matrixes SM×M and TM×M is defined as follows: Dα,K =

M M 1  |Sij − Tij |[|Sij − Tij | > α]λK (i, j) K i=0 j=0

(5)

In this equation, α represents a threshold of the distance of two N-WMI values, K represents that only the first K greatest items of SM×M are calculated. As we can see, equation (5) averages the differences of items of SM×M and TM×M with great N-WMI values and great distances, as these items well represent the statistical characteristics of the two type of text segments. The expressions [|Sij − Tij | > α] and λK (i, j) are evaluated as:  1 if |Sij − Tij | > α [|Sij − Tij | > α] = 0 otherwise

Linguistic Steganography Detection

 λK (i, j) =

3

229

1 if Sij is the first K greatest 0 otherwise

Method

In natural language, normal text has many inherent statistical characteristics that can’t be provided by text generated by linguistic steganography approaches we investigate. Here is something we have observed: there is a strong correlation between words in the same sentence in normal text, but the correlation is weakened a lot in the generated text. The reason is a normal sentence has a natural, coherent and complete sense, while a generated sentence doesn’t. For example, for a sentence leading with “She is a . . . ”, it reads more likely “She is a woman teacher”, or “She is a beautiful actress”, or “She is a mother” and so on in the normal text. But “She is a man”, or “She is a good actor”, or “She is a father” only possibly exists in the generated text. This shows that the word “she” has a strong correlation with “woman”, “actress” and “mother”, but it has a weak correlation with “man”, “actor” and “father”. Therefore, we probably evaluate the N-WMI of “she” and “woman” or “actress” or “mother” with a greater value than that of “she” and “man” or “actor” or “father”. In our research, we use N-WMI to measure the strength of correlation between two words. In our experiment, we build a corpus from the novels written by Charles Dickens, a great English novelist in the Victorian period. We name this corpus Charles-Dickens-Corpus. We build another corpus from novels written by a few novelists whose last name begin with letter “s” and name it S-Corpus. Finally, we build a third corpus from the cover text generated by the linguistic steganography algorithms we investigated: NICETEXT, TEXTO and Markov-Chain-Based, calling it Bad-Corpus. We then build the training corpus from Charles-Dickens-Corpus, the good testing sample set from S-Corpus and the bad testing sample set from Bad-Corpus. The training corpus consists of about 400 text documents amounting to a size of more than 10MB. There are 184 text documents in the good testing sample set and 422 text documents in bad testing sample set. The commonly used word dictionary D collects 2000 words (that is M = 2000) most widely used in English. We let N = 4, that is, we use 4-WMI. Thereafter, the following procedure has been employed. First, we process the training corpus as a large text segment to get the training N-WMI matrix TM×M using dictionary D. Our program reads every text document, splits it into sentences, counts the numbers of occurrences of all the N-WWP in D that were contained in the training corpus and gets C. Furthermore, for every N-WWP, we can get its Cxy , Cx and Cy incidentally. Then we can evaluate the N-WMI of every N-WWP with equation (2), and obtain the training N-WMI matrix TM×M . In the step, we can store TM×M to disk for later use. So if the related configuration parameters are not altered, we can just read TM×M from the disk in this step for the sequent sample text segment detection.

230

Z. Chen et al.

Second, we process a sample text segment to get the sample N-WMI matrix SM×M . The procedure is similar with the first step, but in this step our program just read a text segment of a certain size, such as 10kB from every sample text document. In this way, we can control the size of the sample text segment. Third, we evaluate N-WVMI value V of the sample text segment using SM×M and TM×M with equation (3) or (4). Some attentions have to be paid to this step. If there are some pairs of items absent in matrix SM×M and TM×M , we use equation (4). That is to say, we just calculate variance of the I pairs of items when I pairs of items are present in the matrixes SM×M and TM×M . Otherwise, equation (4) turns to equation (3). In this step, another variable, the PAD value Dα,K is calculated by equation (5). The variable is a useful assistant classification feature in addition to N-WVMI value V . In the experiment, we let = 2 and K=100, so we calculate D2,100 . Finally, we use SVM classifier to classify the sample text segment as stego-text segment or normal text segment according the values V and D2,100 . Fig. 3 shows the flow of the detection procedure. The real line arrowhead represents the flow of data, while the dashed line arrowhead represents that nothing will be transferred or processed if TM×M has already been stored. The thick dashed rectangle indicates the whole detection system. Obviously, there are two key flows in the system: training and testing. The training process is not always required before the testing process. Once the training process is completed, it does not need to repeat in sequent detection unless the some configuration parameters are changed. The testing process contains two steps to evaluate sample N-WMI matrix, classification features N-WVMI and PAD values respectively, ending with an output that indicates whether the testing sample text segment is stego-text using a SVM classifier [9].

Fig. 3. Flow of the detection procedure

Linguistic Steganography Detection

231

Table 1. Sample text set and detection results Type Good Set Bad Set

Generator Markov-Chain-Based NICETEXT TEXTO

Total

4

Sample 184 100 212 110 606

Success 178 89 212 110 589

Failure 6 11 0 0 17

Accuracy 94.01% 98.48% 97.96% 97.19%

Results and Discussion

In our experiment, totally 606 testing sample text segments with a size of 20kB are detected by using their 4-WVMIs. The composing of the sample text set is present in Table 1. Using a SVM classifier, we can get a total accuracy of 97.19%. Note that the accuracy of each linguistic steganography method (denoted by LSM ethod) is computed as follows: Accuracy(LSM ethod) =

SU C(GoodSet) + SU C(LSM ethod) SAM (GoodSet) + SAM (LSM ethod)

Where SU C represents the number of success text segments, and SAM represents the number of sample text segments. In Table 1, we can see that the accuracy of detecting stego-text segments generated by Markov-Chain-Based is obviously lower than that generated by the other two algorithms. The probable reason is that Markov-Chain-Based method sometimes embeds secret messages by adding white space between sequent words in sample text, copying these words as generated words, other than generating new words when there is only a branch in the state transfer chart. For example, a text segment generated by Markov-Chain-Based as follows: “. . . I’ll wait a year, according to the forest to tell each other than a brown thrush sang against a tree, held his mouth shut and shook it out, the elder Ammon suggested sending for Polly. . . . ” We can see that the algorithm adds white space between words “according” and “to”, between words “sending” and “for” and so on in the text segment and these words are copied to generated text from sample text directly. This keeps more properties of normal text. Fig. 4 shows testing results of all testing samples. Fig. 5 - Fig. 7 show testing results of testing samples generated by Markov-Chain-Based, NICETEXT and TEXTO respectively. As the discussion above, the accuracy of detecting MarkovChain-Based method appears slightly lower. The results of detecting the other two algorithms appear ideal with the text segment size of 20kB. But when the segment sizes are smaller than 5kB, such as 2kB, the accuracies will decrease obviously. This is determined by characteristics of the statistical algorithm. So sample text segments with a size greater than 5kB are recommended.

232

Z. Chen et al.

Fig. 4. Testing results of all testing samples

Fig. 5. Testing results of testing samples generated by Markov-Chain-Based

In addition, our algorithm is time efficient although we have not measured it strictly. It takes about 1 minute to complete our detection of more than 600 sample text segments.

Linguistic Steganography Detection

233

Fig. 6. Testing results of testing samples generated by NICETEXT

Fig. 7. Testing results of testing samples generated by TEXTO

After all, the results of our research appear pretty promising. We have detected the three forenamed linguistic steganography methods in a blind way accurately, and our method may suit to detect other or new linguistic steganography methods that generate nature-like cover text. For other linguistic steganography methods such as Synonym-Substitution-Based or translation-based steganography,

234

Z. Chen et al.

the detection based on the characteristics of correlations between words may still work, and that is also our future work.

5

Conclusion

In this paper, a statistical linguistic steganography detection algorithm has been presented. We use the statistical characteristics of the correlations between the general service words that are gathered in a dictionary to classify the given text segments into stego-text segments and normal text segments. The strength of the correlation is measured by N-window mutual information (N-WMI). The total accuracy is as high as 97.19%. The accuracies of blindly detecting these three different linguistic steganography approaches: Markov-Chain-Based, NICETEXT and TEXTO are 94.01%, 98.48% and 97.96% respectively. Our research mainly focuses on detecting linguistic steganography that embeds secret messages by generating cover text. But it is easy to modify our general service word dictionary to fit the detection of Synonym-Substitution-Based algorithm and other linguistic steganography methods modifying the content of the cover text. Therefore, our algorithm is widely applicable in linguistic steganalysis. Many interesting and new challenges are involved in the analysis of linguistic steganography algorithms, which is called linguistic steganalysis that has little or no counterpart in other media domains, such as image or video. Linguistic steganalysis performance strongly depends on many factors such as the length of the hidden message and the way to generate a cover text. However, our research shows that the linguistic steganalysis based on correlations between words is promising.

Acknowledgement This work was supported by the NSF of China (Grant Nos. 60773032 and 60703071 respectively), the Ph.D. Program Foundation of Ministry of Education of China (No. 20060358014), the Natural Science Foundation of Jiangsu Province of China (No. BK2007060), and the Anhui Provincial Natural Science Foundation (No. 070412043).

References 1. Winstein, K.: Lexical steganography through adaptive modulation of the word choice hash, http://alumni.imsa.edu/∼ keithw/tlex/lsteg.ps 2. Chapman, M.: Hiding the Hidden: A Software System for Concealing Ciphertext as Innocuous Text (1997), http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf 3. Chapman, M., Davida, G., Rennhard, M.: A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y. (eds.) ISC 2001. LNCS, vol. 2200, pp. 156–167. Springer, Heidelberg (2001)

Linguistic Steganography Detection

235

4. Maher, K.: TEXTO, ftp://ftp.funet.fi/pub/crypt/steganography/texto.tar.gz 5. Shu-feng, W., Liu-sheng, H.: Research on Information Hiding. Degree of master, University of Science and Technology of China (2003) 6. Taskiran, C., Topkara, U., Topkara, M., et al.: Attacks on lexical natural language steganography systems. In: Proceedings of SPIE (2006) 7. Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text document information hiding. Journal on Communications 25(12), 97–101 (2004) 8. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. Beijin Publishing House of Electronics Industry (January 2005) 9. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/∼ cjlin/libsvm