Improved MMR Technique for Extractive Text Summarization

0 downloads 0 Views 801KB Size Report
Abstract. Text summarization is a process that reduces the length of the original text and saves time by preparing the concept of text for reader. Two kinds of text ...
International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

--------------------------------------------------------------------

Improved MMR Technique for Extractive Text Summarization Maryam Kiabod*, Mohammad Naderi Dehkordi, Sayed Mehran Sharafi Department of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan, Iran *Corresponding author's Email: [email protected]

Abstract Text summarization is a process that reduces the length of the original text and saves time by preparing the concept of text for reader. Two kinds of text summarization approaches are extractive and abstractive techniques. Our technique is based on extractive approach. In this approach, some scores are assigned to text sentences. The most important sentences of text are the sentences with the highest scores. One of the methods which is used in extractive approach is MMR technique. In this paper, we use MMR technique in combination with graph theory and genetic algorithm to extract the most important sentences of text. In this approach, the MMR technique is improved and its equation is changed. Also, genetic algorithm is performed to get better results in calculating the graph edges weights. The results show that this technique has better performance than the MMR technique. Keywords: MMR technique, Genetic Algorithm, Graph Theory, Word Weight, Sentence Weight

1. INTRODUCTION As the amount of information grows rapidly, text summarization is getting more important. It results in a faster availability to the main concept of a text. There are two kinds of text summarization technique: extractive and abstractive. Extractive technique selects the most important sentences of the original text and put them together to form the summary of the text. Abstractive method uses Natural Language Processing to derive the main concept of the text and produce the summary. In extractive method, two characteristics should be considered: finding sentences of text which are more important than others, and the coherence of the summary. One approach to respond to it is to assign

589

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------some numerical measure to each sentence (called sentence score) and then choose sentences with the highest scores to form the summary. For this reason, statistical criterions and semantic relations among words are used to weight each word of the original text and then calculate the sentence score based on words weights. Some researches applied TF/IDF (Term Frequency-Inverse Document Frequency) [1], number of words occurring in title [2], and number of numerical data [3] as statistical criterions to weight words of the text. Using these criterions does not produce a human quality summary. As a result, NLP and lexical cohesion [4] are used to guarantee the cohesion of the summary. In order to apply lexical cohesion, the chains of related words are formed that represents cohesive structure of the text. Halliday and Hasan [5] classified lexical cohesion into two categories: Reiteration category and Colllocation category. Reiteration category considers repetition, synonym, and hyponyms, while collocation category deals with the co-occurrence among words in the original text. One of the new methods to determine the most important sentences of text is MMR technique. The benefit of using this technique is considering both the maximum relevance novelty and minimum redundancy. This method was improved in several ways.Another method used in extractive text summarization system is graph theory. This method assigns a node to each sentence of the text. The edges weights in this graph are the similarity between two sentences of text. The score of each sentence in this method is calculated according to the edges weights and nodes weights. We use this method to improve MMR technique and calculate the importance of a sentence in the text. Some methods, such as Genetic Algorithm, are performed on problems to get better results. We use Genetic Algorithm to constitute the sentences vectors of the text. The rest of this paper is as follow. Section 2 provides a review of previous works on text summarization systems. Section 3 presents our technique. Section 4 describes experimental results and evaluation. Finally we conclude and suggest future work in section 5.

590

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

--------------------------------------------------------------------

2. Text Summarization Approaches Automatic text summarization dates back to fifties. In 1958, Luhn [6] created text summarization system based on weighting sentences of a text. He used word frequency to specify topic of the text. There are some methods that consider statistical criterions. Edmundson [7] used Cue method (i.e. "introduction", "conclusion", and "result"), title method and location method for determining the weight of sentences. Statistical methods suffer from not considering the cohesion of text. Kupiec, Pederson, and Chen [8] suggested a trainable method to summarize the original text. In this method, number of votes collected by the sentence determines the probability of being included the sentence in summary. Another method includes graph approach proposed by Kruengkrai and Jaruskululchi [9] to determine text title and produce summary. Their approach takes advantages of both the local and global properties of sentences. They used clusters of significant words within each sentence to calculate the local property of sentence and relations of all sentences in text to determine global property of text. Beside statistical methods, there are other approaches that consider semantic relations among words. These methods need linguistic knowledge. Chen, Wang, and Guan [10] proposed an automated text summarization system based on lexical chain. Lexical chain is a series of interrelated words in a text. WordNet is a lexical database includes relations among words such as synonym, hyponymy, meronymy, and some other relations. Svore, Vander Wende and Bures [11] used machine learning algorithm to summarize text. Eslami, Khosravyan D., Kyoomarsi, and Khosravi [12] proposed an approach based on Fuzzy Logic. Fuzzy Logic does not guarantee the cohesion of the summary of text. Halavati, Qazvinian, Sharif H. applied Genetic algorithm in text summarization system [13]. Genetic Algorithm also is used in improving content selection in automatic text summarization by Khosraiyan, Kumarci, and Khosravi[14]. It was based on statistical tools. Latent Semantic Analysis [15] is another approach used in text summarization system. Abdel Fattah and Ren [16] proposed a technique based on Regression to estimate text features weights. In regression model a mathematical function can relate output to input variables. Feature parameters were considered as input variables

591

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------and training phase identifies corresponding outputs. There are some methods that combine algorithms, such as, Fuzzy Logic and PSO [17]. Salim, Salem Binwahla, and Suanmali [18] proposed a technique based on fuzzy logic. Text features (such as similarity to title, sentence length, and similarity to keywords, etc.) were given to fuzzy logic as input parameters. Carbonell and Goldstein [19] presented MMR (Maximal Marginal Relevance) as text summarization technique. In this approach a greedy algorithm is used to select the most relevant sentences of text to user query. Another aim in this approach is minimizing redundancy with sentences already included in the summary. Then, a linear combination of these two criterions is used to choose the best sentences for summary. Carbonell and Goldstein [19] used cosine similarity to calculate these two properties. Xie and Liu [20] used centroid score to calculate the first property and cosine similarity to compute the second property. Different measures of novelty were used to adopt this technique by Murray, Renals, and Carletta [21]. Mc Donald [22] used optimization algorithm to solve the new formulation of the summarization task.

3. Proposed Technique One of the methods in text summarization is extractive method. The goal in extractive text summarization approach is assigning numerical scores to sentences and selecting sentences with the highest scores to form the summary. There are lots of techniques used for calculating the importance of text sentences, but the newest one is MMR technique. MMR technique was proposed by Carbonell and Goldstein in 1998 to consider both the properties of minimal redundancy and maximum relevance novelty. Some techniques have been proposed to improve this technique. In this article we change the equation of the MMR technique to improve the performance of this technique. The method proposed in this article is recursive and composed by four steps as it follows. 1. The input document is transformed into a labeled graph, where each node stands for a sentence and the edges represent the similarity relationships between the sentences. 2. The MMR technique is fed on the graph in order to predict the suitable sentences to be extracted and used in the summary.

592

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------3. The sentences are sorted according to the score returned by the MMR technique and those having the largest score are selected and inserted into the summary node of the graph. Then, the node containing that sentence is removed from the graph. 4. The number of sentences in summary node of the graph determines whether the algorithm should be terminated or not. If the number of sentences in the summary node reaches to a threshold, the algorithm terminates, otherwise it returns to the second step. In the following, further details on the method are provided. The features we used as the labels of the graph are discussed. Also, we describe the structure which we used to implement this technique.

4. Pre-processing In this step, we prepare the original text to be analyzed in the next sections. For this purpose, first, we use sentence segmentation to signify the sentences of text. After that, sentence tokenization is used to identify words of each sentence. Part of speech tagging is another process that is used to determine the type of each word in the sentence. The words having part of speech as noun are crucial in any text materials. As a result, we work on nouns of text. So, we remain nouns of the text and remove every other type of words from sentences. The final step in this section is pruning text words to reduce the number of words processed in the next sections. This leads to a faster analysis process. The criterion we use for pruning words of text is term frequency of words. If term frequency of a word is less than a special threshold, we will remove that word. We got the best result for the quarter of sum of term frequency of all text words. As a result, we use equation )1) to calculate this threshold. ∑

(1)

593

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

--------------------------------------------------------------------

5. Calculating Weights of Words In this step, we assign weights to text nouns to be used in the next step. There are some statistical and linguistic criterions to compute this score. We use three criterions. Two of them are statistical and one of them is a linguistic criterion. The two statistical criterions are term frequency of text nouns, and normalized number of sentences containing that word. We use WordNet as a lexical database to determine the semantic relations between words. We use equation (2) to compute words weights. (2) where α, β, θ are weights of features and determine the importance of these features. TF represents term frequency of the word, SenCount is the number of sentences containing the word normalized by total number of sentences of text, and sem-sim represents the maximum semantic similarity between the word and title. We use WordNet to determine semantic relations between words. These parameters are in the range of [0, 1] and the sum of them equals to 1.0. We utilize a two layered backpropagation neural network with nine neurons in hidden layer, maximum error of 0.001, and learning rate of 0.2 to obtain these weights. The dendrites weights of this network are initialized in the range of [0, 1]. We use sigmoid function as transfer function. The importance of each parameter is determined by the average of dendrites weights connected to the input neuron that represents a parameter [23].After training neural network with training dataset we use weights to calculate the weights of the parameters.

6. The Representation of Text by Graph In a natural language text the sentences are related to each other. In graph theory the relatedness of text sentences is represented by the labels of graph edges. We add a node to the graph containing the sentences of the summary. First, we initiate it by adding the title of text. In each cycle of the algorithm, the most important sentence, the node with the highest score, is added to the summary node and the node containing that sentence is omitted from the graph. Finally, summary node contains the output of the graph. In this

594

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------work, we consider undirected weighted links. The nodes of the graph contain the sentences of text. The relatedness of two sentences is measured by the similarity between the sentences. There are several techniques to compute the similarity between two sentences. In our case, we use Cosine-Similarity to calculate the similarity value between two sentences, but we change the way we constitute the vector for each sentence of text. In this article, we propose Genetic Algorithm to create the vectors of text sentences. Figure 1 shows the graph model of the text.

S1 Sim 1,2

S2

Sim 1,summ

Sim 1,summ

Summary

sentence

Sim 1,4

Sim 3, summ 3,summ

S3

s Sim 4,summ

Sim 2,4

S4

Figure 1: The representation of text by graph

Genetic Algorithm [24] is a technique used to locate optimal solutions by working on a population of potential solutions. Each population is a collection of chromosomes. A chromosome is formed by a sequence of genes. First of all, Genetic operators are applied to the current population. Then, a new population is generated. Genetic Algorithm uses a fitness function and selection mechanism to determine which chromosomes have to survive and produce the children for the next generation. After several generations, Genetic Algorithm converges to the optimal solution. We use these advantages of Genetic Algorithm to find the similarity between sentences of the text. Genetic algorithm fitness

595

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------function that we used has been shown in equation (3). The parameter setting for Genetic Algorithm is shown in table 1. ∑

[]

(

[ ])

(3)

Table 1: Parameter setting for GA Setting Type

Value

Encoding Scheme

Floating-Point

Population Size

100

Generations

2000

Selections

Roulette Wheel

Crossover

One Point

Crossover rate

0.8

Mutation rate

0.06

Elitism

Yes

7. MMR Technique MMR considers the most relevant sentences and minimal redundancy of summary sentences at the same time. The final score of a given sentence is calculated as follow.

where D is the average of document vectors, Summ represents the average of sentences vectors that have been extracted into the summary, and λ is used to adjust the combined score to emphasize the relevance or to avoid redundancy. The two similarity functions (i.e. sim1, sim2) represent the similarity of a sentence to the entire document and to the selected summary, respectively. The sentences with the highest MMR scores will be iteratively chosen into the summary until the summary reaches a predefined proper size. In our technique, we change MMR formula to adjust it to the graph theory. In our work, the MMR score of each node of the graph is calculated by the equation (5)

596

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

--------------------------------------------------------------------

( ) ∑ (5) where n denotes the number of edges connected to the graph nodes except the summary node. In this way, the relatedness of the sentence to the remaining sentences of the text is calculated by getting the average of the similarities between that sentence and the remaining sentences of the text. As we mentioned before, we use cosine- similarity to compute the similarity between two sentences and improve it by performing Genetic Algorithm to create the vector for each sentence of the text. By using the proposed formula, we encounter a new problem. The problem is that, in the case of having a sentence with more similarity to some sentences of text and less similarity to other sentences of text, the technique takes this sentence equal to a node with almost the same similarity to all of text sentences. To solve this problem, we use the variance measure to distinguish between two kinds of sentences which were explained. So, the new equation of MMR technique is changed according to equation (6).

( )



[

] (6)

where n is the number of text sentences and the variance is calculated by equation (7). ∑ where

(

)

(7)

is the similarity between sentences

similarity between the sentence

and

and avg(

is the average

and other sentences of text. In our opinion, more

significant words of each sentence play more important role in determining similarity between two sentences. As a result, we rank words of each sentence of text according to their importance probability (calculated by Genetic Algorithm in the previous section) in

597

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------decreasing order. Then, we compute the cosine similarity of the vectors of two sentences to calculate the similarity score in MMR technique. The cosine similarity is calculated by equation (8). ∑

(8)

√∑

where A, and B are the attribute vectors of the two sentences. These vectors are usually the term frequency of the words in text. In our technique, these vectors are the ranked probability of being included each word of the sentence in significant words list in decreasing order. As mentioned before, we used Genetic Algorithm to determine this probability. In this way, we use the importance probability of sentence words to make the vectors of each sentence of the text.

8. Evaluation Text summarization evaluation is a complicated task. Pyramid method is a new technique to evaluate the effectiveness of text summarization systems [25]. Using this method to evaluate our system has some benefits, such as, it considers both of the analysis of content selection variation in human summaries and making the results less dependent on the model user for evaluation. The main unit in this method is Semantic Content Units (SCU). SCU links different surface realizations with the same meaning. In this technique, a SCU from a lower tier is not included unless all SCUs from higher tiers are included as well. The number of human summarizers who expressed the SCU in their summaries determines the weight of each SCU. This is a very labor intensive task. So, a special annotation tool (DUCView) has been developed to facilitate the process. In this method, the score which is assigned to each sentence is calculated by dividing the total SCU weight by the optimal content score. The total SCU weight is determined by equation (9). ∑

(9)

where i represents the weight of SCU in tier T i . This weight is i for every SCU in tier T i. Di displays number of SCUs in the summary that appears in Ti.

598

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------The optimal content score is calculated by equation (10). ∑

| | ∑



| |

|

where X denotes the summary size and j is the first one top down such that the sum of its cardinality and the cardinalities of tiers above it is greater than or equals to X. We use DUC20021 as input data to train neural network and test our technique. We compare our results with the MMR technique proposed by Carbonell and Goldstein to calculate the properties of the sentence in MMR technique. The sentence vector in their approach consists of term frequency of sentence words. The results are shown in Figure 2.

Figure 2: The comparison of pyramid score with proposed technique and without it

The results show that using the proposed equation in MMR technique causes better performance in selecting the most important sentence of text .

Conclusion In this paper, we improved MMR technique. We applied Genetic Algorithm to identify the importance of each sentence words. Then, we investigated the use of graph theory in MMR technique. Finally, we changed the formula of MMR technique and used variance measure to distinguish the sentences with the more similarity to all of the text sentences. The evaluation results show that using the proposed equation in MMR

1

www.nlpir.nist.gov

599

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------technique results to better performance. In future work, we intend to change the fitness function and to use other similarity factors in MMR technique.

References [1] M/ Wasson, "Using Leading Text for News Summaries: Evaluation results and implications for commercial summarization applications", In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the ACL, 2007; 1364-1368. [2] G. Salton, C. Buckley, "Term-weighting approaches in automatic text retrieval", Information Proceeding and Management , 1998; 24:513-523.Reprinted in:Sparck-Jones, K.; Willet ,P.(eds), Readings in I.Retreival, Morgan Kaufmann, 1997; 323-328. [3] C.Y. Lin, "Training a selection function for extraction", In Proceedings of eighth international conference on Information and knowledge management, Kansas City. Missouri. United States, 1999; pp.55-62. [4] M. Hoey, " Patterns of Lexis in Text", Oxford: Oxford University Press [5] M. Halliday, and R. Hasan, "Cohesion in English", London: Longman [6] H.P. Luhn, "The Automatic Creation of literature abstracts", IBM journal of Research Development,159165, 1958 [7] H.P. Edmundson, "New methods in automatic extraction", Journal of the ACM. 1969. 55-62. [8] J. Kupiec, J. Pedersen and F. Chen, "A trainable document summarizer", In Proceedings of the 18th ACMSIGIR Conference, 1955; 68-73. [9] C. Jaruskululchi, C. Kruengkrai, "text summarization using local and global properties of sentences", IEEE/WIC international conference on web intelligence , 2003; 13-16. [10] Y. Chen, X. Wang, L. Guan, L.V.YI. "Automatic text Summarization Based on Lexical chains", in Advances in Natural Computation, 2005; 947-951. [11] K. Svore, L. Vanderwende and C. Bures, "Enhancing single-document summarization by combining Ranknet and third-party sources", In Proceeding of the EMNLP-CoNLL. [12] F. Kyoomarsi, H. Khosravi, E. Eslami and P. K. Dehkordy, " Optimizing Text Summarization Based on Fuzzy Logic", In Proceedings of Seventh IEEE/ACIS International Conference on Computer and Information Science, IEEE, University of shahid Bahonar Kerman, 2008; 347-352. [13] V. Qazvinian, L. Sharif Hassanabadi, R. Halavati, "Summarization Text with a Genetic Algorithm-Based Sentence Extraction", International of Knowledge Management Studies (IJKMS), 2008; 4: 426-444. [14] P.K. Dehkordi, F. Kumarci and H. Khosravi, "Text summarization based on Genetic Programming", International Journal of Computing and ICT Research, 2009; 3. [15] S. Hariharan, "Multi document summarization by combinational approach", International Journal of Computational Cognition, 2010; 8: 68-74.

600

International Journal of Mechatronics, Electrical and Computer Technology Vol. 4(11), Apr. 2014, pp. 589-601, ISSN: 2305-0543

Available online at: http://www.aeuso.org © Austrian E-Journals of Universal Scientific Organization

-------------------------------------------------------------------[16] M.A. Fattah, and F. Ren, "Automatic Text Summarization, Proceedings of World of Science", Engineering and Technology, 2008; 27: pp.195-192. [17] L. Suanmali, M.S. Binwahlan and N. Salim, "sentence Features Fusion for Text Summarization using Fuzzy Logic", IEEE, 2009; 142-145. [18] N. Salim, L. Suanmali and M.S. Binwahlan, " fuzzy swarm based text summarization", journal of computer science, 2009; 338-346. [19] J. Carbonell and J. Goldstein "The use of MMR, diversity-based rerunning for reordering documents and producing summaries", in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998; 335–336. [20] Xie S. and Liu Y. "Using corpus and knowledge-based similarity measure in Maximum Marginal Relevance for meeting summarization", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2008; 4985–4988. [21] G. Murray, S. Renals and J. Carletta, "Extractive summarization of meeting Recordings", in Proceedings of 9th European Conference on Speech Communication and Technology, 2005; 593–596. [22] R. McDonald, "A study of global inference algorithms in multi-document Summarization", in Proceedings of the European Conference on IR Research, 2007; 557–564. [23] S. Zadeh, N. Sharif, "Evaluation of effective parameters and their effect on summarization systems using neural network (in Persian)", Fifth annual international conference of computer society of iran, 2008. [24] E.D. Goldberg, "Genetic algorithms in search, optimization, and machine learning", Publication Addison-Wesley Professional, 1989. [25] A. Nenkova, R. Passonneau and K. McKeown, "The Pyramid method: Incorporating human content selection variation in summarization evaluation", ACM Transactions on Speech and Language Processing, 2007; vol. 4, no. 2.

601