Arabic Rhetorical Relations Extraction for

1 downloads 0 Views 477KB Size Report
bic text to automatically finding answers to non-factoid questions ("Why" and. "How to"). .... in unit4, so it is concluded that an explanation relation holds between the two units. (Rel2). ... clauses and sentences which help the reader understanding the text. Our work is .... Shows relations frequency that led to correct answers.
Arabic Rhetorical Relations Extraction for Answering "Why" and "How to" Questions Jawad Sadek1, Fairouz Chakkour2, Farid Meziane1 1

School of Computing Science & Engineering, University of Salford, Manchester, England [email protected],[email protected] 2 Computer Engineering Departement, Aleppo University, Aleppo, Syria [email protected]

Abstract. In the current study we aim at exploiting discourse structure of Arabic text to automatically finding answers to non-factoid questions ("Why" and "How to"). Our method is based on Rhetorical Structure Theory (RST) that many studies have shown to be a very effective approach for many computational linguistics applications such as (text generation, text summarization and machine translation). For both types of questions we assign one or more rhetorical relations that help discovering the corresponding answers. This is the first Arabic Question Answering system that attempts to answer the “Why” and “How to” questions. Keywords: Information Retrieval, Text Mining, Question Answering for Arabic, non-factoid questions, Discourse analysis.

1 Introduction During recent years the internet has witnessed an explosive growth in the amount of text available. This has motivated researchers in the field of natural language processing to pay special attention to develop systems and tools that are capable of generating direct answers to questions containing specific information the user is looking for instead of a list of relevant documents. These systems are known as question answering (QA) systems. Researchers in the field of QA have developed many systems for different natural languages. Most of these systems focused on factoid questions like who, what, where and when [1][2][3]. However in [4] a study that was presented for answering why questions in the English language; However, no attempt was made to create a system that can handle why and how to questions for the Arabic language. In this paper we developed an Arabic text analysis tool which aims at extracting proper Arabic rhetorical relations that can be used within RST structure for automatically discovering answers to why and how to questions in the Arabic language.

2 Methodology for RST-based Question Answering Finding answers to why and how to questions involves searching for arguments in texts. The distinction that RST makes between the part of a text that realizes the primary goal of the writer, termed nucleus, and the part that provides supplementary material, termed satellite, makes it an appropriate tool for analyzing argumentative paragraphs. Consider the following example, taken from an Arabic website, which explains the method used to extract answers. The text is broken into seven elementary units delimited by square brackets that produced the schema shown in Fig.1. ‫ ]ﺟﺮاء اﻟﺼﯿﺪ اﻟﺠﺎﺋﺮ واﻟﺘﻐﯿﺮات اﻟﻤﻨﺎﺧﯿﺔ وأﻧﺸﻄﺔ ﺑﺸﺮﯾﺔ‬1[‫]ﺣﺬر ﺑﺤﺚ ﻋﻠﻤﻲ ﺣﺪﯾﺚ ﻣﻦ أن ﻗﻨﺎدﯾﻞ ﺑﺤﺮ ﻋﻤﻼﻗﺔ ﻗﺪ ﺗﮭﯿﻤﻦ ﻋﻠﻰ ﻣﺤﯿﻄﺎت اﻟﻌﺎﻟﻢ‬ ‫ ﻣﻦ ﻧﻮع ﺿﺨﻢ ﻣﻦ‬،‫ ﻟﻸﺑﺤﺎث اﻟﺒﺤﺮﯾﺔ واﻟﺠﻮﯾﺔ" اﻷﺳﺘﺮاﻟﻲ‬CSIRO ‫ ]وﺗﺤﺬر دراﺳﺔ أﺟﺮاھﺎ "ﻣﺮﻛﺰ‬2[.‫أﺧﺮى ﻗﺪ ﺗﺆدي ﻟﻔﻨﺎء اﻟﺜﺮوة اﻟﺴﻤﻜﯿﺔ‬ ‫ ﺑﻘﻄﺮ‬،‫ ﻛﯿﻠﻮﻏﺮام‬200 ‫ وﻗﺪ ﯾﺰن‬،‫" وﻟﮫ ﻗﺎﺑﻠﯿﺔ اﻟﻨﻤﻮ ﻟﯿﺼﻞ ﺣﺠﻤﮫ إﻟﻰ ﺣﺠﻢ ﻣﺼﺎرع ﺳﻮﻣﻮ ﯾﺎﺑﺎﻧﻲ‬Normura ‫ ﯾﺪﻋﻰ "ﻧﻮرﻣﻮرا‬،‫ﻗﻨﺎدﯾﻞ اﻟﺒﺤﺮ‬ ‫ ]ﻣﻨﮭﺎ اﺳﺘﺨﺪام اﻟﻤﻮﺟﺎت اﻟﺼﻮﺗﯿﺔ ﻟﺘﻔﺠﯿﺮ‬4[،‫ ]وﯾﻌﻤﻞ ﺑﺎﺣﺜﻮن ﻋﻠﻰ ﺗﺠﺮﺑﺔ ﺗﻘﻨﯿﺎت ﻣﺨﺘﻠﻔﺔ ﻟﻠﺴﯿﻄﺮة ﻋﻠﻰ اﻧﺘﺸﺎر ﻗﻨﺎدﯾﻞ اﻟﺒﺤﺮ‬3[.‫ﯾﺒﻠﻎ اﻟﻤﺘﺮﯾﻦ‬ 6 [‫ ]وﯾﻌﺰو اﻟﺒﺎﺣﺜﻮن اﻟﺘﺰاﯾﺪ اﻟﮭﺎﺋﻞ ﻓﻲ أﻋﺪاد ﻗﻨﻨﺎدﯾﻞ اﻟﺒﺤﺮ‬5[.‫ وﺗﻄﻮﯾﺮ ﺷﺒﻜﺎت ﺧﺎﺻﺔ ﻟﻠﻘﻀﺎء ﻋﻠﯿﮭﺎ‬،‫ﺗﻠﻠﻚ اﻟﻤﺨﻠﻮﻗﺎت اﻟﺘﻲ ﺗﺘﻤﯿﺰ ﺑﺠﺴﻢ ﺷﻔﺎف‬ 7 [.‫]ﻟﻠﺼﯿﺪ اﻟﺠﺎﺋﺮ ﻟﻸﺳﻤﺎك اﻟﺘﻲ ﺗﻘﺘﺎت ﻋﻠﻰ ﻗﻨﺎدﯾﻞ اﻟﺒﺤﺮ وﺗﺘﻨﺎﻓﺲ ﻣﻌﮭﺎ ﻋﻠﻰ ﻣﻮارد اﻟﻐﺬاء‬ [A new research warns that giant jellyfish may dominate world’s oceans]1 [due to overfishing, climate change and other human activities, which could lead to destroy fisheries.]2 [A study led by “CSIRO marine and atmospheric research” in Australia warns of giant jellyfish called “Normura” that can grow as big as a sumo wrestler, they weigh up to 200 kilograms and can reach 2 meters in diameter.]3 [Researchers are experimenting with different methods to control jellyfish,] 4 [some of these methods involve the use of sound waves to explode these creatures that have transparent body and develop special nets to cut them up.]5 [Researchers (scientists) said that the cause of this explosion number of jellyfish] 6 [is the overfishing that feed on small jellyfish and compete with them for their food.] 7

Given the following question, of why type, relating to the above text, we need to extract answer according to the derived schema. {What is the cause of the increasing number of jellyfish?}

{‫}ﻣﺎ ﺳﺒﺐ ﺗﺰاﯾﺪ اﻋﺪاد ﻗﻨﺎدﯾﻞ اﻟﺒﺤﺮ؟‬

We notice that question words match the unit6. Furthermore, unit7 provides the cause of the problem stated in unit6, this means that an interpretation relation holds between

Fig. 1. A scheme representation of the text.

unit7 and unit6 which is labeled as Rel3 in the schema (Fig.1). Because of the relevance between the question and the unit6, we can select the other part of the relation, unit7, as a candidate answer. Now in the case of the following question, belonging to the how-to question type {How do we control jellyfish blooms?}

{‫} ﻛﯿﻒ ﯾﻤﻜﻦ ﻟﻨﺎ اﻟﺤﺪ ﻣﻦ اﻧﺘﺸﺎر ﻗﻨﺎدﯾﻞ اﻟﺒﺤﺮ ؟‬

One can observe that unit5 gives some methods for solving the problem mentioned in unit4, so it is concluded that an explanation relation holds between the two units (Rel2). Since our question corresponds to unit4 we can select the other part of the relation as the answer which is unit5.

3 Text Structure Derivation

3.1 Rhetorical Relations Selection A set of Arabic rhetorical relations should be complied to use in the application of why and how to question answering systems. We performed an Arabic text analysis with the aim of extracting Arabic relations that lead to answer these types of Arabic questions. We came up with new four different rhetorical relations (Causal – Evidence – Explanation – Purpose); Table1 shows the definition of the extracted relations according to the constraint stated in [5]. We also selected four rhetorical relations (Interpretation – Base – Result – Antithesis) which were identified by Al-Sanie [6] who used RST for Arabic text summarization. 3.2 Determining the Elementary Units and Rhetorical Relations As a first step towards automatically deriving the structure of a text, we first need to determine the elementary units of a text and then find the rhetorical relations that hold between these units. Punctuation and cue phrase can play an important role in solving a variety of natural language processing tasks [7]. Cue phrases are the connective (words, phrases, letters…) that are used by writer as cohesive ties between adjacent clauses and sentences which help the reader understanding the text. Our work is based on cue phrases and punctuation as indicators of the boundaries between elementary textual units and to hypothesize the rhetorical relations that hold between them. In this context, we analyzed Arabic texts and studied Arabic style of linking linguistic units at all levels [8][9] to generate a set of cue phrases. For example the relation Explanation can be identified on the basis of the occurrence of the cue phrases (......،"‫"ﺑﻮاﺳﻄﺔ‬، "‫"ﻋﻦ طﺮﯾﻖ‬،"‫)"ﻣﻦ ﺧﻼل‬. Also (...."‫ "وﻗﺎل‬، "‫ "أﻛﺪ‬،"‫ )"وأﺷﺎر‬can signal an Evidence relation.

Table 1. Definitions of the extracted Arabic rhetorical relations. Rhetorical

Definitional Element

Definition

Constraints on the nucleus, N:

None.

Constraints on the satellite, S:

S presents a cause for the situation presented in N.

Constraints on the N + S:

Without the presentation of S, reader might not

Relation

Casual

know the particular cause of the situation in N. The effect:

Reader recognizes the situation presented in S as a cause of the situation presented in N.

Constraints on the nucleus, N:

Explanation

Writer state information may need to be believed by reader (R).

Evidence Constraints on the satellite, S:

Writer presents what supports his claim in N.

Constraints on the N + S:

R’s comprehension of S increases his belief of N.

The effect:

Reader’s belief of N is increased.

Constraints on the combination

Reader (R) won’t comprehend information before

of nuclei:

reading both nuclei.

The effect:

R’s completely comprehension of writer’s notion.

Constraints on the nucleus, N:

Presents an activity or event that needs justification to be convinced by reader (R).

Purpose

Constraints on the satellite, S:

Writer provides a justification.

Constraints on the N + S:

S presents a purpose for the event stated in N.

The effect:

R recognizes the aim of the activity presented in N.

Each cue phrase associated with the features mentioned in [7] so that rhetorical relations can be identified based on their values. We assigned some of these features to the extracted cue phrases (Relation, Position, Status, Linking, Regular Expression, and Action) and added the feature question position which specifies the part of a relation that is relative to the question. The break action feature value specifies where to create an elementary unit boundary in the input text; it takes one of the following values: ● Normal: instructs to insert a unit boundary immediately before the occurrence of the cue phrase. ● Normal_then_comma: instructs to insert a unit boundary immediately before the occurrence of the cue phrase and another unit boundary immediately after the occurrence of the first comma. If no comma is found before the end of the sentence, a unit boundary is created at the end of the sentence. ● Normal_then_to: instructs to insert a unit boundary immediately before the occurrence of the cue phrase and another unit boundary immediately after the occurrence of the first preposition "‫( "إﻟﻰ‬to). ● Nothing: no unit boundary is inserted, but assigns an action value to the next cue. ● End: instructs to insert a unit boundary immediately after the occurrence of the cue.

The algorithm presented in Fig .2 identifies elementary units of a text and derive rhetorical relations that relate them, where rhetorical relation has the form: rhet_rel (relation name, left span, right span). In step 7 the algorithm hypothesizes all possible relations signaled by the cue phrase under scrutiny. Input: A text T. Output: A list RR of relations that hold among units. 1. RR:= null; 2. Determine the set C of all cue phrases in T; 3. Use the position and action properties for each cue phrases in C in order to insert textual boundaries; 4. for each cue ∈ c 5. 6.

rr:= null; while there is a relation that cue can relate rr:= rr ⊕ rhet_rel(name(cue), l(cue), r(cue));

7. 8. 9.

end while RR: = RR ∪ {rr};

10. end for

Fig. 2. Algorithm that extracts relations for a given text.

4 Evaluation We developed a system using the Java programming language and performed an experiment similar to the one described in [4]. We selected a number of texts (150-350 words each) taken from Arabic news websites. No corrections have been made to the content of text in case of any mistake (grammatically or orthographic). We distributed the texts to 15 people from different disciplines and were asked to read and extract why and how to-questions which answers could be found in the text. They were also asked to answer the extracted questions. As a result we collected a total of 98 why and how to questions and answers pair. We posed the 98 questions we collected to our system and compared the answers retrieved by the system to the subject-formulated answers and considered the answer as correct if it matches the answer selected by the subject. The system was able to answer 54 questions correctly (55% of all questions). Table 2 presents the frequency distribution of the rhetorical relations extracted in this work. If we consider the number of referred questions for Result and Base relations as shown in Table 2, we notice that they lead to three answers only. This is because of the nature of the texts used in our experiment which are news texts. However, these two types of relations can play a much more important role if other types of texts such as organizations’ reports are used.

Table 2. Shows relations frequency that led to correct answers Relation

#correct answers %correct answers Relation

#correct answers

%correct answers

Interpretation

12

22.2

Result

1

1.9

Explanation

11

20.3

Purpose

10

18.5

Antithesis

3

5.6

Casual

9

16.7

Evidence

6

11.1

Base

2

3.7

5 Conclusion Deriving the discourse structure is very important for extracting answers to argument questions. We focused on doing manually Arabic texts analysis with the aim of extracting a set of Arabic rhetorical relations associated with a set of cue phrases that lead to identify the correct answers for why and how to questions. As a first Arabic question answering system that attempts to answer the “Why” and “How to” questions, the evaluation of our experiment gave good preliminary results. In future we plan to study different types of texts which may increase the number of rhetorical relations extracted in this work. Furthermore, we expect that the overall performance of our system will be reduced if longer and more specialized texts such as scientific and economic documents are used. Hence, the use of simple cue phrases need to be expanded to include more complex patterns based on the text structure and the syntax of the Arabic language.

References 1. Kannan, G., Hammoui, A., Al-Shalabi R., Swalha M.: A new Question Answering System for Arabic Language. American Journal of Applied Science, PP797-805 (2009) 2. Benajiba, Y., Rosso, P., Lyhyaoui, A.: Implementation of the Arabic QA Question Answering System’s Computers. In ICTC (2007) 3. Hammou, B., Abu-salem, H., Lytinen, S., Evens, M.: QARAB: A question answering system to support the Arabic language. In workshop on computational approaches to Semitic languages, ACL (2002) 4. Suzan, V., Lou, B., Nelleke, O.: Discourse-based answering of why-questions. Treatment Automatic des languages, Special issue on computational Approaches to Discourse and Document Processing 47 (2). pp:21-41, (2007) 5. Mann, W.C., Thompson, S.: A Rhetorical Structure Theory: Toward a functional theory of text organization, (1988) 6. Mathkour, H., Touir, A., Al-Sanea, W.: Parsing Arabic Texts Using Rhetorical Structure Theory. Journal of Computer Science 4(9), 713-720 (2008) 7. Daniel, M.: The Theory and Practice of Discourse Parsing and Summarization. The MIT press, London (2000) 8. Jattal, M.:Nezam al-Jumlah, pp127-140 Aleppo University (1979) 9. Haskour, N.: Al-Sababieh fe Tarkeb Al-Jumlah Al-Arabih. Aleppo University (1990)