Solving Combination of Thai Auxiliary Verb Based on ...

15 downloads 0 Views 252KB Size Report
verb syntactic structure is held in Section 3. Sec- tion 4 explains the process of the system. Section 5. Function- al Type. Sub-type. Pre/. Post. Thai Word[IPA].
Solving Combination of Thai Auxiliary Verb Based on Finite-state Approach Taneth Ruangrajitpakorn Prachya Boonkwan

Peerachet Porkaew Thepchai Supnithi

Human Language Technology Laboratory National Electronics and Computer Technology Center 112 Thailand Science Park, Phahonyothin Road, Klong 1, Klong Luang Pathumthani, 12120, Thailand +66-2-564-6900 Ext.2237, Fax.: +66-2-564-6772 {taneth.ruangrajitpakorn, peerachet.porkaew, prachya.boonkwan, thepchai.supnithi }@nectec.or.th

Abstract This paper presents a computational model for auxiliary verb orderings in Thai. It aims at the pre-processing step of an efficient Thai parser by preliminarily percolating allowable scrambling patterns. Orderings schemes, achieved by corpus observation, are modelled by finite-state transducer in which auxiliary verbs are categorised into groups. This results in an improvement of speed efficiency of the parser as it tremendously reduces backtracking step in derivations. Keywords: Thai auxiliary verb, Thai syntax, finite-state approach, parsing technique, natural language processing

1

Introduction

Syntactic parsing is a challenging, yet indispensable, task in natural language processing. It analyses an input sentence into a computation-friendly structure amenable for semantic interpretation. Issues of parsing are language-specific; that is, it depends on grammatical characteristic of the aimed language. In particular to Thai, parsing is quite a complicated task. Conspicuously analytic and rigid-wordordered, Thai makes use of particles and auxiliary words to express grammatical information such as tense, number, aspect. Orderings of auxiliary verbs

play a crucial role in determining tense, aspect, and mood. Some of them allow scrambling while the others do not. Consider the following example. (1) a. sǒmˑtcʰāːj kāmˑlāŋ tcà làp Somchai PROG FUT sleep b. sǒmˑtcʰāːj tcà kāmˑlāŋ làp Somchai FUT PROG sleep Both sentences (1a) and (1b) convey the same meaning ‘Somchai is going to sleep.’ (2) a. mīːˑtcʰāj tɔ̂ŋ kʰə̄ːj tʰāmˑŋāːn Meechai must PAST work nāj tàːŋˑpràˑtʰêːt in foreign country b. mīːˑtcʰāj kʰə̄ːj tɔ̂ŋ tʰāmˑŋāːn Meechai PAST must work The sentence in (2a) means ‘Meechai must have worked in a foreign country.’ The sentence in (2b) means ‘Meechai had to work in a foreign country.’ (3) a. rút Ruj b. *rút Ruj

ʔàːt may tɔ̂ŋ must

tɔ̂ŋ must ʔàːt may

sàˑdɛ̄ːŋ perform sàˑdɛ̄ːŋ perform

kʰɔ̄nˑsə̀ːt concert kʰɔ̄nˑsə̀ːt concert

The sentence in (3a) means ‘Ruj may have to perform a concert tonight.’ The sentence in (3b) is ungrammatical. Form the above examples, there are three classes of scrambling permission in Thai auxiliary verbs. In (1) scrambling kāmˑlāŋ ‘PROG’ and tcà ‘FUT’

is allowed without any changes in the meaning. In (2) scrambling tɔ̂ŋ ‘must’ and kʰə̄ːj ‘PAST’ is also allowed but the meaning is changed endocentrically. Finally in (3) scrambling ʔàːt ‘may’ and tɔ̂ŋ ‘must’ is not allowed in (3b). These cases should be carefully taken into account for the sake of a high-quality Thai parser. This paper presents a preliminary study on how to modalise Thai auxiliary verbs and their orderings in a simple sentence. Insightful analysis of each pattern is provided and formalised into a nonFunctional Type

aspect

direction

Sub-type

Pre/ Post

deterministic finite-state transducer. The objective of this research is to develop an intelligent chunker of Thai auxiliary verbs that can distinguish permitted patterns from the others. One of its prospective applications is that it will eventually be integrated to an efficient and robust Thai parser. The rest of this paper is organised as follows. Section 2 explains the definition and scope of Thai auxiliary verbs. An explanation of the auxiliary verb syntactic structure is held in Section 3. Section 4 explains the process of the system. Section 5

Thai Word[IPA] กําลัง /kāmˑlāŋ/,

English Equivalence

progressive

pre

perfect

pre

ได /dâːj/

perfect tense

upward

post

ขึ้น /khɯ̂n/

x

inward

post

เขา /khâo/, มา /māː/

x

outward

post

ไป /pāj/, ออก /ʔɔ̀ːk/

x

downward

post

ลง /lōŋ/

x

uncertainty

pre

คง /kʰōŋ/, คงจะ /kʰōŋˑtcà/

would

recommendation

กําลังจะ /kāmˑlāŋ tcà/

continuous aspect

ควร /kʰuān/, ควรจะ /kʰuānˑtcà/,

pre

นา /nâː/, นาจะ /nâːˑtcà/,

should

พึง /pʰɯ̄ŋ/, พึงจะ /pʰɯ̄ŋˑtcà/ ตอง /tɔ̂ŋ/, จะตอง /tcàˑtɔ̂ŋ/,

modal

obligatory

pre

ตองจะ /tɔ̂ŋˑcà/, จําตอง /tcāmˑtɔ̂ŋ/, จําเปนตอง /cāmˑpēnˑtɔ̂ŋ/,

must, have to

จําเปนจะตอง /tcāmˑpēnˑtcàˑtɔ̂ŋ/

pre

อาจ /ʔàːt/, อาจจะ /ʔàːtˑtcà/

may

pre

สามารถ /sǎːˑmâːt/

can

post

ได /dâːj/, เปน /pēn/

can

past

pre

เคย /kʰɤ̄ːj/

past tense

future

pre

จะ /tcà/, จัก /tcàk/

future tense

present

pre

ยอม /jɔ̂m/

present tense

past

post

แลว /lɛ́ːʋ/

past tense

present

post

อยู /jùː/

present tense

pre

จง /tcōŋ/

imperative form

post

ไว /ʋáj/, เสีย /siǎ/

imperative form

option possibility

tense

mood

imperative

Table 1. List of Thai auxiliary verbs with their description and their position

bears discussion content. Finally, Section 6 concludes this paper and lists up future work.

2

Thai Auxiliary Verb

Since Thai belongs to analytic language, Thai sentence meaning depends on words’ ordering and additional function words. An auxiliary verb is one of the frequently used function words in Thai. In linguistics, an auxiliary verb is defined as a verb functioning to give further semantic or syntactic information that accompanies the main verb in a clause and helps to make distinctions in modality, mood, voice, aspect, and tense (Diller, 1988). However, Thai language has three topics to be considered. Firstly, Thai has auxiliary verb that functions a main verb from both preceding and succeeding (Noss, 1964). In technical terms, an auxiliary verb which functions before a main verb is called a preauxiliary verb. Contrariwise, an auxiliary verb that accompanies after a main verb is named as a postauxiliary verb (Sornlertlamvanich, 1998). Secondly, based on the definition above, we notice that there is another type of Thai word which is able to combine with other auxiliary verb and function to a main verb. This kind of word expresses directional functionality. For instance, the word “ขึ้น” expresses upward direction adding to a verb and its location is succeeding a verb. As this type of word can collocate with auxiliary verb and their function is similar, we consider including this word type to an auxiliary verb in this paper. In total, 37 auxiliary verbs in Thai are focused in this paper. Thai auxiliary verbs, their functions, and English equivalents are represented in Table 2. Lastly, a certain Thai auxiliary verb is capable to

combine with other auxiliary verb to add on the main verb for the extra meanings or functions. In rare cases, there can grammatically be more than 5 auxiliary verbs accompanied with the main verb. An example utilisation of normal auxiliary verb is shown in Figure 1a. An example of complex combining auxiliary verbs is illustrated in Figure 1b. /kʰāoː/ /kāmˑlāŋ/ /nɔ̂nː/

IPA Word

เขา

กําลัง

/jùː/

อยู่

นอน

POS noun pre-aux verb post-aux literary translation: he is sleeping

Figure 1a. an example of auxiliary verb usage IPA Word POS IPA

/kʰāoː/

/nâːˑtcà/

/tɔ̂ŋ/

/sǎːˑmâːt/

เขา

น่าจะ

ต้อง

สามารถ

noun pre-aux pre-aux pre-aux /tcòtˑtcām/ /tuāˑʔàkˑsɔ̌ːn/ /dâːj/ /lɛ̂ːʋ/

Word

จดจํา

ตัวอักษร

POS

verb

noun

ได้

แล้ว

post-aux post-aux

literary translation: he should have to be able to remember an alphabet.

Figure 1b. an example of complex auxiliary verb usage As a complex construction shown in Figure 1b, it is a major issue for syntactic NLP to solve or parse this derivation since this construction is ambiguous for parser to create a correct tree. Presently, there is yet no Thai parser that handles this issue.

3

Finite-State of Thai Auxiliary Verb

From observation, fixing position and ordering of

Figure 2. Thai pre-auxiliary verb’s structure

Thai auxiliary verb are key factors to increase parsing accuracy. Therefore, we generate a finite-state (Moore, 1956) that can analyse any given combination of Thai auxiliary verbs to help the parser understand its structure. 3.1

System Overview

This system is a part of a parser that manages auxiliary verbs in any sentence. All input sentences need to be word-segmented using JWordSeg (http://www.suparsit.com/nlp-tools). All words in the input sentence are required to be POSannotated using existing algorithm (Sornlertlamvanich, 1997). The process of analysing auxiliary verb will be explained below. Firstly, the system searches for verb(s) and auxiliary verb(s) in the input. Then, the system recognises the found auxiliary verbs into either pre-auxiliary verb or post-auxiliary verb by matching pre-verb auxiliary and post verb auxiliary word lists given in Table 2 and Table 3. However, the word “ได” is ambiguous. It can be both preauxiliary verb and post auxiliary verb. This word has to be checked before it can be considered to be a pre-verb auxiliary or a post verb auxiliary. The first condition is if it is located before the entire verb cluster, it obviously is a pre-auxiliary verb. If it follows emotionally expective verb 1 and it is not followed by another verb, it will be also treated as a pre-auxiliary verb. Otherwise, the word “ได” will be handled as a post-auxiliary verb. The system will take a pre-auxiliary verb set and post-auxiliary verb set as an input respectively. The finite-state automata return success where there is no word left in the input set once it reaches an end state. Otherwise, the return is failure which indicates non-parsable auxiliary verb usage in a source sentence. 3.2

The number in the middle column is signified in the finite-state automaton in Figure 2. All of the pre-auxiliary verbs will be sent to a start state. The system then continues to move a sequence of auxiliary verb to a next state one by one. If the most left word of an input matches criteria of a link, it can move further to the next link. After matching, the match input word will be removed from the input set. However, if they do not match the next link’s criteria, it can take empty transition (notation as ‘e’ in the finite-state) to move to the consecutive state. Functional No. Word/IPA Type probability 1 อาจ /ʔàːt/, อาจจะ /ʔàːtˑtcà/ expression ควร /kʰuān/, ควรจะ

recommendation expression uncertainty expression time expression#I 2

/nâːˑtcà/, พึง /pʰɯ̄ŋ/, พึงจะ

3

คง /kʰōŋ/, คงจะ /kʰōŋˑtcà/ กําลัง /kāmˑlāŋ/, จะ /tcà/, กําลัง

4

จะ /kāmˑlāŋˑtcà/, จัก /tcàk/, ยอมจะ /jɔ̂mˑtcà/, ยอม /jɔ̂m/ ตอง /tɔ̂ŋ/, จะตอง /tcàˑtɔ̂ŋ/, ตอง

obliged expression

5

จะ /tɔ̂ŋˑtcà/, จําเปนตอง /tcāmˑpēnˑtɔ̂ŋ/, จําตอง /tcāmˑtɔ̂ŋ/, จําเปนจะตอง /tcāmˑpēnˑtcàˑtɔ̂ŋ/

time expression#II2 possibility expression time expression#III2 imperative expression

Finite-State of a Pre-auxiliary Verb Set

6

เคย /kʰə̄ːj/

7

สามารถ /sǎːˑmâːt/

8

ได /dâːj/

9

จง /tcōŋ/

Table 3. category of Thai pre-auxiliary verbs with their functional type 2

a verb expresses emotional wish that can possibly happen, for example “หวัง” (hope) and “ปรารถนา” (wish),

/kʰuānˑtcà/, นา /nâː/, นาจะ /pʰɯ̄ŋˑtcà/

According to definition in Section 2, there are 26 pre-auxiliary verbs in total. A fixed pattern for grammatical auxiliary verb usage is sequentially arranged based on their functional type. Preauxiliary verbs are categorised with their functional type in Table 2.

1

2

Time expression#I, time expression#II, and time expression#III are spilt because their different usage. The detail is shown in Figure 2.

Figure 3. Thai post-auxiliary verb’s structure 3.3

Finite-State of a Post-auxiliary Verb Set

Totally, there are 11 post-auxiliary verbs following a description in Section2. They are all categorised in Table 3. Functional Type directional expression#I 3 directional expression#II3 possibility expression time expression imperative expression#I 4 imperative expression#II4

No. 1

Word/IPA เขา /kʰâo/, ออก /ʔɔ̀ːk/, ขึ้น /kʰɯ̂n/, ลง /lōŋ/

2

ไป /pāj/, มา /māː/

3

ได /dâːj/, เปน /pēn/

4

แลว /lɛ̂ːʋ/, อยู /jùː/

5

เสีย /siǎ/

6

ไว /ʋáj/

Table 4. category of Thai post-auxiliary verbs with their functional type The centre column contains a number used in the finite-state automaton in Figure 3. The process in the Figure 3 works based on the finite-state approach described in Section 3.2.

3

Directional expression#I and directional expression#II are separated since a word in directional expression#I can be followed with a word in directional expression#II, but directional expression#II cannot be placed before directional expression#I. 4 Though both of imperative expression#I and #II express the same mood, they have different usage restriction. Imperative expression#II can follow directional expression, but imperative expression#I cannot follow directional expression.

4

Process of the system

Let us exemplify a process of the finite-states. Let the test input sentence be “เขานาจะตองสามารถ จดจําตัวอักษรไดแลว”. The explanation is described step by step. Step1: The input is word-segmented to eight words as “เขา|นาจะ|ตอง|สามารถ|จดจํา|ตัวอักษร| ได|แลว”. Step2: Each word is annotated with its POS. The result is “เขา@noun|นาจะ@aux|ตอง@aux| สามารถ@aux|จดจํา@verb|ตัวอักษร@noun| ได@aux|แลว@aux”. Step3: The system examines each word to indicate auxiliary verbs in the input. Then it separates the type of auxiliary verbs by examining their surfaces based on data in Table 2 and 3. It manages to take “นาจะ”, “ตอง”, and “สามารถ” as a preauxiliary verb set because these words are matched to word list in Table 2. Since “แลว” are matched to word list in Table 3, they are subjected to a postauxiliary verb set. However, “ได” is able to be both pre-auxiliary and post-auxiliary verb therefore the system has to especially check its location and collocation. Since its location is not before verb and it is not after emotionally expective verb, it is handled as a post-auxiliary verb. Step4: The pre-auxiliary verb set, which contains นาจะ|ตอง|สามารถ, will initially be taken as an input for pre-auxiliary verb’s finite-state automaton. Step4.1: According to Figure 2, the input will begin at start state in finite-state system at node q0.

The system begins with the leftmost word which is word “นาจะ”. This word matches the criterion of the first link which requires either probability expression (condition1), recommendation expression (condition2), or uncertainty expression (condition3) because this word is a recommendation expression. Therefore, the input set will remove the matched word “นาจะ” and move to further state q1. Step4.2: Currently, the input set contains 2 words which are “ตอง” and “สามารถ”. The system now focuses on the next word “ตอง”. Since this word doesn’t match to the link’s condition 4, it will transit through an empty link (e) and reach next state (q3). At q3 state, the word matches a condition of link 5 and moves to state q4 since it accepts obliged expression word “ตอง”. Then, the matched word is removed from the input set. Step4.3: The input set currently contains one word which is “สามารถ”. At the state q4, the input takes empty link (e) because it doesn’t satisfy the given restriction of the link 6, and it moves to state q6. Then, the word matches the restriction of link 7 because it is a possibility expression pre-auxiliary verb, and the input set transits to state q7. There is no word left in the input set therefore the system passes the empty link (e) and reaches the final state q8. Since an input can reach end state without any word remaining, it can be claimed that this sequence is grammatical and parsable. Step5: After finishing pre-auxiliary verb’s finitestate system, the focus will point to the two remaining post-auxiliary verbs. They will enter a post-auxiliary verb’s finite-state system and begin at the start state q0 referred to Figure 3. Step5.1: The leftmost word is “ได”, which will firstly take empty link (e) to state q2 because it doesn’t match to link’s restriction 1 and 2. However, it satisfies link 3’s condition and passes to state q3 since it is a possibility expression Step5.2: At state q3, the remaining word “แลว” meets the restriction with next criterion (link 4) and reaches the final state q4 which is an end state. Step6: After running through both finite-states, there is no input word left. Thus it can be claimed that the example sentence contains well-formed auxiliary verb usage and is able to be parsed successfully.

The example shows the potential of the finitestates to parse combination of multiple auxiliary verb usage. However, the next example shows a non-parsable input sentence. The example is “*เขา สามารถตองนาจะจดจําตัวอักษรแลวไว”. Step1: An input will be word-segmented to eight words as “เขา|สามารถ|ตอง|นาจะ|จดจํา|ตัวอักษร| แลว|ไว”. Step2: Each word will be tagged with its POS. The result is “เขา@noun|สามารถ@aux| ตอง@aux|นาจะ@aux|จดจํา@verb| ตัวอักษร@noun|แลว@aux|ไว@aux”. Step3: The system examines each word to indicate auxiliary verbs in the input. Then it separates the type of auxiliary verbs by examining their surfaces based on data in Table 2 and 3. It manages to take “นาจะ”, “ตอง”, and “สามารถ” as a preauxiliary verb set because these words are matched to word list in Table 2. Since “แลว” and “ไว” are matched to word list in Table 3, they are subjected as a post-auxiliary verb set. Step4: The pre-auxiliary verb set, which contains สามารถ|ตอง|นาจะ, is initially taken as an input for pre-auxiliary verb’s finite-state automaton. Step4.1: The input begins at start state q0 in finite-state system according to Figure2. The input set begins with the leftmost word “สามารถ”. At state q0, this word is unable to match to link 1’s condition therefore it passes through empty link (e) to next state q1. Unfortunately, a condition of the next links are also unacceptable to the input word thus empty links (e) are again continuously taken, and it transits to state q6. At state q6, the word is finally accepted by link 7’s condition because it is a possibility expression, and it is moved to state q7. After matching, word “สามารถ” is removed and the next word is further checked. Step4.2: At state q7, the word “ตอง” is checked by a following link (link 8). However, it finds no matching for the link 8’s condition. Therefore, the system passes the input word to final state q8 by taking empty link transition (e). However, there is one word left in the input set at the end node hence the system returns failure value and the remaining words in both pre-auxiliary verb and

post-auxiliary verb need not to be checked. Obviously, the given pre-auxiliary verb set fails to satisfy the finite-state system. It can be concluded that this sentence contains ungrammatical auxiliary verb usage. These processes show potential of finite-state to indicate possibility to parse an auxiliary verb in the simple sentence. In the next section, a limitation of the finite-states will be discussed in details.

5

Discussion

The given finite-states have the ability to comprehend Thai auxiliary pattern for computational utilisation. In this section, we will observe the scope of the finite-state automata and their limitation. Unfortunately, it is possible to have negator marker “ไม” relating to auxiliary verb. An appearance of negator can vary among the sequence of both pre-auxiliary verb and post-auxiliary verb. For pre-auxiliary verb, there are three types of negator usages. The first pre-auxiliary verb group only allows having a negator before it. The second pre-auxiliary verb group only allows having a negator after it. The last pre-auxiliary verb group allows having a negator both before and after it. When pre-auxiliary verbs combine together in sentence, there are several slots between pre-auxiliary verbs that negator cannot be allowed to insert according to above usage restrictions. For instance, pre-auxiliary verb “จะ” and “กําลัง” both allow having a negator after it to be “จะ|ไม” and “กําลัง| ไม” respectively. However, when they combine together as “กําลัง|จะ”, it can’t allow inserting negator in between the words. The only possible form is “กําลัง|จะ|ไม” since negator is unable to be inserted before word “จะ”. This issue becomes much more complex once several pre-auxiliary verbs scramble. For post-auxiliary verb, negator marker is possible to be inserted before few post-auxiliary verbs. Else, negator marker is put before verbs to negate the whole expressed meaning of the verb phrase. Moreover, there is possible to have more than one negator combining with auxiliary verb in the sentence to form a double negation. There is no research or method to comprehensively explain its process and meaning. In the next section, we will conclude this paper and list up a future work.

6

Conclusion and Future Work

Thai auxiliary verb pattern is an issue for syntactic parsing. Since each of them has unique grammatical function, the order of it would be really necessary in NLP application. In conclusion, the given automata are capable to solve the difficulty of Thai auxiliary verb understanding for the computer. Unfortunately, the automata have technical restrictions. Stimulating future works are listed. 1. An experiment will be done to evaluate the quality of syntactic analysis. 2. Thai auxiliary verb pattern becomes much more complicated with the appearance of negator since some auxiliary verbs only allow succeeding or preceding negator but some allows for both positions. 3. After syntactic parsable automata, the interesting topic is semantic function. Orderings of auxiliary verbs play a crucial role in determining tense, aspect, and mood. Some of them allow scrambling while the others do not. The examples are shown in Section1. References Ciravegna F., and Lavelli A. 1999, Parsing using cascades of Rules: an Information Extraction Perspective, In proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL99), June 1999, Bergen , Norway. Diller, V.N., A., 1988, Thai Syntax and "National Grammar", Pergamon Press, Oxford. Moore, F., E., 1956, "Gedanken-experiments on Sequential Machines". Automata Studies, Annals of Mathematical Studies (34). Princeton, N.J.: Princeton University Press: 129–153. Noss, B., Richard, 1964, Thai Reference Grammar, U. S. Government Printing Office, Washington, DC. Sornlertlamvanich, V., Charoenporn, T., and Isahara, H., 1997, ORCHID: Thai Part-of-Speech Tagged Corpus, National Electronics and Computer Technology Center Technical Report, pp. 5-19. Sornlertlamvanich, V., Takahashi, N., and Isahara., H., 1998, Thai Part-Of-Speech tagged corpus: ORCHID, In proceedings of the Oriental COCOSDA Workshop: 131-138. JWordSeg, word-segmentation toolkit. Available from: http://www.suparsit.com/nlp-tools) , 2007.