verb syntactic structure is held in Section 3. Sec- tion 4 explains the process of the system. Section 5. Function- al Type. Sub-type. Pre/. Post. Thai Word[IPA].
Solving Combination of Thai Auxiliary Verb Based on Finite-state Approach Taneth Ruangrajitpakorn Prachya Boonkwan
Peerachet Porkaew Thepchai Supnithi
Human Language Technology Laboratory National Electronics and Computer Technology Center 112 Thailand Science Park, Phahonyothin Road, Klong 1, Klong Luang Pathumthani, 12120, Thailand +66-2-564-6900 Ext.2237, Fax.: +66-2-564-6772 {taneth.ruangrajitpakorn, peerachet.porkaew, prachya.boonkwan, thepchai.supnithi }@nectec.or.th
Abstract This paper presents a computational model for auxiliary verb orderings in Thai. It aims at the pre-processing step of an efficient Thai parser by preliminarily percolating allowable scrambling patterns. Orderings schemes, achieved by corpus observation, are modelled by finite-state transducer in which auxiliary verbs are categorised into groups. This results in an improvement of speed efficiency of the parser as it tremendously reduces backtracking step in derivations. Keywords: Thai auxiliary verb, Thai syntax, finite-state approach, parsing technique, natural language processing
1
Introduction
Syntactic parsing is a challenging, yet indispensable, task in natural language processing. It analyses an input sentence into a computation-friendly structure amenable for semantic interpretation. Issues of parsing are language-specific; that is, it depends on grammatical characteristic of the aimed language. In particular to Thai, parsing is quite a complicated task. Conspicuously analytic and rigid-wordordered, Thai makes use of particles and auxiliary words to express grammatical information such as tense, number, aspect. Orderings of auxiliary verbs
play a crucial role in determining tense, aspect, and mood. Some of them allow scrambling while the others do not. Consider the following example. (1) a. sǒmˑtcʰāːj kāmˑlāŋ tcà làp Somchai PROG FUT sleep b. sǒmˑtcʰāːj tcà kāmˑlāŋ làp Somchai FUT PROG sleep Both sentences (1a) and (1b) convey the same meaning ‘Somchai is going to sleep.’ (2) a. mīːˑtcʰāj tɔ̂ŋ kʰə̄ːj tʰāmˑŋāːn Meechai must PAST work nāj tàːŋˑpràˑtʰêːt in foreign country b. mīːˑtcʰāj kʰə̄ːj tɔ̂ŋ tʰāmˑŋāːn Meechai PAST must work The sentence in (2a) means ‘Meechai must have worked in a foreign country.’ The sentence in (2b) means ‘Meechai had to work in a foreign country.’ (3) a. rút Ruj b. *rút Ruj
ʔàːt may tɔ̂ŋ must
tɔ̂ŋ must ʔàːt may
sàˑdɛ̄ːŋ perform sàˑdɛ̄ːŋ perform
kʰɔ̄nˑsə̀ːt concert kʰɔ̄nˑsə̀ːt concert
The sentence in (3a) means ‘Ruj may have to perform a concert tonight.’ The sentence in (3b) is ungrammatical. Form the above examples, there are three classes of scrambling permission in Thai auxiliary verbs. In (1) scrambling kāmˑlāŋ ‘PROG’ and tcà ‘FUT’
is allowed without any changes in the meaning. In (2) scrambling tɔ̂ŋ ‘must’ and kʰə̄ːj ‘PAST’ is also allowed but the meaning is changed endocentrically. Finally in (3) scrambling ʔàːt ‘may’ and tɔ̂ŋ ‘must’ is not allowed in (3b). These cases should be carefully taken into account for the sake of a high-quality Thai parser. This paper presents a preliminary study on how to modalise Thai auxiliary verbs and their orderings in a simple sentence. Insightful analysis of each pattern is provided and formalised into a nonFunctional Type
aspect
direction
Sub-type
Pre/ Post
deterministic finite-state transducer. The objective of this research is to develop an intelligent chunker of Thai auxiliary verbs that can distinguish permitted patterns from the others. One of its prospective applications is that it will eventually be integrated to an efficient and robust Thai parser. The rest of this paper is organised as follows. Section 2 explains the definition and scope of Thai auxiliary verbs. An explanation of the auxiliary verb syntactic structure is held in Section 3. Section 4 explains the process of the system. Section 5
Thai Word[IPA] กําลัง /kāmˑlāŋ/,
English Equivalence
progressive
pre
perfect
pre
ได /dâːj/
perfect tense
upward
post
ขึ้น /khɯ̂n/
x
inward
post
เขา /khâo/, มา /māː/
x
outward
post
ไป /pāj/, ออก /ʔɔ̀ːk/
x
downward
post
ลง /lōŋ/
x
uncertainty
pre
คง /kʰōŋ/, คงจะ /kʰōŋˑtcà/
would
recommendation
กําลังจะ /kāmˑlāŋ tcà/
continuous aspect
ควร /kʰuān/, ควรจะ /kʰuānˑtcà/,
pre
นา /nâː/, นาจะ /nâːˑtcà/,
should
พึง /pʰɯ̄ŋ/, พึงจะ /pʰɯ̄ŋˑtcà/ ตอง /tɔ̂ŋ/, จะตอง /tcàˑtɔ̂ŋ/,
modal
obligatory
pre
ตองจะ /tɔ̂ŋˑcà/, จําตอง /tcāmˑtɔ̂ŋ/, จําเปนตอง /cāmˑpēnˑtɔ̂ŋ/,
must, have to
จําเปนจะตอง /tcāmˑpēnˑtcàˑtɔ̂ŋ/
pre
อาจ /ʔàːt/, อาจจะ /ʔàːtˑtcà/
may
pre
สามารถ /sǎːˑmâːt/
can
post
ได /dâːj/, เปน /pēn/
can
past
pre
เคย /kʰɤ̄ːj/
past tense
future
pre
จะ /tcà/, จัก /tcàk/
future tense
present
pre
ยอม /jɔ̂m/
present tense
past
post
แลว /lɛ́ːʋ/
past tense
present
post
อยู /jùː/
present tense
pre
จง /tcōŋ/
imperative form
post
ไว /ʋáj/, เสีย /siǎ/
imperative form
option possibility
tense
mood
imperative
Table 1. List of Thai auxiliary verbs with their description and their position
bears discussion content. Finally, Section 6 concludes this paper and lists up future work.
2
Thai Auxiliary Verb
Since Thai belongs to analytic language, Thai sentence meaning depends on words’ ordering and additional function words. An auxiliary verb is one of the frequently used function words in Thai. In linguistics, an auxiliary verb is defined as a verb functioning to give further semantic or syntactic information that accompanies the main verb in a clause and helps to make distinctions in modality, mood, voice, aspect, and tense (Diller, 1988). However, Thai language has three topics to be considered. Firstly, Thai has auxiliary verb that functions a main verb from both preceding and succeeding (Noss, 1964). In technical terms, an auxiliary verb which functions before a main verb is called a preauxiliary verb. Contrariwise, an auxiliary verb that accompanies after a main verb is named as a postauxiliary verb (Sornlertlamvanich, 1998). Secondly, based on the definition above, we notice that there is another type of Thai word which is able to combine with other auxiliary verb and function to a main verb. This kind of word expresses directional functionality. For instance, the word “ขึ้น” expresses upward direction adding to a verb and its location is succeeding a verb. As this type of word can collocate with auxiliary verb and their function is similar, we consider including this word type to an auxiliary verb in this paper. In total, 37 auxiliary verbs in Thai are focused in this paper. Thai auxiliary verbs, their functions, and English equivalents are represented in Table 2. Lastly, a certain Thai auxiliary verb is capable to
combine with other auxiliary verb to add on the main verb for the extra meanings or functions. In rare cases, there can grammatically be more than 5 auxiliary verbs accompanied with the main verb. An example utilisation of normal auxiliary verb is shown in Figure 1a. An example of complex combining auxiliary verbs is illustrated in Figure 1b. /kʰāoː/ /kāmˑlāŋ/ /nɔ̂nː/
IPA Word
เขา
กําลัง
/jùː/
อยู่
นอน
POS noun pre-aux verb post-aux literary translation: he is sleeping
Figure 1a. an example of auxiliary verb usage IPA Word POS IPA
/kʰāoː/
/nâːˑtcà/
/tɔ̂ŋ/
/sǎːˑmâːt/
เขา
น่าจะ
ต้อง
สามารถ
noun pre-aux pre-aux pre-aux /tcòtˑtcām/ /tuāˑʔàkˑsɔ̌ːn/ /dâːj/ /lɛ̂ːʋ/
Word
จดจํา
ตัวอักษร
POS
verb
noun
ได้
แล้ว
post-aux post-aux
literary translation: he should have to be able to remember an alphabet.
Figure 1b. an example of complex auxiliary verb usage As a complex construction shown in Figure 1b, it is a major issue for syntactic NLP to solve or parse this derivation since this construction is ambiguous for parser to create a correct tree. Presently, there is yet no Thai parser that handles this issue.
3
Finite-State of Thai Auxiliary Verb
From observation, fixing position and ordering of
Figure 2. Thai pre-auxiliary verb’s structure
Thai auxiliary verb are key factors to increase parsing accuracy. Therefore, we generate a finite-state (Moore, 1956) that can analyse any given combination of Thai auxiliary verbs to help the parser understand its structure. 3.1
System Overview
This system is a part of a parser that manages auxiliary verbs in any sentence. All input sentences need to be word-segmented using JWordSeg (http://www.suparsit.com/nlp-tools). All words in the input sentence are required to be POSannotated using existing algorithm (Sornlertlamvanich, 1997). The process of analysing auxiliary verb will be explained below. Firstly, the system searches for verb(s) and auxiliary verb(s) in the input. Then, the system recognises the found auxiliary verbs into either pre-auxiliary verb or post-auxiliary verb by matching pre-verb auxiliary and post verb auxiliary word lists given in Table 2 and Table 3. However, the word “ได” is ambiguous. It can be both preauxiliary verb and post auxiliary verb. This word has to be checked before it can be considered to be a pre-verb auxiliary or a post verb auxiliary. The first condition is if it is located before the entire verb cluster, it obviously is a pre-auxiliary verb. If it follows emotionally expective verb 1 and it is not followed by another verb, it will be also treated as a pre-auxiliary verb. Otherwise, the word “ได” will be handled as a post-auxiliary verb. The system will take a pre-auxiliary verb set and post-auxiliary verb set as an input respectively. The finite-state automata return success where there is no word left in the input set once it reaches an end state. Otherwise, the return is failure which indicates non-parsable auxiliary verb usage in a source sentence. 3.2
The number in the middle column is signified in the finite-state automaton in Figure 2. All of the pre-auxiliary verbs will be sent to a start state. The system then continues to move a sequence of auxiliary verb to a next state one by one. If the most left word of an input matches criteria of a link, it can move further to the next link. After matching, the match input word will be removed from the input set. However, if they do not match the next link’s criteria, it can take empty transition (notation as ‘e’ in the finite-state) to move to the consecutive state. Functional No. Word/IPA Type probability 1 อาจ /ʔàːt/, อาจจะ /ʔàːtˑtcà/ expression ควร /kʰuān/, ควรจะ
recommendation expression uncertainty expression time expression#I 2
/nâːˑtcà/, พึง /pʰɯ̄ŋ/, พึงจะ
3
คง /kʰōŋ/, คงจะ /kʰōŋˑtcà/ กําลัง /kāmˑlāŋ/, จะ /tcà/, กําลัง
4
จะ /kāmˑlāŋˑtcà/, จัก /tcàk/, ยอมจะ /jɔ̂mˑtcà/, ยอม /jɔ̂m/ ตอง /tɔ̂ŋ/, จะตอง /tcàˑtɔ̂ŋ/, ตอง
obliged expression
5
จะ /tɔ̂ŋˑtcà/, จําเปนตอง /tcāmˑpēnˑtɔ̂ŋ/, จําตอง /tcāmˑtɔ̂ŋ/, จําเปนจะตอง /tcāmˑpēnˑtcàˑtɔ̂ŋ/
time expression#II2 possibility expression time expression#III2 imperative expression
Finite-State of a Pre-auxiliary Verb Set
6
เคย /kʰə̄ːj/
7
สามารถ /sǎːˑmâːt/
8
ได /dâːj/
9
จง /tcōŋ/
Table 3. category of Thai pre-auxiliary verbs with their functional type 2
a verb expresses emotional wish that can possibly happen, for example “หวัง” (hope) and “ปรารถนา” (wish),
/kʰuānˑtcà/, นา /nâː/, นาจะ /pʰɯ̄ŋˑtcà/
According to definition in Section 2, there are 26 pre-auxiliary verbs in total. A fixed pattern for grammatical auxiliary verb usage is sequentially arranged based on their functional type. Preauxiliary verbs are categorised with their functional type in Table 2.
1
2
Time expression#I, time expression#II, and time expression#III are spilt because their different usage. The detail is shown in Figure 2.
Figure 3. Thai post-auxiliary verb’s structure 3.3
Finite-State of a Post-auxiliary Verb Set
Totally, there are 11 post-auxiliary verbs following a description in Section2. They are all categorised in Table 3. Functional Type directional expression#I 3 directional expression#II3 possibility expression time expression imperative expression#I 4 imperative expression#II4
No. 1
Word/IPA เขา /kʰâo/, ออก /ʔɔ̀ːk/, ขึ้น /kʰɯ̂n/, ลง /lōŋ/
2
ไป /pāj/, มา /māː/
3
ได /dâːj/, เปน /pēn/
4
แลว /lɛ̂ːʋ/, อยู /jùː/
5
เสีย /siǎ/
6
ไว /ʋáj/
Table 4. category of Thai post-auxiliary verbs with their functional type The centre column contains a number used in the finite-state automaton in Figure 3. The process in the Figure 3 works based on the finite-state approach described in Section 3.2.
3
Directional expression#I and directional expression#II are separated since a word in directional expression#I can be followed with a word in directional expression#II, but directional expression#II cannot be placed before directional expression#I. 4 Though both of imperative expression#I and #II express the same mood, they have different usage restriction. Imperative expression#II can follow directional expression, but imperative expression#I cannot follow directional expression.
4
Process of the system
Let us exemplify a process of the finite-states. Let the test input sentence be “เขานาจะตองสามารถ จดจําตัวอักษรไดแลว”. The explanation is described step by step. Step1: The input is word-segmented to eight words as “เขา|นาจะ|ตอง|สามารถ|จดจํา|ตัวอักษร| ได|แลว”. Step2: Each word is annotated with its POS. The result is “เขา@noun|นาจะ@aux|ตอง@aux| สามารถ@aux|จดจํา@verb|ตัวอักษร@noun| ได@aux|แลว@aux”. Step3: The system examines each word to indicate auxiliary verbs in the input. Then it separates the type of auxiliary verbs by examining their surfaces based on data in Table 2 and 3. It manages to take “นาจะ”, “ตอง”, and “สามารถ” as a preauxiliary verb set because these words are matched to word list in Table 2. Since “แลว” are matched to word list in Table 3, they are subjected to a postauxiliary verb set. However, “ได” is able to be both pre-auxiliary and post-auxiliary verb therefore the system has to especially check its location and collocation. Since its location is not before verb and it is not after emotionally expective verb, it is handled as a post-auxiliary verb. Step4: The pre-auxiliary verb set, which contains นาจะ|ตอง|สามารถ, will initially be taken as an input for pre-auxiliary verb’s finite-state automaton. Step4.1: According to Figure 2, the input will begin at start state in finite-state system at node q0.
The system begins with the leftmost word which is word “นาจะ”. This word matches the criterion of the first link which requires either probability expression (condition1), recommendation expression (condition2), or uncertainty expression (condition3) because this word is a recommendation expression. Therefore, the input set will remove the matched word “นาจะ” and move to further state q1. Step4.2: Currently, the input set contains 2 words which are “ตอง” and “สามารถ”. The system now focuses on the next word “ตอง”. Since this word doesn’t match to the link’s condition 4, it will transit through an empty link (e) and reach next state (q3). At q3 state, the word matches a condition of link 5 and moves to state q4 since it accepts obliged expression word “ตอง”. Then, the matched word is removed from the input set. Step4.3: The input set currently contains one word which is “สามารถ”. At the state q4, the input takes empty link (e) because it doesn’t satisfy the given restriction of the link 6, and it moves to state q6. Then, the word matches the restriction of link 7 because it is a possibility expression pre-auxiliary verb, and the input set transits to state q7. There is no word left in the input set therefore the system passes the empty link (e) and reaches the final state q8. Since an input can reach end state without any word remaining, it can be claimed that this sequence is grammatical and parsable. Step5: After finishing pre-auxiliary verb’s finitestate system, the focus will point to the two remaining post-auxiliary verbs. They will enter a post-auxiliary verb’s finite-state system and begin at the start state q0 referred to Figure 3. Step5.1: The leftmost word is “ได”, which will firstly take empty link (e) to state q2 because it doesn’t match to link’s restriction 1 and 2. However, it satisfies link 3’s condition and passes to state q3 since it is a possibility expression Step5.2: At state q3, the remaining word “แลว” meets the restriction with next criterion (link 4) and reaches the final state q4 which is an end state. Step6: After running through both finite-states, there is no input word left. Thus it can be claimed that the example sentence contains well-formed auxiliary verb usage and is able to be parsed successfully.
The example shows the potential of the finitestates to parse combination of multiple auxiliary verb usage. However, the next example shows a non-parsable input sentence. The example is “*เขา สามารถตองนาจะจดจําตัวอักษรแลวไว”. Step1: An input will be word-segmented to eight words as “เขา|สามารถ|ตอง|นาจะ|จดจํา|ตัวอักษร| แลว|ไว”. Step2: Each word will be tagged with its POS. The result is “เขา@noun|สามารถ@aux| ตอง@aux|นาจะ@aux|จดจํา@verb| ตัวอักษร@noun|แลว@aux|ไว@aux”. Step3: The system examines each word to indicate auxiliary verbs in the input. Then it separates the type of auxiliary verbs by examining their surfaces based on data in Table 2 and 3. It manages to take “นาจะ”, “ตอง”, and “สามารถ” as a preauxiliary verb set because these words are matched to word list in Table 2. Since “แลว” and “ไว” are matched to word list in Table 3, they are subjected as a post-auxiliary verb set. Step4: The pre-auxiliary verb set, which contains สามารถ|ตอง|นาจะ, is initially taken as an input for pre-auxiliary verb’s finite-state automaton. Step4.1: The input begins at start state q0 in finite-state system according to Figure2. The input set begins with the leftmost word “สามารถ”. At state q0, this word is unable to match to link 1’s condition therefore it passes through empty link (e) to next state q1. Unfortunately, a condition of the next links are also unacceptable to the input word thus empty links (e) are again continuously taken, and it transits to state q6. At state q6, the word is finally accepted by link 7’s condition because it is a possibility expression, and it is moved to state q7. After matching, word “สามารถ” is removed and the next word is further checked. Step4.2: At state q7, the word “ตอง” is checked by a following link (link 8). However, it finds no matching for the link 8’s condition. Therefore, the system passes the input word to final state q8 by taking empty link transition (e). However, there is one word left in the input set at the end node hence the system returns failure value and the remaining words in both pre-auxiliary verb and
post-auxiliary verb need not to be checked. Obviously, the given pre-auxiliary verb set fails to satisfy the finite-state system. It can be concluded that this sentence contains ungrammatical auxiliary verb usage. These processes show potential of finite-state to indicate possibility to parse an auxiliary verb in the simple sentence. In the next section, a limitation of the finite-states will be discussed in details.
5
Discussion
The given finite-states have the ability to comprehend Thai auxiliary pattern for computational utilisation. In this section, we will observe the scope of the finite-state automata and their limitation. Unfortunately, it is possible to have negator marker “ไม” relating to auxiliary verb. An appearance of negator can vary among the sequence of both pre-auxiliary verb and post-auxiliary verb. For pre-auxiliary verb, there are three types of negator usages. The first pre-auxiliary verb group only allows having a negator before it. The second pre-auxiliary verb group only allows having a negator after it. The last pre-auxiliary verb group allows having a negator both before and after it. When pre-auxiliary verbs combine together in sentence, there are several slots between pre-auxiliary verbs that negator cannot be allowed to insert according to above usage restrictions. For instance, pre-auxiliary verb “จะ” and “กําลัง” both allow having a negator after it to be “จะ|ไม” and “กําลัง| ไม” respectively. However, when they combine together as “กําลัง|จะ”, it can’t allow inserting negator in between the words. The only possible form is “กําลัง|จะ|ไม” since negator is unable to be inserted before word “จะ”. This issue becomes much more complex once several pre-auxiliary verbs scramble. For post-auxiliary verb, negator marker is possible to be inserted before few post-auxiliary verbs. Else, negator marker is put before verbs to negate the whole expressed meaning of the verb phrase. Moreover, there is possible to have more than one negator combining with auxiliary verb in the sentence to form a double negation. There is no research or method to comprehensively explain its process and meaning. In the next section, we will conclude this paper and list up a future work.
6
Conclusion and Future Work
Thai auxiliary verb pattern is an issue for syntactic parsing. Since each of them has unique grammatical function, the order of it would be really necessary in NLP application. In conclusion, the given automata are capable to solve the difficulty of Thai auxiliary verb understanding for the computer. Unfortunately, the automata have technical restrictions. Stimulating future works are listed. 1. An experiment will be done to evaluate the quality of syntactic analysis. 2. Thai auxiliary verb pattern becomes much more complicated with the appearance of negator since some auxiliary verbs only allow succeeding or preceding negator but some allows for both positions. 3. After syntactic parsable automata, the interesting topic is semantic function. Orderings of auxiliary verbs play a crucial role in determining tense, aspect, and mood. Some of them allow scrambling while the others do not. The examples are shown in Section1. References Ciravegna F., and Lavelli A. 1999, Parsing using cascades of Rules: an Information Extraction Perspective, In proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL99), June 1999, Bergen , Norway. Diller, V.N., A., 1988, Thai Syntax and "National Grammar", Pergamon Press, Oxford. Moore, F., E., 1956, "Gedanken-experiments on Sequential Machines". Automata Studies, Annals of Mathematical Studies (34). Princeton, N.J.: Princeton University Press: 129–153. Noss, B., Richard, 1964, Thai Reference Grammar, U. S. Government Printing Office, Washington, DC. Sornlertlamvanich, V., Charoenporn, T., and Isahara, H., 1997, ORCHID: Thai Part-of-Speech Tagged Corpus, National Electronics and Computer Technology Center Technical Report, pp. 5-19. Sornlertlamvanich, V., Takahashi, N., and Isahara., H., 1998, Thai Part-Of-Speech tagged corpus: ORCHID, In proceedings of the Oriental COCOSDA Workshop: 131-138. JWordSeg, word-segmentation toolkit. Available from: http://www.suparsit.com/nlp-tools) , 2007.