scholars

5 downloads 0 Views 369KB Size Report
Verb Valency Enhanced Croatian Lexicon. 52 .... Verb lemma represents the infinitive form of the verb, which is in case of lexical ... lemmas which have the same spelling and wordform, but ... morphemic forms and type of complementation.
Applications of Finite-State Language Processing: Selected Papers from the 2008 International NooJ Conference

Edited by

Tamas Varadi, Judit Kuti and Max Silberztein

CAMBRIDGE

SCHOLARS P U B L I S H I N G

TABLE OF CONTENTS

Preface of the Editors

vii

Max Silberztein Disambiguation Tools for NooJ

1

Kimmo Koskenniemi HFST: Modular Compatibility for Open Source Finite-state Tools

15

Bozo Bekavac, Zeljko Agic, Marko Tadic Interacting Croatian NERC System and IntexINooJ Environment

21

Sergio Matos, Belinda Maia NooJ and Corpografo—a New Partnership

30

Anabela Barreiro Linguistic Resources and Applications for Portuguese Processing and Machine Translation

41

Kristina Vuckovic, Nives Mikelic Preradovic, Zdravko Dovedan Verb Valency Enhanced Croatian Lexicon

52

Zoltan Alexin: Generating Large and Optimized FSAfor Morphological Analysis

60

Kata Gabor Creating a Shallow-parsed Hungarian Corpus with NooJ

67

Cristina Mota Combining NooJ with Co-Training for NER in Portuguese

76

Elina Chadjipapa, Eleni Papadopoulou, Zoe Gavriilidou New data in the Greek NooJ module: Compounds and Proper Nouns

87

Hela Fehri, Kais Haddar, Abdelmajid Ben Hamadou Automatic Recognition and Semantic Analysis of Arabic Named Entities

95

Slim Mesfar Morphological Grammars for Standard Arabic Token isation

108

vi

Selected Papers from the 2008 International NooJ Conference

Huei-Chi Lin

Constitution of Encoded Texts

122

Simonetta Vietri The Formalization of Italian Lexicon-Grammar Tables in a Nooj Pair Dictionary/Grammar

132

Milos Utvic The Regular Derivation in Serbian: Principles and Classification Using NooJ

142

Stasa Vujicic, Dusko Vitas Odonym Recognition in Serbian

152

Sandra Gucul-Milojevic, Vanja Radulovic, Cvetana Krstev A View on the Representation of Women in Serbian Newspaper Texts.... 160 Istvan Csury Problems of Corpus Annotation on the Level of Semantic Constructions in Discourse

171

Bea Ehmann, Vera Garami Narrative Psychological Content Analysis with NooJ: Linguistic Markers of Time Experience in Self Reports

180

Helene Pignot, Odile Piton Language Processing of 17th Century English with NooJ

191

List of contributors

204

Verb Valency Enhanced Croatian Lexicon Kristina Vučković, Nives Mikelić Preradović, Zdravko Dovedan [email protected], [email protected], [email protected] Faculty of Humanities and Social Sciences University of Zagreb Department of Information Sciences Ivana Lucica 3, Zagreb, Croatia

CHAPTER ONE INTRODUCTION In this paper we will show how verb valency data, added to the Croatian dictionary in NooJ, enhances recognition of VP-chunks as well as NP-chunks and PPchunks in a sentence. The paper can be seen as the joined project that resulted from the two separate PhD theses written at the Department of Information Sciences, at the Faculty of Humanities and Social Sciences in Zagreb. For the first thesis a chunker for Croatian language using NooJ had to be built. At the same time, for the second thesis, Croatian verb valency lexicon had to be constructed. Our NooJ dictionary has over 36 000 entries of which 1 884 are verbs. Each verb is only marked by its category and inflection property. Additional data has been added from the Croatian Valency Lexicon of Verbs, Version 2.0008 (CROVALLEX) that was built as an attempt to formally describe valency frames of Croatian verbs. By combining these two projects we managed to obtain better and improved Croatian chunker results and thus make preparation for building Croatian parser. In Chapter Two we will give main characteristics of the CROVALLEX system and describe the date we selected from it. In the following chapters we will show the conversion of data from the one system to the other as well as the ways in which we enhanced not only our dictionary but also our grammars for recognition of tenses for reflexive verbs. What follows is the description of new grammars that were enabled by this new data after which we will conclude with the improved precision, recall and fmeasure results and future work.

CHAPTER TWO CROATIAN VALENCY LEXICON OF VERBS One of the aims of the CROVALLEX is to become a module for the automatic processing of the texts in Croatian language, especially for automatic syntactic analysis, if it gets integrated as a part of the system for generating sentence representation with the use of tectogrammatical parser.

Main Characteristics CROVALLEX contains 1 739 verbs with 5 118 valency frames which makes an average of 3 valency frames per verb. It also contains 173 syntactic-semantic classes (more accurately, 72 classes with two further levels of subdivision). Those classes have been derived from VerbNet (a verb lexicon based on Levin’s verb classes which also provides selection restrictions attached to semantic roles) and specially refined and modified for Croatian language. The motivation for introducing such semantic classification in CROVALLEX was the fact that it simplifies systematic checking of consistency and allows for making more general observations about the data. The 1 739 verbs were selected from the Croatian frequency dictionary [Moguš99], according to their frequency number. Behind the description of valency frames is the Functional Generative Description (FGD) that is being developed by Czech linguists Petr Sgall and his collaborators since the 1960s. On the topmost level, CROVALLEX is divided into word entries. The content of a word entry corresponds to the traditional term of a lexeme. Each word entry relates to one or more headword lemmas. The word entry consists of

sequence of frame entries relevant for the lemma, where each frame entry corresponds to one of the lemma’s meanings. Information about the aspect of the lemma is assigned to each word entry as a whole. Verb lemma represents the infinitive form of the verb, which is in case of lexical homonyms and homographs followed by a Roman number in superscript. Reflexive particle “se” is part of the infinitive only if the verb is derived reflexive (e.g. vratiti se) or reflexiva tantum (e.g. penjati se). Lexical homonyms are groups of two lemmas which have the same spelling and wordform, but considerably differ in their meanings. They also might differ as to their etymology (e.g. hȉtatiI – to rush vs. hȉtatiII – to throw), aspect (e.g. matíratiI inf. – to make something appear beamlessII vs. matíratiI fin.-to defeat), or conjugated forms (izvezem [first person sg.] for izvestiI - embroider vs. izvesti [first person sg.] for izvestiII -export). Homographs are groups of two lemmas which have the same wordform, but different accent, and also considerably differ in their meanings. They also might differ as to their etymology (e.g. ìskapatiI – leak out drop by drop vs. iskápatiII - excavate), aspect (e.g. ìsplakatiI fin.-cry one»s eyes out vs. isplákatiII inf.-rinse), or conjugated forms (napadnem [first person sg.] for nàpastiI - attack vs. napasem [first person sg.] for nàpāstiII -graze). Each word entry consists of a sequence of frame entries, corresponding to the individual meanings of the headword lemma. The primary and the most frequent meanings are listed first, whereas rare and idiomatic meanings are listed last. Each frame entry contains a description of the valency frame itself and of the frame attributes.

Selected Data In CROVALLEX, a valency frame consists of a sequence of frame slots. Each frame slot corresponds to one complementation of the given verb. The following

attributes are assigned to each slot: functor, list of possible morphemic forms and type of complementation. Functors (labels of “deep roles”; similar to theta-roles) are used for expressing types of relations between verbs and their complementations. Functors are divided into inner participants (actants) and free modifications. At this point, we decided not to use data on functors for our verb valency enhancements. The information we used is one connected with frame slots and its attribute values. Each frame slot in a sentence can be expressed by a limited set of morphemic means, which are called morphemic forms. In CROVALLEX, the set of possible forms is defined either explicitly, or implicitly. If the form is defined explicitly, then it gets enumerated in a list attached to the given slot. If the form is defined implicitly, no list is specified, because the set of possible forms is implied by the functor of the respective slot. The list of forms attached to a frame slot may contain values of the following types: • Pure (prepositionless) case. There are seven morphological cases in Croatian. In the CROVALLEX notation, they have traditional numbering: 0-hidden nominative, 1 - nominative, 2 - genitive, 3 - dative, 4 - accusative, 5 - vocative, 6 - locative, and 7 - instrumental. • Prepositional case. Lemma of the preposition and the number of the required morphological case are specified (e.g., od+2, na+4, o+6...). • Subordinating conjunction. Lemma of the conjunction is specified. • Infinitive construction. The abbreviation «inf» stands for infinitive verbal complementation and can appear together with a preposition (e.g. «nego+inf») and with the morphological case (e.g. «inf+4»). • Construction with adjectives. Abbreviation «adjnumber» stands for an adjective complementation

in the given case, e.g. adj-7 («Osjećam se osvježenim» - «I feel refreshed»). • Construction with adverbs. Abbreviation «advadverb_word» stands for an adverb complementation in the specific form, e.g. advhrabro («Osjećam se hrabro» - «I feel brave»). • Construction with nominative predicate. Abbreviation «nom_pred» stands for the complementation that represent nominative predicate, e.g. nom_pred («Historija je postala legendom» - «History has become legend»). For this first stage of our experiment, we used only data on prepositionless and prepositional case and marked each as an obligatory or typical attribute. The obligatory attributes have to be filled in every frame while the typical attributes might be empty. Regarding the aspect of the verb, there are three kinds of verbs in CROVALLEX: perfective verbs, imperfective verbs and dual aspect verbs. In CROVALLEX, the value of aspect is attached to each word entry as a whole (i.e., it is the same for all its frames). Dual aspect verbs (i.e. “analizirati”, “bombardirati”) are the type of verbs that can be used in different contexts either as perfective or as imperfective. The last information we were interested to obtain from CROVALLEX is if the verb is marked as reflexive or not.

CHAPTER THREE CONVERSION OF DATA CROVALLEX outputs its data as an xml file with all its data not allowing us to choose only the information we need. Thus, it was important, before proceeding to NooJ dictionary, to prepare the data in NooJ’s dictionary readable format. This process also included choosing only information we needed: type of the verb, reflexive property, obligatory/typical prepositional and prepositionless case. We have built a small converter for this purpose which was able, at the same time, to compare our existing dictionary list and add the necessary data directly to it. All the verbs that were not in the dictionary at the time were added manually together with the data from CROVALLEX. The same was done for those verbs that were not in CROVALLEX but were in our dictionary. Our dictionary entry for the verb class has thus transformed from isključiti,V+FLX=UGOJITI to isključiti,V+FLX=UGOJITI+Aspect=fin+pov+DCob l=0+DCobl=Acc+PCtyp=G+DCtyp=D where +pov marks reflexive property of a verb, +DCobl marks obligatory prepositionless case, +DCtyp marks typical prepositionless case, +PCtyp marks typical prepositional case and +PCobl marks obligatory prepositional case.

Adjusting the Grammars

Addition of new properties for verbs resulted in minor adjustments of our existing grammars for VP recognition. This can be best seen on the picture shown in Figure 2-1 where we have a grammar for recognizing Perfect Tense of verbs that are not reflexive (V+PDR) and those that are reflexive (V+PDR+pov). The later class of verbs requires reflexive pronoun “se” before the Perfect Tense form of the verb, after it, or in between the main verb and the auxiliary verb.

Figure 2-1 All our graphs for verb phrase recognition have been broaden with the similar addition (see dashed box of Figure 2-1) for recognizing tenses of reflexive verbs.

New Grammars New verb information have also enabled us to build new grammars that will recognize noun phrases and prepositional phrases following the verbal phrase. Since each verb now has been marked with obligatory prepositional and prepositionless cases, our grammar can check if the noun phrase and prepositional phrase

immediately following the verb phrase are the ones required i.e. obligatory or possible i.e. typical for the main verb of the VP. Being that Croatian language has a flexible (very flexible) word order, the order of NP-chunks and PPchunks after the VP-chunk is not fixed so we had to include both versions (see Figure 2-2).

Figure 2-2 Same grammars have also been constructed for versions where prepositionless case marking for a verb is in accusative, locative, dative and other cases. By applying these grammars we have managed to disambiguate case markings for those NP-chunks and PP-chunks following the verb phrase. As it is known, NooJ applies its grammars in a cascaded way. sSo, only after applying grammars that have both DC (direct case = prepositionless case) and PC (prepositional case) attributes marked as obligatory, come those grammars that have DC marked as obligatory and PC as typical, than followed by grammars with PC as obligatory and DC as typical, and than followed by those that have only DC or PC obligatory and finally followed by those that have only DC or PC marked as typical.

CHAPTER FOUR RESULTS As our Golden Standard we have used a subset of 137 sentences already described in [Vučković08] with 1 150 manually marked NP-chunks. Before use of CROVALLEX data, Tthe chunker, before using CROVALLEX data, had recognized 1 099 NP-chunks. Out of that number, 601 of recognized chunks were NP-chunks with correct case marking, while 437 had more than one case marking. However, after we used CROVALLEX data, our chunker has recognized 1 070 NP chunks, 729 of which had a correct case marking, 26 had false case marking, while 246 had multiple case markings. The remaining NP chunks were falsely marked as NP’s. Precision, recall and f-measure were also calculated for NP-chunks as it is shown in Table 4-1. Before After CROVALLEX CROVALLEX Precision 33,34 68,13 Recall 52,26 63,39 f-measure 40,69 65,68 Table 4-1 If we compare the results before and after additional data, we see considerable improvements in our system. We believe that these results can further be improved by adding other information from CROVALLEX that was not included at this stage of our project. Such data include, among others, adjective constructions, and adverb constructions constructions, and infinitive constructions with or without a preposition.

CHAPTER FIVE CONCLUSION AND FUTURE WORK In this paper we have shown how NP-chunk and PPchunk recognizer can perform better when fed with verb valency information. For this purpose we have merged two separate projects: Croatian Verb Valency Lexicon and Chunker for Croatian. This has proven to be a good choice since our original chunker has improved its precision and recall results. It is our belief that its performance can be even further improved after adding other relevant data from CROVALLEX that are not included at this stage of the project. This we leave to be shown in our future work.

REFERENCES [Abney91] Steven Abney: Parsing by Chunks, in Principle-Based Parsing, (eds.) R. Berwick, S. Abney, C. Tenny, Kluwer Academic Publishers, 257-278, 1991. [Mikelić08] Nives Mikelić Preradović: Pristupi izradi strojnog tezaurusa za hrvatski jezik, PhD Thesis, Faculty of Humanities and Social Sciences, Zagreb, 2008. [Moguš99] Moguš, M., Bratanić, M., Tadić, M. Hrvatski čestotni rječnik. Zagreb: Zavod za lingvistiku i Školska knjiga. 1999. [Silberztein08] Max Silberztein: NooJ v2, Manual, http://www.nooj4nlp.net/NooJ_Manual, 2008. [Vučković08] Kristina Vučković, Marko Tadić, Zdravko Dovedan: Rule Based Chunker for Croatian, in Proceeding of the Sixth International Conference on Language Resources and Evaluation LREC 2008, Marakesh: ELRA, 2008.