The Romanian wordnet in a nutshell - Springer Link

7 downloads 0 Views 108KB Size Report
May 9, 2013 - Abstract The project on the Romanian wordnet has been under ... adjectives and adverb (e.g. related nouns, participle) are lexical relations.
Lang Resources & Evaluation (2013) 47:1305–1314 DOI 10.1007/s10579-013-9230-7 BRIEF REPORT

The Romanian wordnet in a nutshell Dan Tufiş · Verginica Barbu Mititelu · Dan Ştefănescu · Radu Ion

Published online: 9 May 2013 © Springer Science+Business Media Dordrecht 2013

Abstract The project on the Romanian wordnet has been under continuous development for more than 10 years now. It has been in constant use in many projects and applications which determined, to a large extent, the content and coverage of various lexical domains. The article presents the most recent developments of the Romanian wordnet and offers quantitative data for its current version. Keywords Aligned wordnets · BalkaNet · EuroWordNet · Interlingual mapping · Lexical ontology · Ontology projection · Princeton WordNet · Romanian

1 Introduction The Princeton WordNet (PWN) (Miller et al. 1990; Miller 1995; Fellbaum 1998) is a lexical semantic network whose recent version (PWN3.0) contains 117,659 concepts (lexicalized by 155,287 words) related by semantic and lexical relations. The lexical stock covers the open class categories and is distributed among four semantic networks, each of them corresponding to a different word class: nouns, verbs, adjectives and adverbs. The notion of meaning in PWN is equivalent to the D. Tufis¸ (&) · V. B. Mititelu · D. S¸tefa˘nescu · R. Ion Romanian Academy Research Institute for Artificial Intelligence “M. Dra˘ga˘nescu”, Bucharest, Romania e-mail: [email protected] V. B. Mititelu e-mail: [email protected] D. S¸tefa˘nescu e-mail: [email protected] R. Ion e-mail: [email protected]

123

1306

D. Tufis¸ et al.

notion of concept and it is represented, according to a differential lexicographic theory, by a series of words which, in specific contexts, could be mutually substituted (a synset). The organizing relations are specific to the grammar category of the literals in a synset. The major relations in PWN are: synonymy, hypernymy, meronymy (for nouns), troponymy, entailment (for verbs). Besides the semantic relations that hold between synsets, there are several other relations that relate literals, called lexical relations. The most important one is synonymy. Defined as identity of meaning and substitutability in most but not necessarily all contexts, synonymy establishes the sets of words occurring in the nodes of the network (the synsets). Most relations for adjectives and adverb (e.g. related nouns, participle) are lexical relations. Derivational relations link words with the same root, irrespective of their part of speech: teach (verb)—teacher (noun), teacher (noun)—teachership (noun). Antonymy holds between literals of the same part of speech and is instantiated by pairs from all four grammar categories represented in PWN. The influence of the WordNet project in the domain of natural language processing was enormous and several other projects have been initiated to complement information offered by PWN. Among the most important such initiatives was the alignment of PWN synsets to the concepts of SUMO&MILO (Niles and Pease 2001) upper and mid-level ontology, which turned the ensemble PWN + SUMO&MILO into a proper lexical ontology. Another enhancement of PWN was the development of DOMAINS (Bentivogli et al. 2004) hierarchical classification system, which assigns each synset of PWN a DOMAINS class.

2 Multilingual wordnets: EuroWordNet and BalkaNet As mentioned before, the impact of the PWN on the NLP systems for English was unanimously acclaimed by the researchers and developers of language processing systems and, as a consequence, in 1996, the European Commission decided to finance EuroWordNet (EWN) (Vossen 1998), a large project aiming at developing similar lexical resources for several major European languages: Dutch, English, French, German, Italian and Spanish. The most innovative feature of this project was the idea to have the synsets of the monolingual semantic dictionaries aligned via an Inter-Lingual Index (ILI), to allow cross-lingual navigation from one language to the others. The ILI represented a conceptualization of the meanings, lexicalized in different languages by specific synonymy sets. An initial set of 1,024 Common Base Concepts (CBCs) were selected and defined as PWN1.5 synsets and EWN defined for them (actually only for 1,012 CBCs) a top-ontology that has been the common semantic framework for defining the relations in each individual wordnet separately. To express the cross-lingual relations among the synsets in one language and the language-independent concepts of ILI, the EWN project defined 20 distinct types of binary equivalence relations (EQ-SYN, EQ-HYPO, EQ-MERO, etc.). While PWN was essentially focused on representing paradigmatic relations among the synsets, EWN considered the syntagmatic relations as well. As compared to PWN, the set of internal relations defined by EWN is much larger (90) including

123

The Romanian wordnet

1307

case relations (Agent, Object, Patient, Instrument, etc.) and derivative lexical relations (XPOS-SYNONYMY: to adore/adoration).1 After three successful years, the initial EWN project was extended for two more years with the task to include in the multilingual ontology four other languages: Basque, Catalan, Czech and Estonian. A significant follow-up of EWN was the BalkaNet (BKN) European project (Stamou et al. 2002), observing the EWN methodology, bringing into the multilingual lexical ontology five Balkan area languages: Bulgarian, Greek, Romanian, Serbian and Turkish. Czech language was also included as a liaison to the EWN aligned wordnets. The major objective of this project was to build core semantic networks (8,000 synsets) for the new languages and to ensure full crosslingual compatibility with the other 9 semantic networks built in EWN. The philosophy of the BKN architecture was similar to EWN, but it brought several innovations such as: more precise design methodologies, a common XML codification of the monolingual wordnets, the introduction of valence frames for verbs and deverbal nouns, the increased set of lexical relations (dealing with perfective/imperfective aspect and the rich inflectional morphology of the Balkan languages) allowing for non-lexicalized concepts, the definition of regional specific concepts, etc. The concepts considered highly relevant for the Balkan languages (see details in Tufis et al. 2004) were identified and called BKN Base Concepts. These are classified in three increasing size sets (BCS1, BCS2 and BCS3). Altogether there are 8516 concepts that were lexicalized in each of the BKN wordnets. BCS1 was an extension (according to the Conceptual Density Principle—see the next section) of EWN CBC set. The monolingual wordnets had to have their synsets aligned to the equivalent synsets of the PWN. The BCS1, BCS2 and BCS3 were adopted as core wordnets for several other wordnet projects such as Hungarian, Slovene, Arabic, etc. Most recently, an effort to develop a language-independent module to which all existing wordnets can be connected and which ensures better machine processing of lexical information, even cross-linguistically, took the form of the KYOTO project (Fellbaum and Vossen 2012).

3 The ongoing RWN development and its current status By the end of the BKN project (August 2004) the Romanian wordnet (RWN henceforth), built by a common team from the Research Institute for Artificial Intelligence of the Romanian Academy and the Faculty of Informatics of the “Al. I. Cuza” University of Ias¸i, contained almost 18,000 synsets, conceptually aligned to PWN 2.0 and, through it, to the synsets of all the BKN wordnets. Afterwards, the Research Institute for Artificial Intelligence of the Romanian Academy undertook the task of maintaining and further developing the RWN.

1

PWN 3.0 includes 14 noun–verb relations (morpho-semantic links) such as AGENT, INSTRUMENT, LOCATION, RESULT, etc.

123

1308

D. Tufis¸ et al.

During the BKN project we created in-house tools for the development of our wordnet: WNBuilder and WNCorrect (Tufis¸ and Barbu 2004). The former is used for the development and syntactic correction of the synsets, while the latter is used for the semantic correction. Two basic development principles were followed in the BKN methodology (Tufis et al. 2004): the Hierarchy Preservation Principle (HPP) (i.e., the hierarchical structure of the concepts in a wordnet is the same irrespective of the natural language for which the wordnet is developed) and the Conceptual Density Principle (CDP) (i.e., once a concept is selected to be implemented, all its ancestors up to the unique beginners are also selected) were strictly observed. The HPP allows for all semantic relations to be automatically imported from PWN. The CDP compliance prevents the existence of “orphan synsets” (Tufis et al. 2004). Throughout time, various criteria for selecting the concepts to be implemented have been observed: a complete coverage of the 1984 corpus, of the newspaper articles corpus NAACL2003, of the Acquis Communautaire corpus, of the Eurovoc thesaurus, and as much as possible from the Wikipedia lexical stock. Specific to the RWN is the way sense numbers are assigned to literals. Whenever a word is present in our electronic explanatory dictionary of Romanian (EXPD), its sense number is preserved in the RWN synset. In EXPD the hierarchical organization of word meanings is outlined by the sense numbering system (as exemplified below for the simplified entry BAIE): 1.

2.

Sca˘ldat, scalda˘, ˆımba˘iere. (bathing). 1.1 Cada˘, vas special de ˆımba˘iat (bathtub) 1.2 cantitate mare de saˆnge pierduta˘ de cineva (large quantity of blood lost by someone) 1.2.1 (prin extensiune) ma˘cel. (bloodbath) … Mina˘ (din care se extrag minerale) (mine from which ores and minerals are extracted).

One can notice the two different sense groups: in the former group the first meaning is connected with bathing; by metonymy, the second refers to the recipient for bathing; by extension, the third designates a large quantity of blood; the fourth is further extended to refer to a bloodshed; the latter group contains only one meaning, completely unrelated with the previous ones. We decided to maintain these nested sense numbers for literals because they can be viewed as an extra “relation” in wordnet, which keeps track of the semantic evolutions of words. When a Romanian word needed as an equivalent of an English one is not in the EXPD, it is assigned a sense “number” x (e.g. strămutare:x). As many words are polysemous, it is very probable that the same literal with the same sense number x occurs in more than one synset: strămutare:x occurs in the Romanian synsets corresponding to the English synsets {move:2, relocation:2}, {shift:1, displacement:2} and {resettlement:1, relocation:1}. Another special case is represented by words which are in EXPD but the sense equivalent to the English one is not listed. The lexicographer carefully examines the attested word-senses in order to find the closest one; if it exists, the unattested sense

123

The Romanian wordnet

1309

gets the same sense number as this one with “.x” added at its end (e.g. plural:1.x with the gloss the form of a word that is used to denote more than one; the word plural is registered in EXPD, but not with this meaning; however, the closest meaning it has is grammatical category that shows that there are two or more things of the same type which is its first sense in EXPD) (note that x is not a variable, but a not assigned yet sense number). Thus, the hierarchical organization of meanings remains unaltered. If it does not exist, i.e. the sense under consideration is not close to any of the recorded senses in EXPD, then the “x” sense “number” is assigned to it (e.g. familie:x equivalent to the English family:1 with the gloss a social unit living together), so it is treated as a distinct and unrelated sense. Please notice the identical treatment of the senses unattested in EXPD (although the word exists in EXPD) and the senses of a word not occurring in EXPD. 3.1 Semantic validation and conflict resolution In the process of manually developing a wordnet with more lexicographers working independently it is unavoidable to have semantic conflicts. We call conflicting literals or, simply, conflicts those cases when the same literal with the same sense number occurs in at least two different synsets. We do not consider literals with sense numbers ending in x conflicts in our network, as we envisage that, at a certain moment, we will assign proper numbers to these senses, in a way consistent with the other literals. We identified the main causes for conflicting literals in RWN version aligned to PWN2.0: ●





The equivalent PWN 2.0 synsets are extremely difficult, even impossible to distinguish from each other. As a consequence, their Romanian counterparts were identical. For example, inacţiune:1 appears in the Romanian synsets equivalent to the English {inaction:1, inactivity:1, inactiveness:1} (gloss: the state of being inactive) and {inaction:2, inactiveness:2} (gloss: a state of no activity). A proof of the redundancy is the fact that in PWN 3.0 only the former synset is preserved, while the latter was eliminated (and thus, this conflict disappeared in RWN3.0). Diaphasic variation. In PWN there are synsets in hyponymy relation, between which the only difference concerns the usage register: {mental hospital:1, psychiatric hospital:1, mental institution:1, institution:5, mental home:1, insane asylum:1, asylum:2} with the gloss a hospital for mentally incompetent or unbalanced person and {Bedlam:2, booby hatch:1, crazy house:1, cuckoo’s nest:1, funny farm:1, funny house:1, loony bin:1, madhouse:1, nut house:1, nuthouse:1, sanatorium:2, snake pit:2} with the gloss pejorative terms for an insane. Both synsets were implemented in Romanian, ignoring the stylistic difference, with the same literals with the same sense numbers: balamuc:1.1, ospiciu:1.1, casă_de_nebuni:x and this had to be corrected. BILI synsets doubling ILI synsets. At the end of the BKN project, the developed wordnets contained, besides the synsets aligned to ILI, some hundreds of synsets with literals lexicalizing concepts specific to the Balkan area. These

123

1310



D. Tufis¸ et al.

synsets wear the tag BILI, instead of ILI. We noticed that, because PWN contains some concepts beyond the American and British culture and civilization, some of the BILI concepts double ILI ones. For example, basma:1 was introduced as a BILI synset, although it is also the equivalent term for {kerchief:1} with the gloss a square scarf that is folded into a triangle and worn over the head or about the neck. PWN contains many metonymies that are not registered in our EXPD. For example, {revista˘:1} was given as the equivalent for both {magazine:2} (product consisting of a paperback periodic publication as a physical object) and {magazine:1, mag:1} (a periodic paperback publication). Another sense number needs to be postulated for revistă to match magazine:2 (e.g. revistă:1.x).

3.2 Mapping RWN onto PWN3.0 Since the beginning of the RWN project we have been concerned with its alignment to the latest PWN version available. We present below our strategy for aligning RWN to PWN3.0. We exploited the following facts: RWN has already been mapped to PWN2.0 and it is far easier to map PWN2.0 to PWN3.0 than directly RWN to PWN3.0. As such mappings between PWN2.0 and PWN3.0 already exist (see for example the one from the NLP Research Group at UPC2), mapping RWN to PWN3.0 might seem a trivial process. However, the existing mappings are automatically generated and no validation was performed on them. Moreover, as synsets appear or disappear between versions, there are many cases of 1-to-many or 1-to-0 mappings instead of 1-to-1, forcing us to decide what to do with the RWN data coming under such situations. We decided to develop our own automatic mapping algorithm between PWN2.0 and PWN 3.0. The results were compared to those obtained at UPC as if they were a Gold Standard. Both precision and recall values were around 95 % and the differences were manually validated or invalidated. Thus, we obtained a highly reliable mapping between PWN2.0 and PWN3.0 synsets.3 For the 1-to-1 mappings we simply transferred the PWN 3.0 synset identifiers to RWN. For the other types, we manually checked the related RWN data in order to properly decide what to do with it. There were 513 such situations, representing 1.23 % synsets of RWN, out of which 56 were of 1-to-2 type and 457 of n-to-1 type. All cases were manually validated: two lexicographers took them in turn and decided for a solution in each case. The 1-to-2 type is the result of the reorganization of synsets by moving literals from one synset into another, of the introduction of new word senses in PWN3.0 as compared to PWN2.0 and of the glosses improvement. Our task here was to choose the best fitting English synset for the Romanian one and to implement the other English synsets into Romanian. The n-to-1 type is represented by cases when two or more synsets in which different words occurred were merged. For these we decided

2

http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=59.

3

http://nlptools.racai.ro/nlptools/index.php?page=pwn3to2.

123

The Romanian wordnet

1311

upon the new form of the synsets and glosses. Tables 1 and 2 give a quantitative summary of the current RWN3.0. We mention that out of the total number of synsets 541 are labeled as BILI synsets (i.e. concepts specific to the Balkan area, as identified in the BKN project). They are not dangling nodes, but are attached in the most appropriate place in the network, usually via a semantic relation with the hypernym(s). For example, colac with the gloss a kind of bread, in the form of a ring, obtained by braiding some ropes of dough has pâine (bread) as hypernym. 3.3 Import of DOMAINS 3.2 and of SUMO/MILO annotations into RWN3.0 Based on the mapping between the PWN versions, it was possible to transfer the IRST DOMAINS 3.2 and SUMO/MILO annotations from the PWN2.0 to PWN3.0 and, afterwards, to import them into RWN3.0. RWN3.0 covers most of the DOMAINS-3.2, SUMO, MILO and domain ontologies concepts existing in PWN3.0 (Table 3). During the building of the RWN3.0 and the importing of the DOMAINS3.2 and SUMO/MILO annotations from PWN2.0 via PWN3.0 we detected that 13,946 English synsets of the PWN2.0 were not aligned to SUMO/MILO. We tried to remedy this omission by a very simple automatic procedure, followed by a humanTable 1 POS distribution of the synsets in RWN3.0 Noun literal/ synset/ sense

Verb literal/ synset/ sense

38875/41061/56594

6749/10063/16122 4450/4822/8265

Table 2 Some of the internal relations used in the RWN3.0

Adjective literal/ synset/ sense

Hypernym Near_antonym

Table 3 PWN3.0 versus RWN3.0 ontological labeling (DOMAINS, SUMO, MILO)

47171 3939

Adverb literal/ synset/ sense

Total literal/ synset/ sense

2798/3065/4026

52357/58725/86175

Domain_TOPIC

3676

Also_see/near_also_see

1520

Part_holonym

5422

Near_eng_derivat

Similar_to

4088

Instance_hypernym

Verb_group

1504

Attribute

868

Member_holonym

1870

Cause

187

Labels

PWN3.0

46093 3869

RWN3.0

DOMAINS-3.2

164

164

SUMO

824

787

MILO

1273

1202

863

715

Domain ontologies

123

1312

D. Tufis¸ et al.

validation and post-editing phase. The procedure may be briefly described as follows: Let A be a synset not mapped onto a SUMO/MILO concept and B a synset which is either a hypernym/hyponym of A (when A is a nominal or a verbal synonym) or is similar-to A (when A is an adjectival synset). Then, if A and B have the same DOMAINS label, A should be labeled with the same SUMO/MILO concept as B. This procedure was repeated until no new assignment was possible. For unmapped adverbial synsets nothing has been done yet since there is no “transfer” semantic relation to follow. Using this procedure, we were able to assign SUMO/MILO mappings for 12,693 English synsets, thus being left with 1,253 unmapped synsets: 561 for adverbs, 587 for adjectives, 80 for nouns and 25 for verbs. The 80 nominal synsets and the 25 verbal synsets have been manually mapped. Currently there are still 1,148 unlabeled synsets (561 for adverbs, 587 for adjectives) for which there is no structural information to help. Our experiments involving the mapping of the SUMO/MILO concepts on the PWN synsets also revealed quite a large number of contradictions in the hierarchical structuring of the two resources. We found many situations in which for two synsets S1 and S2, with S2 being a hypernym of S1, the SUMO/MILO concept assigned to S2 is a sub-class or sub-domain of the concept assigned to S1, as encoded in the description file (kif) containing the SUMO/MILO information (Table 4). For example, in PWN 2.0, synset 13322644-n, which contains AIDS:1—“a serious (often fatal) disease of the immune system transmitted through blood products especially by sexual contact or contaminated needles” is a hyponym of the synset 13322073-n, which contains infectious_disease:1—“a disease transmitted only by a specific kind of contact”. On the other hand, the SUMO/MILO concept assigned to 13322644-n (AIDS:1)—DiseaseOrSyndrome is a more general concept of InfectiousDisease—the SUMO/MILO concept assigned to 13322073-n. The SUMO/MILO kif file explicitly states that InfectiousDisease is a subclass of DiseaseOrSyndrome. The majority of these contradictions (more than 95 %) are still preserved for PWN 3.0. The example given above holds for PWN 3.0 as well, but the synsets identifiers are 14127782-n (AIDS) and 14127211-n (infectious_disease). For PWN 2.0 the total number of such contradictions is 8,027, while for PWN 3.0 it is 7,830. Complete lists of contradictions are available on-line.4 In the RWN2.0 and RWN3.0 we made the mapping corrections as we considered appropriate, but didn’t modified SUMO/MILO mappings to PWN2.0 and PWN3.0.

4 Conclusions and further work The development of RWN is a continuous project, keeping up with the new updates of the PWN, in line with the international interest for building wordnets.5 The 4

http://nlptools.racai.ro/nlptools/index.php?page=contradictionsPwn2SMzip http://nlptools.racai.ro/nlptools/index.php?page=contradictionsPwn3SMzip.

5

http://globalwordnet.org.

123

and

respectively

Definition (S1)

An elephant native to Africa having enormous flapping ears and ivory tusks

A woman whose merits were not been recognized but who then achieves sudden success and recognition

A kind of revolver

A percussion instrument consisting of a metal plate that is struck with a softheaded drumstick

A listing of the words used in some enterprise

Synset S1

02504458-n (African elephant:1)

09923263-n (Cinderella:1)

03073296-n (Colt:2)

03447721-n (Gong:1)

06420678-n (Vocabulary:1)

Table 4 Examples of contradictions for PWN 3.0

06418693-n (Wordbook:1)

03915437-n (Percussion instrument:1)

Musical-instrument

Text

04086273-n (Revolver:1)

10787470-n (Woman:1)

Human

Weapon

02503517-n (Elephant:1)

Synset S2

Mammal

SUMO (S1)

A reference book containing words (usu-ally with their meanings)

A musical instrument in which the sound is produced by one object striking another

A pistol with a revolving cylinder (usually having six chambers for bullets)

An adult female per-son (as opposed to a man)

Five-toed pachyderm

Definition (S2)

Book

Percussion-instrument

RevolverGun

Woman

Elephant

SUMO (S2)

The Romanian wordnet 1313

123

1314

D. Tufis¸ et al.

increase in its coverage is steady with the choice for the new synsets imposed by the applications built on the basis of RWN. There are several applications we developed using RWN as an underlying resource: word sense disambiguation, word alignment, question-answering in open domains, connotation analysis, machine translation, etc. The state-of-art performances on these systems are undeniably rooted in the quality and the coverage of RWN. RWN can be browsed on our language web services platform.6 The browser uses graph hyperbolic representations and it visualizes in a friendly manner all the synsets in which one given literal appears together with its corresponding synonyms, the semantic relations for each of its senses, definition of each sense, the DOMAINS and SUMO/MILO. RWN can be downloaded from the META-SHARE node at RACAI.7 Acknowledgments The new work reported here was supported by the Romanian Academy and by the European Community’s Seventh Framework Programme under METANET4U Grant Agreement no. 270893.

References Bentivogli, L., Forner, P., Magnini, B., & Pianta, E. (2004). Revising WordNet domains hierarchy: Semantics, coverage, and balancing. Proceedings of COLING 2004 Workshop on “Multilingual Linguistic Resources” (pp. 101–108). Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press. Fellbaum, C., & Vossen, P. (2012). Challenges for a multilingual wordnet. Language resources and evaluation, published online: 10 May 2012. Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38, 39–41. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1990). Introduction to WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244. Niles, I., & Pease, A. (2001). Towards a standard upper ontology. Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (pp. 2–9). Stamou, S., Oflazer, K., Pala, K., Christoudoulakis, D., Cristea, D., Tufis¸, D., Koeva, S., Totkov, G., Dutoit, D., & Grigoriadou, M. (2002). BALKANET A Multilingual Semantic Network for the Balkan Languages. Proceedings of the International Wordnet Conference (pp. 12–24). Tufis¸, D., & Barbu, E. (2004). A methodology and associated tools for building interlingual wordnets. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004) (pp. 1068–1070). Tufis, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, Methods, Results and Perspectives. In Tufis, D. (Ed.) Special Issue on BalkaNet of the Romanian Journal of Information Science and Technology, vol. 7, no. 1–2 (pp. 9–43). Vossen, P. (Ed.). (1998). A multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic Publishers.

6

http://nlp.racai.ro/WnBrowser/.

7

http://ws.racai.ro:9191/.

123