Download as a PDF

2 downloads 855 Views 59KB Size Report
In this paper, we present a computational dictionary for German verbal idioms, called Phraseo-Lex, where idioms are classified according to a wide range of ...
Building Lexicons out of a Database for Idioms Ricarda Dormeyer and Ingrid Fischer IMMD II, University of Erlangen-N¨urnberg Abstract In this paper, we present a computational dictionary for German verbal idioms, called Phraseo-Lex, where idioms are classified according to a wide range of description criteria. It was designed as a source of idiomatic knowledge for both the human user and applications in natural language processing. We show how it can be used for the latter, namely by generating detailed idiomatic entries for a given natural language processing lexicon. We introduce the implementation of a mapping between part of the Phraseo-Lex dictionary entry and one particular NLP lexicon. The goal system used consists of a chart parser, a syntactic formalism similar to PATR-II, and a separate semantic component using a combination of Discourse Representation Theory and -calculus.

1. The Phraseo-Lex Dictionary Idioms, especially verbal idioms, are the most complex multi-word units found in natural language, and a wide range of syntactic, semantic, and pragmatic properties typical for them have been described by different researchers. In this paper we present the Phraseo-Lex system (Keil, 1997), a database for the representation of German verbal idioms. By restricting ourselves to the relatively specialized phenomenon of verbal idioms, we gain the advantage that we can incorporate a large number of features in our dictionary. Phraseo-Lex was designed for two different purposes, namely on the one hand as a computational dictionary for humans and on the other hand as a source and an instrument to generate lexicons for natural language processing systems. This approach lead to a clear separation between the different classification criteria and features as well as to the design of a library of interface functions in order to enable other programs to access the different kinds of linguistic knowledge provided in an easy and unambiguous way. An advantage of this approach is that linguists and builders of natural language processing systems use the same system for different kinds of work, thereby detecting different faults and improving it in different ways. For the traditional linguist, we offer a graphical user interface that enables them to add idiomatic knowledge in an intuitive way. An example for this is an idiom’s syntactic structure and the categories of the participating lexemes, both of which are described together by graphically building a phrase structure tree. The linguistic content of the tree is processed by the system and can be retrieved for each of the idiom parts separately, using query functions from the interface mentioned above. This is especially useful for the generation of lexicons in a grammar formalism where an idiom is not represented as one single unit, but separated into its constituents. For natural language processing, the Phraseo-Lex dictionary may be useful in two different ways: Firstly, the systematic collection of idiomatic knowledge it offers can be used when designing a formal representation of idioms, as it helps to ensure that this representation accounts for all relevant phenomena. Secondly, after such a representation

has been planned, the contents of the database can be converted into it automatically, thus adding idiomatic knowledge to a given NLP lexicon. To generate idiomatic lexical entries in this way, an additional lexicon-building program is needed. It works by mapping the information it gets using the Phraseo-Lex interface functions to the representation needed by the goal system. The usefulness of a computational dictionary for this purpose is mainly due to the fact that verbal idioms are quite diverse, and many of their characteristics still seem to be of a rather arbitrary nature. Even Schenk (1995), who tries to come up with a theory that predicts the syntactic behaviour of idioms, admits in the end that “not all operations on idioms have been accounted for. For example, restrictions on passivization or imperative formation in the case of phrasal idioms do not follow from this analysis” (Schenk, 1995). Therefore, when representing idioms in a formal grammar, it is useful to have a detailed linguistic database at one’s disposal, where the knowledge needed has already been entered for the individual idiom. In the remainder of this paper, we first give an overview over some of the idiom characteristics captured in the Phraseo-Lex dictionary in section 2. This overview is restricted to the features most interesting for natural language processing. Then we describe an exemplary mapping algorithm to convert the Phraseo-Lex dictionary entries into a natural language processing system which uses Discourse Representation Theory as its semantic framework in section 4, and prior to this introduce this goal system and an adequate way to represent idioms in it in section 3.

2. Syntax and Semantics of Verbal Idioms The Phraseo-Lex computational dictionary contains an extensive description of German verbal idioms concerning syntactic, semantic, and pragmatic criteria, as well as a graphical user interface and a search component that allows the human user to retrieve all dictionary entries with any given set of characteristics from the database. A complete description of the system from the human user’s point of view can be found in (Keil, 1997). In the following overview, we restrict ourselves to those syntactic and semantic criteria we believe to be relevant for the representa-

tion of idioms in natural language processing.

2.1. Lemma and Base Lexemes In Phraseo-Lex, like in conventional dictionaries, a dictionary entry is headed by its lemma. It is represented in a citation form similar to the traditional one for verbal idioms, but extended by information about the subject position. Examples (1) and (2) show lemmata for an idiom with a variable subject, and one where the subject is a fixed part of the idiom. (1)

(jmd.) einen Bock schießen (sb.) a buck shoot to make a mistake

(2)

der Kopf raucht jdm. the head smokes sb. sb.’s head is spinning

Additionally, an idiom is indexed with a list of its content words, called the idiom’s base lexemes. They are given in their traditional citation form, which often differs from the inflected form in the idiom’s lemma. When turning a Phraseo-Lex entry into a lexicon entry in a given NLP lexicon, they can be used to identify the words the idiom consists of.

2.2. Syntactic Features We call the classification into idioms with a fixed subject and those with a variable one the idiom’s syntactic type. The remaining syntactic structure is described by means of a phrase structure tree. The phrase structure grammar needed to construct such a tree is implicitly given by an interactive tree building facility, which is part of the PhraseoLex graphical user interface. It contains a display of the tree and a column of syntactic category buttons that can be used to add nodes to the tree. The phrase structure tree serves to encode the verb’s dependents, their case, the words they contain and their syntactic categories. Just like simple verbs, verbal idioms require certain dependents to appear with them in the sentence. These are called the idiom’s external valencies. Additionally, the idiom’s internal structure can also be described in terms of valency theory, namely as the verb’s complements or internal valencies. The idiom in example (1) has one internal valency, einen Bock (a buck), and one external valency, a noun phrase in the nominative case. This valency structure can be computed from the information encoded in the phrase structure tree, with the subject position being taken from the idiom’s syntactic type. A typical characteristic of idioms is their resistance to syntactic manipulations: Many idioms cannot undergo certain transformations without losing their idiomatic reading; furthermore, syntactic anomalies and unique components may occur. In Phraseo-Lex, a given set of syntactic transformations can be marked as possible or impossible for each idiom, for example passivization, relativization, negation, wh-question, adnominal modification, and quantification. Transformations that apply to the idiom as a whole can be marked as

possible, impossible or undecidable; for those that apply to a constituent of the idiom, typically to a noun phrase, a list of base lexemes that allow the transformation can be given. A syntactic anomaly is a construction that is considered syntactically incorrect in nonidiomatic language, but nevertheless occurs in some idioms, for example a missing determiner, or a deviation in verb valency. Phraseo-Lex provides a list of syntactic anomalies found in German idioms, from which those applying to the current idiom, if any, can be selected. A unique component is a word that does not exist outside the idiom it occurs in. In Phraseo-Lex, each idiom’s unique component(s) are listed explicitly.

2.3. Semantic Features The classical view on the semantics of idioms is that they do not have an internal semantic structure, i.e. they are semantically noncompositional. This was even taken as one ˇ of the defining criteria for idioms. Cerm´ ak supports this position when calling the idiom’s noncompositionality “a conditio sine qua non for its semantic substance” and claiming that “semantically, the idiom is a holistic, Gestalt phenomenon, a feature often acknowledged, which excludes ˇ any possibility of an objective semantic analysis” (Cerm´ ak, 1998). This traditional view was challenged by Wasow et al., who claimed that there exists a class of idioms for which parts of the idiom “have identifiable meanings which combine to produce the meaning of the whole” (Wasow et al., 1983), i.e. a class of compositional idioms. A more recent view recognizes a continuum between fixed idiomatic expressions on the one hand and freely combinable words on the other hand, with different degrees of both syntactic flexibility and semantic analyzability in between (Abeill´e, 1995; Dobrolovol’skij, 1995). Taking this into account, we decide for each idiom part separately whether we take it to have a meaning of its own, or not, i.e. we distinguish between meaningful and meaningless idiom parts. This leads to a third class of idioms, called partially compositional idioms, which consists of the idioms having both meaningful and meaningless components. We describe the meaning of an idiom by means of one or more literal, non-idiomatic paraphrases, one of which is selected as the main paraphrase. The main paraphrase is supposed to reflect the idiom’s semantic type, i.e. it should take the identifiable parts of meaning into account. This means that for compositional idioms, the main paraphrase should have the same syntactic structure as the idiom, i.e. each meaningful idiom part should be assigned an appropriate paraphrase part with the same meaning. An example for this is given in (1), where Bock corresponds to mistake, and schießen corresponds to make. This mapping between idiom parts and paraphrase parts is made explicit in a dictionary section called semantic structure. Furthermore, each internal or external valency is assigned a semantic role (agent, patient, etc.), where internal valencies that do not carry independent meaning are marked as having no role. We can proceed like this because meaningful parts of an idiom can be considered figurative

Lemma jmd.

Role Paraphrase

Bock

Referent

Lemma

Agent

sb.

jmd.

Patient

mistake

Handtuch

Paraphrase

Role

Referent

Agent

sb.

No Role

Figure 1: Semantic structure of a compositional and a noncompositional idiom

arguments. Figure 1 shows the semantic structure for the compositional idiom in (1) and for the noncompositional idiom in (3).

consisting of a chart parser, a syntactic component using a unification grammar formalism similar to PATR-II and a separate semantic component which combines Discourse Representation Theory (Kamp & Reyle, 1993) with a version of -calculus, thus allowing to build discourse representation structures bottom-up from incomplete DRSs for single words, an approach that was originally introduced by Pinkal. The system has been implemented in the programming language S CHEME. For a more detailed description, see (Fischer, Geistert & G¨orz 1996). We represent idioms in a distributed manner, by assigning a separate representation to each of the elements the idiom consists of. To make sure that parts of an idiom appear in a sentence only in the context of the idiom as a whole, we need to assign to each of the verb’s dependents an attribute determining its verbal head, and to the idiomatic verb a list of its obligatory dependents. For V-NP idioms without a fixed determiner or an obligatory adnominal modifier, like the one in example (1), this is all that is necessary to ensure correct idiom processing; for others, like the one in example (3), the noun and its determiner must be treated the same way. The syntactic representation of the relevant parts of idiom (1), as it is constructed automatically by our mapping program, is given below in Figure 7. 

(3)

(jmd.) das Handtuch werfen (sb.) the towel throw to give up

Modification of part of an idiom occurs frequently in idiom usage, and often takes the form of an adjective or genitive attribute modifying an idiomatic noun. We follow Ernst (1981) in distinguishing between external and internal modification. External modification takes place if an attribute modifies the meaning of the idiom as a unit, internal modification if it modifies only part of the idiom’s meaning. For internal modification to be possible, the modified part must have a separate meaning, i.e. this kind of modification does not take place in noncompositional idioms. Examples for internal modification of the Bock component of idiom (1) and for external modification of the noncompositional idiom (4) are given in sentences (5) and (6), respectively. (4)

(jmd.) etwas in den Griff bekommen (sb.) s.th. in the grip get to master s.th.

(5)

Lisa hat einen großen Bock geschossen. Lisa has a big buck shot. Lisa has made a big mistake.

(6)

Lisa bekommt das Thema in den politischen Griff. Lisa gets the issue in the political grip. Lisa masters the issue politically.

Although external modification is in principle possible for almost any verbal idiom, in practice some idioms are never modified. An example for this is idiom (3); in more than a hundred occurrences of this idiom in German newspapers, we did not find any modification of the Handtuch part. Because of this, in Phraseo-Lex the modifiable parts of an idiom are listed explicitly.

MAIN-CL 

NP V NP

(NP 1 agr number) (NP 1 agr person) (V obl agr case1) (NP 1 agr case) (V obl agr case2) (NP 3 head) (NP 3 sem) MAIN-CL (MAIN-CL subj) (MAIN-CL obj) NP 

= = = = = = = = = =

(V agr number) (V agr person) nom nom (NP 3 agr case) (V sem) (V val2) V NP 1 NP 3

DET N

(DET agr) (DET sem) (DET noun) NP (NP det stem)

= = = = =

(N agr) (N det sem) (N sem) N (DET stem)

Figure 2: Grammar rules extended for idiom processing

3. The Goal System The example goal system we chose to convert our database into is a prototypical natural language processing system

To ensure that the relations encoded in the additional features are processed correctly, the grammar rules must be extended accordingly. No rules need to be added to the sys-

tem, it is sufficient to modify the existing ones. When generating idiomatic entries from our database, this encoding must be done manually. Figure 2 shows two grammar rules of our system, with the extensions for idiom processing given in italics. The additional features mentioned above are therefore a val2 feature in the verbal entry, a head feature and, for idioms with a fixed determiner, a det sem feature in the noun’s entry, and a noun feature for the idiomatic determiner. In nonidiomatic language, the lexical entries do not contain these attributes, and unification using the rules given here always succeeds. In -DRT, two different kinds of -abstraction can take place: it is possible to abstract over a discourse referent (predicative DRS) or over a complete DRS (partial DRS). The sentence DRS is built up by functional composition, an operation on a partial DRS as functor and a predicative DRS as argument. Figure 3 shows a DRS for a short sentence and the incomplete DRSs it is built of. The noun phrase is represented by a partial DRS, the verb by a predicative one. The discourse referent e represents the event described by the sentence. 

as well as with regard to the adjunction of modifiers. Figure 5 shows the necessary DRSs for representing the meaning of idiom (1). The determiner in this idiom shows the usual behaviour known from nonidiomatic language, and therefore does not need to be represented explicitly as part of the idiom.

 



e e: make(x, y)

mistake(x)



ex Lisa(x) e: sleep(x)



 x Lisa(x)

e e: sleep(x)

+Q(x)

Figure 3: DRSs to build the sentence Lisa sleeps. The treatment of a given idiom in this semantic formalism depends on the idiom’s compositionality. As described above, we distinguish compositional idioms from noncompositional ones, or rather, semantically meaningful idiom parts from meaningless ones. Figure 4 shows the semantic representation of the sentences Lisa schießt einen Bock and Lisa wirft das Handtuch, respectively. In the following, we show how these DRSs are composed from the semantic representation of the single words they consist of.

exy Lisa(x) mistake(y) e: make(x, y)

ex Lisa(x) e: give up(x)

Figure 5: DRSs for the idiom parts Bock and schießen An idiom part that does not carry independent meaning, on the other hand, is assigned an empty semantic representation in our system, with all the idiom’s meaning concentrated in the verbal entry. Figure 6 shows the DRSs for idiom (3). Unlike the one in example (1), in this idiom the definite determiner is fixed, and it does not have the referential meaning usually carried by definite determiners. We model this fact by assigning it an empty semantic representation. This also means that we must provide empty semantic entries for each syntactic category, using the idiom parts’ syntactic entries and the grammar rules given above to prevent a semantically empty determiner, or preposition, to combine with a semantically meaningful noun. The empty semantic entries can be added in advance to the parsing system to prepare it for the automatic generation of the idiomatic lexicon, either manually or in a first automated step.

 

 +Q(x)+R(x)

  e e: give up(x)

Figure 6: DRSs for the idiom parts das, Handtuch, and werfen Please note that the verbal entry of the compositional example abstracts over a DRS that is to become one of its arguments, whereas in the noncompositional example, it is to modify the verb’s event referent. By this, we model the nonreferential status of the semantically empty constituent and at the same time allow for adnominal modification to take place, re-interpreting it as modifying the idiom as a whole, as suggested by Ernst (1981).

Figure 4: DRSs for idiomatic sentences

4. The Mapping Process A semantically meaningful idiom part is treated much the same way as a nonidiomatic word, the main difference being that it is required to appear only in the context of the idiom as a whole. Idiom parts of this kind usually are syntactically variable with regard to the determiners they select

The mapping from the Phraseo-Lex dictionary to the lexicon introduced above has been implemented for the limited range of phenomena the goal system allows. We will restrict our description to idioms consisting of a simple verb

and several noun phrases acting as the idiom’s internal and external valencies. Since our goal system contains separate syntactic and semantic lexical entries, two distinct mappings must take place. In the following, we will first describe the generation of syntactic idiom entries, and after that the construction of the DRSs belonging to them.

4.1. Generating Syntactic Entries A nonidiomatic noun in our goal system is described syntactically by specifying its case, number, and gender attribute, as well as its stem, i.e. its nominative singular form, and a keyword pointing to its semantic entry. A verbal entry consists of attributes for number, person, a valency list determining each dependent’s case, its infinitive form, and also a semantics keyword. Our syntax lexicon is a full-form lexicon, i.e. each inflected form of a noun or verb is listed with an entry of its own. In Phraseo-Lex, the lexical material concerning an idiomatic noun consists of the inflected form appearing in the idiom’s citation form, and its base lexeme form, i.e. nominative singular. Since we assume idioms in general to be flexible, we need to generate lexical entries not only for these, but for all forms differing in case and number the lexeme may appear in. This information, as well as the gender attribute, must be taken from the goal lexicon, since it is no part of the Phraseo-Lex dictionary entry. This means that the mapping process constructs idioms from ordinary words, thus reflecting the fact that idioms are build from the normal lexical material of a language. It does this by looking up the given word’s base lexeme form in the lexicon, i.e. from the point of view of the goal system, its stem, thus finding the whole range of inflected forms, and then manipulating those by changing their semantics keyword, and adding a head and possibly a det sem attribute. The head attribute is taken from the verb’s semantics attribute. In its current form, Phraseo-Lex does not contain information about the flexibility of the determiners. Therefore, in our implementation of the det sem attribute, we took a semantically meaningful idiom part as having a variable determiner, and a semantically empty part as having a fixed one. For nouns marked as not quantifiable in Phraseo-Lex, only the singular resp. the plural half of the syntactic entries is generated. Phraseo-Lex does not provide the information whether a given lexeme is in singular or plural form; again, the mapping algorithm must determine this by comparing its inflected form to all the forms available in the goal lexicon. Unique components form an exception to this; they must be introduced into the lexicon by the mapping program. Since even unique components can be quite flexible in actual usage, it is necessary to add entries for different case forms to the lexicon. This poses a problem, since PhraseoLex does not provide an inflection pattern for unique components, nor information about their gender. We solve it by representing only the nominative, dative and accusative singular forms of the word, which are identical for most words, and by assigning them any gender that matches the determiner given in the idiom’s citation form. A more sat-

isfying solution would require a morphological component; since an idiom’s unique component usually consists of regular morphemes, it is possible to conclude its gender and inflection pattern from these in a systematic way. If a noun cannot be found in the goal lexicon, our current implementation displays a warning and adds the noun in the same way as it would add a unique component. An alternative approach would be to stop the mapping process and ask the user to enter the missing noun. Our decision to proceed otherwise is mainly due to the small size of our goal system, which would force us to do extensive preliminary work first if we had chosen the second option. The valency list of the idiomatic verb, which is encoded in the Phraseo-Lex phrase structure tree, can be retrieved by a single Phraseo-Lex interface function. If the idiom has a variable subject, the verb is given in its infinitive form only, i.e. all inflected forms must be taken from the goal lexicon. Here, syntactic variation is too large to allow to create the inflected forms without morphological knowledge; the verb’s entry is required to exist when the mapping process starts. A deviation in valency, however, may be handled similar to a unique component, namely by changing the verb’s valency features while keeping its inflectional pattern. bock: N (agr case) (agr number) (agr person) (agr gender) head stem sem

= = = = = = =

nom singular third mas schiessen vp13 bock bock vp13

schiesst: V (agr number) (agr person) (obl agr case1) (obl agr case2) (obl agr case3) val2 stem sem

= = = = = = = =

singular third nom acc no bock vp13 schiessen schiessen vp13

Figure 7: Syntactic entries for the idiom parts Bock and schießt For idioms with a fixed subject, the situation is different; here, only one inflected form is to be generated, and this form is part of the idiom’s citation form and can be retrieved by a Phraseo-Lex interface function. The Phraseo-Lex interface also contains functions to determine whether a given valency position is an internal or an external valency. According to this information, the val features are set up. Figure 7 shows the syntactic entries for the nominative singular of Bock and the third person singular of schießen, as they are generated by our lexicon building program. The features val2 (for the second valency position) resp. head (for the verbal head) contain the information necessary to

find the other relevant part for building the idiom, namely its semantics keyword. Every part of the idiom is marked with an extra ending, in our example vp13. This is due to the fact that the same words can occur in different idioms and should not be mixed up during parsing. The semantics keywords and the additional val2 resp. head features are the only changes to the literal words’ entries that have to be made in this example.

the semantic formalism. In the future, we plan to test the generation of lexicons with several different linguistic formalisms, including HPSG. If necessary, we will adapt or complete the attributes the dictionary contains and thus improve the linguistic functions in the process, leading to a complete, general, theory-independent interface.

6. References 4.2. Generating Semantic Entries For each of the lexemes given above, one semantic entry is generated, consisting of a discourse representation structure. For building the verb DRS, the verb valency is needed again. For each noun phrase in the verb’s list of depen dents, a is added to its -list. If the noun phrase is an internal valency with no semantic role, i.e. if it is not an ar gument, it becomes a , thus ensuring that an adnominal modifier is interpreted as modifying the event represented by the verb. Otherwise, an argument is added to the condition describing the verb’s meaning. The Phraseo-Lex interface contains a function that determines for a given internal valency whether it carries a semantic role, or not. 





 

  e e: ?(x, y)

e e: ?(x)

Abeill´e, Anne (1995). The Flexibility of French Idioms: A Representation with Lexicalized Tree Adjoining Grammar. In M. Everaert, E.J. van der Linden, A. Schenk, & R. Schreuder (eds.), Idioms: Structural and Psychological Perspectives (pp. 15–42). Hillsdale, NJ: Lawrence Erlbaum Associates. ˇCerm´ak, Frantiˇsek (1998). Substance of Idioms: Perennial problems, lack of data or theory? In Proceedings of the 3rd International Symposium on Phraseology (pp. 32– 38). Stuttgart. Dobrovol’skij, Dmitrij (1995). Kognitive Aspekte der IdiomSemantik. T¨ubingen: Gunter Narr Verlag. Ernst, Thomas (1981). Grist for the Linguistic Mill: Idioms and ‘Extra’ Adjectives. Journal of Linguistic Research 1, pp. 51–68. Fischer, Ingrid; Geistert, Bernd; G¨orz, G¨unther (1996). Chartbased Incremental Semantics Construction with Anaphora Resolution Using -DRT. In Botley, S.; Glass, J. (eds.), Proceedings of the Discourse Anaphora and Anaphor Resolution Colloquium (pp. 235–244). Lancaster, UK. 

Figure 8: DRSs in construction for the idiom parts schießen and werfen At this stage, the verb DRSs for examples (1) and (3) have the form given in Figure 8. The actual semantic content of the verb, which is still missing here, is taken from the Phraseo-Lex semantic description, where each meaningful idiomatic element is assigned the paraphrase part corresponding to it. Each of the verb’s internal valencies needs a DRS of its own. If the noun phrase is not an argument, it is assigned the appropriate empty DRSs, like the empty-det DRS and the empty-noun DRS given in Figure 6. In this case, there is nothing left to do except to make the noun’s syntax entry point to the semantics entry empty-noun. Empty DRSs for the different syntactic categories have been added to the semantics lexicon in advance. Otherwise, a noun DRS is constructed and filled with the appropriate semantic content, in the same way as it is done when building the verb DRS.

5. Conclusion We have introduced a detailed description of German verbal idioms in the area of syntax and semantics, which has been implemented in a computational dictionary using a database system. The database can be used to automatically generate idiomatic lexicons for natural language processing. We have shown such a mapping process for a small example system using Discourse Representation Theory as

Fischer, Ingrid; Keil, Martina (1996). Parsing Decomposable Idioms. In Proceedings of Coling (pp. 388–393). Kopenhagen. Kamp, Hans; Reyle, Uwe (1993). From Discourse to Logic. Dordrecht, Netherlands: Kluwer Academic Press. Keil, Martina (1997). Wort f¨ur Wort — Repr¨asentation und Verarbeitung verbaler Phraseologismen (Phraseo-Lex). T¨ubingen: Max Niemeyer Verlag. Schenk, Andr´e (1995). The Syntactic Behavior of Idioms. In M. Everaert, E.J. van der Linden, A. Schenk, & R. Schreuder (eds.), Idioms: Structural and Psychological Perspectives (pp. 253–271). Hillsdale, NJ: Lawrence Erlbaum Associates. Wasow, T.; Sag, I.; Nunberg, G. (1983). Idioms: An Interim Report. In Proceedings of the XIIth International Congress of Linguists (pp. 102–115). Tokyo.