Learning Subcategorization Frames from Corpora

0 downloads 0 Views 196KB Size Report
Noun phrases (NPs), prepositional phrases (PPs), adverbs and secondary clauses (SCs) may function as arguments to verbs. (1) Shows that a single verb ...
LEARNING SUBCATEGORIZATION FRAMES FROM CORPORA: A CASE STUDY FOR MODERN GREEK Manolis Maragoudakis, Katia Lida Kermanidis and George Kokkinakis Wire Communications Laboratory University of Patras 26500 Rio, Greece {mmarag, kerman, gkokkin}@wcl.ee.upatras.gr

ABSTRACT Certain Natural Language Processing (NLP) applications such as parsing and semantic processing require complete lexicons that provide subcategorization information for a word of interest, i.e. the necessary information about the set(s) of syntactic constituents the word must combine with, in order for its meaning to be fully expressed. Modern Greek presents high flexibility in the allowable orderings of its syntactic phrases as well as rich variety of syntactic constructions, which may function as arguments to verbs. In this paper, we describe a set of machine learning techniques used to automatically extract subcategorization frames of verbs from corpora.

1. INTRODUCTION Verb subcategorization is an important issue especially for parsing and grammar development as it provides the parser with syntactic and/or semantic information of a verb’s arguments, i.e. the set of restrictions the verb imposes on its arguments. Machine-readable dictionaries are not available for some languages, and dictionaries listing subcategorization frames (SFs) usually give only the expected frames rather than the actual ones and are therefore not complete. Acquiring frames automatically from corpora overcomes these problems altogether. Previous work on learning subcategorization frames focuses mainly on English ([2],[9],[3],[7]). [1] deals with Italian, [5] and [6] with German, [8] with Japanese, [13] with Czech. In most of the above approaches, the input corpus is fully parsed and many learn only a small number of frames ([2],[9]). In this paper we present a method of acquiring verb subcategorization frames for Modern Greek automatically from chunked corpora. We use the statistic metrics of [13] as a measure to discover frames. As a wide-coverage syntactic parser for Modern Greek is not available, the input corpus is preprocessed by POS and case tagging ([11],[12]) as well as a robust

chunker that detects intrasentential phrases based on minimal resources ([10]). The frames are not known beforehand but are learned automatically. First we describe some properties of the Modern Greek language relevant to the task of SFs acquisition which need to be taken under consideration.

2. PROPERTIES OF MODERN GREEK Modern Greek (MG) is a “free word-order” language. The arguments of a verb do not have fixed positions. They are basically determined by their morphology and especially by the case. Noun phrases (NPs), prepositional phrases (PPs), adverbs and secondary clauses (SCs) may function as arguments to verbs. (1) Shows that a single verb (πιστεύω-believe) can take as arguments all of the above syntactic constituents. (1)

a. Πιστεύω την Ελένη Believe[1sg] Helen[NPacc] I believe Helen b. Πιστεύω πως θα έρθει Believe[1sg] that come[3sgFut] I believe that he will come c. Πιστεύω στο Θεό Believe[1sg] in[PREP] God[acc] I believe in God d. Έτσι πιστεύω So[ADV] believe[1sg] I believe so

Verbs select for particular prepositions, particular types of secondary clauses, and particular cases (accusative or genitive) of their NP complements. We have used the following set of labels to tag the possible verb arguments. Numbers next to the NPs indicate cases. They may be followed by certain letters in the case of pronouns, denoting the type of the pronoun. Numbers next to PPs indicate the preposition and the ones next to

adverbial phrases indicate the type of the adverb (temporal, relative etc.). SCs are tagged by the conjunction, adverb or pronoun that introduces them. NPs: N1, N2, N3 Pronouns: N1-A, N1-P, N3-P, N1-D, N2-D, … PPs: P1 (ανά), P2 (από), P3 (για), P4 (δια), … SCs: να, αν, που, πως, … ADVBs: ADVP1, ADVP2, ADVP3 The verb itself is tagged as being active or passive. Three more tags, referring to whether a verb phrase may contain a weak personal pronoun in the genitive or accusative case, have been included. The above verb types will be mentioned as verbs henceforth.

3. TASK DESCRIPTION In this section, a detailed description of our method is presented. Furthermore, we describe the process of obtaining and training the input data, as well as producing the output by our algorithms. The main goal of this work is to deal with the issue of subcategorization using as limited resources as possible for the preprocessing of the corpus and at the same time being able to learn a large number of frames focusing on a large number of verbs. As an argument of a verb, we consider every syntactic constituent with a frequent co-occurrence with the verb. The corpus was selected from the Greek newspaper Το Βήµα (118K words, about 5.000 sentences, general purpose articles). Preprocessing includes basic grammatical tagging (POS, case for NPs and voice for verbs) ([11]), phrase chunking ([10]) (detecting noun phrases, verb phrases, prepositional phrases, adverbial phrases and conjunctions that join the phrases together) and detecting the headword of a noun phrase since its case is considered to be the case of the entire phrase. The chunker works with only a 450 keyword lexicon, containing closedclass words such as articles, prepositions etc., and a 300 suffix lexicon containing the most common suffixes of Modern Greek words. The annotated corpus is then used in order to identify the verb and its surrounding phrases of a sentence. In order to make the following discussion more comprehensive, we refer to the set of a verb and its neighbor phrases as an observed vector (OV). The phrases near the verb are mentioned as verb neighbors (VNs).

After a number of experiments1, we found out that a satisfactory size of an observed vector (OV) should contain up to 2 phrases (VNs) before the verb and up to 3 phrases after. This observation comprises an improvement in comparison with other related works in which an OV may contain all the daughters of a verb ([13]), even if they are quite distant from it. If a phrase exceeds the boundaries of (-2,+3), it is very unlike to be a correct argument of a verb. Our primary intention is to examine the extracted OVs and determine the correct subcategorization frames, which are actually in most cases a subset of the OVs. However, we need to face a common problem that often appears. In real sentences, it is very usual for a subcategorization frame to be accompanied by one or more adjuncts. Thus, the extracted OV almost certainly contains noise. The approach of Brent and others, which counts all of the OVs and decides which of them are strongly associated with a given verb, doesn’t seem to work properly in our situation. In MG it is fairly rare for a correct frame to be observed isolated in a sentence, particularly in a large one. That is the reason why we decided to consider not only the possible subsets of the OVs produced, but (due to the syntactic flexibility of phrases in MG described in 2) also all the permutations of the constituents forming possible subsets. We used a naive algorithm that records the frequency of each subset of an OV. Large, infrequent subsets are rarely correct frames, so they are apt to be discarded. Moreover, small but sporadic ones are probably missing some arguments and therefore are rejected as well.

N1 N3 P5 (1)

N1 N3 (1)

N1 (1+1)

N1 P5 (1)

N3 (1+1)

N3 P5 (1)

P5 (1+1)

Fig.1: Computing the subsets of the frame N1 N3 P5. The count of each subset is given within parentheses.

After refining the OVs, we have completed the process of defining the input data. With the aim of determining the correct subcategorization frames, it is necessary to assign a score value to each candidate vector (frame). This metric allows us to classify a hypothesis of a set of candidate dependents as arguments or adjuncts. Note that we have not previously labeled the input data. We 1

We have developed a module for extracting the OVs, which is easily parameterized in terms of the number of phrases to be taken into account before and after the verb.

calculate the scores using some well-known statistical methods:

training process), it should be considered as a valid argument.

3.1 Log Likelihood Statistic Making the hypothesis that the distribution of an OV f in the data is independent of the distribution of a verb v we can use the log likelihood statistic [13] in order to detect frames highly associated to verbs. This hypothesis is expressed as p( f ) = p( f | v)=p(f |! v), meaning that the distribution of f given v is the same as the distribution of f given that v is not present (!v). k1 = c(f, v)

the count of a given frame f for a given verb v cv = c(f, v) + c(!f, v) the count of a given verb v k2 = c(f,! v) the count of frame f with every other verb except for v cnv=c(f,!v) + c(!f,!v) the count of every other verb except for verb v Using the above values: p1 = k1/cv

4. RESULTS Our method was tested on a population of approximately 87.000 candidate frames (including the subsets), which have been generated from about 8.000 observed vectors (See Table 1). Moreover, the number of verb types which have occurred in the corpus more than 5 times is 864. To evaluate our results, we have used a 600-sentence test corpus, in which the correct frames were hand-labeled. Our recall is the number of the frames that were suggested as correct, divided by the number of the correct frames in the Greek syntax. Our precision is the percentage of the actually correct frames divided by the number of all suggested as correct frames. Notice that there are some factors that influenced negatively our outcome. The errors due to the chunker, along with the high frequency of coordinate conjunctive phrases within the corpus2, lessened the number of correctly marked frames as arguments, hence decreasing the precision and recall values.

p2 = k2/cnv p = (k1+ k2) / (cv+cnv) The log likelihood statistic is then given by: -2logλ = 2[logL(p1, k1, cv) + logL(p2, k2, cnv) – logL(p, k2, cnv) – logL(p, k2, cnv)] where,

Corpus Size

Observed Vectors

967KB 118K Words 5.020 Sentences

8.048

Observed Vectors (with subsets)

Precision

Recall

73%

82%

87.412

Table 1: Statistical analysis of our approach.

logL(a,b,c) = clog(a) + (b-c)log(1-a).

5. CONCLUSION

3.2 T-score The t-score statistic is computed by the following equation, using the definitions from the previous section:

T=

p1 − p 2

σ 2 (cv, p1 ) + σ 2 (cnv, p 2 )

where,

σ (n, p) = np(1 − p) Based on the above metrics, we observed that if a candidate frame scored high log likelihood and relatively low T-score (according to thresholds which have arisen during the

A method for automatically extracting subcategorization frames for a plethora of verbs in MG has been presented. The results are comparable to those of other related work, using limited resources for the linguistic preprocessing. We estimate that if we had a free of error chunker and eliminating the problem of the conjunction phrases, we could achieve more than 75% accuracy. We are currently experimenting with the thresholds of the log likelihood and T-score in order to achieve better accuracy.

2

The chunker characterizes the coordinate conjunctions as separate phrases.

We are also integrating the SF information produced by our technique into a syntactic shallow parser for Greek. In conclusion, we consider using the correct tagged frames as the train corpus of a memory-based learner, which will automatically classify new candidate frames without having to take all the previously mentioned metrics (see Sections 3.1 and 3.2) into account.

6. REFERENCES [1] Basili, R., M. T. Pazienza and M. Vindigni (1997), Corpus-driven Unsupervised Learning of Verb Subcategorization frames. Proceedings of the Conference of the Italian Association for Artificial Intelligence, AI*IA 97, Rome. [2] Brent, M., (1993). From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax. Computational Linguistics, vol. 19, no. 3, pp. 243262. [3] Briscoe, T., Carroll J. (1997), Automatic Extraction of Subcategorization from Corpora. Proceedings of the 5th ANLP Conference, pp. 356363. ACL, Washington D.C. [4] Carroll, J., Minnen G. (1998), Can Subcategorization Probabilities Help a Statistical Parser? Proceedings of the 6th ACL/SIGDAT Workshop on Very Large Corpora, Montreal, Canada. [5] De Lima, F., (1997), Acquiring German Prepositional Subcategorization frames from Corpora. Proceedings of the 5th Workshop on Very Large Corpora (WVLC-5). [6] Eckle, J., and U. Heid (1996). Extracting raw material for a German subcategorization lexicon from newspaper text. Proceedings of the 4th International Conference on Computational Lexicography, COMPLEX'96, Budapest, Hungary. [7] Gahl, S., (1998), Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus. Proceedings of COLING-ACL 1998, pp.428-432. [8] Kawahara, D., N. Kaji and S. Kurohashi (2000), Japanese Case Structure Analysis by Unsupervised Construction of a Case Frame Dictionary. Proceedings of COLING 2000.

[9] Manning, C., (1993) Automatic Acquisition of a Large Subcategorization Dictionary from Corpora. Proceedings of 31st Meeting of the ACL 1993, pp. 235-242. [10] Stamatatos, E., N. Fakotakis and G. Kokkinakis (2000), A Practical Chunker for Unrestricted Text. Proceedings of the 2nd International Conference of Natural Language Processing (NLP2000), pp. 139-150.

[11] Sgarbas, K., N. Fakotakis, G.Kokkinakis (1995), A PC-KIMMO-Based Morphological Description of Modern Greek. Literary and Linguistic Computing, Vol.10. Oxford University Press, No.3, pp.189-201. [12] Sgarbas, K., N. Fakotakis, G. Kokkinakis (1999), A Morphological Description of Modern Greek using the Two-Level Model (in Greek). Proceedings of the 19th Annual Workshop, Division of Linguistics, University of Thessaloniki, Greece, April 23-25, 1999, pp.419433. [13] Zeman, D., A. Sarkar (2000), Learning Verb Subcategorization from Corpora: Counting Frame Subsets. Proceedings of LREC 2000, pp. 227-233.