Automatic Diacritics Insertion in Romanian Texts - racai

3 downloads 0 Views 154KB Size Report
Landini, Gabriel (1997): Zipf's laws in the Voynich Manuscript. http://web.bham.ac.uk/. G.Landini/evmt/zipf.htm. Ide, N. (1998) Corpus Encoding Standard: SGML ...
F. Kiefer, G. Kiss and J. Pajzs (eds.) Papers in Computational Lexicography COMPLEX’99,

pp219-228

TEI-Encoding of a Core Explanatory Dictionary of Romanian Dan Tufiş1,2, Georgiana Rotariu2, Ana-Maria Barbu2 1 Romanian Academy (RACAI), 2ICI Bucharest [email protected], geo/[email protected]

ABSTRACT The efforts on development of large Lexical DataBases (LDB) are just emerging in most of the CE-countries. CONCEDE is an EU project aiming at harmonising the methodologies, tools and (to a less extent) resources for building LDBs for six CElanguages: Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The paper addresses the specific problems concerning the process of TEI encoding of the Romanian Explanatory Dictionary. The Romanian Explanatory Dictionary (DEX, second edition, 1996) is the reference dictionary of Romanian and it was developed by the Institute for Linguistics of the Romanian Academy and published by the Univers Enciclopedic Publishing House. DEX is meant for a wide public and therefore the lexicographic content is relatively rich with head-words belonging to what is generally called a basic vocabulary for one language, including regional variants, but containing also various technical or specialised terms as well as various neologisms and recent lexical imports.

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

1. Introduction The efforts on development of large Lexical DataBases (LDB) are just emerging in most of the CE-countries. CONCEDE is an EU project aiming at harmonising the methodologies, tools and (to a less extent) resources for building LDBs for six CE-languages: Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The project adopted an incremental approach, therefore having a generic sampling method for deciding at each step on what headwords to include into the lexical database was important. This procedure is described in the paper that gives an overall presentation of the CONCEDE project. Here, we will address the specific problems concerning the process of TEI encoding of the Romanian Explanatory Dictionary.

2. The Romanian Explanatory Dictionary The Romanian Explanatory Dictionary (DEX, 1996) is the reference dictionary of Romanian and it was developed by the Institute for Linguistics of the Romanian Academy and published by the Univers Enciclopedic Publishing House. DEX is meant for a wide public and therefore the lexicographic content is relatively rich with head-words belonging to what is generally called a basic vocabulary for one language, including regional variants, but containing also various technical or specialised terms as well as various neologisms and recent lexical imports. It contains about 65,000 entries each of them containing plenty of information, some of it in an explicit way, some other in the implicit format (layout conventions). The information categories are the following: head-word, accentuation, inflected forms, accent shift (where the case), pronunciation, grammatical information on the head-word and inflected forms, sense definitions, references to other head-words or other sense definitions, phrasal constructions, usage information, lexical relations (synonyms, hypernyms, hyponyms and sometimes antonyms), headword variants, etymology. In most cases, examples and synonymy series accompany sense definitions. Whenever the case, and distinctly marked, the dictionary entries contain expressions and locutions headed by the head-word (or one of its inflected forms). Given that the copyright for the electronic version of the dictionary was not in the hands of the Romanian Academy and the copyright holders did not agree to provide the raw texts, we had to keyboard the information provided in the printed dictionary. When doing so, all the information implicit in the layout (see section 3.1.3) was made explicit by means of specific mark-up. Because within the CONCEDE project we had not enough resources to keyboard all the entries and trying to give as much potential as possible to our work for various applications, we decided to type in selectively the most frequent words we found in our various corpora (more than 10.000.000 words). We extracted a list of about 15.000 candidate lemmas existing in DEX, out of which more than 10.000 entries were already keyboarded. According to preliminary tests made on the annotated part of our corpora containing about 1.000.000 words annotated in conformance with CES dtd (Ide, 1998, Dimitrova & all, 1998, Ide, Veronis, 1995, Ide, Veronis, 1994), these 10.000 lemmas (and their inflected forms) will cover at least 90% of new texts. Therefore, encoding these lemmas into a LDB would create a useful lexical resource for most of NLP applications.

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

3. Extraction and conversion of data from the printed dictionary 3.1. The extraction process Professional typists ensured the keyboarding process and they were instructed to follow exactly the layout of the printed dictionary, except for some well-specified conditions (see below). We developed a program that is aware of all the conventions in the printed form of DEX (described in the sections 3.1.2 and 3.1.3) as well about the TEI.dictionary. This program is rather sensitive to a pre-specified order of the information types but is quite permissive in defining the lexicographic conventions. One should notice that the current version of the program is not able to achieve a fully automated conversion (the average is about 80% and although we are confident that this figure could be improved we are rather doubtful concerning a fully error-free automated procedure). This is due to the fact that there are entries that for one reason or another do not conform to the general structuring and conventions mentioned before (and discussed next) and on the other hand, and this is a more serious reason, certain encoding decisions do not depend on syntactic criteria, but require interpretation and human judgement. For the vast majority of the selected entries, the conventions were strictly observed and this fact significantly simplified the automatic conversion from the MSWord format (actually the Word files were exported as HTML files and the conversion started from this format) to the target SGML encoding. 3.2. The ordering of various information types in the printed dictionary Information associated to a head-word in DEX observes the structuring and typographical conventions as shown in Figure 1 and explained below.

L

F

G

AL, A, ai, ale, art.

S 1. (Articol posesiv sau genitival, înaintea

pronumelui posesiv sau a substantivului în genitiv posesiv, când cuvântul care posedă nu are articol enclitic) Carte a elevului. 2. (Înaintea numeralelor ordinale, începând cu „al doilea”) Cartea a zecea. – Lat. illum, illam.

E

U

Ex

L - lemma (head-word); F - inflected forms; G - grammatical information; H - homographs; S - sense; Ss secundary sense; D - definition; V - variants; E - etymology; Ex - examples; U - phrasal unit I - usage information

Figure 1: The layout of an entry in DEX The order of different classes of information is (more often than not) as follows: 1. The lemma form of the head-word

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

2. Inflected forms, where the case; when the inflected forms have specific grammatical information this is specified together with the corresponding inflected form. If the inflected form is associated with a specific sense of the head-word, then the inflected form is accompanied by a reference to the specific sense; 3. Global grammatical information 4. The explanation(s) of the head-word; they are either direct definitions (grouped on various senses) or indirect definitions (specified as references to other head-words or in case of functional words by explaining their usage) 5. Information on the pronunciation, variants, irregular inflectional paradigms 6. Etymological information 3.3. Lexicographic conventions Thanks to the systematic usage of some prescribed lexicographic conventions (some of them explicitly documented in the preface of the printed dictionary, others used implicitly but consistently) the conversion from HTML format to the target encoding was substantially simplified. This section briefly reviews these conventions. - The head-words are always written in uppercase characters. If they belong to a homonymy class, the head-words (or the words appearing in the definitions of some head-words) are differentiated by numeric superscripts. The stressed vowel is always represented in the printed dictionary by an accented letter (á, é, ó, etc.) In the keyboarded format of the printed dictionary, the accented vowels in the head-words were represented as a quote followed by the corresponding vowel ('a, 'e, 'o, etc.). The distant senses are marked by uppercase letters (A, B, etc) or Roman numeral (I, II, etc.). The explicit related senses are numbered by means of lowercase letters (a, b, etc.) or Arab numbers (1, 2, etc.). The senses that are dependent on a main sense are marked by a black diamond (♦). The phrasal units (locutions, expressions, compounds, etc.) subordinated to a main sense, as well as some specialised senses which do not require a new definition, are signaled by an white diamond ( ). - The equal sign (=) signals the definition of a phrasal unit. Double quotes surround a collocation if they appear inside a sense definition or a gloss if they appear in the etymological section. - The square brackets contain pronunciation information, lexical variants, specific irregular inflected forms. Each type of information marked by square brackets is differentiated by specialised labels (Pr: - for pronunciation, Var: - for variants, or a grammatical label, such as Prez.ind., for irregular inflected forms). In case that more than one type of squared information is provided, they are separated by dashes. - Usage information is provided between parenthesis. - The symbols and + are used in providing further information on the etymology of the head-word. The etymological information always appears as the last field of the entry and is systematically introduced by a dash.

4. TEI-encoding of the dictionary The first experimental step towards encoding a sample of DEX entries, tried to preserve both possible views: the editorial and the lexical views (for a detailed discussion on possible views on a paper dictionary and on methods to conciliate them, see Ide, Veronis, 1995). With the variety of information existing in DEX (but also due to some inconsistencies that exist in the printed dictionary), ensuring TEI-conformance and preserving the editorial and lexical views proved to be practically impossible. Thus, we reformulated our goal so that to ensure as much

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

TEI conformance as possible and to fully preserve the lexical information in the printed dictionary. Once we finally decided to adopt the lexical encoding schema we committed ourselves to a prescribed order of the elements inside the element as shown in Figure 2. ................. ........... (.............. | ...........)+ (................)*

Figure 2: The structure of an encoded entry In most cases the ordering and structure in Figure 2 is observed by the printed dictionary (see Figure 1) but whenever this was not the case, the necessary positional changes where done so that to comply with the prescribed structure. Additionally, we got rid of the special graphical characters which were either irrelevant for our purposes or became redundant due to the explicit SGML mark-up. The phrasal units and definitions where expanded wherever the case so that both the clarity and ease of exploitation improved (as shown by the experiments we made by using SGML-QL (Véronis, 1997)). The preliminary experiments with storing and exploitation of our TEI dictionary as a standard database (ORACLE) show that the automatic conversion is much more feasible. This initial encoding proved to be a challenge for various reasons: •





the entries have not always homogeneous structure; to overcome the observed inconsistencies, initially, we used the entryFree element, the consistency restrictions of which are very loose, thus easily allowing an intermediary straightforward encoding of all lexical information provided in the printed dictionary. The displaced information (with respect to our ordering) was easily spotted and moved in the appropriate position. the lexical information is frequently implicitly specified. For instance, consider the lexical entry below: CENTR'AL, -Ă, centrali, -e, adj., s.f. .... The grammatical information s.f. (feminine noun) refers not to the lemma, but to the wordform obtained by adding the suffix -Ă, which probably is not easy to infer for a non-native speaker of Romanian. The proper expansion would associate CENTR'AL, centrali with the implicit information "adj.m." (masculine adjective) and CENTR'ALĂ, centrale with "adj.f., s.f" (feminine adjective or feminine noun). another problem we were faced with was that a large part of information in the printed dictionary is provided in a format which was meant to save space (relying on the human reader ability to expand the compressed form). Saving space was done in basically two ways: the first one, posing no problems to machine processing, concerns the abbreviations used throughout the dictionary; the second one, trouble making, concerns the phrases which are printed in a shortened form (definitions, examples, collocations, etc.). Such a phrase consists of fix and variable components, but unfortunately expanding the meaningful combinations relies, as said before, on the human reader of the dictionary. For instance a simple case of shortening an expression is the following: "în (sau din) două vorbe (sau cuvinte)". It stands for the following four variants: "în două vorbe", "din două vorbe", "în două cuvinte", "din două cuvinte". For the sake of consistency (and conformance with TEIencoding lexical view recommendations) we finally decided to expand both the abbreviations and the shortened phrases.

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

The intermediary encoding mentioned before, paved the way for migrating from entryFree to the more constrained entry element. However, the basic structure of the entry element of tei2dict.dtd had to be slightly modified for covering all lexical information one finds in DEX (e.g. the content model of the gramGrp element, lbl domain, etc.). In the figures below, there are described the main extensions we made to the basic dtd, in order to accommodate all available information as provided in the printed dictionary (the extension files below are appropriately referred to in the tei2.dtd). -->

Figure 3: concede.ent

Figure 4: concede.dtd

5. Adapting teidict2.dtd for DEX Up to this phase of the project we avoided any lexical loss in the SGML encoding as compared to the (implicit or explicit) information provided by DEX. Therefore, we slightly modified teidict2.dtd. As said before, the initial encoding used markup, but following the ordering in Figure 2. We encoded about 300 entries out of the 500 selected as described in section 2.

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

The modification of the markup for was done by using XEMACS macros and appropriate specifications (see Figures 3 and 4) in the extension files TEI.extensions.ent and TEI.extensions.dtd of the elements and attributes which were in the initial encoding (and we wanted to preserve). In the following we will dwell on these modifications and exemplify why they were needed. 1. The element is included into the element and because in written Romanian the accent is not marked, and information was kept distinct. 2. Another element that was embedded into is . This was necessary in order to make possible to associate grammatical information that was pertinent only to a specific inflected form or variant. For instance the inflected forms in direct cases (Nominative and Accusative) both singular or plural are implicit, but for the oblique cases the orthographic forms are explicitly associated with case information. Exemple 1 acel ac`el acea ace`a acei acele acelui acelei genitiv, dativ singular acelor genitiv, dativ plural ...

3. Romanian is a strongly inflected language and therefore preserving the morphological information provided in DEX is absolutely necessary (when this information is provided, usually the wordform in case is irregular). Therefore the entity %m.morphInfo was included into the content of . The morphological information is systematically provided for the head-word, but this is also true for the inflected forms that have variants. In such cases, the morphological information is provided at the beginning of the entry.

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

When the inflectional paradigm of the head-word is irregular, DEX specifies the irregular forms together with the corresponding morphological towards the end of the entry (just before the etymological information). According to the encoding schema in Figure 2, these wordforms and associated grammatical information were moved into the first element which contains the orthographic, orthoepic and morphological information applying for the head-word. It also happens that specific morphological information may apply just for some senses (see example 2) or phrasal units (see example 3). Exemple 2 fi ... verb conjugarea IV intranzitiv A. Verb predicativ ...

Exemple 3 locuţiune verbală A-i fi cuiva drag (cineva sau ceva) a-i plăcea, a îndrăgi, a iubi.

The HTML-SGML conversion program we mentioned before, explicitly generates all the implicit grammatical information of the head-words (infinitive for verbs, masculine singular for adjectives or pronouns, etc.) 4. In order to easily identify the words that collocate with a specific head-word we included the element into (see example 4). Exemple 4 an

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

adverb ... figurat Precedat de mai Acum câţiva ani. ...

6. Related entries The markup was used for encoding the phrasal units. DEX contains many phrasal units and distinguishes among locutions, expressions, syntagms, constructs, compounds. The description of a phrasal unit exhibits a similar structure (but simplified) to the one of a regular entry. A maximal description of a phrasal unit contains all the toplevel elements shown in Figure 2, except for the and . Below we provide examples for each type of phrasal units mentioned before. Exemple 5 de la a la z de la început până la sfârşit totul, în întregime.

Exemple 6 locuţiune adverbială An de an An cu an în fiecare an, mereu.

Exemple 7 cinci-degete

Proceedings of COMPLEX’99 International Conference on Computational Lexicography, Pecs, June 16-19, 1999

substantiv plantă erbacee târâtoare, cu frunzele formate din cinci foliole şi cu flori galbene (Potentilla reptans).

Exemple 8 Faţă de masă material textil, plastic etc. folosit spre a acoperi o masă (când se mănâncă sau ca ornament).

Acknowledgements The work reported here was jointly funded by the CONCEDE European project (PL96-1142) and by a grant of the Romanian Academy (GAR#187/1998)

References DEX, (1996). Coteanu, I., Seche, L., Seche, M. (coord.): Dicţionarul Explicativ al Limbii Române, Ediţia a II-a, Univers Enciclopedic, Bucureşti, 1996 Dimitrova, L., Erjavec, T., Ide, N., Kaalep, J. H., Petkevič, V., and Tufiş, D. (1998): MultextEast: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages” In Proceedings of COLING-ACL’98, Montreal, Canada, 315-319 Erjavec, T., Ide, N. (1998): The Multext-EAST Corpus. In Proceedings of the First International Conference on Language Resources and Evaluation, LREC’98, Granada, 1998, pp. 971-974 Ide, N., Véronis, J., eds. (1995): Text Encoding Initiative. Kluwer Academic Publishers, Dodrecht / Boston / London, 1995 Landini, Gabriel (1997): Zipf's laws in the Voynich Manuscript. http://web.bham.ac.uk/ G.Landini/evmt/zipf.htm Ide, N. (1998) Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora In Proceedings of the First International Language Resources and Evaluation Conference, Granada, Spain. See also http://www.cs.vassar.edu/CES/. Ide, N. and Véronis, J. (1994): Multext (Multilingual Tools and Corpora). In Proceedings of the 14th International Conference on Computational Linguistics, COLING'94, Kyoto, pp. 90-96.