Title Placeholder - Microsoft Download Center

0 downloads 0 Views 135KB Size Report
1 Microsoft Language Development Center, MSFT, Portugal. 2 Polytechnic Institute ..... IConstruções gramaticais e sistemas de conversão texto- ... Um guia completo para usar correctamente a língua portuguesa, Lisboa, D. Quixote. [10] Braga ...
Homograph Ambiguity Resolution in Front-End Design for Portuguese TTS Systems Daniela Braga. 1, Luís Coelho2, Fernando Gil V. Resende Jr.3 1

Microsoft Language Development Center, MSFT, Portugal 2 Polytechnic Institute of Oporto, Portugal 3 Federal University of Rio de Janeiro, Brazil

[email protected], [email protected]; [email protected] systems, the work of Ribeiro et al. [2], [3] can be mentioned. Although homograph disambiguation is not the main topic of this work, it shows that morphosyntactic information is responsible for the TTS good performance. In [2], Ribeiro et al. compare POS parsers, each using a hybrid approach (probabilistic and linguistically-rule based) and a probabilistic approach. Results seem to show a better performance of the hybrid approach. A table is shown with a typology of morphosyntactic ambiguities that influence the LTS converter. However, as no case of ambiguity is followed by examples, it is not clear whether the ambiguity cases are caused by homonym pairs or homograph pairs. Updates made in [3] seem to confirm that the morphosyntactic ambiguity that was analyzed is homonymic-based, which has little impact in LTS conversion, though it is very important to prosodic generation, concerning focus and prosodic group boundaries. In Seara et al. [4], [5], a POS parser is presented to predict the vowel quality of noun and verb forms in Brazilian Portuguese (BP) TTS systems. Although this is a very interesting work regarding homograph disambiguation, especially because the vowel quality variation along the verb inflection is shown to be predicted, it does not include homograph pairs whose disambiguation is semantically determined. Both morphosyntactic and semantic analysis were considered in Ferrari¶V et al. [6] approach for homograph disambiguation in BP TTS systems. The Cognitive Grammar framework was proposed and a corporabased analysis was pursued in order to find the neighboring expected constructional schemes. This approach was tested with only one example. Although it is a very interesting approach, each existing homograph pair still needs a similar study. In this paper, we present a module, which is applied both to EP and BP, and that solves a large range of homograph ambiguity, either morphosyntactically or semantically determined. This paper is structured as following: in section 2, the architecture of the homograph disambiguation system is described; in section 3, tests and results are shown and a comparison with a probabilistic technique is discussed; in section 4, our main conclusions are presented.

Abstract In this paper, a module for homograph disambiguation in Portuguese Text-to-Speech (TTS) is proposed. This module works with a part-of-speech (POS) parser, used to disambiguate homographs that belong to different parts-ofspeech, and a semantic analyzer, used to disambiguate homographs which belong to the same part-of-speech. The proposed algorithms are meant to solve a significant part of homograph ambiguity in European Portuguese (EP) (106 homograph pairs so far). This system is ready to be integrated in a Letter-to-Sound (LTS) converter. The algorithms were trained and tested with different corpora. The obtained experimental results gave rise to 97.8% of accuracy rate. This methodology is also valid for Brazilian Portuguese (BP), since 95 homographs pairs are exactly the same as in EP. A comparison with a probabilistic approach was also done and results were discussed. Index Terms: Text-to-Speech, homograph, disambiguation, Part-of-Speech parser, semantic analysis

1. Introduction Homograph ambiguity is a well-known problem of difficult solution in TTS conversion. In Portuguese it is responsible for 0,62% of error rate in our LTS conversion, which means that in a 1000 sentence corpus, containing 9090 words, there are 57 mispronounced homograph words, because the output transcription is not the default one. Although this error rate may seem to be not very significant, it can be rather shocking from the user point of view.This is a complex issue, since it depends on morphological and syntactic information most of times. For instance, in DO ¶PRVX@ (noun, µlunch¶)/ DO ¶P2VX@ (verb, µI have lunch¶), the tonic vowel quality change is related with the POS each of the words belongs to, which is noun and verb respectively (SAMPA for EP is used for phonetic transcription, with one extension to represent the veODU ODWHUDO FRQVRQDQW HJ VDO! >µVDO @ µsalt¶ . Sometimes, homograph disambiguation can only be done by using semantic information, when words belong to the same POS (e.g. µEH6W@ nounµbeast¶ EHVWD!>µE(6W@ (QRXQ µcrossbow¶)). The main work on homograph disambiguation for TTS systems can be found in [1], in which the author establishes a typology of homograph pairs for English, describes several techniques traditionally used to resolve homograph ambiguity (N-gram taggers, Bayesian classifiers and decision trees) and proposes a hybrid algorithm combining the best of the three described techniques. Regarding homograph disambiguation in EP TTS

PREPRESS PROOF FILE

2. System Architecture The homograph disambiguation module can be seen as a part of the POS parser and it is integrated in the LTS converter, one of the most relevant modules of TTS FrontEnd component. In this section, we will describe the system design and its components, the methodology used, the proposed homograph typology and the corresponding algorithms.

1

CAUSAL PRODUCTIONS

2.1. Methodology

Text

The first milestone of our work was collecting the maximum number of homographs, either through literature [7], [8], [9] or through performance results of our LTS converter [10]. We obtained a library of 106 homographs for EP, from which 95 of them are valid for BP. This was a rather difficult task, since a complete list of homographs for Portuguese is not available or published. The homograph collection is still proceeding, because the good performance of our module depends on the presence of the given homograph on the library. The second milestone was organizing the homographs into types, according with their grammatical category and phonetic alternation. Each type was given a code number to which an algorithm was matched. The third milestone was building algorithms with a set of conditions for each homograph type. In this stage, several libraries were collected [8], [9], [11] as described in section 2.2., and a POS parser was started. Disambiguation rules were trained with three different corpora: CETEMPúblico (corpus containing newspaper language) [12], COMPARA (Portuguese-English aligned corpus containing literature translations) [13] and EUROPARL-Opus (multilingual aligned corpus containing transcriptions of European Parliament debates) [14]. This diversity in corpora was important to help us find different contexts for each homograph pair. In other words, one of the two possible readings of each homograph is more likely to be found in a certain corpus type than in another. For instance, in the pair JRVWR!>R@ µWDVWH¶ JRVWR!>2@ µI like¶), it is very difficult to find the second form (verb in the first person singular) in a journalistic corpus like CETEM-Público, where subjectivity is avoided. However, it can be commonly found in literature texts, such as in COMPARA. The system was then tested with a different corpus as described in section 3.

Homograph type identification

Word breaker

Homographs library

Closed POS library Morphemes library

Homograph disambiguation algorithms

Lemmas library

Restrict lexical combinations library

Wordnets library

Irregular verbs library Preparatory subject exp. Library

Morphosyntactic analysis

Semantic analysis

Phonetic transcription

Figure 1: Homograph architecture.

disambiguation

system

In Figure 1, the system architecture is shown. The text is initially split into words. Then, the system performs a search for homographs. Once a given homograph is identified in the homographs library, it is conducted to its type and to the corresponding disambiguation algorithm. In Figure 1, libraries that are necessary to perform morphosyntactic analysis can be seen on the left, whereas libraries that are used for semantic analysis are placed on the right.

2.3. Homograph Typology In Tables 1 and 2, the considered homograph typology is displayed. In Table 1, homographs belonging to different POS are shown, whereas homographs belonging to the same POS can be seen in Table 2. The homograph ambiguity of Table 1 is solved with morphosyntactic analysis, but semantic information is needed to resolve the ambiguity displayed in Table 2. The first criterion used to design our homograph typology was based on this distinction. The second criterion had to do with the POS opposition and the inherent vocalic alternation of each homograph pair. As the tables show, the most productive opposition occurs between Noun and Verb, from a morphological point of view, and between [e]/[E] and [o]/[O], from a phonetic point of view. Systematic evidence is that in Nouns, tonic vowels are typically closed, although in Verb forms tonic vowels are opened. Types 1 and 2 represent 68.3% of the total homographs library. Types 14, 15 and 20 need both morphosyntactic and semantic analysis, since they have three possible outputs. Types 12 and 24 establish a different vocalic alternation: it occurs in the pre-tonic syllable and, because of that, the opposition is [@]/[E].

2.2. Libraries The following libraries were gathered: 1) Homographs library, containing 106 homograph pairs grouped in 24 types. 2) Closed POS library, containing the Parts of Speech that have a fixed number of items (pronouns, prepositions, adverbs, conjunctions, contractions, articles, numbers, determiners, interjections). 3) Morphemes library, containing noun, verb, adjective and adverb suffixes, prefixes, and Latin and Greek affixes. 4) Lemmas library, containing Portuguese Jspell dictionary [15] with about 34000 words morphologically annotated. 5) Irregular verbs library, containing the inflexion forms of the main irregular Portuguese verbs. 6) µ3UHSDUDWRU\ VXEMHFW¶ H[SUHVVLRQV library, containing expressions with verb to be in the third person + adjective followed by that-clauses: