SEGMENTING SPOKEN LANGUAGE UTTERANCES ... - CiteSeerX

4 downloads 209922 Views 58KB Size Report
artifacts of spoken language, such as edits, repairs and other dys- fluencies. Furthermore, the ... semantic class for the utterance based on the semantic classes of the clauses. Although there ..... The performance of auto- matically Clausified ...
SEGMENTING SPOKEN LANGUAGE UTTERANCES INTO CLAUSES FOR SEMANTIC CLASSIFICATION Narendra K. Gupta

Srinivas Bangalore

AT&T Labs-Research 180 Park Avenue Florham Park, NJ 07932 [email protected]

AT&T Labs-Research 180 Park Avenue Florham Park, NJ 07932 [email protected]

ABSTRACT Robust spoken language understanding in large-scale conversational dialog applications is usually performed by classification of the user utterances into one or many semantic classes. The features used for classification are sensitive to variations caused by artifacts of spoken language, such as edits, repairs and other dysfluencies. Furthermore, the performance of these classifiers typically degrades when the user’s utterance contains multiple semantic classes. In this paper, we present a semantic classification technique that first automatically removes dysfluencies and segments the user’s utterance into clauses and then classifies the utterance based on the classification of the clauses. We show that this preprocessing improves the semantic classification accuracy for utterances and significantly decreases the amount of training data needed for a given classification accuracy level.

1. INTRODUCTION Spoken language understanding in a conversational system typically consists of a syntactic analyzer and a semantic analyzer. The syntactic analyzer transforms the user’s utterance into a representation that is suitable for the semantic analyzer to produce interpretations for that utterance. Several instantiations [1, 2, 3] of the spoken language understanding component in a conversational system use a syntactic analyzer that is based on a hand-written grammar and a set of hand-crafted application-independent rules that map the syntactic structures into application-independent semantic representations. These semantic representations are then transformed into application-specific representations which are more appropriate to the task of the conversational system. The two levels of semantic representation implies that porting a dialog system for a new domain involves incorporating domain specific lexical items, identifying domain specific semantic predicates and mappings from general predicates to domain specific predicates. Moreover, the syntactic analyzer needs to accommodate for spoken language phenomena such as (a) ungrammaticality (b) dysfluencies like speech repairs, repeats, and explicit/implicit repairs (c) unpredictable word errors introduced by speech recognizers. Although, these systems have the potential of creating a rich semantic representation they are not robust and are time consuming to develop for an application domain.

In recent times, a class of dialog applications based on classification techniques has gained practical significance [4, 5, 6]. Classification methods originally designed for text classification of documents have been increasingly used for semantic classification of user utterances in such dialog systems. These dialog systems have had reasonable success in the context of large-scale call-routing applications. The classification techniques train models from data that contain the user utterances which are associated with a limited number of actions that the dialog system can perform. The advantage of this approach is that it is robust, rapidly trainable from data and directly associates a user utterance to the semantics of the application. Most of the successful practical systems mentioned above use lexical n-grams as features to learn a classification model. While it is surprising that these simple features allow for such useful systems, the performance of the system typically degrades when there are multiple semantic classes in an user’s utterance. Associating the entire user utterance with all the classes creates spurious associations and makes the classification problem even harder. In contrast, we believe that the semantic categories for a user’s utterance can be arrived at in a compositional manner from the semantic classes of each of its component clauses. Some of the clauses might not even contribute to the semantic classes in the context of an application. In this paper, we present an approach to semantic classification by first automatically removing dysfluencies and segmenting the spoken language utterances into clauses and then deriving the semantic class for the utterance based on the semantic classes of the clauses. Although there has been previous work on accommodating for speech repairs in the syntactic analyzer, there has been little work on studying its effect on the semantics of the application. In contrast to these previous works, in this paper, we study the impact of preprocessing on the semantics of an application. The following is the outline of the paper. In Section 2, we describe our approach to semantic classification. The segmentation of a spoken language utterance into clauses is discussed in Section 3. In Section 4, we discuss the data used for experiments. We present the details of the experiments and discuss the results in Section 5.

at in a compositional manner from the semantic classes of each of its component clauses. Some of the clauses might not even conSemantic classification task involves assigning one or more application- tribute to the semantic classes of the utterance in the context of an specific labels (semantic classes) to a user’s utterance. This task is application and are labeled as Garbage. achieved by training a classifier on a set of user utterances annoIn Table 1, we present a sample corpus of user’s utterances tated with semantic classes. Given that there may be multiple seannotated with the semantic classes from the How May I Help mantic concepts embedded in an utterance the semantic classifier You?(HMIHY) application2 [4]. The utterances are annotated with needs to be capable of assigning a set of classes to an utterance – one or more semantic classes and in cases where there are more a multi-class classifier. than one semantic class, the order among the classes is not signifiWe used a machine-learning tool called boostexter, which is cant. In Table 2, we present the clausified and dysfluency-free verbased on the boosting family of algorithms first proposed in [7] sion of the utterances from Table 1 along with the semantic class and extended for multi-class problem in [8]. For semantic clasfor each clause. Note that with the segmentation into clauses, some sification, boostexter uses n-grams as the space of possible weak utterances have the same semantic class repeated for its compoclassifiers. Each weak classifier assigns weights (positive or negnent clauses thus amplifying the association of that semantic class ative) to different classes depending on whether a specific n-gram for that utterance. In the next section, we discuss the process of is present or absent in the user’s utterance. At each iteration of clausification. classifier training, boostexter selects a single weak classifier and its weight based on the reduction in a loss function. The classifier 3. CLAUSIFIER output for a user’s utterance is the weighted sum of the outputs of all the weak classifiers. These weights can be converted to a Spoken language contains certain characteristic features that need probability value using logistic or exponential functions dependto be accommodated in a conversational system. These include ing on the loss function minimized by boostexter. Since boostexa) ungrammaticality of utterances b) presence of dysfluencies [9] ter is a multi-class classifier, probabilities assigned to each class such as repeats, restarts, and explicit/implied repairs c) absence of are independent of each other and do not sum to one. A semantic essential punctuation marks such as end of sentence and comma interpreter can use an appropriate threshold on these probabilities separated enumerations. These features make the word strings reto determine the semantic classes present in the input text. sulting from recognition or transcription of speech to be syntactiAlthough the lexical n-gram features are easy to compute, these cally and semantically incoherent. features are sensitive to the variations common in natural language The task of identifying sentence boundaries, speech repairs input. It is typical for the same semantic class to be associated and dysfluencies have been a focus of spoken language parsing with many variants of user’s utterances. In order to learn reliable research for several years [10, 11, 12, 13, 14, 15]. Most of the premodels, boostexter requires sufficient training data that captures vious approaches cope with dysfluencies and speech repairs in the these variations and their semantic classes. These variations are parser by providing ways for the parser to skip over syntactically exasperated with phenomena common in spoken language such as ill-formed parts of an utterance. In more recent work [16, 17, 18], ungrammaticality, dysfluencies and unpredictable speech recognithe problem of speech parsing is viewed as a two step process. A tion errors. The data set needed to train a reliable classifier expands preprocessing step is used to identify speech repairs. However, in combinatorially when a user’s utterance is associated with multicontrast to these previous work, in this paper, we study the impact ple semantic classes. It is evident that data collection and annotaof preprocessing on the semantics of an application. tion is a time consuming and tedious task. In order to alleviate the data bottleneck, we propose to reduce the variability introduced by User’s dysfluencies and segment user’s utterances into clauses with each Extract Utterance Extract Extract 1 clause associated with semantic classes. We show that boostexter Explicit Discouse Filled Pauses Markers Edits achieves the same performance with half the training data when it is trained on processed (dysfluency-free and segmented) version of the training corpus as compared to the unprocessed training corpus. Clausified Identify and Another advantage of segmenting the user’s utterance into clauses Identify Identify and Utterance Insert Coordinating Remove and associating them with the appropriate semantic classes is that Segment Edits it allows for better classification. When the user’s utterance conConjunctions Boundaries tains multiple semantic classes the entire utterance is annotated with all the classes. This causes spurious n-gram associations with Fig. 1. Cascade of classifiers for clausification semantic classes and makes the classification problem harder. The performance degrades rapidly in such cases. In contrast, we show In [18], we presented a discriminative approach to identify that the semantic categories for a user’s utterance can be arrived speech repairs and segment the input utterance into clauses – self1 We use the term clause to represent a segment of an utterance that has contained, syntactic units embodying a single concept in the form 2. SEMANTIC CLASSIFIER

a subject and a predictate. However, due to ungrammaticality and elision in spoken language meaning-bearing segments with only fragmented phrases as clauses. Furthermore, for this version of the system, we do not extract embedded clauses.

2 The goal in this application is to understand caller’s responses to the open-ended prompt How May I Help You? and route such a call based on the meaning of the response.

Utterance # 1

2

3

User’s Utterance I am I am supposed to be on that att $number rate $ItemRate uh in state charge that you have been charging me $ItemRate and I am uh I am on the $ItemRate out of state which I am correct on that but I need somebody to change that from $currency-us to $currency-us and go back and look at the history of my account and give me a credit for all the over charge of $ItemRate you been changing now that should not be hard to understand I have already gone through all of the options it still does not answer my question on the rate that I was charged on my current bill I want them these charges taken off because I did not make those calls

Semantic Label

Explanation of Bill Billing Credit Help Explanation of Bill Unrecognized Number Billing Credit

Table 1. A sample user’s utterance annotated with semantic classes Utterance # 1

2

3

Clausified Utterances I am supposed to be on that att $number rate $ItemRate in state charge you have been charging me $ItemRate I am on the $ItemRate out of state which I am correct on that I need somebody to change that from $currency-us to $currency-us go back look at the history of my account give me a credit for all the over charge of $ItemRate you been charging that should not be hard to understand I have already gone through all of the options it still does not answer my question on the rate that I was charged on my current bill I want these charges taken off I did not make those calls

Semantic Label Explanation of Bill Explanation of Bill Explanation of Bill Billing Credit Garbage Garbage Billing Credit Garbage Help Explanation of Bill Billing Credit Unrecognized Number

Table 2. Clausified and dysfluency-free utterance with each clause associated with its semantic class

of a single subject-predicate set. We presented a cascade of classifiers to detect sentence boundaries, identify speech repairs and editing them out and identify coordinating conjunctions to break the sentences into clausal units. We called this cascade a clausifier. We also presented experiments on human-human conversations obtained from the Switchboard corpus. Individual classifiers in the cascade classified each word boundary as edit, conjunction or clause boundary on the basis of the words and their part-of-speech on left and right of the boundary. We have extended our clausifier to take advantage of easy to identify dysfluencies as additional features that individual classifiers in the cascade can use to better identify the restarts and repairs. Figure 1 shows a cascade of classifiers we use for clausification. As an example, consider the transcription of an utterance from HMIHY3 shown as utterance 1 in Table 1. When this text is passed through an ideal clausifier all the dysfluencies, edits, sentence boundaries and coordinating conjunctions are identified and tagged as shown in Figure 2. Individual tags can then be interpreted to produce clauses as shown in utterance 1, Table 2. As can be seen, clauses shown in Table 2 are much more coherent and free from the artifacts of spoken language. We show 3 In

this utterance, the named entities are substituted with tokens that represent the values of these entities.



I am I am supposed to be on that att $number rate $ItemRate f uh in state s you have been charging charge that me $ItemRate c and I am f uh I am on the $ItemRate out of state which I am correct on that c but I need somebody to change that from $currency-us to $currency-us c and go back c and look at the history of my account c and give me a credit for all the over charge of $ItemRate you been changing s d now that should not be hard to understand



 













  





 









Tags: s : Sentence boundary; c : Coordinating Conjunctions; f : Filled pause; d : Discourse markers; : Edit

 

 

 



 

Fig. 2. Example output after processing with the clausifier.

that this normalization not only improves the classifier but also has an impact on the size of the training corpus needed to achieve a certain performance.

Type of dysfluency Sentence continuation across turns Sentence boundaries Edits Coordinating Conjunctions Discourse Markers Explicit Edits Filled Pauses

Human-Human Dialogs (Switchboard) 1.76% 5.8% 3.00% 3.44% 2.30% 0.25% 3.35%

Human-Machine Dialogs (HMIHY) 0 4.09% 1.49% 2.14% 0.45% 0.06% 3.21%

F-values of classifiers in the cascade 64 32 97 73 89 100 (deterministic)

Table 3. Dysfluencies in human-human and human-machine dialogs and the performance of each classifier in the cascade.

4. HUMAN-MACHINE DIALOG DATA Our primary interest in this work is in understanding user’s utterances in the context of human-computer dialog systems. There is enough anecdotal evidence that people speak differently when they know they are talking to a machine. Table 3 compares4 5 the occurrence of various phenomena in human-human conversations against human-machine conversations. It is interesting to note that users do not continue sentences across turns, use fewer edits, discourse markers and explicit edits, but use the coordinating conjunctions and the filled pauses with similar frequencies. In previous work, we used the human-human utterances in the Switchboard corpus to evaluate the performance of the clausifier. However, owing to the differences presented in Table 3 and the fact that there is no publicly available human-machine utterances annotated with dysfluencies and semantic labels, we are compelled to annotate a corpus of our own. We have annotated a relatively small corpus of 4000 transcribed utterances from human-computer dialogs (HMIHY) with all the dysfluencies, restarts and repairs, segment and clause boundaries. We also have annotated each utterance with semantic classes appropriate for the HMIHY application. There are a total of 56 semantic classes for this application. We used this data for all our experiments described in this paper. 5. EXPERIMENTAL RESULTS Given the fact that the annotated data is relatively small, the results presented in this paper are based on 10 fold cross validation. In each run, data was randomly partitioned 90% for training and 10% for testing purposes. We trained both a clausifier and several semantic classifiers using the training partitions and tested both clausifier and classifiers on the test partitions. In this section, we provide evaluation results for the clausifier, the semantic classifier performance using hand-annotated clauses as input as well as the results using the output of a trained clausifier. 5.1. Performance of Clausifier In Table 3, we present the F-values of the individual classifiers in the cascade shown in Figure 1. Identification of explicit edits, discourse markers and coordinating conjunctions is an easier 4 In addition to the differences shown in the table, human-human conversations contain back-channels that are absent in human-machine conversations. 5 Sentence boundary numbers are computed after removing backchannels and boundaries at turns.

problem. In spite of the small data set the performance of these components are reasonable. Identification of sentence boundary and edits is a difficult problem and needs more data to learn good models. We present the recall and precision of the clausifier at the clause level. More specifically clause level recall is the percentage of clauses present in the reference that are correctly identified by the clausifier, and clause level precision is the percentage of clauses present in the output of the clausifier that are present in the reference. In Table 4, sentence level recall and precisions for different partitions of test data are provided. Partition of test data All Test data No clausification action needed Some clausification action is needed

% of data 100 29

Recall

Precision

F-Value

51.3 89.8

56.8 88.1

53.9 88.9

71

45.7

51.5

48.4

Table 4. Performance of the clausifier on different partitions of the data As can be seen from Table 4, the overall performance of the clausifier is relatively low at an F-value of 53.9. In cases where no clausifier action is needed (row 2), the clausifier wrongly introduces annotations which results in an F-value of 88.9 at the clause level. For cases where some clausifier actions are needed (row 3), the clausifier has an F-value of 48.4. In spite of this low performance, we show that clausification helps in improving the semantic classification accuracy for transcribed data. 5.2. Performance of Semantic Classifier Using the 4000 examples, we carried out several experiments to demonstrate performance improvements in semantic classification by using clausified labeled data over using labeled raw utterances (without clause information). In all the experiments, we trained two separate semantic classifiers. The Raw model was trained on the raw transcription of utterances, each labeled with one or more semantic labels. The Clausified model was trained using handclausified (all dysfluencies removed, and segmented both at sentence and clause boundaries) transcribed data where each clause is individually assigned one or more labels. Some clauses may not be important for the application and are labeled as Garbage.

0.70

Raw clausified •



• • •

0.65

• • • •





• 0.60

• • •

• •

0.55

F-Value at confidence value threshold of 0.2

0.75

Training Curve for Raw and Clausified data

0.50

These models were tested on randomly held out test data. The Raw model was tested on raw (un-clausified) transcribed/recognized text, and the Clausified model on the clausified transcribed/recognized text. A clausifier, trained on the same data that was used to train the Clausified model, was used to clausify the test data for testing the Clausified model. Semantic classes of an utterance were computed as the union of semantic classes6 , assigned to constituent clauses, ignoring the Garbage class. To evaluate the classifier we used weighted precision and recall of all the semantic classes and calculate the F-measure. More specifically, for a given confidence threshold on the classifier output, we computed the total number of true positives, false positives and false negatives for all the semantic classes. From these numbers, we calculated precision and recall and subsequently the F-measure. All the results presented here are at threshold of 0.2 which was found to be the best operating point in all the experiments.



1000

2000

3000

Number of Training Examples

Fig. 3. Training Curve for Raw and Clausified models

6. CONCLUSIONS 5.2.1. Performance on Transcribed Speech Table 5 shows performance of Raw and Clausified models for different partitions of transcribed test data. The performance of automatically Clausified model improves by 2% over the Raw model when tested on all the test data; with a perfectly clausified test set the improvement is as large as 6% in semantic classification accuracy over the Raw model. It is also interesting to note that the performance of the Clausified model is better than the Raw model for partitions of the test data where some clausification action (63 vs 59) is needed, while the Raw model performs better for the partitions of the test data where there is no need for any clausification action (74 vs 71) – a consequence of the poor precision of the clausifier. Furthermore, the performance of the Clausified model progressively improves over the performance of the Raw model as the number of semantic classes increases. Also, notice that in this application 60% of the data is annotated with a single semantic class. With a more complex dialog application, one would expect more of the data to be annotated with richer semantic information. This suggests that clausification model is an essential step in such conversational systems.

In most current large-scale conversational systems, spoken language understanding is performed as semantic classification using automatically trained classifiers. The performance of these classifiers degrades rapidly when the user’s utterance contains multiple semantic classes. In this paper, we have presented an approach to segmenting spoken language utterances into clauses and associating semantic classes to the clauses with the aim of improving semantic classification of the user utterance. In the context of the HMIHY conversational application, we have shown that a classifier trained on processed (dysfluency-free and clausified) data can produce the same level of performance with less than half the training data when compared against a classifier trained on unprocessed data. This is a significant result given that data annotation is an expensive, and time-consuming enterprise. Furthermore, we have shown that a classifier trained on processed data significantly outperforms a classifier trained on unprocessed data. The performance gain is as much as 8% for test data that contain multiple semantic classes indicating that clausification is an essential step for complex conversational systems. 7. REFERENCES

5.2.2. Effect of Size of Training Data In this experiment, we study the effect of clausification on the size of training data needed for a given performance accuracy. Figure 3 shows the training curve for the Raw and Clausified models. In this experiment, we trained the two models on increasing number of training utterances and tested each model on a fixed set of test utterances. In order to estimate the best-case performance of the clausified model, we used hand-annotated clauses for the test set. It can be seen from Figure 3 that a Clausified model produces comparable performance to the Raw model with less than half the training examples. This is a significant result, given that data annotation is a very expensive and time consuming enterprise. 6 In this paper, due to space constraints, we will not discuss other ways of combining the classes of clauses to arrive at the classes for an utterance.

[1] J. Allen etal., “The TRAINS project: A case study in building a conversational planning agent,” Journal of Experimental and Theoretical AI, vol. 7, pp. 7–48, 1995. [2] Hiyan Alshawi, David Carter, Richard Crouch, Steve Pullman, Manny Rayner, and Arnold Smith, CLARE – A Contextual Reasoning and Cooperative Response Framework for the Core Language Engine, SRI International, Cambridge, England, 1992. [3] Goodine David, Stephanie Seneff, Lynette Hirschman, and Michael Phillips, “Full Integration of Speech and Language Understanding in MIT Spoken Language System,” in Proceedings of Eurospeech, 1991. [4] A. L. Gorin, G. Riccardi, and J. H Wright, “How May I Help You?,” Speech Communication, vol. 23, pp. 113–127, 1997.

Partition of test data All Test Data No Clausification Needed Clausification Needed Single Semantic Class Exactly Two Semantic Classes More than Two Semantic Classes

Trained on Clausified Tested on Tested on Hand Clausified Automatically Clausified 68 64 71 71 67 63 63 62

% of total 100 29 71 60

Trained on Raw Tested on Raw 62 74 59 63

27

68

73

69

14

55

69

63

Table 5. Performance (F-values) of classifiers trained on raw utterances and trained on clausified utterances on various partitions of transcribed test data. [5] Jennifer Chu-Carroll and Bob Carpenter, “Vector-based natural language call routing,” Computational Linguistics, vol. 25, no. 3, pp. 361–388, 1999.

[17] E. Charniak and M. Johnson, “Edit detection and parsing for transcribed speech,” in Proceedings of NAACL, Pittsburgh, PA, 2001.

[6] P. Natarajan, R. Prasad, B. Suhm, and D. McCarthy., “Speech-enabled natural language call routing: BBN call director,” in Proceedings of ICSLP, 2002.

[18] Narendra Gupta and Srinivas Bangalore, “Extracting clauses for spoken language understanding in conversational systems,” in Proceedings of EMNLP, 2002, pp. 273–280.

[7] R.E. Schapire, “A brief introduction to boosting,” in Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. [8] R.E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, December 1999. [9] M. Meteer et al., “Dysfluency annotation stylebook for the switchboard corpus,” Distributed by LDC, 1995. [10] J. Bear, J. Dowding, and E. Shriberg, “Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog,” in Proceedings of ACL, 1992, pp. 56–63. [11] Stephanie Seneff, “A relaxation method for understanding spontaneous speech utterances,” in Proceedings, Speech and Natural Language Workshop, San Mateo, CA, 1992. [12] Peter Heeman, Speech Repairs, Intonation Boundaries and Discourse Markers: Modeling Speakers’ Utterances, Ph.D. thesis, University of Rochester, 1997. [13] T. Ruland, C.J. Rupp, J. Spilker, H. Weber, and K.L. Worm, “Making the most of multiplicity: A multi-parser multistrategy architecture for the robust processing of spoken language,” Tech. Rep., DFKI, Verbmobil report 230, 1998. [14] M.G. Core and L.K. Schubert, “A syntactic framework for speech repairs and other disruptions,” in Proceedings of ACL, 1999, pp. 413–420. [15] Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-Tr, and Gokhan Tur, “Prosody-based automatic segmentation of speech into sentences and topics,” Speech Communication, , no. 1–2, September 2000. [16] A. Stolcke and E. Shriberg, “Statistical language modeling for speech disfluencies,” in Proceedings of ICASSP, Atlanta, GA, 1996.