Evaluation of context effects in sentence recognition - Acoustical

7 downloads 0 Views 177KB Size Report
sentence recognition scores and proportions of incomplete responses. ... that they have perceived (i.e., not only using the sentence context). A method for ...
Evaluation of context effects in sentence recognition Adelbert W. Bronkhorst TNO Human Factors, POB 23, 3769 ZG Soesterberg, The Netherlands

Thomas Brand and Kirsten Wagener Carl von Ossietzky Universita¨t Oldenburg, AG Medizinische Physik, 26111 Oldenburg, Germany

共Received 11 April 2001; revised 17 October 2001; accepted 8 January 2002兲 It was investigated whether the model for context effects, developed earlier by Bronkhorst et al. 关J. Acoust. Soc. Am. 93, 499–509 共1993兲兴, can be applied to results of sentence tests, used for the evaluation of speech recognition. Data for two German sentence tests, that differed with respect to their semantic content, were analyzed. They had been obtained from normal-hearing listeners using adaptive paradigms in which the signal-to-noise ratio was varied. It appeared that the model can accurately reproduce the complete pattern of scores as a function of signal-to-noise ratio: both sentence recognition scores and proportions of incomplete responses. In addition, it is shown that the model can provide a better account of the relationship between average word recognition probability (p e ) and sentence recognition probability (p w ) than the relationship p w ⫽ p ej , which has been used in previous studies. Analysis of the relationship between j and the model parameters shows that j is, nevertheless, a very useful parameter, especially when it is combined with the parameter j ⬘ , which can be derived using the equivalent relationship p w,0⫽(1⫺p e ) j ⬘ , where p w,0 is the probability of recognizing none of the words in the sentence. These parameters not only provide complementary information on context effects present in the speech material, but they also can be used to estimate the model parameters. Because the model can be applied to both speech and printed text, an experiment was conducted in which part of the sentences was presented orthographically with 1–3 missing words. The results revealed a large difference between the values of the model parameters for the two presentation modes. This is probably due to the fact that, with speech, subjects can reduce the number of alternatives for a certain word using partial information that they have perceived 共i.e., not only using the sentence context兲. A method for mapping model parameters from one mode to the other is suggested, but the validity of this approach has to be confirmed with additional data. © 2002 Acoustical Society of America. 关DOI: 10.1121/1.1458025兴 PACS numbers: 43.71.Gv 关DOS兴

I. INTRODUCTION

In both clinical and experimental audiology, sentence tests are used extensively for the evaluation of speech perception abilities in normal and hearing-impaired listeners. An important advantage of sentence tests over word tests is that the performance-intensity function is steeper, so that thresholds 共e.g., the speech reception threshold or SRT—the level corresponding to a score of 50%兲 can be determined more accurately 共Plomp and Mimpen, 1979; Nilsson et al., 1994; Kollmeier and Wesselkamp, 1997兲. Other practical advantages are that sentence recognition is a very easy task and that sentences are more representative of everyday conversation than isolated words. An obvious disadvantage of sentence tests is that they do not provide detailed information concerning phoneme perception and the occurrence of confusions. One of the main features of a sentence is that it normally contains a wealth of contextual information. This information can be globally divided into two parts: syntactic information, concerning the structure of the sentence, and semantic information, relating to its meaning 共see, e.g., Boothroyd and Nittrouer, 1988, for examples of speech materials with varying contextual information兲. In addition, there may be 2874

J. Acoust. Soc. Am. 111 (6), June 2002

coarticulatory effects which enhance the probability of correct identification of words within a sentence because of the information contained in transitions from and to neighboring words. Furthermore, there can be context effects that are not mediated by the sentence itself, but that depend on a priori knowledge of the listener of, for example, sentence topic or of the set out of which the sentences are drawn. Data obtained with various types of sentence material show that the performance-intensity function depends to a considerable degree on the amount of contextual information 共Kalikow et al., 1977; Boothroyd and Nittrouer, 1988; Van Rooij and Plomp, 1991; Olsen et al., 1996兲. This affects the applicability of sentence tests, because when the sentence recognition scores should quantify 共peripheral兲 hearing capabilities, it must be assumed that the influence of the contextual information is constant across listeners. A simple example of a situation where this assumption is violated is the repeated presentation of a restricted set of sentences to the same listener. This allows the listener to learn the set and causes a significant increase of the recognition scores 共cf. the effect of set size on word recognition demonstrated by Miller et al., 1951兲. In several early studies, methods were proposed for the quantitative analysis of context effects in the recognition of

0001-4966/2002/111(6)/2874/13/$19.00

© 2002 Acoustical Society of America

printed text; these include the measure of linguistic entropy described by Shannon 共1951兲 and the ‘‘cloze’’ procedure used by Taylor 共1953兲 and Treisman 共1965兲. Such measures are also useful when analyzing speech recognition performance because they capture a considerable part of the contextual information present in speech. Van Rooij and Plomp 共1991兲, for example, found that the linguistic entropy can account for differences of up to 3 dB in the SRT in noise, obtained with the corpus of everyday Dutch sentences developed by Plomp and Mimpen 共1979兲. In a recent study by van Wijngaarden et al. 共2002兲, it was shown that linguistic entropy can also be used to explain differences in speech recognition performance between native and non-native listeners. Methods based on printed text have, however, limited use in evaluating context effects in speech because of the fundamental differences between speech recognition and recognition of printed text. In the last decades, several alternative methods have been developed that do not suffer from this disadvantage because they can be directly applied to speech. The approach that is presently most widely used makes use of two equations, formulated by Boothroyd 共1978兲 and Boothroyd and Nittrouer 共1988兲. The first equation expresses the relationship between recognition probabilities p e and p w of elements and wholes 共e.g., phonemes and words, or words and sentences兲 p w ⫽p ej ,

共1兲

where j is a parameter that depends on the amount of contextual information. This parameter is equal to n 共the number of elements兲 when there is no context and it decreases when the amount of context increases. The value of j indicates the effective number of independent elements in a whole. The second equation gives the relationship between recognition probabilities p e and q e of elements with and without context 共e.g., words presented in sentences and in isolation兲 p e ⫽1⫺ 共 1⫺q e 兲 k .

共2兲

The parameter k has a value of 1 when there is no context, and it increases when context is added. As explained by Boothroyd and Nittrouer 共1988兲, k can be interpreted as the proportional increase of the number of channels with independent information, occurring when context is added to an element. For details concerning the derivation of Eqs. 共1兲 and 共2兲, the reader is referred to Boothroyd and Nittrouer 共1988兲. The parameters j and k have been used to quantify differences between speech tests 共Boothroyd and Nittrouer, 1988; Bosman, 1989; Bronkhorst et al., 1993兲 and effects of hearing impairment on speech recognition 共Olsen et al., 1996; Grant and Seitz, 2000兲. Another application of the parameter j is in models that predict the statistics of speech recognition scores 共Kollmeier and Wesselkamp, 1997; Brand, 2000兲. Advantages of these parameters are that they provide a convenient, single measure of contextual information and that they can be derived relatively easily from speech recognition data. However, there is a problem involved in applying j to results for meaningful sentences because it was found in several studies that j is then not constant but increases as a function of p e 共Boothroyd and Nittrouer, 1988; Kollmeier and WesJ. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

selkamp, 1997; Wagener et al., 1999b兲. This would suggest that it is preferable to use k instead of j in that case. However, because data for speech material with and without context must be combined in order to calculate k, estimation of this parameter is not only more difficult, but also less reliable, than estimation of j. A radically different method for modeling context effects was recently proposed by Mu¨sch and Buus 共2001a兲, as part of their model for predicting speech intelligibility that is based on statistical signal detection theory. In the model, it is assumed that the listener correlates the internal representation of the perceived speech with templates of a number of alternatives, and that these correlations can be mapped to distributions with nonzero mean for the target, and zero mean for the other alternatives. The mean of the target distribution depends on the parts of the frequency spectrum that are conveyed. The variances of the distributions are all the same, and they are determined by noise originating from three sources: nonideal production 共articulation兲, audibility, and linguistic entropy 共lack of context兲. An interesting aspect of this approach is that two sources of contextual information are treated separately. Information concerning set size 共response alternatives兲 is dealt with directly in converting d ⬘ 共the mean of the target distribution divided by the standard deviation of the distributions兲 to recognition probability 关see Eq. 共1兲 in Mu¨sch and Buus, 2001a兴. This part is based on earlier work by Green and Birdsall 共1958兲, who used a similar approach to explain the effect of set size on word recognition, as measured by Miller et al. 共1951兲. The remaining contextual information is modeled by choosing a certain level of ‘‘cognitive’’ noise. Mu¨sch and Buus 共2001a, 2001b兲 have successfully applied their model to predict effects of set size, speech material, and frequency-domain distortions on word recognition. However, because the predictions are based on multiple parameters, not only those related to context, it is difficult to determine how reliable the estimates of set size and cognitive noise power are. Furthermore, it is not clear whether the model is equally successful in predicting sentence recognition. A third approach, which is somewhat similar to that of Boothroyd 共1978兲 and Boothroyd and Nittrouer 共1988兲, was developed by Bronkhorst et al. 共1993兲. They presented a model that provides a single framework for the analysis of recognition probabilities of wholes and elements with or without context, and that can predict the probabilities p w,m that m elements of the whole 共m⫽0,1,...,n, where n is the number of elements兲 are correctly recognized. In the model, the recognition process is broken up into two parts: recognition of elements without context 共in isolation兲 and guessing of missing elements using contextual information. It is assumed that the chance of correctly guessing a missing element (c i ) is independent of the position of the element in the whole, but depends only on the number of missing elements 共i兲. The probability that m elements of the whole are recognized is obtained by summing over all permutations of combinations of k recognized elements (k⫽0,...,m) and m⫺k guessed elements. The parameters c i , i⫽1,...,n 共the c values兲 quantify the amount of contextual information; their value lies between 1 over the number of response alternaBronkhorst, Brand, and Wagener: Context in sentence recognition

2875

tives 共when no context is present and a random choice has to be made兲 and 1. A more detailed explanation of the model can be found in Bronkhorst et al. 共1993兲; the equations for calculating p w,m are also given in Appendix A of this paper.1 Bronkhorst et al. 共1993兲 have shown that their model can be applied to recognition of CVC words, presented either auditorily or orthographically. Using c values that were estimated from word counts in a CVC lexicon, they were able to predict effects of set size on CVC word recognition. Recently, the model has also been used to improve performance of multiband speech recognizers 共Hagen and Boulard, 2001兲. Drawbacks of the model are, however, that the mathematical equations that are involved are relatively complex, especially for large numbers of elements, and that the number of parameters is large—it is, in principle, either n or n⫺1 共see Appendix A兲. As a result, the model cannot easily be applied to experimentally obtained scores, in particular when only average recognition scores of wholes and elements are determined. Furthermore, although application of the model to sentence recognition was added as an example in the paper by Bronkhorst et al. 共1993兲, there has been no thorough evaluation of how well sentence recognition scores can be predicted by the model. The purpose of the present paper is to address the abovementioned problems, associated with the model of Bronkhorst et al. 共1993兲, and to investigate more closely the relationship between this model and the approach developed by Boothroyd 共1978兲 and Boothroyd and Nittrouer 共1988兲. The paper can be divided into two parts. In the first part, the model is applied to speech recognition data obtained with two German sentence tests, which differ with respect to semantic content, and a comparison is made between results for auditory and orthographic presentation. In the second part, the relationships between different context parameters 共the c values, k, j, and j ⬘ , a new parameter introduced in this paper兲 are investigated in detail. All parameters quantify the same thing and it is evident that they should be intimately related. The analysis of the relationship between the c values themselves is particularly relevant for application of the model because it is probable that they can be represented by a smaller number of truly independent parameters. One possible representation, based on a recursive relation, was already proposed by Bronkhorst et al. 共1993兲 and verified for c values derived from a CVC lexicon 共see footnote 5 of that paper兲. II. PREDICTION OF RESULTS OF SENTENCE TESTS A. Speech material

The data used in this study were obtained with two different German sentence tests: the so-called Go¨ttingen and Oldenburg tests. For both tests, extensive validations have been carried out and great effort was taken to equalize the sentences and sentence lists with respect to word recognition score 共Kollmeier and Wesselkamp, 1997; Wagener et al., 1999a, 1999b兲. The Go¨ttingen test comprises 200 simple, meaningful sentences with varying length 共between three and seven words兲, pronounced by a male talker. The Oldenburg test is based on a test developed several years ago by Hager2876

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

man 共1982兲 for the Swedish language. It consists of fiveword sentences with a fixed syntactical structure that are constructed by randomly choosing one of ten alternatives for each of the five words. The structure of the sentences is: name—verb—number—adjective—object; e.g., Peter gets three wet knives. The sentences are syntactically correct, but the meaning can be somewhat strange. Of the 105 different sentences that can be constructed in this way, ten lists of ten sentences were selected for the test. The sentences were generated by combining recordings of pairs of words, pronounced by the same talker as used for the Go¨ttingen test. In terms of contextual information, the sentences of the Go¨ttingen and Oldenburg tests are comparable to the high- and low-predictability sentences developed by Boothroyd and Nittrouer 共1988兲. B. Collection of speech recognition data

The speech recognition data used in this study were obtained from normal-hearing listeners with hearing thresholds better then 15 dB HL at octave frequencies ranging from 125 to 8000 Hz. The listeners were between 16 and 42 years of age. Recognition performance was measured using a procedure in which the signal-to-noise 共S/N兲 ratio was varied adaptively. The masker was speech-shaped noise with a fixed rms level of 65 dB. In both tests, word scoring was used and listeners could draw their responses from open sets. The data for the Go¨ttingen test were collected in a study aimed at optimizing adaptive procedures with respect to their efficiency in estimating the SRT and the slope of the performance-intensity function 共Brand, 2000, 2002兲. Results for 12 listeners were analyzed. Each of the listeners was presented with 1 list of 20 sentences and 6 lists of 30 sentences. The data for the Oldenburg test originated from experiments in which the learning effects, that occur with this material, were investigated 共Wagener et al., 1999b兲. Results for two groups of normal-hearing listeners were used; one group 共10 subjects兲 had previous experience with this test, the other 共19 subjects兲 not. In order to minimize the effect of learning in the present analysis, only 4 of the 5 lists 共of 30 sentences兲 completed by the first group, and 2 of the 6 lists 共of 20 sentences兲 completed by the second group were included. C. Application of the model to speech

As shown in Appendix A, the probabilities p w,m that m of the n elements in a whole are recognized correctly, can be expressed as functions of the variable q, the probability of recognizing elements without context, and the parameters c i (i⫽1,...,n), which quantify the amount of contextual information. When p w,m is known for all values of m larger than 0, the average element recognition probability p e can be calculated as well. However, when working with sentence material, it is normally not feasible to measure q directly, and the only available experimental data are estimates of p w,m 共and p e 兲 obtained as function of a certain independent variable, for example S/N ratio. In order to apply the model to these data, we have used an iterative procedure consisting of the following steps: 共1兲 Initial estimates of c i (i⫽1,...,n Bronkhorst, Brand, and Wagener: Context in sentence recognition

FIG. 1. Patterns of responses obtained for the four-, five- and six-word sentences of the Go¨ttingen test 共G4, G5, and G6兲 and the five-word sentences of the Oldenburg test 共O5兲. The symbols indicate proportions of whole and partial sentences that are reproduced correctly, plotted as a function of the word recognition score. The dashed lines represent model predictions. The standard deviations between measured and predicted values are indicated in the upper right-hand corners of the panels.

⫺1) are chosen; 共2兲 using Eqs. 共A1兲 and 共A2兲, the probabilities p w,m (m⫽0,...,n) and p e are calculated for values of q between 0 and 1 with steps of 0.025; 共3兲 the values of q that correspond to the measured values of p e are determined by linear interpolation and the probabilities p w,m for those values are calculated; 共4兲 the rms difference between measured and predicted values of p w,m is determined; 共5兲 when the rms difference is not minimal, new estimates of c i are chosen and the procedure is continued at step 共2兲. The minimization was carried out using the MATLAB© routine fmins, which employs the Nelder–Mead simplex 共direct search兲 method. The © MATLAB routine that was used to evaluate p w,m and p e is listed in Appendix B. In the adaptive procedure used for collecting speech recognition data, the step size for changing the S/N ratio was not fixed but determined by the performance of the listener. Consequently, data were available for continuum of S/N ratios. In order to obtain average values for fixed values of the S/N ratio, data within consecutive intervals with a width of 1.5 dB were pooled across sentence lists and listeners. When less than 15 sentences had S/N ratios falling within an interval, the results for that interval were discarded. For the Go¨ttingen test, this resulted in data for 7, 9, and 6 S/N ratios, for sentences with 4, 5, and 6 words, respectively; the results for sentences with 3 and 7 words were discarded because insufficient data were available. For the Oldenburg test, which only consists of five-word sentences, data for eight S/N ratios were obtained. Application of the iterative fitting procedure to these data resulted in estimates of c 1 ,...,c n⫺1 (n ⫽4,5,6) for the two types of material and for the three sentence lengths. It was assumed that c n was equal to zero in all cases because no random guessing had been allowed during administration of the tests 共see Appendix A兲. The data points and the optimal predictions are plotted in Fig. 1. Results for the Go¨ttingen sentences with 4, 5, and 6 words, and for the J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Oldenburg sentences are shown in separate panels. Each 共dashed兲 curve corresponds to a certain value of m, the number of correctly perceived words. It can be seen that there is, in general, a good correspondence between data and predictions, as indicated by the standard deviations of the differences, shown in the upper right-hand corners of the panels. The obtained values of c i are listed in Table I and plotted in the left-hand panel of Fig. 2. Standard errors of these estimates were calculated using the residuals and partial derivatives of p w,m with respect to the c values 共Snedecor and Cochran, 1978兲. For all types of material, estimates of j were also determined, using a one-dimensional iterative procedure similar to the one described above; the results are also listed in Table I. It should be noted that these represent average values of j because the dependence of j on p e 共discussed in Sec. III B below兲 was ignored in this analysis. D. Orthographic presentation of sentences

An experiment was performed in which 150 of the 200 sentences of the Go¨ttingen test were presented orthographically with 1–3 missing words. Fifty sentences, all with 3, 5, or 7 words, were left out. This was done in order to limit the number of words to 4 – 6, and to get a more even distribution of sentences with different lengths than in the original set. The four-word sentences were presented only with one or two missing words; the others also with three missing words. Each number of missing words occurred in the same proportion of sentences 共i.e., 50% for the four-word sentences and 33.3% for the five- and six-word sentences兲. The missing words in the sentences were chosen randomly. The sentences were divided into five lists of 30 sentences. The order of presentation of the lists was balanced over subjects. Five native German subjects participated in the experiment; they all worked at the Hanse Wissenschaftskolleg in Delmenhorst. Bronkhorst, Brand, and Wagener: Context in sentence recognition

2877

TABLE I. Context parameters obtained by fitting the present model and Eqs. 共1兲 and 共3兲 to sentence recognition data obtained with the Go¨ttingen and Oldenburg tests. The standard errors 共s.e.兲 of the estimates are given as well. The values in the upper and lower parts of the table represent results for auditory and orthographic stimulus presentation, respectively. Go¨ttingen test 4 Estimate

s.e.

c1 c2 c3 c4 c5 j j⬘

0.72 0.55 0.19

c1 c2 c3

# words

Oldenburg test

5 Estimate

s.e.

6 Estimate

s.e.

0.03 0.04 0.03

0.87 0.75 0.44 0.13

0.01 0.02 0.03 0.02

2.25 2.50

0.19 0.26

2.29 3.02

0.34 0.15

0.06 0.06

0.47 0.34 0.18

5 Estimate

s.e.

0.18 0.28

0.92 0.89 0.71 0.36 0.13 2.41 3.32

0.01 0.02 0.04 0.04 0.02 0.29 0.18

0.53 0.43 0.28 0.12

0.03 0.03 0.03 0.01

3.96 3.43

0.33 0.08

0.04 0.02 0.02

0.68 0.41 0.29

0.07 0.07 0.03

⭐0.1 ⭐0.1 ⭐0.1

Auditory

Orthographic

They were asked to complete the sentences, ensuring that the answers were meaningful and syntactically correct. It should be noted that we also obtained results for orthographic presentation of the Oldenburg test. However, for reasons explained below and in footnote 3, these results were discarded. E. Application of the model to results for orthographic presentation

In analyzing results for orthographic presentation, the c values can be obtained directly, because the probabilities of perceiving individual words 共without context兲 are then either 1 or 0, and at most one term remains when the probabilities p w,m are evaluated using Eq. 共A1兲. Thus, when subjects complete sentences from which one word was removed, an estimate of c 1 can be obtained by dividing the number of times that this word was answered by the number of sentences. Similarly, the proportions of sentences with 2,3,... words missing that are ‘‘correctly’’ completed 共i.e., the answer is equal to the target sentence兲 are estimates of c 1 c 2 , c 1 c 2 c 3 , etc. Direct estimates of c 2 ,c 3 ,... are obtained by dividing the average number of ‘‘correct’’ words in sentences with 2, 3,... missing words by the number of those sentences.

In scoring the results, target words and answers were first converted to one type of notation, containing no diereses 共i.e., ‘‘a¨ ’’ was converted to ‘‘ae,’’ etc.兲 and using ‘‘ss’’ instead of ‘‘ß.’’ By counting both the average number of correct words and the number of correctly completed sentences, estimates of c 1 , c 2 , c 3 , c 1 c 2 , and c 1 c 2 c 3 were obtained for the five- and six-word sentences, and estimates of c 1 , c 2 , and c 1 c 2 for the four-word sentences. For each subject, the c values were determined iteratively, minimizing the squared deviations from the estimates.2 Subsequently, results were averaged across subjects. The results are listed in Table I and presented graphically in Fig. 2 共right panel兲. It appears that, in all cases, the c values are much lower than those obtained for auditory presentation. The large discrepancy between the two sets of data is remarkable, and, although we were not able to obtain reliable estimates of the c values for orthographic presentation of the Oldenburg sentences,3 it is clear that these cannot be larger than 1 over the number of alternatives for each word 共i.e., 0.1兲. This is indicated graphically in the right-hand panel of Fig. 2 by symbols with error bars. A possible explanation for the large discrepancy between results for the two presentation modes is given in Sec. IV C.

FIG. 2. Model parameters 共c values兲 plotted as a function of the index i 共representing the number of missing words兲, determined by fitting the model to data obtained with the Go¨ttingen 共G4, G5, G6兲 and Oldenburg 共O5兲 sentence tests. The left-and right-hand panels show results for auditory and orthographic stimulus presentation, respectively. The curves in the left-hand panel were obtained by fitting Eq. 共5兲 to the data points; those in the right-hand panel result from transforming the fitted values to equivalent values for orthographic presentation, using the procedure described in Sec. IV C.

2878

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Bronkhorst, Brand, and Wagener: Context in sentence recognition

III. RELATIONSHIP BETWEEN CONTEXT PARAMETERS A. Definition of the parameter j ⬘

As mentioned in the Introduction, the parameter j, introduced by Boothroyd and Nittrouer 共1988兲, can be interpreted as the effective number of independent elements in a whole. Normally, j is defined using Eq. 共1兲, but it can also be defined using the following, equivalent equation: p w,0⫽ 共 1⫺p e 兲 j ⬘ .

共3兲

The parameter is called j ⬘ because j and j ⬘ are, in general, not equal to each other. This is illustrated in Table I, which lists estimates of j and j ⬘ , obtained by fitting Eqs. 共1兲 and 共3兲, respectively, to the results of the Go¨ttingen and Oldenburg speech tests. It appears that j ⬘ is larger than j, regardless of sentence length, for the Go¨ttingen data, and smaller than j for the Oldenburg data. The values of j ⬘ , thus, cover a smaller range than those of j, which suggests that j ⬘ depends less on the type of speech material than j does. As will be shown in Sec. III C, the parameter j ⬘ is of interest because it quantifies the influence of context at low values of p e , whereas j is related to context effects at high values of p e . B. Using the model to evaluate other context parameters

Given that the model allows evaluation of p w,n (⫽p w ), p w,0 and p e as a function of q and the c values, it is easy to express j, j ⬘ and k, defined in Eqs. 共1兲, 共3兲, and 共2兲, respectively, as functions of the same variables. The resulting equations are given in Appendix A. Shown as well are expressions that can be used to evaluate the three parameters for values of q close to 0 or 1. Because p e is equal to q for these values, the same limits apply when the parameters are evaluated as a function of p e . According to these expressions, and assuming that c n ⫽0 and c n⫺1 ⬎0 共which will normally be the case兲, the limit of j for low values of p e is 1, while j ⬘ has a higher limit, equal to n/ 兵 1⫹c n⫺1 (n⫺1) 其 . Both j and j ⬘ converge to a value of n for high values of p e . The parameter k approaches 1⫹c n⫺1 (n⫺1) when p e is small and is close to 1 when p e is large. Numerical evaluations show that j typically increases as a function of p e while j ⬘ remains relatively constant, with a minimum for high values of p e ; k either decreases as a function of p e or increases to a maximum for values of p e close to 1. This is illustrated in Fig. 3, which shows the dependence of j/n, j ⬘ /n, and k on p e , calculated using the c values for auditory presentation of the Go¨ttingen and Oldenburg sentences, listed in Table I. The curves were obtained by evaluating j, j ⬘ , k, and p e for values of q between 0 and 1 using Eqs. 共A3兲–共A6兲. The predicted increase of j as a function of p e is in agreement with results of earlier studies. Boothroyd and Nittrouer 共1988兲, who assumed a linear relationship, found slopes of 0.87 and 2.23 for their high- and low-predictability sentence material, respectively. Results of listening tests with the German sentence material, performed at fixed S/N ratios, yield slopes of 0.68 and 1.88 for the Go¨ttingen and Oldenburg material, respectively 共Kollmeier and Wesselkamp, J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

FIG. 3. Predicted dependence of j/n 共upper panel兲, j ⬘ /n 共middle panel兲, and k 共lower panel兲 on the element recognition probability p e for the c values obtained for the Go¨ttingen 共G4, G5, G6兲 and Oldenburg 共O5兲 sentence tests. The variable n represents the number of words in the sentence.

1997; Wagener et al., 1999b兲. As can be seen in the top panel of Fig. 3, the model also predicts that the slope is largest for the Oldenburg material, and the slopes of the curves 共in the middle part兲 are about the same as those observed by Kollmeier and Wesselkamp 共1997兲 and Wagener et al. 共1999b兲. 共Note that, because j/n is plotted in the figure, the slopes of the curves should be multiplied by the number of words.兲 C. Relationship between c values

A problem associated with the application of the present model is that the number of free parameters increases with the number of words in a sentence. Although it may be argued that the context effects in long sentences are, potentially, more complex than those in short sentences, it is questionable whether this increase in complexity should really require one parameter per extra word, especially when the Bronkhorst, Brand, and Wagener: Context in sentence recognition

2879

sentence only conveys one, simple, message 共as is normally the case in sentence tests兲. It is therefore probable that there exists a relationship between the c values, which can be modeled using only a few independent parameters. When the c values are interpreted as chances of filling in missing words, it is intuitively clear that such a relationship should exist because, on the average, the number of possible sentences that fit to an incomplete set of words should increase in an orderly manner when the number of words in the set is reduced. The increase is, of course, related to syntax, vocabulary, and meaning, but it should not be too different for sentences of varying length that show similar redundancy. The c values for the Go¨ttingen sentences, plotted in Fig. 2, provide support for this point because they, indeed, show a very regular dependence on the index i 共which can be interpreted as the number of missing words兲 for all sentence lengths. A simple way of modeling this dependence was already proposed by Bronkhorst et al. 共1993兲. They suggested that the number of competing alternatives ␯ i ⫽(1/c i ⫺1) is multiplied by a constant factor A c when the number of perceived elements is reduced by one 共i.e., it increases exponentially兲, while it is limited to a certain maximum value ␯ max

␯ i⫹1 ⫽

1 . 1 1 ⫹ A c ␯ i ␯ max

共4兲

In this way, c values that were obtained by counting alternatives in a CVC lexicon 共plotted in the left-hand panel of Fig. 3 of Bronkhorst et al., 1993兲 could be predicted accurately. It appears that the same rule can be applied to predict the dependence of the c values on the size of the set of CVC words: (1/c⫺1) can be multiplied by a constant factor A s each time the set size is doubled. An excellent fit (s.d. ⫽0.016) to all c values can then be obtained using only four independent parameters: c 1 for the smallest set, A c , A s , and ␯ max . However, Eq. 共4兲 is less successful in predicting the c values for the Go¨ttingen sentence test. In particular, it cannot account for the decrease of the 共negative兲 slope that occurs as a function of i and it is only possible to obtain a reasonable fit when unrealistically large values of ␯ max are used. A better fit is obtained when the increase in the 共total兲 number of alternatives is modeled with a power function c i⫹1 ⫽c ␣i ⫹ 共 1⫺c ␣i 兲



c min ␣ ⫺1 1⫹c min



.

共5兲

This equation will converge to a certain minimum 共equal or close to c min for values of ␣ greater than about 1.5兲 when i is large. With this relation, one is able to predict the dependence of the c values on both i and the number of words in the sentence, using different values for ␣ 共␣ c and ␣ l 兲, but only a single estimate of c min , and, as fourth parameter, c 1 for the six-word sentences. In fitting the data, ␣ l is first substituted in Eq. 共5兲 to obtain estimates of c 1 for the other sentence lengths; subsequently, ␣ c is used to predict the dependence of c on i for each of the three sentence lengths. The best fit (s.d.⫽0.026) is obtained for c 1 ⫽0.94, ␣ c ⫽2.69, ␣ l ⫽2.23, and c min⫽0.035. Using the same value of c min , but 2880

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

different values of c 1 and ␣ c 共0.55 and 1.60, respectively兲, the c values for the Oldenburg test can also be reproduced relatively accurately (s.d.⫽0.024). The results of the fits are shown in the left-hand panel of Fig. 2. Because one can also obtain a good four-parameter fit to the c values derived from the CVC lexicon with this equation, it probably represents a more generic description of the relationship between c values than Eq. 共4兲. Although the exact values of the parameters used in Eq. 共5兲 will be different for each sentence test, it is not unreasonable to assume that the values found for the two German sentence tests, in particular ␣ c , ␣ l , and c min , are good firstorder approximations of parameter values for other, similar sentence tests. This makes it possible to apply the present model to other data using only a minimum number of free parameters. For example, assuming that the high- and lowpredictability sentences of Boothroyd and Nittrouer 共1988兲 are similar to the Go¨ttingen and Oldenburg sentences, respectively, we can use the two values of ␣ c and the single value of c min , given above, to fit the model to their data 共listed in their Table III兲, while varying only one parameter: c 1 . The optimal fits for the high- and low-predictability material were obtained for c 1 ⫽0.79 and 0.29, respectively. Standard deviations were 0.050 and 0.046, respectively. In comparison, Boothroyd and Nittrouer 共1988兲 obtained standard deviations of 0.051 and 0.049, respectively, using a modified version of Eq. 共1兲 in which j depends linearly on pe . D. Approximate relationships between j, j ⬘ , and the c values

Both j and j ⬘ can be derived easily from the results of sentence tests, and the parameter j has been used in a number of previous studies 共Boothroyd and Nittrouer, 1988; Bosman, 1989; Olsen et al., 1996; Kollmeier and Wesselkamp, 1997兲. It therefore seems useful to look more carefully at the relationship between j and j ⬘ on the one hand, and the c values on the other hand, because, if estimates of the latter values can be derived from the former, it becomes possible to apply the present model when less data are available, and without the complex iterative determination of the c values. Given that the analytic relationships given in Appendix A are not very helpful for practical purposes, we carried out numerical evaluations in order to obtain more simple, approximate relationships. It was assumed that the dependence of the c values on i (i⫽1,...,n⫺1) can be modeled adequately with Eq. 共5兲 and that c n equals 0. For given values of c i , estimates of j and j ⬘ were determined by first calculating p w and p w,0 as a function of p e using Eqs. 共A1兲 and 共A3兲 共for values of q that increase in steps of 0.025 between 0 and 1兲, and by subsequently fitting p w and p w,0 , evaluated using Eqs. 共1兲 and 共3兲, respectively, to these data. Because we wanted to obtain average values of j and j ⬘ , the dependence of these parameters on p e was not taken into account. In the calculations, we used values of n ranging from 4 –7; c 1 had values between 0.2 and 0.9, increasing in steps of 0.1; ␣ c ranged from 1.5–5 共increment 0.5兲, and 1/c min was chosen between 10 and 30 共increment 5兲. This resulted in a total of 1280 combinations Bronkhorst, Brand, and Wagener: Context in sentence recognition

of these four parameters. The values of j and j ⬘ , obtained for these combinations, were divided by n and then submitted to a multiple linear regression analysis, using as independent variables both the four parameters above and the parameter c n⫺1 , multiplied by n. The latter combination 共which is not truly independent because it is obtained from other parameters兲 was added because initial calculations showed that it accounts for most of the variance of j ⬘ . The analysis revealed, rather surprisingly, that c 1 is very strongly correlated with j/n. More than 96% of the variance of j/n is explained by the following simple relationship: j ⫽1.04⫺0.63c 1 . n

共6兲

When only values of ␣ c greater than or equal to 2 are considered, the explained variance is even higher 共99%兲. Because j does not depend on c min , and only slightly on ␣ c 共when ␣ c ⬎1.5兲, Eq. 共6兲 will hold regardless of how exactly the dependence of c i on i is modeled, as long as c i does not decrease too slowly as a function of i. Equation 共6兲 demonstrates that the parameters c 1 and j are essentially equivalent in most cases. Given that c 1 can be interpreted as the chance of completing a whole with only a single missing element, something which is most likely to occur when the element recognition probability p e is high, it can be concluded that both c 1 and j quantify performance at high values of p e . As mentioned above, it was found that j ⬘ is highly correlated with nc n⫺1 . This relationship, however, turned out to be nonlinear and a better correspondence was obtained when log(j⬘/n) was used as the dependent variable in the regression analysis. It appeared that 93% of the variance of j ⬘ /n is captured by the following relationship: j⬘ ⫽e ⫺0.08⫺0.37nc n⫺1 . n

共7兲

The explained variance can be increased slightly 共to 95%兲 when c 1 is included as the second parameter, but, for simplicity, this was not done. Because j ⬘ mainly depends on n and c n⫺1 , and hardly on the other parameters, we can again conclude that the exact shape of the dependence of c i on i does not matter. In addition, it can be deduced in a similar way as above that both c n⫺1 and j ⬘ quantify performance at low values of p e . IV. DISCUSSION

able to extract all contextual information present in the scores. Interestingly, the results for the two sentence tests deviate in a way that reflects the fundamental differences between them: the c values for the Oldenburg test are not only smaller than those for the Go¨ttingen test, indicating that there is less context, but they also decrease less as a function of the index i 共interpreted as the number of missing elements兲, which means that the context depends less on the amount of perceived information. This can by explained by the fact that the context is mainly a priori information concerning syntax and the word set that is used. The analysis of the relationship between j and the c values provides additional support for the model. In studies employing either the same sentence material as used here 共Kollmeier and Wesselkamp, 1997; Wagener et al., 1999b兲 or different material 共Boothroyd and Nittrouer, 1988兲, it was found that j is not constant, but increases as a function of the word recognition probability p e . This increase is predicted by the model, and it is shown that the data of Boothroyd and Nittrouer 共1988兲 that were fitted using a modified version of Eq. 共1兲, in which j depends linearly on p e , can be predicted at least as accurately by the model. The fact that no increase was found for other types of material 共e.g., the zeropredictability sentences and the CVC words used by Boothroyd and Nittrouer, 1988兲 is not in contradiction with the predictions. As shown in Fig. 3 共top panel兲, there are cases when j is virtually constant over a wide range of values of p e . When there is no context and the c values are all zero, this is trivial, because the model then predicts that j is constant 共equal to the number of elements兲. A shortcoming of the model is that it uses many parameters—normally the number of elements minus 1, which implies that an extensive data set is required to obtain reliable estimates. It appears, however, that the experimentally obtained c values show a clear regularity: they decrease monotonically as a function of the index i and they converge to a certain minimum. It is shown that this dependence can be modeled using a recursive relation with three free parameters: c 1 , ␣ c , determining the decrease as a function of i, and the minimum value c min . Although the particular relation proposed here 关Eq. 共5兲兴 may not be accurate in all cases—it has only been verified with a limited amount of data—it nevertheless provides a useful extension of the model because, in practice, the c values can only be determined with limited accuracy.

A. Validity of the model

B. Relationship between context parameters

As stated in the Introduction, one of the aims of this paper is to check whether the present model can be used to predict results of sentence tests. This was investigated by applying it to data obtained with two different types of sentences, with varying length. It appeared that the model can accurately reproduce the complete pattern of responses— scores for reproduction of both complete and partial sentences—in all cases. As stated earlier 共see footnote 1兲, this does not mean that the scheme underlying the model 共in particular the two-stage process of sensory perception and guessing of elements兲 is a valid representation of the actual speech perception. It only demonstrates that the model is

Another aim of the present paper was to perform a detailed analysis of the relationships between the context parameters j, k, and the c values. Within the framework of the model, this is straightforward because j and k can be expressed as a function of the c values 共and of q, the probability of recognizing elements without context兲. Such a relationship can also be formulated for a new parameter, j ⬘ , defined in Eq. 共3兲, which just as j itself can be interpreted as the effective number of elements in the whole. Although the relationships are useful for detailed evaluations, shown, for example, in Fig. 3, they are rather complex and not very helpful for practical purposes. We therefore carried out a

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Bronkhorst, Brand, and Wagener: Context in sentence recognition

2881

statistical analysis, using Eq. 共5兲 to generate ‘‘realistic’’ sets of c values and concentrating on the parameters j and j ⬘ which, just as the c values, can be derived from the scores obtained with one type of material 共evaluation of the parameter k requires comparison of results for speech with and without context兲. The outcome shows that in most cases 共when ␣ c ⬎1.5, which means that c i should decrease not too slowly as a function of the index i兲, there are almost one-toone relations between j and c 1 , and between j ⬘ and c n⫺1 . From this, two conclusions can be drawn. First, it appears that both j and j ⬘ yield only partial information concerning the context: one can roughly say that j quantifies the effect of context when speech perception is good, and j ⬘ that when speech perception is poor. This is not too surprising because j is based on responses that are entirely correct, and j ⬘ on responses that are entirely incorrect. Second, using Eqs. 共6兲 and 共7兲 in combination with Eq. 共5兲, one can easily map j 共and j ⬘ 兲 to c values, requiring estimates of only two 共or one兲 additional parameter共s兲. This facilitates application of the model when only minimal information is available. C. Effect of presentation mode

One of the advantages of the model, which was also utilized in the study by Bronkhorst et al. 共1993兲, is that it can be applied to both speech and text. However, because the c values capture all contextual information present in the material, differences can occur between results for the two presentation modes, even when the speech is the verbal equivalent of the text. Such differences were found by Bronkhorst et al. 共1993兲—only for presentation of words in noise—and they were ascribed to coarticulation. Given that coarticulation and other potential factors, such as intonation, probably have a small effect in sentences,4 it is at first sight puzzling that the present results show such a large difference for both types of sentences. In order to understand this, we must look more closely at the way in which the stimuli are degraded in the two presentation modes. During orthographic presentation, elements have recognition probabilities of either 1 or zero, and no intermediate values. In fact, this corresponds most closely to the scheme underlying the model, because a clear distinction can be made between sensory perception of elements that are presented and pure guessing to fill in the missing elements. During auditory presentation, all elements are degraded to a certain extent and, except when the degradation is severe, they are never ‘‘missing,’’ but there is always some sensory information present. The fact that the model does not take this information into account when a certain element is taken as missing 共i.e., when it would not be correctly identified without sentence context兲 is the most probable explanation of the discrepancy between results for auditory and orthographic presentation.5 Performance is better in the former case 共the c values are higher兲 because, even when the sensory information is insufficient to identify the element without context, it can always be used to reduce the number of alternatives from which the response should be chosen. A simple example is when the listener has only heard one phoneme of a certain word in a sentence. Without context, identification of the word would be virtually impossible 共i.e., q is essentially zero兲, but when the context reduces the 2882

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

number of candidates to, say, ten 共which means that c would be 0.1 during orthographic presentation兲, the additional information might well narrow it down to only two or three alternatives 共and c would lie between 0.33 and 0.5兲.6 It would evidently be very useful to be able to transform a set of c values obtained for one presentation mode to equivalent values for the other presentation mode. For example, one could then predict the performance-intensity function of sentences before actually recording them. The relationship between results for the two presentation modes can also be used to differentiate between processing skills for speech and text, e.g., when testing children or non-native subjects. A possible approach for obtaining this transformation is derived here. However, given that we can base it only on a limited amount of data, it is tentative and subject to validation with further data. The approach is based on the observation that Eq. 共5兲 can accurately predict the results for orthographic presentation, using values of ␣ c 共for the dependence on the index i兲 and ␣ l 共for the dependence on sentence length兲 that are proportional to the values for auditory presentation 共i.e., that are obtained from the latter values by multiplying them with a constant ␤, where ␤ ⬍1兲. The finding that ␣ c and ␣ l are lower for orthographic than for auditory presentation is in line with the explanation given above because, in the former case, the c values represent ‘‘real’’ contextual information that is spread out over the sentence and thus less sensitive to reductions of the number of words that are available than the local sensory information that is presumably used to reduce the number of alternatives during auditory presentation. Considering first the dependence of the c values on sentence length, and assuming that we can also apply Eq. 共5兲 to map values of c 1 from auditory to orthographic mode 共disregarding the second term兲, it can be derived that the value of ␣, to be entered in Eq. 共5兲 is proportional to ␤ (6⫺n) , where n(n⫽4,5,6) is the number of words in the sentence. In order to limit the number of free parameters to only 1, we now make one additional assumption: that ␣ reduces to a value of 1 when n is equal to 0. This implies that

␣ ⫽ ␤ ⫺n .

共8兲

We have used this expression to obtain estimates of c 1 for orthographic presentation from the fitted values of c 1 for auditory presentation, and we have applied Eq. 共5兲, with ␣ c equal to 2.69␤ 共using the value of ␣ c for auditory presentation, given in Sec. III C兲, to predict the dependence of c on the index i. The c values were then fitted to the data for orthographic presentation; the best result 共s.d. of fit 0.039兲 is obtained for ␤ ⫽0.71. The predicted values are plotted as lines in the right-hand panel of Fig. 2. Given that only one free parameter is used in the transformation, the agreement between data and predictions is remarkably good. The figure also shows predictions of c values for orthographic presentation of the Oldenburg sentences. These were obtained in the same way as above, keeping ␤ the same, but now multiplying it by 1.6 共the value of ␣ c for that material兲. Because the predictions fall within the expected range 共between 0 and 0.1兲, we can conclude that, although further validation is Bronkhorst, Brand, and Wagener: Context in sentence recognition

menhorst, Germany. The authors wish to thank Birger Kollmeier for his support and three anonymous reviewers for their critical evaluation of an earlier version of the paper.

clearly warranted, the present, simple method appears to be quite effective. ACKNOWLEDGMENTS

This research was made possible by a fellowship granted to the first author by the Hanse Wissenschaftskolleg in DelAPPENDIX A: MODEL EQUATIONS

The model of Bronkhorst et al. 共1993兲 yields predictions of the probabilities p w,m that m (m⫽0,...,n) elements of wholes containing n elements are recognized. These are evaluated as a function of the probabilities that elements in position i of the whole (i⫽1,...,n) are recognized without context 共i.e., in isolation兲. The latter are designated by the symbol q i . For simplicity, it will be assumed here that q i is independent of i. The probabilities p w,m can then be calculated as follows: p w,n ⫽q n ⫹nc 1 q n⫺1 共 1⫺q 兲 ⫹



n 共 n⫺1 兲 c 1 c 2 q n⫺2 共 1⫺q 兲 2 ⫹¯⫹c 1 c 2 ¯c n 共 1⫺q 兲 n 2!

p w,n⫺1 ⫽ 共 1⫺c 1 兲 nq n⫺1 共 1⫺q 兲 ⫹2 p w,n⫺2 ⫽ 共 1⫺2c 2 ⫹c 1 c 2 兲 ]



p w,0⫽ 1⫺nc n ⫹



n 共 n⫺1 兲 c 2 q n⫺2 共 1⫺q 兲 2 ⫹¯⫹nc 2 ¯c n 共 1⫺q 兲 n 2!



n 共 n⫺1 兲 n⫺2 n 共 n⫺1 兲共 n⫺2 兲 q c 3 q n⫺3 共 1⫺q 兲 3 ⫹¯⫹nc 3 ¯c n 共 1⫺q 兲 n 共 1⫺q 兲 2 ⫹3 2! 3!



共A1兲



n 共 n⫺1 兲 n 共 n⫺1 兲共 n⫺2 兲 c n c n⫺1 ⫺ c n c n⫺1 c n⫺2 ⫹¯ 共 1⫺q 兲 n . 2! 2!

The element recognition probability p e can be derived simply from the probabilities p w,m using p e ⫽p w,n ⫹

1 共 n⫺1 兲 共 n⫺2 兲 p w,n⫺1 ⫹ p w,n⫺2 ⫹¯⫹ p w,l , n n n

共A2兲

which can also be expressed as p e ⫽q⫹c 1 q n⫺1 共 1⫺q 兲 ⫹2

共 n⫺1 兲 共 n⫺1 兲共 n⫺2 兲 c 2 q n⫺2 共 1⫺q 兲 2 ⫹3 c 3 q n⫺3 共 1⫺q 兲 3 ⫹¯⫹c n 共 1⫺q 兲 n . 2! 3!

共A3兲

A MATLAB© procedure for the evaluation of these expressions is listed in Appendix B. In the equations, the parameters c i represent the chances of correctly guessing one missing element given that i of the n elements were missed. Because it is, in general, more difficult to fill in an element when fewer elements have been perceived, c i will normally decrease as a function of i. The value of c n , the chance of guessing one element when nothing has been heard, has a special significance: it quantifies the effect of random guessing when no sensory information is present. As can be derived easily from Eq. 共A3兲, it also is the lower limit of p e . In most cases, subjects are asked to refrain from this kind of guessing, which means that c n should be taken equal to zero. Using the above equations, one can easily express the context parameters j, j ⬘ , and k, defined in Eqs. 共1兲, 共3兲, and 共2兲, respectively, as functions of q and c i , i⫽1,...,n

j⫽

log兵 q n ⫹nc 1 q n⫺1 共 1⫺q 兲 ⫹¯⫹c 1 ¯c n 共 1⫺q 兲 n 其 , log兵 q⫹c 1 q n⫺1 共 1⫺q 兲 ⫹¯⫹c n 共 1⫺q 兲 n 其

log j ⬘⫽

再冉

1⫺nc n ⫹

共A4兲





n 共 n⫺1 兲 n 共 n⫺1 兲共 n⫺2 兲 c n c n⫺1 ⫺ c n c n⫺1 c n⫺2 ⫹¯ 共 1⫺q 兲 n 2! 3! , log兵 1⫺q⫺c 1 q n⫺1 共 1⫺q 兲 ⫺¯⫺c n 共 1⫺q 兲 n 其

log兵 1⫺q⫺c 1 q n⫺1 共 1⫺q 兲 ⫺¯⫺c n 共 1⫺q 兲 n 其 k⫽ . log共 1⫺q 兲

共A5兲

共A6兲

Unfortunately, these relationships cannot be simplified analytically, and numerical evaluation shows that the three context parameters depend in a complex manner on the c values. It is, however, relatively straightforward to calculate j, j ⬘ , and k for values of q close to 0 and close to 1 J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Bronkhorst, Brand, and Wagener: Context in sentence recognition

2883

log共 c 1 ¯c n 兲 , log共 c n 兲 q→0 lim j⫽n, lim j⫽

q→1

lim j ⬘ ⫽ q→0

c n ⬎0⫽m,

c 1 ,...,c n⫺m ⬎0, c n⫺m⫹1 ,...c n ⫽0

log共 1⫺nc n ⫹n 共 n⫺1 兲 c n c n⫺1 /2⫺¯ 兲 n , c n ⬎0⫽ , log共 1⫺c n 兲 1⫹c n⫺1 共 n⫺1 兲

共A7兲

c n ⫽0

共A8兲

lim j ⬘ ⫽0, q→1

lim k⫽⬁,

c n ⬎0⫽1⫹c n⫺1 共 n⫺1 兲 , c n ⫽0

q→0

共A9兲

lim k⫽1. q→1

APPENDIX B: COMPUTER PROGRAM FOR EVALUATION OF THE MODEL EQUATIONS

program that evaluates Eqs. 共A1兲 and 共A2兲 when the c values and the number of elements are given. By default, calculations are performed for values of q—the recognition probabilities of elements without context—ranging from 0 to 1 in steps of 0.025. Optionally, a user-defined vector with q values can be specified. MATLAB

2884

©

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Bronkhorst, Brand, and Wagener: Context in sentence recognition

1

It should be made clear that the present model is not designed to model the perception of speech or text; its only purpose is to provide a description of the pattern of the errors that is obtained when speech or text is perceived in nonoptimal conditions. In explaining the model, we refer to perceptual and cognitive aspects when we state that we use probabilities that elements are perceived and chances that elements are guessed, but this is only to illustrate how the mathematical relationships can be interpreted. 2 In general, the stress in these fits was small. In other words, the product of c 1 and c 2 was always close to the observed value of c 1 c 2 , and the product of c 1 , c 2 , and c 3 was very close to the observed value of c 1 c 2 c 3 . This indicates that the hypothesis used in deriving the model—that c i depends 共on average兲 only on i, and not on the position共s兲 of the missing word共s兲 in the sentence—is valid for the results obtained with orthographic presentation. 3 The five subjects that participated in the experiment with the Go¨ttingen sentences also completed 30 sentences from the Oldenburg test with one or two missing words. They were instructed to adhere to the fixed syntactical structure of these sentences. It appeared that c 1 and c 2 were essentially zero in this test. Subjects that have experience with the Oldenburg test can reach higher c values because they have learned part of the set of possible words. However, their performance could not be evaluated using orthographical presentation because a simple strategy 共always filling in words of one sentence that they have remembered兲 already ensures that all the c values will be 0.1. 4 Evidence for this is provided by the fact that Boothroyd and Nittrouer 共1988兲 found no context effects 共a j equal to the number of words兲 with their zero-predictability sentences. 5 Our explanation thus suggests that the c values for auditory presentation overestimate the actual influence of context. It is important to note that this conclusion also applies to other context parameters, and in particular to j because of the direct relationship between the c values and the other context parameters. 6 This explanation seems to be at odds with the fact that Bronkhorst et al. 共1993兲 found virtually the same c values for CVC words presented orthographically and auditorily 共in quiet兲. There are, however, two reasons why one would expect that the effect of presentation mode is less for CVC words than for sentences. First, there is a smaller difference between element recognition scores without and with context when the elements are phonemes than when they are words, for the simple reason that there are much fewer alternative phonemes than alternative words. As a result, a reduction of the number of potential alternatives based on partial sensory information has less effect. Second, it is more likely that useful partial information can be extracted from a word than from a single phoneme, because individual phonemes within a word can be recognized, in particular vowels with a relatively high level. J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

Boothroyd, A. 共1978兲. ‘‘Speech perception and sensorineural hearing loss,’’ in Auditory Management of Hearing-Impaired Children, edited by M. Ross and T. G. Giolas 共University Park, Baltimore兲. Boothroyd, A., and Nittrouer, S. 共1988兲. ‘‘Mathematical treatment of context effects in phoneme and word recognition,’’ J. Acoust. Soc. Am. 84, 101– 114. Bosman, A. J. 共1989兲. ‘‘Speech perception by the hearing impaired,’’ Doctoral thesis 共University of Utrecht, The Netherlands兲. Brand, T. 共2002兲. ‘‘Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests,’’ J. Acoust. Soc. Am. 111, 2801–2810. Brand, T. 共2000兲. ‘‘Analysis and optimization of psychophysical procedures in audiology,’’ Doctoral thesis 共University of Oldenburg, Germany兲. Bronkhorst, A. W., Bosman, A. J., and Smoorenburg, G. F. 共1993兲. ‘‘A model for context effects in speech recognition,’’ J. Acoust. Soc. Am. 93, 499–509. Grant, K. W., and Seitz, P. F. 共2000兲. ‘‘The recognition of isolated words and words in sentences: Individual variability in the use of sentence context,’’ J. Acoust. Soc. Am. 107, 1000–1011. Green, D. M., and Birdsall, T. G. 共1958兲. ‘‘The effect of vocabulary size on articulation score,’’ Technical Memorandum, No. 81 and Technical Note AFCRC-TR-57-58, University of Michigan: Electronic Defense Group. Hagen, A., and Boulard, H. 共2001兲. ‘‘Error correcting posterior combination for robust multiband speech recognition,’’ Proceedings Eurospeech 2001, Scandinavia. Hagerman, B. 共1982兲. ‘‘Sentences for testing speech intelligibility in noise,’’ Scand. Audiol. 11, 79– 87. Kalikov, D. N., Stevens, K. N., and Elliot, L. L. 共1977兲. ‘‘Development of a test of speech intelligibility in noise using sentence materials with controlled word predictability,’’ J. Acoust. Soc. Am. 61, 1337–1351. Kollmeier, B., and Wesselkamp, M. 共1997兲. ‘‘Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment,’’ J. Acoust. Soc. Am. 102, 2412–2421. Miller, G. A., Heise, G. A., and Lichten, W. 共1951兲. ‘‘The intelligibility of speech as a function of the context of the test materials,’’ J. Exp. Psychol. 41, 329–335. Mu¨sch, H., and Buus, S. 共2001a兲. ‘‘Using statistical decision theory to predict speech intelligibility. I. Model structure,’’ J. Acoust. Soc. Am. 102, 2896 –2909. Mu¨sch, H., and Buus, S. 共2001b兲. ‘‘Using statistical decision theory to predict speech intelligibility. II. Measurement and prediction of consonantdiscrimination performance,’’ J. Acoust. Soc. Am. 102, 2910–2920. Nilsson, M., Soli, S. D., and Sullivan, J. A. 共1994兲. ‘‘Development of the Hearing In Noise Test for the measurement of speech reception thresholds in quiet and in noise,’’ J. Acoust. Soc. Am. 95, 1085–1099. Olsen, W. O., van Tasell, D. J., and Speaks, C. E. 共1996兲. ‘‘Phoneme and word recognition for words in isolation and in sentences,’’ Ear Hear. 18, 175–188. Bronkhorst, Brand, and Wagener: Context in sentence recognition

2885

Plomp, R., and Mimpen, A. M. 共1979兲. ‘‘Improving the reliability of testing the speech reception threshold for sentences,’’ Audiology 18, 43–52. Shannon, C. E. 共1951兲. ‘‘Prediction and entropy of printed English,’’ Bell Syst. Tech. J. 30, 50– 64. Snedecor, G. W., and Cochran, W. G. 共1978兲. Statistical Methods 共Iowa State Press, Ames, USA兲, pp. 465– 467. Taylor, W. L. 共1953兲. ‘‘Cloze procedure: A new tool for measuring readability,’’ J. Quart. 30, 415– 433. Treisman, A. M. 共1965兲. ‘‘Verbal responses and contextual constraint,’’ J. Verbal Learn. Verbal Behav. 4, 118 –128. Van Rooij, J. C. G. M., and Plomp, R. 共1991兲. ‘‘The effect of linguistic

2886

J. Acoust. Soc. Am., Vol. 111, No. 6, June 2002

entropy on speech perception in noise in young and elderly listeners,’’ J. Acoust. Soc. Am. 90, 2985–2991. Van Wijngaarden, S. J., Steeneken, H. J. M., and Houtgast, T. 共2002兲. ‘‘Quantifying the intelligibility of speech in noise for non-native listeners,’’ J. Acoust. Soc. Am. 111, 1906 –1916. Wagener, K., Brand, T., and Kollmeier, B. 共1999a兲. ‘‘Entwicklung und Evaluation eines Satztests fu¨r die deutsche Sprache. II. Optimierung des Oldenburger Satztests,’’ Z. Audiol. 38共2兲, 44 –56. Wagener, K., Brand, T., and Kollmeier, B. 共1999b兲. ‘‘Entwicklung und Evaluation eines Satztests fu¨r die deutsche Sprache. III. Evaluation des Oldenburger Satztests,’’ Z. Audiol. 38共3兲, 86 –95.

Bronkhorst, Brand, and Wagener: Context in sentence recognition