Multichannel speech intelligibility and talker recognition using

6 downloads 0 Views 160KB Size Report
In a 3D auditory display, sounds are presented over headphones in a way that they seem to ... headphone to eardrums by the HRTFs results in the percep-.
Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation Rob Drullmana) and Adelbert W. Bronkhorst TNO Human Factors Research Institute, Department of Perception, P.O. Box 23, 3769 ZG Soesterberg, The Netherlands

共Received 13 January 1999; revised 19 July 1999; accepted 17 December 1999兲 In a 3D auditory display, sounds are presented over headphones in a way that they seem to originate from virtual sources in a space around the listener. This paper describes a study on the possible merits of such a display for bandlimited speech with respect to intelligibility and talker recognition against a background of competing voices. Different conditions were investigated: speech material 共words/sentences兲, presentation mode 共monaural/binaural/3D兲, number of competing talkers 共1–4兲, and virtual position of the talkers 共in 45°-steps around the front horizontal plane兲. Average results for 12 listeners show an increase of speech intelligibility for 3D presentation for two or more competing talkers compared to conventional binaural presentation. The ability to recognize a talker is slightly better and the time required for recognition is significantly shorter for 3D presentation in the presence of two or three competing talkers. Although absolute localization of a talker is rather poor, spatial separation appears to have a significant effect on communication. For either speech intelligibility, talker recognition, or localization, no difference is found between the use of an individualized 3D auditory display and a general display. © 2000 Acoustical Society of America. 关S0001-4966共00兲01104-3兴 PACS numbers: 43.66.Pn, 43.66.Qp, 43.72.Kb 关DWG兴

INTRODUCTION

In various communication systems, such as those used for teleconferencing, emergency telephone systems, aeronautics, and 共military兲 command centers, there may be a need to monitor several channels simultaneously. Conventional systems present speech over one or two channels, which may lead to reduced intelligibility in critical situations, i.e., when more than two talkers are talking at the same time. Alternatively, the signals can be presented by means of a 3D auditory display, where sounds presented over headphones are filtered binaurally in such a way that they seem to originate from virtual sources in a space around the listener. As in normal 共nonheadphone兲 listening, the capacities of the human auditory system are used much better with such a 3D system, particularly with respect to sound localization and spatial separation. Spatial separation of the voices improves speech perception 共‘‘cocktail party effect,’’ cf. Cherry, 1953兲 and may also facilitate the identification of the talkers. Spatialized or 3D audio over headphones is obtained by filtering an incoming signal according to head-related transfer functions 共HRTFs兲. These transfer functions are an essential part of a 3D auditory display, because they simulate the acoustic properties of the head and ears of the listener, on which spatial hearing is based. HRTFs are essentially a set of filter pairs that contain the directional information of the sound source as it reaches the listener’s eardrums. When listening over headphones, substituting the transfer from headphone to eardrums by the HRTFs results in the perception of a virtual sound outside the head of the listener. Thus, a兲

Electronic mail: [email protected]

2224

J. Acoust. Soc. Am. 107 (4), April 2000

an external sound source can be simulated for any direction for which the HRTFs exist. Several studies on the efficacy of 3D auditory displays for speech communication have shown positive results. These results were obtained by both HRTF processing and more generic binaural listening techniques. Bronkhorst and Plomp 共1992兲 used artificial-head 共KEMAR兲 recordings of short sentences in frontal position and temporally modulated speech noise 共simulating competing talkers兲 at various other azimuths in the horizontal plane. They evaluated intelligibility in terms of the speech-reception threshold 共SRT兲, i.e., the speech-to-noise ratio needed for 50% intelligibility. For normal-hearing listeners, the gain occurring when one to six noise maskers were moved from the front to positions around the listener varied from 1.5 to 8 dB, depending on the number of maskers and on their positions. Begault and Erbe 共1994兲 used nonindividualized HRTF filtering for spatializing four-letter words 共‘‘call signs’’兲 against a background of diotic multitalker babble. With naive listeners, an advantage of up to 6 dB in SRT was found for 3D presentation 共60 and 90° azimuths兲 compared with diotic presentation. In a subsequent study, Begault 共1995兲 used words against diotic speech noise. Again, at ⫾90° azimuth an average advantage of 6 dB in SRT re diotic presentation was found. Ricard and Meirs 共1994兲 used nonindividualized HRTFs to measure the SRT of synthetic speech in the horizontal plane against a bandlimited white-noise masker in frontal position. They found an average maximum threshold decrease of 5 dB when the speech source was shifted to ⫾90° azimuth. The question how intelligibility is affected when multiple talkers are presented at different 共virtual兲 positions, in a

0001-4966/2000/107(4)/2224/12/$17.00

© 2000 Acoustical Society of America

2224

way that each talker could be understood per se—a situation different and presumably more difficult than using noise as masker—was recently studied by a number of authors. Crispien and Eherenberg 共1995兲 used HRTF filtering for four concurrent talkers, each at a different azimuth and elevation, pronouncing short sentences. While listeners knew the position of the desired talker and the same stimuli were presented three times, the intelligibility scores for words 共not entire sentences兲 were on average 51%. In a simulated cocktail party situation, Yost et al. 共1996兲 used bandlimited speech 共words兲 uttered by up to three simultaneous talkers. Speech was presented over seven possible loudspeakers in a front semicircle around either a human listener, a single microphone, or a KEMAR. With three concurrent talkers, average word intelligibility for all utterances together were similar 共about 40%兲 for live listening and listening to KEMAR recordings, whereas monaural listening scored only 18%. Peissig and Kollmeier 共1997兲 measured subjective SRTs with HRTF filtering for sentences at 0°, masked by maximally three concurrent talkers at various azimuths. For normal-hearing listeners the maximum gains relative to presenting all talkers at 0° were 8 and 5 dB, for conditions with two and three competing talkers, respectively. Ericson and McKinley 共1997兲 measured sentence intelligibility for two and four concurrent talkers in pink noise 共65 and 105 dB SPL respectively兲, using a speech-to-noise ratio of 5–10 dB. Subjects had to reproduce sentences that contained a certain call sign 共i.e., the talker to monitor was not fixed兲. Diotic, dichotic, and directional presentations 共KEMAR recordings in the horizontal plane兲 were compared. With two talkers, scores were more than 90% for both dichotic and directional presentation when the two talkers were separated by at least 90°. With four talkers and low-level noise, the advantage from directional over diotic presentation with a mixed group of male and female talkers was maximally about 30% 共90° separation of the talkers兲. Finally, Hawley et al. 共1999兲 used up to four concurrent sentences from one talker presented over either loudspeakers or headphones 共KEMAR recordings兲 in seven azimuths in the front horizontal plane. The azimuth configurations varied by taking different minimum angles between target and nearest competing speech. When all sentences originated from different positions, keyword scores for three competing sentences varied from about 65% 共nearest competing sentence 30–90° from target兲 to about 90% 共nearest competing sentence 120° or more from target兲. In the present study, two types of speech material were used 共words, sentences兲, spoken by up to five concurrent talkers, and the virtual positions of the talkers in the horizontal plane were varied systematically. The listener’s task was to attend to a single target talker in the presence of one or more competing talkers. Presentation of the speech signals was done monaurally, binaurally, or via 3D audio. In addition, two modes of 3D presentation were used: one with individualized HRTFs and one with general HRTFs. Individualized HRTFs were adapted to the specific acoustic characteristics of the individual listener and had to be measured for each person, whereas general HRTFs were obtained from just one person and were used by all listeners. Several studies 共Wightman and Kistler, 1989b; Wenzel et al., 1993; 2225

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

Bronkhorst, 1995兲 have demonstrated that individualized HRTFs—and individualized headphone calibration 共Pralong and Carlile, 1996兲—are important for correct localization of virtual sound sources, but their use may be less relevant for speech intelligibility 共cf. Begault and Wenzel, 1993兲. Apart from speech intelligibility, two more aspects were investigated in this study, viz., talker recognition and talker localization. These points are relevant within certain contexts 共e.g., teleconferencing, military communication兲, as it is not only important to know what is being said, but also who and where the talker is. Except for Pollack et al. 共1954兲, who used different combinations of two concurrent talkers, there has not to our knowledge been any study on the effect of multiple talkers on talker identification or verification, either with humans or machines. In summary, the present study extends the approaches taken by previous studies with multiple talkers in a number of ways: 共1兲 two types of speech material were employed to measure intelligibility; 共2兲 testing was done with both individualized and general HRTFs 共most of the earlier studies used KEMAR兲; 共3兲 monaural 共monotic兲 and binaural 共dichotic兲 conditions were studied in addition to 3D conditions; and 共4兲 talker recognition and talker localization were measured in addition to speech intelligibility. Many of the applications employed for 3D audio, as mentioned in the first paragraph, make use of radio or telephone communication. Hence, listeners hear the speech signals through a limited bandwidth. Begault 共1995兲 has shown positive results of 3D audio for both full 共44.1-kHz sampling兲 and low 共8-kHz sampling兲 bandwidth systems. Yost et al. 共1996兲 used utterances low-pass filtered at 4 kHz. In order to obtain a reliable estimate of the performance of a 3D auditory display in critical situations, all speech signals in the present experiments were bandlimited to 4 kHz. This was the only restriction to the signals; no extra deteriorating effects such as speech coding were used.

I. HEAD-RELATED TRANSFER FUNCTIONS A. Measurement of individual HRTFs

Prior to the speech intelligibility and talker recognition experiments, the HRTFs of each individual subject were measured. The HRTF measurement setup is situated in an anechoic room and consists of a chair for the subject and a rotatable arc with a movable trolley on which the sound source is mounted. The source is a Philips AD 2110/SQ8 midrange tweeter with a frequency transfer of 0.25–20 kHz. The distance from the source to the center of the arc is 1.14 m. The stimulus is a computer-generated time-stretched pulse 共Aoshima, 1981兲, equalized to compensate for the nonlinear response of the tweeter. The sound reaching the subject’s ears was recorded by two miniature microphones 共Sennheiser KE-4-211-2兲 in a foam earplug and inserted into the ear canals. This blockedear-canal method is different from measurements with miniature probe microphones 共open-ear-canal measurement兲, where the sound is recorded close to the eardrum. Generally, either method can give good localization performance R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2225

共Wightman and Kistler, 1989a; Pralong and Carlile, 1994; Bronkhorst, 1995; Po¨sselt et al., 1986; Mo” ller et al., 1995; Hartung, 1995兲. An advantage of the blocked-ear-canal method is the stability of the microphone position during the entire measuring process 共including the subsequent headphone measurement兲, and a better signal-to-noise ratio for high frequencies compared to using probe microphones. A subject was seated on the chair in the center of the arc 共reference position兲. In order to assure the stability of the positions measured, head movements were monitored with a head tracker 共Polhemus ISOTRAK兲. If the deviation from the reference position was more than 1 cm or 5°, the subject was instructed 共by means of auditory feedback兲 to adjust his/her position. Angular deviations smaller than 5° were compensated for by adjusting the position of the sound source. A total of 965 positions were measured 共more than strictly needed for the present experiments兲, evenly divided over 360° azimuth and all elevations above ⫺60° 共60° below the horizontal plane兲, with a resolution of 5–6°. The test signal was generated and recorded at a sampling rate of 50 kHz 共antialias filter at 18.5 kHz, 24 dB/oct roll-off兲 by a PC board with two-channel AD/DA and a DSP32C floatingpoint processor. The signal had a duration of 10.2 ms, and a time window was used in order to eliminate reflections. The level of the signals at the reference position was 70 dBA. The average of 50 signals per direction 共played with 25-ms intervals in one measurement兲 was adopted as the HRTF of each ear. After the free-field measurements the transfer function of the sound presented through headphones 共Sennheiser HD 530兲 was determined. As the transfer from headphone to ear is somewhat dependent on the specific placement of the headphone on the subject’s head, ten headphone measurements were performed and the average 共calculated in the dB domain兲 was adopted as the headphone-transfer function. For the implementation in the 3D auditory display, the free-field HRTFs for each subject were deconvolved by his/ her headphone transfer function and adapted to the sampling rate of 12.5 kHz to be used in the listening experiments. Each HRTF was defined as a 128-tap convolution finite impulse response 共FIR兲 filter.

FIG. 1. Examples of 12 individual HRTFs 共thin lines兲 and the general HRTF 共heavy line兲 for the right ear in two different azimuth angles. The dotted vertical line marks the upper frequency of 4 kHz employed in the present study.

subjects were different from the persons whose HRTFs were used, and also different from those who participated in the listening experiments described in Secs. II and III. The results 共in terms of absolute scores兲 showed the best set to have on average 53% correct localization. This was not significantly higher than the other seven, but it did have a significantly lower rate of front–back confusions. As a consequence, these HRTFs were selected as the general HRTFs to be used in the subsequent experiments. As with the individualized HRTFs, the general set was implemented as FIR filters with 128 taps at 12.5-kHz sampling frequency. Figure 1 gives examples of the individual HRTFs of the 12 subjects that were used in the listening experiments 共Secs. II and III兲 and of the general HRTF. The two panels refer to two different azimuths in the horizontal plane. Particularly up to 4 kHz 共the upper frequency in the experiments兲, all HRTFs are quite similar. II. SPEECH INTELLIGIBILITY IN A MULTITALKER ENVIRONMENT

B. General HRTFs

A. Speech material

One of the goals of the present study was to assess the effect of the use of individualized HRTFs versus nonindividualized 共general兲 HRTFs for speech intelligibility and talker recognition. Therefore, prior to the formal perceptual experiments described below, a pilot experiment was carried out in order to find the best general HRTF set among a set of eight 共i.e., HRTFs from eight different persons兲. The selection was based on a relatively difficult localization task, presenting four directions on the left and four directions on the right that lie approximately on the cone of confusion, at a virtual distance of 1.14 m 共the same task was used by Bronkhorst, 1995兲. Using computer-generated pink noise stimuli 共50-kHz sampling, 18-kHz bandwidth, 500-ms duration兲, eight subjects had to indicate the direction of the stimuli by pressing labeled buttons on a hand-held box. The

The experiment on the effect of presentation mode on speech intelligibility consisted of two parts: one with monosyllabic words and one with short sentences. In this way, the influence of the redundancy 共absence or presence of a meaningful context兲 could be assessed.

2226

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

1. Words

The word material consisted of 192 meaningful Dutch consonant–vowel–consonant 共CVC兲-syllables 共Bosman, 1989兲. Recordings of the CVC syllables were made with four male talkers and one female talker 共25–45 years old兲. The recordings were made on digital audiotape 共DAT兲 in an anechoic room. One of the male talkers was used as target talker throughout the experiment, i.e., it was the task of the subjects to understand what he said. The other talkers were R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2226

competing talkers. Target and competing speech were always selected from the same set of CVC syllables. The digital output from the DAT was stored via an Ariel DAT link into separate computer files 共48-kHz sampling rate, 16-bits resolution兲. In order to equalize the levels of the different talkers, the words were grouped in 16 lists of 12 and the A-weighted speech level of each list was determined 共i.e., speech parts that were more than 14 dB below peak level were discarded, cf. Steeneken and Houtgast, 1986兲. As this method is not applicable for single words, the levels were calculated for 12 concatenated words. The single words were then rescaled as to have the desired speech level. Any remaining variation in level between the words is to be attributed to the normal variation found in everyday speech. Finally, the words were downsampled to 12.5 kHz 共using standard MATLAB software, including appropriate low-pass filtering兲, and digitally low-pass filtered at 4 kHz.1 2. Sentences

The sentence material for the target talker consisted of 540 everyday sentences of eight to nine syllables. They were read by a trained male talker in a soundproof room, recorded directly onto computer hard disk, using a sampling rate of 44.1 kHz, with a 16-kHz antialiasing low-pass filter, and 16-bits resolution. The sentences for the competing talkers consisted of 65 similar sentences 共Plomp and Mimpen, 1979兲, read by three male and one female talker. Recordings of these talkers were made on DAT in a soundproof room; the single sentences were stored in separate files via DAT link. For both the target talker and the competing talkers 共ages 30–45兲, all sentences were individually equalized with respect to speech level. Subsequently, all sentences were downsampled to 12.5 kHz and low-pass filtered at 4 kHz.

FIG. 2. Survey of the conditions for monaural, binaural, and 3D presentation 共top view of subject looking ahead兲, with T as position of the target talker and 1–4 as positions of the competing talkers. Only conditions with the target talker in the frontal position or in the right quadrant are shown; conditions for the left quadrant were created by mirroring across the median axis.

B. Experimental design

The listening experiment consisted of two identical tests, viz. one with the words and one with the sentences as stimuli. Either listening test was set up in order to assess the following aspects: 共1兲 Presentation mode, i.e., monaural, binaural, and 3D with individual or general HRTFs; 共2兲 Number of competing talkers; 共3兲 In case of 3D presentation, the positions of the target talker and the competing talker共s兲. Figure 2 shows a survey of all experimental conditions. The number of competing talkers varied from one to four. In the monaural presentation, this led to four conditions. In the binaural condition a selection of six conditions was made, with two varieties in the cases of three and four competing talkers. In the 3D presentations, the possible positions of the talkers were five directions in the horizontal plane: ⫺90°, ⫺45°, 0°, 45°, or 90° azimuth, where a negative sign refers to the left-hand side and a positive sign to the right-hand side. Thus, the talkers were placed in a virtual space on the front horizontal semicircle around the listener. The talker’s utterances were mixed for presentation to the listeners; in the case of 3D, mixing was done after filtering each source sepa2227

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

rately with the appropriate HRTF. While varying the position of the target talker, the positions of the competing talkers were chosen in such a way that the angle from target to nearest competing talker was either 45°, 90°, or 135° 共in one case, 180°兲. Figure 2 only displays conditions where the target talker is in the right quadrant. In total, 4 共monaural兲⫹6 共binaural兲⫹22 共3D individualized HRTFs兲⫹22 共3D general HRTFs兲⫽54 conditions were considered per listening test. In each condition, 10 words/ sentences were presented. There were not sufficient words to have a different stimulus per condition; most words were presented twice and some three times in order to meet the required 540 CVC stimuli. Single words have a low redundancy 共and thus a low chance of being remembered兲, which is different for the sentences, of which 540 separate stimuli existed.

C. Procedure

A total of 12 students from the University of Utrecht, with ages ranging from 20 to 26 years, participated as subR. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2227

FIG. 3. Mean scores for words 共panel A兲 and sentences 共panel B兲 as a function of number of competing talkers, with presentation mode as a parameter. The curves for 3D give the averages of the different talker configurations; the hatched area around the 3D curves indicates the range of scores for the talker configurations 共based on the pooled data for 3D with general and individual HRTFs兲.

jects. They did not report any hearing deficits and were not recently exposed to loud noises. The subjects were paid for their services. Subjects were tested in a soundproof room in two sessions, one for the words and one for the sentences. Half of the subjects started with the words, the other half with the sentences. For each session and each subject a different sequence was made for the mode of presentation and talker configuration 共Fig. 2兲. The order in which the stimuli were presented was fixed; the sequence of the presentation modes was varied according to a 4⫻4 Latin square, to avoid possible order and learning effects. Within each presentation mode the talker configurations were pseudorandomized, in the sense that trials with a fixed position for the target talker were presented in blocks and the number of competing talkers was assigned at random. The number of trials per block varied from 60 to 90, depending on the 3D condition 共cf. Fig. 2, with ten words/sentences per condition兲. Competing talkers 1, 3, and 4 were men; competing talker 2 was a woman. This numbering corresponds to the order in which competing talkers were added. The words/sentences they pronounced were randomly selected in such a way that no two competing talkers pronounced the same word/sentence, which in turn were always different from the one pronounced by the target talker. At the beginning of a block with a new position for the target talker, subjects would first hear three words/sentences from the target talker alone. This was done to make the subject aware of the target talker’s position, so that he/she would focus on that voice and that position during the following trials in a block. These three words/sentences were always the same and did not occur as test trials. Subjects received the target talker in the right quadrant in the word test and in the left quadrant in the sentence test, or vice versa. In 3D, the competing talkers could come from all directions, as shown in Fig. 2. In terms of Fig. 2, subjects got either conditions shown there or conditions mirrored across the median axis. The stimuli were generated by two PC sound boards, each with a DSP32C processor and a two-channel DA convertor. The outputs from the sound boards were mixed 共separately for left and right兲, led through a 4.5-kHz antialiasing low-pass filter 共Krohn-Hite 3342兲 and presented to each sub2228

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

ject through Sennheiser HD530 headphones. The level of presentation was approximately 65 dBA, as verified by measurements with an artificial ear 共Bru¨el & Kjaer 4152兲. Each stimulus was presented only once, and the task of the subject was to reproduce the word or the sentence of the target talker without a single error. Target talker and competing talkers started speaking at the same moment. Depending on the particular set of stimuli, the offset synchrony would vary up to a few hundred ms. No feedback as to the correctness of the subject’s responses was given. Before the actual test, a practice round with 18 stimuli 共from the same target talker, but with items that did not occur in the test兲 was presented in order to familiarize the subjects with the procedure. The practice round consisted of binaural presentations and 3D presentations with general HRTFs. D. Results

For each condition the percentage of correctly received words/sentences was scored. Subsequent analysis of the results was done by means of an analysis of variance for repeated measures. Unless specified otherwise, tests were performed at the 5% significance level. For the sake of convenience, we will abbreviate the different conditions for monaural, binaural, and 3D presentation as mon, bin, and 3D, respectively, and write the number of competing talkers and the talker configuration in parentheses, with reference to Fig. 2. For example, mon共3兲 stands for monaural with 3 competing talkers; bin(4B) stands for binaural with 4 competing talkers, configuration B; 3D(1E) stands for 3D presentation, 1 competing talker, configuration E. First of all, for both words and sentences there were no significant differences between the 3D conditions for individualized and general HRTFs. In view of the similarity in the HRTFs 共cf. Fig. 1兲, this was not surprising. For the subsequent statistical analysis, the results of the two 3D presentations per condition were averaged. Figure 3 shows the mean percentage of words 共panel A兲 and sentences 共panel B兲 correct for the different presentation modes as a function of the number of competing talkers. The scores decrease as the number of competing talkers increases, as expected. For binaural presentation, the results for three and four competing talkers only refer to conditions R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2228

bin(3B) and bin(4B), respectively. It appeared that bin(3A) and bin(4A) showed very high scores 共at least 92% correct for words兲, similar to bin(1A). Hence, extra competing talkers in the ear opposite the target talker have no negative effect on intelligibility. The data for 3D have been averaged over the different talker configurations. These average scores are represented by the black filled circles 共individual HRTFs兲 or squares 共general HRTFs兲 which are connected by a solid line. The hatched area around the 3D curves indicates the range of scores 共averaged over individual and general兲 for the different talker configurations. On average, 3D presentation scores for sentences are 83% and 51% in case of two and three competing talkers, respectively, compared to 43% and 1% for binaural presentation. The effects of presentation mode, number of competing talkers, and their interaction are significant. The trends for words and sentences are quite comparable: significantly higher intelligibility scores for 3D presentation, particularly with two and three competing talkers. Even with four competing talkers the benefit of 3D is there, although intelligibility remains rather low. As illustrated by the hatched areas in Fig. 3, the results with 3D presentation differ for the various talker configurations. A discussion of these results can be found in the Appendix. III. TALKER RECOGNITION IN A MULTITALKER ENVIRONMENT A. Speech material

The speech material for the talker-recognition experiment consisted of ten sentences which had been read out from Dutch newspaper articles by 12 male and 4 female talkers. The sentences were relatively long fragments, varying from 3.9 to 7.1 s with an average of 6.3 s. They had been recorded directly onto the computer hard disk via an Ariel Proport 565, sampled at 16 kHz with 16-bits resolution. The sentences were digitally low-pass filtered at 4 kHz and downsampled to 12.5 kHz. In the same way as for the intelligibility experiment, all sentences were equalized individually with respect to their A-weighted speech levels. From the 12 male talkers, two were selected to be the target talkers. One of them had served as target talker of the CVC syllables in the previous intelligibility experiment. As the performance of talker recognition is expected to depend on the particular talker, the second talker was selected to have different objective 共long-term average spectrum兲 and subjective characteristics 共vocal effort, monotony, speaking rate兲. The target talkers read out ten extra sentences that were used to familiarize the subjects with the particular talkers 共‘talker-familiarization material’兲. These sentences were processed in the same way as the test material. B. Experimental design

The experimental conditions were identical to those used in the intelligibility experiment. All ten sentences were used once in each condition. With respect to the relevant experimental factors employed 共presentation mode, number of competing talkers, talker configuration in 3D兲, the question 2229

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

was whether the subjects could recognize the target talker and, if present, localize him among a number of concurrent talkers. The identification—or rather verification: target talker present or not—is actually no more than a simple yes/no detection task. Over all experimental conditions, six out of the ten sentences per condition were pronounced by the target talker and four by a different talker. Apart from recognizing and localizing the target talker, the time listeners needed to come to their decision was recorded as well. In this way an extra differentiation could be made between the conditions, making it possible to test the hypothesis that 3D presentation would demand less time for recognition. C. Procedure

The same 12 subjects as in the intelligibility experiment participated in this experiment. The order of the stimuli was randomized for each subject. The sequence of the presentation modes was varied according to 4⫻4 Latin squares, which were different for the two target talkers. Within each presentation mode the talker configuration and left/right presentation were randomized. As in the intelligibility experiment, competing talkers 1, 3, and 4 were men and competing talker 2 was a woman. For each trial, the male competing talkers were randomly selected from ten, the female talker from four. In the case that the target talker was not presented, a substitute was also selected out of the ten male talkers. It was not excluded that more than one talker 共either target or competing talker兲 would pronounce the same sentence. On each trial all talkers started speaking at virtually the same time; mutual delays were within 20 ms. The setup and equipment for generating the stimuli were the same as for the intelligibility experiment. The subject’s responses were registered by means of a button box connected to a Tucker Davis PI2 module. This box had five ‘‘target buttons’’ for the five different directions and one button designated as ‘‘not present.’’ The task of the subjects was to decide whether the target talker was presented or not, and to press the appropriate button as soon as they were confident. Thus, the response/direction and the reaction time were registered for each condition. Measurement of the reaction time started at the onset of the target sentence. The subjects were not informed about the presentation mode, nor was any feedback given as to the correctness of their response. The experiment was run in two sessions, one session per target talker. Half of the subjects started with the first target talker, the other half with the second one. Before the actual session, subjects had to listen carefully to the target talker, in order to ensure recognition of the correct talker. A voice-familiarization list of ten sentences 共not included in the test兲 was presented twice, monaurally and via 3D with general HRTFs. After that, subjects were given a practice round of 48 sentences with competing talkers, presented binaurally and in 3D. During this practice round, feedback was given. D. Results

As it was not the aim to draw conclusions on the recognizability of one particular talker, the data of both target R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2229

FIG. 4. Mean recognition scores for four presentation modes as a function of the number of competing talkers. The curves for 3D give the averages of the different talker configurations; the hatched area around the 3D curves indicates the range of scores for the talker configurations 共based on the pooled data for 3D with general and individual HRTFs兲.

talkers were pooled before any further analysis of the results. This also has the advantage that more data points per condition were available, which increases the validity of the subsequent data processing. From the 12 subjects, one subject lost track of the target talker during the second test and one subject had been pressing continuously on one of the buttons during the first test. Their data were discarded. 1. Recognition

The examination of the recognition scores was done after the raw responses in each condition were transformed to an unbiased percentage of correct responses. This method uses the theory of signal detection 共MacMillan and Creelman, 1991; Gescheider, 1997兲 for estimating the true recognition scores, i.e., corrected for guessing. The unbiased scores were used for the subsequent analyses of variance for repeated measures. Figure 4 shows the recognition scores 共direction not necessarily correct兲 for the four presentation modes as a function of the number of competing talkers. Again, as in the intelligibility experiment, the results for individualized and general HRTFs in the 3D conditions did not show a significant difference. The effects of presentation mode and number of competing talkers are significant, but their interaction is not. Compared to intelligibility 共Fig. 3兲, the effect of competing talkers is 共far兲 less pronounced for talker recognition. The slopes of the four curves are virtually identical. Going from one competing talker to four, the average decrease in performance is from about 88% to 69%. Because of the virtually identical scores for the two 3D presentation modes, the results for general and individual HRTFs were pooled and the unbiased percent-correct recognition per condition was recomputed prior to the statistical analysis. Subsequent post hoc testing 共Tukey HSD兲 of both factors revealed a slightly better performance for 3D than for monaural and binaural 共81% vs 74/75%兲, and a gradual, significant decrease when going from one to two competing talkers 共86% to 80%兲 and on to three or four competing 2230

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

FIG. 5. Average localization scores for 3D presentation modes as a function of the number of competing talkers. The hatched area around the curves indicate the range of scores for the different talker configurations 共based on the pooled data for 3D with general and individual HRTFs兲.

talkers 共72% or 68%兲. Overall, the configuration of the talkers in 3D presentation has no effect on the listeners’ performance 共see the Appendix兲. 2. Localization

In order to analyze the localization scores, only those responses were considered in which the target talker was indeed presented and recognized by the subject. Also for localization, no significant difference was found between the 3D conditions with individualized and general HRTFs, and the averages of the two were used for further analysis. Figure 5 shows the localization performance for the 3D presentation modes as a function of the number of competing talkers. The main effect is significant, and post hoc analysis 共Tukey HSD兲 revealed significantly decreasing steps going from two to four competing talkers 共57% down to 43%兲. As with recognition 共Fig. 4兲, the decrease in localization performance is quite gradual. Overall, the localization scores are rather poor, with relatively best performance when the target talker is at 0° azimuth 共see the Appendix兲. 3. Reaction times

Analysis of the reaction times was performed on those responses where the target talker was indeed presented and recognized by the subject, but not necessarily correctly localized. Before going into the statistical analysis of the results, there is one aspect of the experimental design that should be mentioned first. The task given to the subjects consisted in fact of two subtasks which were executed more or less simultaneously. The subtasks were 共1兲 recognizing the target talker and, if recognized, 共2兲 determining his location. Pressing a button could only be done as soon as the second subtask was accomplished. In case of monaural presentation, this boils down to a simple yes/no task, since determining the location is trivial 共all talkers are presented to one ear兲. There is a random factor of presenting the signals to the left or right ear, which would make it slightly less trivial. With binaural presentation there are three buttons to be considered 共⫺90°, 90°, and not present兲, and with 3D presentaR. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2230

mode, number of competing talkers, and the interaction between them. An analysis of the main effects per presentation mode revealed that the reaction times of binaural and 3D increase significantly as the number of competing talkers increases. Further analysis 共planned comparisons兲 showed 3D to have the shortest reaction times for up to three competing talkers. Compared to binaural presentation with two to four competing talkers, 3D presentation gives on average 840-msshorter reaction times. There is hardly an effect of 3D talker configuration on the reaction times 共see the Appendix兲. IV. DISCUSSION FIG. 6. Mean reaction times 共with 600-ms correction for 3D兲 as a function of the number of competing talkers. The curves for 3D give the averages of the different talker configurations; the hatched area around the curve indicates the range of scores for the configurations 共based on the pooled data for 3D with general and individual HRTFs兲.

tion six buttons 共five directions and not present兲. It is known that reaction time increases as the number of response alternatives increase 共cf. Wickens, 1984兲. This means that the reaction times of 3D will a priori be longer than those for monaural and binaural presentation. Therefore, no direct comparison between the three presentation modes is possible. In order to get an estimate of the effect of having to make a choice out of two, three, or six buttons on the reaction times, a subset of the conditions was presented to six new subjects. This time the question was simply: Is the target talker present or not? The button box contained only 2 buttons assigned ‘‘yes’’ and ‘‘no.’’ Like in the original recognition test, subjects listened to the voice-familiarization sentences and got a practice round in order to familiarize themselves with the procedure. The conditions tested consisted of the following subset: mon共2兲, mon共4兲, bin(1A), bin(3B), bin(4B), 3D(1A), 3D(2D), 3D(2F), 3D(3B), 3D(4A), 3D(4B), and 3D(4C). For the 3D conditions, only those with general HRTFs were used. The results of this experiment were compared with the results of the subset in the original experiment. It turned out that the differences in reaction times were on average 660 ms for monaural, 720 ms for binaural, and 1360 ms for 3D. These differences are merely a consequence of simplifying the task of the listener. It appears that this even applies to monaural presentation. A separate analysis of variance on the reaction-time differences showed that monaural and binaural were the same, but 3D was significantly higher. On the basis of these results, it was concluded that the a priori longer reaction time for 3D presentation in the original test was at least 600 ms. Therefore, 600 ms was subtracted from the original 3D results, so that the reaction time of mere recognition was obtained, and a comparison between the presentation modes could be made. The correction of 600 ms was fixed for all 3D conditions, as there was no statistical evidence that it depends on the number of competing talkers. Figure 6 gives the corrected mean reaction times for the four presentation modes as a function of the number of competing talkers. There are significant effects of presentation 2231

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

The effects of monaural, binaural, and 3D presentation of bandlimited speech were investigated with respect to intelligibility and talker recognition in situations with multiple concurrent talkers. The main question in this paper was to what extent a 3D auditory display can give benefits over monaural and binaural presentation in realistic, critical conditions. In summary, results show that for two or more competing talkers, 3D presentation yields better intelligibility than monaural and binaural presentation, and somewhat better and much faster talker recognition. Another important result is that there is no significant difference in performance between the use of individualized and nonindividualized HRTFs. This is true for all experimental findings in this paper: intelligibility, talker recognition, and talker localization. The absence of a difference is probably due to the bandlimited nature of the speech materials and, consequently, the use of low-pass filtered HRTFs. In addition, for spatial information in the horizontal plane, a detailed definition of the HRTFs is less critical for intelligibility and talker recognition. Hawley et al. 共1999兲 also did not find differences between intelligibility scores of sentences presented in the front horizontal plane by means of loudspeakers or KEMAR recordings, the latter in fact being a kind of nonindividualized HRTFs. But, for localization they did find better absolute scores in the actual sound field than with virtual presentation. A. Intelligibility

In the monaural condition with only one competing 共male兲 talker, relatively low scores for both words and sentences are found. With an average speech-to-masker ratio of 0 dB, a score for sentences of only 36% 共all words correct兲 indicates that subjects had great difficulty separating the two male voices. In a similar monaural test, Stubbs and Summerfield 共1990兲 found a score of 57% for two concurrent male talkers. However, this score refers to the percentage of keywords reproduced correctly, not entire sentences, for which the score would clearly be lower. A keyword score of about 45% was reported by Hawley et al. 共1999兲 for monaural virtual listening, i.e., monaural listening to KEMAR recordings of two concurrent sentences produced by one male talker at ⫾90°. Ericson and McKinley 共1997兲 found a relatively high sentence-intelligibility score of 70%–75% for diotic presentation in pink noise 关5–10 dB signal-to-noise ratio 共SNR兲兴 Festen and Plomp 共1990兲 found an SRT for sentences of about ⫹1 dB in the case of a male talker masked by his own voice, viz., time-reversed speech. This implies a score below 50% for a speech-to-masker ratio of 0 dB. It should be noted R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2231

that much lower SRTs are normally obtained when target and competing talker are of opposite sex or when modulated noise is used 共Festen and Plomp, 1990; Hygge et al., 1992; Peters et al., 1998兲. The monaural results for two and more competing talkers are consistent with the steepness of the intelligibility curve near threshold 共10%–15%/dB兲. That is, with two competing talkers, the speech-to-masker ratio is ⫺3 dB and the sentence score drops to 5%, eventually reducing to 0% for additional competing talkers. The binaural curves for three and four competing talkers in Fig. 3 are based on the configuration bin(3B) and bin(4B), i.e., with two competing talkers at the same ear as the target talker. There is no significant difference between binaural scores for three and four competing talkers, indicating that two competing talkers instead of one at the other ear does not decrease performance. One can even state that the presence or absence of competing speech in the other ear does not influence intelligibility at all, in view of the results for the monaural conditions: comparing mon共1兲 with bin(2B) and mon共2兲 with bin(3B) or bin(4B) justifies this conclusion. We see that 3D presentation gives the highest intelligibility for more than one competing talker. For one competing talker, similar results can be obtained with binaural presentation. The findings for one and two competing speakers are essentially in agreement with results on spatial separation found by Yost et al. 共1996兲 for three and two sources, respectively. Although there are situations in which binaural presentation with three or four competing talkers is superior 共viz., when only the target talker is presented to one ear and all competing talkers to the other ear兲, one can in general conclude, when drawing horizontal iso-intelligibility lines in Fig. 3, that binaural presentation with two or three competing talkers yields about the same intelligibility as 3D presentation with three or four competing talkers, respectively. So, 3D presentation allows for an extra competing talker compared to binaural presentation and for two more talkers compared to monaural presentation. Moreover, 3D presentation makes it possible to follow any of the talkers more easily 共although certain azimuths have a slight advantage, see the Appendix兲. In the present experimental design, the target talker and each of the competing talkers were given equal speech levels. This means that with two, three, and four competing talkers, the speech-to-masker ratio decreases, on average, by 3, 4.8, and 6 dB, respectively. Apart from the increase in background level, a second aspect plays a significant role, viz., the change in spectro–temporal properties from a single voice to voice babble. Both factors increase the speech reception threshold 共cf. Festen and Plomp, 1990兲. A number of 3D configurations of our experiment can be compared with the results for KEMAR recordings that Bronkhorst and Plomp 共1992兲 obtained from normal-hearing subjects, i.e., their conditions with sentences presented at 0° azimuth and fluctuating speech noise at ⫾30° and ⫾90°. We used different sentence material 共and a different talker兲 for the target speech and HRTFs instead of artificial-head filtering, but we may get some insight as to the effect of using real instead of simulated 共modulated noise兲 speech. Correcting for the total 2232

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

TABLE I. Comparison of the differences in scores between monaural and 3D presentation for selected conditions in the present study with the difference in estimated scores 共based on differences in SRT兲 between frontal and spatialized presentation of KEMAR recordings in a study by Bronkhorst and Plomp 共1992兲. Maskers refer to competing talkers or modulated noise, respectively.

# Maskers

Condition

Present study Diff. score

1 2 4

3D(1G) 3D(2G) 3D(4G)

59% 67% 16%

Bronkhorst and Plompa Diff. SRT 共dB兲

Est. diff. score

7.0 3.6 0.9

⬎70% 40%–55% 10%–15%

a

Differences in SRT including a 1-dB threshold increase for diotic instead of monaural presentation.

level of competing speech, as mentioned above, Bronkhorst and Plomp found SRTs of ⫺20.0 and ⫺11.2 dB for configurations 3D共1G兲 and 3D共2G兲, respectively. In view of the psychometric curve of speech-to-noise ratio vs intelligibility which is also valid for fluctuating noise 共Festen and Plomp, 1990兲, these SRTs imply sentence-intelligibility scores near 100% for equally loud target and competing speech. The scores found in the present study amount to 95% and 72%, respectively. Hence, one may tentatively conclude that having two talkers instead of noises 关configuration 3D(2G)兴 makes the task more difficult. This finding is corroborated when taking configuration 3D(4G), for which Bronkhorst and Plomp 共with two maskers at ⫾30° instead of ⫾45° in the present case兲 come to a level-corrected SRT of ⫺4.0 dB, corresponding to a sentence intelligibility of about 65%.2 This is much higher than the score of 16% we find with four competing talkers. If we consider the gain of 3D compared to monaural presentation in the above three conditions, we get difference scores 共for sentences兲 as shown in the third column of Table I. The same data for the fluctuating noise maskers of Bronkhorst and Plomp 共1992兲 are shown relative to their conditions of frontal presentation. That is, the conditions for KEMAR recordings of all sources recorded at 0° azimuth presented binaurally to the listeners. The difference-SRT values are compensated for an assumed 1-dB gain for diotic vs monotic presentation. From these difference-SRTs, an estimate is made of the expected intelligibility score 共right column兲, under the assumption of a slope of 10%–15% per dB. Given the differences between the two studies, variance in the data, and the assumptions made, we may conclude that the scores are more or less in line. One has to bear in mind, though, that the use of multiple fluctuating noise maskers is not always a good predictor for modeling multiple talkers, particularly when they are of the same sex. It does not account for segregation of voices and for the distraction that may occur due to simultaneous intelligibility of two or more talkers 共cf. Festen and Plomp, 1990; but see Duquesnoy, 1983兲. B. Talker recognition

For all presentation modes, the recognition scores depend less on the number of competing talkers than intelligiR. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2232

bility scores. The recognition scores we found for 3D presentation are slightly higher than for monaural or binaural presentation. They vary from 89% with one to 71% with four competing talkers 共Fig. 4兲. These relatively high values indicate that talker recognition is an easier task than word or sentence reception. Recognizing the target talker may be described as waiting for the moment that spectral and/or temporal dips occur in the competing voices. In a way this process is similar to understanding a message, but for talker recognition it may end at an earlier stage, as one does not need to capture the entire utterance of the target talker. As the number of competing talkers increases, the background level and the spectro–temporal ‘‘filling’’ increase, making recognition gradually more difficult. While the effect of recognition per se 共target talker present or not兲 appears not to be very dependent on the number of competing talkers, a different picture arises for the time needed for actual recognition 共Fig. 6兲. For both binaural and 3D presentation, reaction times increase significantly when the number of competing talkers increases. Subjects apparently have to wait longer for the moment to ‘‘catch a glimpse’’ of the target talker. In other words, listeners still have a good chance to be sure they heard the target talker, but they take their time. The results on reaction times show there is a clear release from masking when three or more voices are spatially separated, yielding an average difference of over 800 ms compared with binaural presentation. With only one competing talker, 3D presentation with ultimate spatial separation, 3D共1A兲, does not do better than binaural presentation, bin(1A). The difference in reaction time of about 150 ms, as found in the original test 共after 600-ms correction兲 and in the retest, appeared not to be statistically significant. In general, for one competing talker there is no significant difference between bin(1A) and the average of the 3D talker configurations 共as plotted in Fig. 6兲. For the monaural presentation, an asymptote in reaction time is already reached with two competing talkers. At first glance it looks like subjects probably had great difficulty recognizing the talker in these conditions and more or less ‘‘gave up’’ by pressing a button. But, this explanation is contradicted by the monaural recognition scores themselves 共66% or more correct, Fig. 4兲, which are not significantly different from the binaural scores. An alternative explanation may be that recognition with monaural presentation involves a relatively easy task of selective attention, whereas with binaural or 3D presentation the listener uses divided attention 共cf. Yost et al., 1996兲. It should be noted that reaction times for 3D presentation found in this study underestimate those that will occur in realistic situations, since the location of the talker will then generally be known in advance. C. Localization

Absolute localization of the target talker seems generally poor and becomes gradually more difficult as the number of competing talkers increases 共Fig. 5兲. Having a closedresponse set with five alternatives, the localization scores are on average around 50%. The finding that even for localization the use of either individualized or nonindividualized 2233

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

HRTFs does not yield significantly different results is not really surprising in view of the relatively low intersubject variability in HRTFs for frequencies below 4 kHz 共cf. Fig. 1; see also Shaw, 1974; Mo” ller et al., 1995兲. The bandlimited nature of the speech signals and the consequent use of lowpass filtered HRTFs eliminates certain perceptual cues. These cues are primarily associated with the perception of elevation and externalization 共cf. Bronkhorst, 1995兲. The former aspect is not particularly relevant in the present experiments, as all sources were presented on the horizontal plane. The latter point may however have played a role in the relatively poor localization scores, as some subjects mentioned hearing the voices close to their head and not at a distance. Begault and Wenzel 共1993兲 studied horizontal-plane localization of wideband speech stimuli of a single talker presented to inexperienced listeners, using nonindividualized HRTFs and open-set responses. They report that up to 46% of the stimuli were heard inside the head and found an average error angle of 28°. They conclude that most listeners can obtain useful azimuth information, the results being comparable with those for broadband noise stimuli. This conclusion was also drawn by Ricard and Meirs 共1994兲 for synthesized, bandlimited speech, and by Gilkey and Anderson 共1995兲, who used real sources and compared performance of speech and click stimuli. The latter study only reports left–right errors for four subjects, which were on average 16°. In studies with simultaneous speech sources in the virtual horizontal plane, Hawley et al. 共1999兲 found on average 72% correct for 2–4 concurrent sentences, having 7 response alternatives. In those experiments the listeners knew the sentence in advance and only had to pinpoint its location. In another study with bandlimited speech in the horizontal plane, Yost et al. 共1996兲 had subjects localize all two or three words that were presented simultaneously. With a closed-response set of seven azimuths, scores were over 80% for two talkers and about 60% for three talkers. These are relatively high scores, but one has to keep in mind that the subjects could listen to the presentation as often as they liked. Our score of 51% 共three competing talkers兲 seems comparatively low, but the task for our listeners was more complex, i.e., first detecting the target talker and subsequently determining his location. In summary, localizing a target talker among a number of competing talkers when signals are bandlimited does yield relatively poor scores, but not to a dramatic extent in the light of the results reported in the literature. Besides, spatial separation as obtained in a 3D auditory display is more important for intelligibility and talker recognition than accurate localization. V. CONCLUSIONS

From the results of the experiments in this paper, using bandlimited 共4-kHz兲 speech signals and truncated HRTFs, the following conclusions can be drawn: 共1兲 There is no difference in performance between a 3D auditory display based on individualized HRTFs and general HRTFs. This conclusion applies to all scores assessed for speech intelligibility, talker recognition R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2233

共2兲

共3兲

共4兲

共5兲

共including the time required for recognition兲, and talker localization. This means that no individual adaptation of a bandlimited 共4-kHz兲 communication system is needed in a practical application of an auditory display with many users. Compared to conventional monaural and binaural presentation, 3D presentation yields better speech intelligibility with two or more competing talkers, in particular for sentence intelligibility. Equivalent performance is achieved with 3D presentation compared to binaural presentation when one talker is added and compared to monaural presentation when two or three talkers are added. However, in specific conditions 共all competing talkers on the side opposite the target talker兲 binaural presentation may be superior to 3D. Within the 3D configurations examined, intelligibility is highest when the target talker is at ⫺45° or 45° azimuth. Talker-recognition scores are higher for 3D than for monaural and binaural presentation, but the differences are small. Recognition scores depend less on the number of competing talkers than intelligibility scores. The virtual positions of the talkers in 3D are not a relevant factor. For binaural and 3D presentation, the time required to correctly recognize a talker increases with the number of competing talkers. For two or more competing talkers, 3D presentation requires significantly less time than binaural presentation. Absolute localization of a talker is relatively poor and becomes gradually more difficult as the number of competing talkers increases.

ACKNOWLEDGMENTS

This work was supported by the Royal Netherlands Navy. We wish to thank Dr. Niek Versfeld and colleagues of the Experimental Audiology Group at the ENT Department of the Free University Hospital in Amsterdam for providing part of the sentence material. The comments and suggestions of the three anonymous reviewers on earlier versions of this paper are greatly appreciated.

Figure A1 displays the mean scores for words and sentences, in 90°-tca and 45°-tca as a function of the azimuth of the target talker. The 90°-tca conditions only occurred with one or two competing talkers and for one azimuth 共90°兲 with three competing talkers 共filled single diamond兲. Both words and sentences show the same trend regarding the azimuth of the target talker. In the 90°-tca conditions with only one competing talker, the target talker’s position is not relevant; with two competing talkers, a significantly worse score is found if the target talker is placed at 0°. The best target azimuth is 45°. In the 45°-tca conditions highest scores are found at 45° or 0° with one competing talker, at 45° with two competing talkers, at 45° or 0° with three competing talkers, and at any azimuth with four competing talkers. Hence, it would be best to place the target talker at 45°. Comparisons of the results of 90°-tca and 45°-tca for one and two competing talkers generally show that the scores are equal or higher for 90°-tca, depending on speech material and the azimuth of the target talker. Although there is no clear uniform result, enlarging the minimum angle between target talker and nearest competing talker can have a positive effect on speech intelligibility 共cf. Peissig and Kollmeier, 1997兲. For talker recognition and reaction times, no or minor effects of configuration are found, for which no plots are shown here. Only a small difference in recognition is found between 3D(3A) and 3D(3B). That is, when the target talker is at 90°, there is a slight benefit to have the nearest of three competing talkers 90° apart. Comparison between the 90°-tca and 45°-tca conditions for one or two competing talkers shows no significant differences. Hence, talker recognition is not critically dependent on the spatial segregation of target talker and competing talkers. As to the reaction times, only for 45°-tca with four competing talkers’s a significantly shorter reaction time found for a target position of 90°. Comparison between the 90°-tca and 45°-tca conditions for one or two competing talkers shows no significant differences. Figure A2 shows the localization scores in conditions 90°-tca and 45°-tca as a function of the azimuth of the target talker. For both 90°-tca and 45°-tca there is a significant effect of target position, with 0° giving the best performance.

APPENDIX: EFFECT OF TALKER CONFIGURATION IN 3D PRESENTATION

In both the intelligibility and the recognition experiments, the configuration of the talkers in the case of 3D presentation was a factor in the design. In this Appendix we will discuss in more detail two situations with respect to the minimum angle between the target talker and the nearest competing talker, viz., 90° or 45°. These conditions will be referred to as 45°-tca and 90°-tca, respectively (tca⫽target-competing talker angle兲. The second aspect in the talker configuration is the azimuth of the target talker, which could have an absolute value of 0°, 45°, or 90°. Planned comparisons at a 5% significance level were performed for the statistical analyses. Plots are shown only for the cases for which relatively substantial differences between configurations were found. 2234

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

FIG. A1. Mean intelligibility scores for words and sentences as a function of the azimuth of the target talker, with number of competing talkers and 45°- or 90°-tca as parameters. R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2234

FIG. A2. Mean localization scores as a function of the azimuth of the target talker, with number of competing talkers and 45°-or 90°-tca as parameters.

The distribution of the localization errors did not show any response bias toward 0°. Comparison between the 90°-tca and 45°-tca conditions for one or two competing talkers shows no significant differences. 1

Downsampling from 44.1 to 12.5 kHz was done by resampling with a ratio of 2/7. This actually results in a sampling rate of 12.6 kHz. Playing it at a rate of 12.5 kHz results in slight, negligible mismatch 共0.8%兲. 2 We assume an SRT for sentences in noise of ⫺5 dB and a slope of the psychometric function of 15% per dB around the 50% point 共Plomp and Mimpen, 1979兲. Hence, an SRT of ⫺4 dB results in an intelligibility score of 65%. Aoshima, N. 共1981兲. ‘‘Computer-generated pulse signal applied for sound measurement,’’ J. Acoust. Soc. Am. 69, 1484–1488. Begault, D. R. 共1995兲. ‘‘Virtual acoustic displays for teleconferencing: Intelligibility advantage for telephone grade audio,’’ Audio Engineering Society 98th Convention Preprint 4008 共AES, New York兲. Begault, D. R., and Wenzel, E. M. 共1993兲. ‘‘Headphone localization of speech,’’ Hum. Factors 35, 361–376. Begault, D. R., and Erbe, T. 共1994兲. ‘‘Multichannel spatial auditory display for speech communication,’’ J. Audio Eng. Soc. 42, 819–826. Bosman, A. J. 共1989兲. ‘‘Speech perception by the hearing impaired,’’ Ph.D. dissertation, University of Utrecht. Bronkhorst, A. W., and Plomp, R. 共1992兲. ‘‘Effect of multiple speechlike maskers on binaural speech recognition in normal and impaired hearing,’’ J. Acoust. Soc. Am. 92, 3132–3139. Bronkhorst, A. W. 共1995兲. ‘‘Localization of real and virtual sound sources,’’ J. Acoust. Soc. Am. 98, 2542–2553. Cherry, E. C. 共1953兲. ‘‘Some experiments on the recognition of speech, with one and with two ears,’’ J. Acoust. Soc. Am. 25, 975–979. Crispien, K., and Ehrenberg, T. 共1995兲. ‘‘Evaluation of the cocktail-party effect for multiple speech stimuli within a spatial auditory display,’’ J. Audio Eng. Soc. 43, 932–941. Duquesnoy, A. J. 共1983兲. ‘‘Effect of a single interfering noise or speech source on the binaural sentence intelligibility of aged persons,’’ J. Acoust. Soc. Am. 74, 739–943. Ericson, M. A., and McKinley, R. L. 共1997兲. ‘‘The intelligibility of multiple talkers separated spatially in noise,’’ in Binaural and Spatial Hearing in Real and Virtual Environments, edited by R. H. Gilkey and T. R. Anderson 共Erlbaum, Mahwah, NJ兲, Chap. 32, pp. 701–724. Festen, J. M., and Plomp, R. 共1990兲. ‘‘Effects of fluctuating noise and in-

2235

J. Acoust. Soc. Am., Vol. 107, No. 4, April 2000

terfering speech on the speech reception threshold for impaired and normal hearing,’’ J. Acoust. Soc. Am. 88, 1725–1736. Gescheider, G. A. 共1997兲. Psychophysics: The Fundamentals 共Erlbaum, Mahwah, NJ兲. Gilkey, R. H., and Anderson, T. R. 共1995兲. ‘‘The accuracy of absolute localization judgments for speech stimuli,’’ J. Vestib. Res. 5, 487–497. Hartung, K. 共1995兲. ‘‘Messung, Verifikation und Analyse von Aussenohru¨bertragungs-funktionen,’’ in Fortschritte der Akustik—DAGA ’95 共DPG-GmbH, Bad Honnef, Germany兲, pp. 755–758. Hawley, M. L., Litovsky, R. Y., and Colburn, H. S. 共1999兲. ‘‘Speech intelligibility and localization in a multi-source environment,’’ J. Acoust. Soc. Am. 105, 3436–3448. Hygge, S., Ro¨nnberg, J., Larsby, B., and Arlinger, S. 共1992兲. ‘‘Normalhearing and hearing-impaired subjects’ ability to just follow conversation in competing speech, reversed speech, and noise backgrounds,’’ J. Speech Hear. Res. 35, 208–215. MacMillan, N. A., and Creelman, C. D. 共1991兲. Detection Theory: A User’s Guide 共Cambridge University Press, Cambridge兲. Mo” ller, H., So” rensen, M. F., Hammersho” i, D., and Jensen, C. B. 共1995兲. ‘‘Head-related transfer functions of human subjects,’’ J. Audio Eng. Soc. 43, 300–320. Peissig, J., and Kollmeier, B. 共1997兲. ‘‘Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners,’’ J. Acoust. Soc. Am. 101, 1660–1670. Peters, R. W., Moore, B. J. C., and Baer, T. 共1998兲. ‘‘Speech reception thresholds in noise with and without spectral and temporal dips for hearing-impaired and normally hearing people,’’ J. Acoust. Soc. Am. 103, 577–587. Plomp, R., and Mimpen, A. M. 共1979兲. ‘‘Improving the reliability of testing the speech reception threshold,’’ Audiology 18, 43–52. Pollack, I., Pickett, J. M., and Sumby, W. H. 共1954兲. ‘‘On the identification of talkers by voice,’’ J. Acoust. Soc. Am. 26, 403–406. Po¨sselt, C., Schro¨ter, J., Opitz, M., Divenyi, P. L., and Blauert, J. 共1986兲. ‘‘Generation of binaural signals for research and home entertainment,’’ in Proceedings of the 12th International Congress on Acoustics 共Toronto, Canada兲, Vol. 1, B1–6 共Beauregard Press, Canada兲. Pralong, D., and Carlile, S. 共1994兲. ‘‘Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature in-ear recording system,’’ J. Acoust. Soc. Am. 95, 3435–3444. Pralong, D., and Carlile, S. 共1996兲. ‘‘The role of individualized headphone calibration for the generation of high fidelity virtual auditory space,’’ J. Acoust. Soc. Am. 100, 3785–3793. Ricard, G. L., and Meirs, S. L. 共1994兲. ‘‘Intelligibility and localization of speech from virtual directions,’’ Hum. Factors 36, 120–128. Shaw, E. A. G. 共1974兲. ‘‘Transformation of sound pressure level from the free field to the eardrum in the horizontal plane,’’ J. Acoust. Soc. Am. 56, 1848–1861. Steeneken, H. J. M., and Houtgast, T. 共1986兲. ‘‘Comparison of some methods for measuring speech levels,’’ Report IZF 1986-20, TNO Institute for Perception, Soesterberg, The Netherlands. Stubbs, R. J., and Summerfield, Q. 共1990兲. ‘‘Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners,’’ J. Acoust. Soc. Am. 87, 359–372. Wenzel, E. M., Arruda, M., Kistler, D. J., and Wightman, F. L. 共1993兲. ‘‘Localization using nonindividualized head-related transfer functions,’’ J. Acoust. Soc. Am. 94, 111–123. Wickens, C. D. 共1984兲. Engineering Psychology and Human Performance 共Merrill, Columbus, OH兲, Chap. 10, pp. 337–376. Wightman, F. L., and Kistler, D. J. 共1989a兲. ‘‘Headphone simulation of free-field listening: I. Stimulus synthesis,’’ J. Acoust. Soc. Am. 85, 858– 867. Wightman, F. L., and Kistler, D. J. 共1989b兲. ‘‘Headphone simulation of free-field listening: II. Psychophysical validation,’’ J. Acoust. Soc. Am. 85, 868–878. Yost, W. A., Dye, R. H., and Sheft, S. 共1996兲. ‘‘A simulated cocktail party with up to three sound sources,’’ Percept. Psychophys. 58, 1026–1036.

R. Drullman and A. W. Bronkhorst: Multichannel auditory displays

2235