Speech perception and talker segregation: Effects of level, pitch, and ...

4 downloads 1092 Views 408KB Size Report
The benefit in the SRT resulting from tactile support ranged from 0 to 2.4 dB and was largest ... Level differences between target and interfering speech ranged.
Speech perception and talker segregation: Effects of level, pitch, and tactile support with multiple simultaneous talkers Rob Drullman and Adelbert W. Bronkhorst TNO Human Factors, Dept. of Perception, P.O. Box 23, 3769 ZG Soesterberg, The Netherlands

共Received 22 March 2004; revised 12 August 2004; accepted 13 August 2004兲 Speech intelligibility was investigated by varying the number of interfering talkers, level, and mean pitch differences between target and interfering speech, and the presence of tactile support. In a first experiment the speech-reception threshold 共SRT兲 for sentences was measured for a male talker against a background of one to eight interfering male talkers or speech noise. Speech was presented diotically and vibro-tactile support was given by presenting the low-pass-filtered signal 共0–200 Hz兲 to the index finger. The benefit in the SRT resulting from tactile support ranged from 0 to 2.4 dB and was largest for one or two interfering talkers. A second experiment focused on masking effects of one interfering talker. The interference was the target talker’s own voice with an increased mean pitch by 2, 4, 8, or 12 semitones. Level differences between target and interfering speech ranged from ⫺16 to ⫹4 dB. Results from measurements of correctly perceived words in sentences show an intelligibility increase of up to 27% due to tactile support. Performance gradually improves with increasing pitch difference. Louder target speech generally helps perception, but results for level differences are considerably dependent on pitch differences. Differences in performance between noise and speech maskers and between speech maskers with various mean pitches are explained by the effect of informational masking. © 2004 Acoustical Society of America. 关DOI: 10.1121/1.1802535兴 PACS numbers: 43.66.Wv, 43.71.Es 关AK兴

I. INTRODUCTION

In a recent paper on the cocktail party phenomenon, Bronkhorst 共2000兲 gives an extensive review of studies on the intelligibility of speech presented against a background of competing speech. Bronkhorst identifies a number of topics that merit further study. Two of these will be touched upon in this paper. First, we will consider the interference by competing talkers 共particularly in diotic listening兲, their number, and the role of different voices. Second, there is the closely related issue of selective attention, in particular the situation where the listener concentrates on one particular talker, trying to minimize the distraction by other talkers. Apart from maximizing the difference between target and interference acoustically 共level, spectrum, and pitch cues兲, another way to facilitate the focusing of attention is to provide information through a different modality. The type of support examined here is tactile 共and in some case visual兲 information about the temporal fluctuations in the speech signal. A. Multitalker environment

In an environment with many competing talkers, the target speech is embedded in a background of voice babble, with masking characteristics which come close to that of speech noise. That is, both have a similar long-term average spectrum, limited spectral and temporal fluctuations, and lack of linguistic content. Going to fewer competing talkers will increase the difference in masking, with a gradual shift from energetic to more informational masking. Energetic masking refers to the case when the interference 共e.g., noise or speech兲 contains sufficient energy to make the target speech 共partly兲 inaudible or at least unintelligible; informa3090

J. Acoust. Soc. Am. 116 (5), November 2004

Pages: 3090–3098

tional masking occurs when 共portions of兲 the speech interference is intelligible and so similar to the target speech that it becomes difficult for the listener to disentangle target and interfering speech. The effect of the similarity of target and masking voice共s兲 is most clearly demonstrated in studies with different-sex vs same-sex maskers 共cf. Drullman and Bronkhorst, 2000兲. Recently, Brungart 共2001兲 and Brungart et al. 共2001兲 have conducted a series of experiments on energetic and informational masking using the Coordinate Response Measure 共CRM兲 paradigm. In this paradigm, target and masker phrases are very similar. In a first study with a single interfering talker, Brungart 共2001兲 investigated sametalker, same-sex, and different-sex target and masker voices. He found a clear dominance of informational masking over energetic masking, with least masking for different-sex talkers and most masking for same-talker conditions. The results were not monotonically related to the signal-to-noise ratio 共SNR兲.1 In a follow-up study, Brungart et al. 共2001兲 used multiple simultaneous talkers and found essentially similar results. At negative SNRs, however, performance strongly depended on the number of interfering talkers: intelligibility was worse with two or three interfering talkers than with a single interfering talker. Regarding the issue of voice 共dis兲similarity, in particular between male and female voices, an important role is attributed to differences in pitch when segregating simultaneous speech signals. The effect of pitch, defined on a segmental level, on the segregation of concurrent vowels has been the subject of several studies 共e.g., Assmann and Summerfield, 1990; Culling and Darwin, 1993兲. For longer speech messages, Brokx and Nooteboom 共1982兲 performed an interesting study using the same talker for target and interference

0001-4966/2004/116(5)/3090/9/$20.00

© 2004 Acoustical Society of America

while manipulating the pitch difference. An experiment with LPC-resynthesized monotonous target speech showed a 20% increase in keyword recognition when going from same pitch to a 3-semitone difference. With a 12-semitone difference intelligibility decreased again, which the authors explain by the inseparability of the harmonics of target and interfering speech, yielding perceptual fusion of coinciding voiced portions. The results of a second experiment with natural speech and normal intonation did not show this decrease: keyword recognition was about 20% better when the target pitch was double that of the interference than when the pitches were the same. Brokx and Nooteboom 共1982兲 related these findings to perceptual fusion, occurring when two simultaneous sounds have identical pitches, and to perceptual tracking when the pitches of target and interference cross each other, so that the listener may inadvertently switch his attention from target to interfering speech. In a recent study, Darwin et al. 共2003兲 used PSOLA processing—pitch synchronous overlap add 共Moulines and Charpentier, 1990兲, resulting in more naturally sounding manipulated speech compared to LPC resynthesis—with target and interfering speech from the same voice. Darwin et al. 共2003兲 measured male and female talkers with the CRM paradigm, preserving the normal intonation of the sentences. They found no improvement when going from 0 to 1 semitone pitch difference, but a 12% increase in score when going to a difference of 2 semitones, averaged over SNRs of ⫺6 to ⫹3 dB. Stretching the difference to 12 semitones yielded a 24% increase. B. Tactile support

Several methods of 共vibro兲tactile stimulation have been used in the past as an aid for speech perception by people with severe hearing impairment 共Sherrick, 1984兲. The fact that experiments have shown 共limited兲 benefits in speech perception shows that tactile information can indeed give some support. In several cases, multiple-channel tactile support was investigated; in this paper, however, we will restrict ourselves to single-channel tactile support. Normally, information is conveyed by presenting the amplitude fluctuations of the 共filtered兲 speech signal to a tactile transducer, attached to the hand, wrist, or finger. Converting acoustic information in this manner appears to be appropriate, both in terms of sensitivity 共with some amplification of modulation frequencies below 20 Hz, which are especially important for speech兲 and recognition of syllable rhythm and stress 共Weisenberger, 1986; Weisenberger and Miller, 1987兲. A study by Summers et al. 共1994兲 on the perception of time-varying pulse trains showed that information transfer for vibrotactile stimuli was highest when frequency and amplitude modulation were used together. This result was demonstrated explicitly for the perception of voice fundamental frequency and the identification of sentence stress. Although these and other studies in principle give ample evidence for the efficacy of tactile information transfer, the question still remains whether there is a natural or direct mapping of the acoustic events into the tactile domain—our fingers are simply not made to listen. Mahar et al. 共1994兲 performed a number of cross-modal experiments to examine the similarity of auditory and tactile representations for speech共-like兲 stimuli. They concluded J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

that comparisons between auditory and tactile stimuli are easier to perform than comparisons between auditory and visual, when it comes to word stress, amplitude variations, and temporally distributed information in general. In this paper two experiments are described in which the effects of both acoustic factors and tactile support on speech intelligibility in multiple-talker conditions were studied. In the first experiment the speech-reception threshold 共SRT兲 for sentences was measured against a background of one to eight interfering talkers or against speech noise. Vibro-tactile support was given by presenting the low-pass-filtered speech signal 共0–200 Hz兲 simultaneously to the index finger. In order to obtain a clearer picture of the interaction between informational masking and the presence of tactile support, a second experiment was carried out, which focused on the effects of masking for one interfering talker, using different pitches and levels of the target and interfering voices. Again, speech intelligibility was determined both with and without tactile support. II. EXPERIMENT 1: MULTIPLE COMPETING TALKERS A. Stimuli, design

The target stimuli consisted of 13 lists of 13 meaningful Dutch sentences of 8 to 9 syllables pronounced by a male talker and representing conversational speech 共Versfeld et al., 2000兲. The interfering sound was either masking noise with the long-term average spectrum 共and average rms level兲 of this talker or speech of other male talkers. A total of 60 sentences 共15 sentences for each of 4 talkers兲 was used to create the speech interference. These meaningful sentences were taken from the original set of Plomp and Mimpen 共1979兲, and were all different from the target sentences. The 15 sentences for each talker were concatenated and used either as a single interfering voice or mixed to produce 2, 4, or 8 interfering voices. The latter number was actually created by doubling the material for the 4-talker version; i.e., each talker occurred twice but with different initial sentences. All sentence and noise material was stored into WAV files with 44.1-kHz sampling rate and 16-bit resolution. Because of the spectral shaping of the masking noise, it provides optimal 共long-term兲 energetic masking of the target speech. In order to optimize the amount of energetic masking by the interfering talkers, their spectra were equalized with respect to the noise. Spectral shaping was done on each individual file with interfering speech 共containing 1, 2, 4, or 8 talkers兲 in 8 octave bands from 63 Hz to 8 kHz, using a Behringer 8024 equalizer. Subsequently, the speech files were given the same A-weighted speech level as the noise 共speech parts that were more than 14 dB below peak level were discarded, cf. Steeneken and Houtgast, 1986兲. The end result was that the overall long-term level of the speech masker was kept constant 共equal to the noise masker兲, regardless of the number of interfering voices. However, it should be noted that equalizing the spectra is not enough to equalize energetic masking. Due to temporal modulations in the speech maskers, less masking will be produced than for a nonfluctuating masker 共cf. Festen and Plomp, 1990兲. As the amount of fluctuations is reduced when more interfering

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

3091

than 15 dB HL at octave frequencies from 125 Hz to 2 kHz and less than 25 dB HL up to 8 kHz. They were paid for their services.

FIG. 1. Block diagram of the signal processing. Speech and interfering signal are mixed at a variable SNR and presented monaurally. Tactile information is derived from the low-pass filtered speech signal.

talkers are present, we may assume that the difference in energetic masking between speech and noise maskers will decrease with increasing number of interfering talkers. For the sake of clarity, the long-term rms values of the target sentences and of the speech or noise maskers were used to define the 共global兲 SNR in the listening experiments. There were ten conditions in which five types of interference were investigated in an experimental design with and without tactile support. Three extra conditions 共with 1, 2, or 8 interfering talkers兲 were added in which the tactile support was replaced by visual support. This was done in order to determine whether stimulating a different modality with the same information would give the same results. B. Signal processing

Figure 1 shows a block diagram of the signal processing for the presentation of auditory and tactile information. For the auditory pathway, target speech was mixed with the interference 共noise or interfering speech兲 at a variable SNR and presented diotically over earphones. Tactile information consisted of the low-pass filtered target speech signal, presented through a vibrator. The cutoff frequency for the low-pass filter was set at 200 Hz. In this way the tactile stream contained all relevant information about the amplitude fluctuations 共cf. Drullman et al., 1994兲 and, moreover, conveyed the pitch variations of the target voice which might be used as an additional cue. Signal processing was done in real time, using a standard PC equipped with an Aardvark Direct Pro 24/96 sound card. The auditory streams for target speech and interference were led through Tucker Davis PA4 attenuators and a SM3 mixer, a Krohn-Hite 3342 low-pass filter at 8 kHz 共48 dB/ oct兲, a Tucker Davis HB6 telephone amplifier, and presented over Sennheiser HD 250 headphones. The tactile stream was led through a Krohn-Hite 3342 low-pass filter 共200 Hz, with 20-dB gain兲, a B&K 2706 amplifier to a B&K 4810 shaker. The setting of the amplifier and shaker was such that the maximum acceleration was 60 ms⫺2 at the speech peaks. This level was sufficient for the subjects to feel the tactile stimulation and not too high to be uncomfortable. For the visual support, the shaker was replaced by a small hand-held box with a red LED. The LED was controlled in the same way as the tactile driver, i.e., the brightness varied in accordance with the low-pass filtered speech amplitude.

D. Procedure

From the 13 lists of sentences, five were used for auditory presentation only, five for auditory presentation with tactile support, and three for auditory presentation with visual support. Lists were presented in a fixed order. Each list was used for one condition. Blocks of five auditory or auditory⫹tactile conditions were presented in a sequence. Half of the subjects started with the auditory-only conditions, the other half with the auditory⫹tactile conditions. The first condition in a block was the noise interference; the four interfering-talker conditions were balanced according to a 4⫻4 Latin square. The final three lists were used for the auditory⫹LED conditions 共1, 2, or 8 interfering talkers兲, which were presented in random order. In the SRT measurements, the level of the interference was fixed at 60 dBA for every condition; the level of the target sentences was changed according to an up–down adaptive procedure 共Plomp and Mimpen, 1979兲. The first sentence in a list was presented at a level below the reception threshold. This sentence was repeated, each time at a 4-dB higher level, until the listener could reproduce it without a single error. The remaining 12 sentences were then presented only once, in a simple up–down procedure with a step size of 2 dB. A response was scored as correct only if all words were correctly reproduced by the subject. Subjects made verbal responses, which were not recorded. The average SNR for sentences 4 –13 was adopted as the SRT for that particular condition. The interference was presented in a constant loop and was silenced in between the individual target sentences by muting the PA4 attenuator. On and offset of the interference was 500 ms before and after presentation of a target sentence. Measurements were conducted for each subject individually in a sound-proof room. Subjects received written and oral instructions. During the conditions with tactile stimulation, they had their right arm on a special rest, so that they could feel the vibrations of the B&K shaker with their index finger, through a hole in the armrest.2 An accelerometer which was mounted on the shaker merely functioned as the contact between shaker and figer. Prior to the actual tests, subjects were given a brief training 共10–15 min兲, in order to get familiar with the target voice and with the procedure. Single words, short combinations of words 共2– 4 syllables兲, and entire sentences 共different from the ones used in the tests兲 were presented, at a fixed SNR of ⫺2 dB. While listening to target speech and interference, subjects received the tactile stimulation. They were explicitly told to note the correspondence between what they heard and felt. All subjects reported that they did indeed note this correspondence. There was no training for the visual support.

C. Subjects

Subjects were 12 normal-hearing native Dutch students, mostly from Utrecht University, whose ages ranged from 19 to 26 years. All had pure-tone air-conduction thresholds less 3092

J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

E. Results

The mean results of the SRT measurements as a function of type of interference and tactile/visual support are plotted

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

III. EXPERIMENT 2: PITCH-VARYING SINGLE INTERFERING TALKER

FIG. 2. Mean SRTs and standard deviations as a function of interference 共1, 2, 4, or 8 interfering talkers, or noise兲. The more negative the SRT, the better the intelligibility.

in Fig. 2. Due to a technical error in the visual support for the first subject, results for the audio⫹LED conditions are based on 11 tests instead of 12. There appears to be a small overall effect of a lower SRT 共i.e., higher intelligibility兲 when tactile or visual support is given. A two-way ANOVA with repeated measures revealed significant effects of tactile support (p ⬍0.05) and type of interference (p⬍0.001), but no significant interaction. Overall, the mean SRT for auditory presentation is ⫺3.4 dB and for auditory⫹tactile ⫺4.4 dB. A separate analysis for the LED conditions showed that the overall change in SRT due to visual support is barely significant ( p⫽0.066). Neither is there any significant interaction with the number of competing talkers. However, closer inspection of the data shows that the statistical analysis is strongly affected by the large variance in the conditions with only one interfering talker, as shown in Fig. 2. The standard deviations without and with tactile support are 4.0 and 3.4 dB, respectively, which is 2 to 4 times higher than the standard deviations for conditions with more interfering talkers or noise. We have therefore repeated the within-subjects ANOVA while excluding the single-interfering-talker conditions, and found a significant interaction between interference type and the presence of tactile or visual support (p⫽0.028). This confirms what is easily observed in Fig. 2, namely that the nonauditory support causes a relatively large 共2.4-dB兲 release from masking when there are two interfering talkers, but that the effect vanishes when the number of interfering talkers is increased, or when noise is used as interference. The conditions with only one interfering talker are nevertheless of interest because the mean SRTs are very low: ⫺9.1 dB for auditory only and ⫺11.0 dB with tactile support. Although the standard deviations are high, the SRTs scores are by far the best of all conditions. With only one interfering voice, it is probably possible to listen in the dips of the temporal envelope 共momentary low-energy speech fragments兲 of the interference. This effect is apparently subject dependent and it is greatly reduced when more interfering voices are present. J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

The results of the above experiment indicate that the gain resulting from tactile or visual support is largest when there are only one or two interfering talkers, and disappears when the interference is babble or noise. Because informational masking has a similar dependency on the type of interference, this suggests that the gain is mainly a release from informational masking. However, it is not clear how much informational masking occurs in the particular speech material that we have used. As discussed in the Introduction, earlier research has indicated that informational masking depends on pitch differences between target and interfering speech 共Brokx and Nooteboom, 1982; Darwin et al., 2003兲. We therefore decided to carry out a second experiment in which the effect of tactile support was evaluated as a function of controlled pitch differences between target and interfering speech. In a first approach, the same SRT paradigm as in experiment 1 was used. Results with 12 subjects showed no effects of tactile support and some effect of pitch difference. However, there was a large difference in performance between subjects. We concluded that for this type of experiment, with a single interfering voice that can be highly similar to the target voice, the SRT paradigm is not suitable. This is because the slope of the psychometric function can be very shallow or even obtain negative values 共e.g., Brungart, 2001兲, so that there is a poor convergence of the adaptive procedure. To have a better differentiation between conditions, the percentage of correctly received words in a sentence list was used, as described in the following subsections. A. Stimuli, design

A total of 420 sentences of a male target talker was taken from the same database as used in the first experiment 共Versfeld et al., 2000兲. Another 39 sentences from the same talker were used as interference. Both target sentences and 共concatenated兲 interfering sentences were processed in such a way 共see below兲 that the difference in average pitch would vary between conditions. Sixty conditions for target speech and interference were investigated in an experimental design with five average pitch differences of 0, 2, 4, 8, and 12 semitones, six fixed SNRs of ⫺16, ⫺12, ⫺8, ⫺4, 0, and ⫹4 dB, each with and without tactile support. The 420 target sentences were divided into 60 lists of 7 sentences, and instead of using an adaptive design as in the SRT, series of sentences were now presented at fixed SNRs. Intelligibility was measured by scoring the number of correctly received words per sentence 共which contained on average 6.1 words兲. B. Signal processing

Of all 420 target sentences the mean pitch per sentence was determined first. The algorithm used is based on an accurate autocorrelation method, with a frame duration of 10 ms 共Boersma, 1993兲. The overall mean pitch of the 420 sentences was 105 Hz with a standard deviation of 15 Hz. In

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

3093

TABLE I. Mean word scores 共standard deviations兲 in percentages for the different conditions in experiment 2. Auditory only

Auditory⫹tactile

⌬ pitch 共semitones兲

⌬ pitch 共semitones兲

SNR 共dB兲

0

2

4

8

12

0

2

4

8

12

⫺16 ⫺12 ⫺8 ⫺4 0 ⫹4

38共26兲 64共19兲 70共24兲 56共21兲 46共14兲 76共29兲

38共17兲 55共23兲 72共22兲 61共24兲 49共26兲 78共25兲

43共22兲 60共27兲 72共22兲 69共23兲 76共18兲 92 共9兲

51共21兲 70共17兲 82共13兲 89共12兲 97 共4兲 100 共1兲

53共17兲 79共12兲 95 共6兲 96 共4兲 99 共1兲 100 共0兲

40共28兲 56共16兲 79共23兲 65共16兲 73共21兲 87 共7兲

45共21兲 71共17兲 79共15兲 66共24兲 76共15兲 91共12兲

49共23兲 64共22兲 74共18兲 73共15兲 90共13兲 98 共3兲

46共22兲 82共11兲 87共11兲 93 共7兲 98 共3兲 99 共2兲

65共17兲 84共12兲 94 共7兲 99 共2兲 100 共1兲 99 共1兲

order to have control over the mean pitch differences between target and interfering sentences, all target sentences were resynthesized to have a mean pitch of 105 Hz. That is, the entire pitch contour of a sentence 共voiced parts兲 was shifted up or down to get a mean of 105 Hz; the natural pitch contour was preserved. Signal processing was done by means of a PSOLA algorithm, implemented in the PRAAT software package 共Boersma and Weenink, 1996兲. Subsequently, the interfering sentences were processed by making copies with mean pitches at 0, ⫹2, ⫹4, ⫹8, and ⫹12 semitones relative to 105 Hz. Signal processing for the presentation of the auditory and tactile stimuli was identical as in the first experiment 共see Fig. 1兲. In this second experiment there was no visual 共LED兲 presentation. C. Subjects

Subjects were ten normal-hearing native Dutch students, whose ages ranged from 19 to 25 years. They were different from the ones used in the first experiment. All had pure-tone air-conduction thresholds less than 15 dB HL at octave frequencies from 125 Hz to 2 kHz and less than 25 dB HL up to 8 kHz. They were paid for their services. D. Procedure

From the 60 lists of 7 sentences, 30 were used for auditory only presentation and 30 for auditory⫹tactile presenta-

FIG. 3. Mean number of words correct as a function of SNR for target and interfering speech of the same talker. Data are pooled over auditory only and auditory⫹tactile support. Different curves correspond to mean pitch differences between target and interfering voice. 3094

J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

tion. Lists were presented in a fixed order, one list for one condition. The 30 auditory and 30 auditory⫹tactile conditions were presented consecutively. Half of the subjects started with the auditory conditions, the other half with the auditory⫹tactile conditions. Different mean pitches for interfering speech were balanced according to a 5⫻5 Latin square; the order of presentation of the SNRs was randomized. Conditions with the same pitch were presented in a block. Measurements were performed for each subject individually in a sound-proof room. Subjects listened to the target speech and had to reproduce as many words of the target sentence as they could. Each target sentence was presented once, and word scores for sentence 2–7 were counted. The rest of the procedure was similar to the first experiment. Training was done prior to the real test, with ⫹6 and ⫹2 semitones pitch difference at a fixed SNR of ⫺4 dB.

E. Results

Table I shows the results 共means and standard deviations兲 for the different experimental conditions. Figures 3–5 display the results graphically, pooled over tactile conditions, pitch differences, and SNRs, respectively. As in the previous experiment, relatively high standard deviations are found. For the sake of clarity, the spread in the data is not displayed in Figs. 3–5, but is given only in Table I. The scores were

FIG. 4. Mean number of words correct as a function of SNR for target and interfering speech of the same talker. Data are pooled over a range of mean pitch differences between the voices 共0 to 12 semitones兲. The two curves show the results with and without tactile support.

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

IV. GENERAL DISCUSSION A. Tactile support

FIG. 5. Mean number of words correct as a function of mean pitch difference between target and interfering of the same talker. Data are pooled over a range of SNRs 共⫺16 to ⫹4 dB兲. The two curves show the results with and without tactile support.

analyzed by means of a repeated-measures ANOVA with three factors.3 For the full experimental design, significant main effects are found for tactile support (p⬍0.05), pitch difference (p⬍0.001), and SNR (p⬍0.001). Of the four interactions, only the interaction between pitch difference and SNR is significant (p⬍0.001). The results of Figs. 4 and 5 indicate a small overall increase from 71% to 77% in word score due to tactile support. Performance improves with increasing pitch difference 共Fig. 5兲, both with and without tactile support. A closer look at Fig. 3 shows a dip in the performance near 0-dB SNR when the mean pitch difference is 0 or 2 semitones. This is a situation where target and interference are most alike acoustically, and tactile support would presumably be most effective. Indeed, the data in Table I show a maximal increase of 27% for 0- and 2-semitone pitch differences at 0 dB SNR. For the entire range of SNRs, Fig. 6 displays the scores for the pooled 0- and 2-semitone conditions. A separate ANOVA on this subset reveals a significant interaction between tactile support and SNR (p ⬍0.02). Post hoc comparisons 共Tukey HSD兲 show a significant effect of tactile support at 0-dB SNR: from 48% words correct without to 75% words correct with tactile support.

FIG. 6. As Fig. 4, but data pooled over mean pitch differences of 0 and 2 semitones only. J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

Both experiment 1 and 2 show evidence of a significant benefit of tactile support in sentence and word intelligibility. Due to the different measurement methods in the experiments, results are not directly comparable. We tried to employ the same SRT paradigm for experiment 2 as we did in experiment 1, but had to use a word-score paradigm because of the large differences in SRT performance between individual listeners. In experiment 1 a tactile benefit of 0–2.4 dB in SRT 共average 1 dB兲 is found. This gain does not, however, translate to a very high improvement in intelligibility such as found for sentences in steady-state noise, which show a steep slope of the psychometric function 共about 15% per dB; see Versfeld et al., 2000兲. In particular, when there is a single interfering talker, the steepness is reduced substantially. This cannot be derived from the data found in experiment 1,4 but an indication is given by the results of experiment 2 共Fig. 4兲, where we see slopes of about 5%/dB around 50% intelligibility. Such shallow slopes for conditions with one interfering talker were also found by Brungart 共2001兲 and Brungart et al. 共2001兲. Thus, the 1-dB average gain in SRT found in experiment 1 seems to be roughly consistent with the average improvement in intelligibility of 6% in experiment 2. All in all, we find evidence that presenting speech information through a different modality can indeed give some support in perception. This is true for vibrotactile support and—although investigated only in a limited set of conditions—also for visual support. It is most probably the focusing of attention that is responsible for the slight benefit: listeners are made aware of the moment the target talker is present, which can give them just a little advantage of ‘‘listening in.’’ The tactile or visual information itself does not contain any new or distinct elements that are not already present in the auditory signal. In this respect the visual cues given in our experiment are fundamentally different from the cues provided by lipreading. B. Number of interfering talkers

The results of experiment 1 共Fig. 2兲 show a clear increase of the SRT when two or more interfering talkers are present. With the long-term interference levels being equal, this effect must be attributed to reduced ‘‘dip listening.’’ The presence of more talkers obstructs the advantage of listening to the target speech in case of low-energy fragments in the interfering speech. This effect is noted in many other studies on speech perception 共e.g., Brungart et al., 2001; Hawley et al., 2004兲. Dip listening, however, is a very general psychoacoustical phenomenon 共cf. Buus, 1985兲. In the present study, the advantage of dip listening vanishes already when going from one to two interfering talkers. Pooled over the tactile conditions, the SRT increases from ⫺10 to ⫺2 dB. The latter level remains approximately constant with four or eight interfering talkers. Previous experiments with the same type of sentences showed an intelligibility score of only 36% for two simultaneous male voices presented at an SNR of 0 dB 共Drullman

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

3095

and Bronkhorst, 2000兲. These results were obtained from monaural listening, so they could be slightly better when presented diotically, like in the present experiment 1. Nevertheless, a score of 50% 共definition of the SRT兲 at a SNR of ⫺9.1 dB, as in experiment 1 共left-hand black bar in Fig. 2兲, would normally imply a 共much兲 higher intelligibility at 0 dB. This discrepancy can probably be explained by the fact that level differences between target and masker act as a cue to the listeners 共cf. Brungart, 2001兲. A low-level target voice can easily be discerned from a single high-level masking voice. The opposite it also true, and the result is that the psychometric function can have a local minimum around 0 dB SNR 共see Fig. 3兲. Because the SRT paradigm starts at a low 共negative兲 SNR, it normally converges also to a low SRT, except when it reaches SNRs around 0 dB: then, the outcome becomes unpredictable. This is—we think—the main reason why the SRT paradigm proved to be unsuccessful in our second experiment, and it probably also accounts for the very large standard deviations found for a single interfering talker in experiment 1. Quantitative comparison between the present experiment 1 共without tactile support兲 and the study by Brungart et al. 共2001兲 with multiple simultaneous talkers 共one, two, or three interfering talkers兲 is difficult. The CRM paradigm they used is more a word intelligibility than a sentence intelligibility task. Moreover, target speech and interfering speech only differ in three keywords. The nature and timing of target and interfering speech is such that there is maximum masking of the keywords, so dip listening is less probable than with our speech material and paradigm. This is true even with a single interfering talker, because of temporal comodulation between target and interference. Despite the differences between the two studies, results from both our experiment 1 and from experiments by Brungart et al. 共2001兲 show a substantial decrease in performance for more than one interfering talker. Our SRTs are always at 共slightly兲 negative SNRs, and performance appears to fall into one of three categories: single interfering talker, many interfering talkers, or interfering noise. Surprisingly, the use of noise instead of speech makes it easier to understand the target talker, even compared to having as many as 8 interfering talkers: the SRT decreases from about ⫺1.5 dB to about ⫺5 dB for noise. This difference between speech and noise maskers is also clearly present in the results of Brungart et al. 共2001兲 and can be best explained by the effect of informational masking 共see also Sec. IV D below兲. C. Pitch and level differences

The results of experiment 2 show that performance gradually improves when the difference in pitch between target and interfering voice increases 共Fig. 5兲. However, the improvement is not monotonic, as can be seen in Fig. 3. When the difference is only 2 semitones, there is virtually no improvement compared to the condition with identical voices. These results are not exactly in agreement with results from the study by Darwin et al. 共2003兲. As mentioned in the Introduction, they found increases of 12% up to 24% for pitch differences of 2 and 12 semitones, respectively, averaged over SNRs of ⫺6 to ⫹3 dB. In our experiment 2, 3096

J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

averaged over ⫺8- to ⫹4-dB SNR, the increase for a 12semitone pitch difference is 36%. This may be explained by the difference in sentence material and task, but it may also be caused by the particular talker. Darwin et al. 共2003兲 show different results for four male talkers for which the performance of the listeners can vary significantly. The largest effect of the imposed pitch differences is found for talkers with a rather flat intonation. When a talker has a more expressive intonation, the effect is reduced, and it can even disappear. In another study, Bird and Darwin 共1998兲 used monotone, natural male speech and PSOLA processing to present two sentences from the same talker with pitch differences from 0 to 12 semitones. With a limited number of interfering sentences, Bird and Darwin 共1998兲 measured a maximum increase in word score of about 18% and 40% for pitch differences of 2 and 12 semitones, respectively. In summary, our results show increases in performance for large pitch differences 共12 semitones兲 that are generally in line with results from other studies. Differences must be attributed to talker characteristics, sentence material, and methodology. For small pitch differences 共2 semitones兲, our experiment does not show any improvement in intelligibility, whereas other studies do find an improvement of over 10%. We already mentioned that level differences in experiment 1 appear to act as a cue for separating two simultaneous voices. The results of experiment 2 indicate that a higher level for target speech generally helps perception, but that the results depend on pitch differences. An increase of the SNR causes a monotonic improvement of the intelligibility for conditions with large pitch differences of 8 and 12 semitones 共Fig. 3兲. This monotonic relation starts to break down for a pitch difference of 4 semitones and disappears for differences of 2 and 0 semitones. In the latter cases, we see a clear decrease in the scores at SNRs of ⫺4 and 0 dB. This leads to the conclusion that if a pitch difference is large, the cues for separating the two voices are strong enough to yield a high intelligibility score. If, on the other hand, the pitches are closer together, similar levels for target and interference make separation, and hence intelligibility, more difficult. These results are consistent with those of Brungart 共2001兲, who reports large differences in performance between different-sex or same-sex 共and same-talker兲 conditions. However, when manipulating only the pitch, as was done in the study by Darwin et al. 共2003兲, the differences in performance are less clear than in the present study. Darwin et al. 共2003兲 could only find a similar combined effect of pitch and level differences when they used both pitch and vocal-tract length changes, so in fact simulating two sex-related features of the voices. Results from the present study indicate that pitch manipulation alone may be sufficient to induce the level effect. D. Energetic and informational masking

We did not record the responses of the subjects when they had to reproduce the target sentence. So, we cannot make a proper analysis of the errors that they made. From our observations during the experiments we conclude that intrusions were sometimes made, but more often subjects would reproduce only a few words or have complete blanks.

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

The subjects knew that the target and interfering sentences consisted of meaningful, everyday speech 共syntactically and semantically correct兲, so intrusions are less likely to occur than in the CRM paradigm used by Brungart and colleagues. We think the fact that interfering noise and speech of many talkers yield different performances in the first experiment is sufficient evidence for informational masking. As discussed in Sec. IV B, speech intelligibility decreases when the number of interfering talkers increases, and part of this effect can be attributed to dip listening. Momentary differences in energetic masking can, of course, occur during simultaneous presentation of target and interfering talker共s兲, as there is no fixed temporal alignment between them. It is also plausible that informational masking reduces as a function of the number of talkers because, when the interference sounds more like voice babble and separate words are not perceived clearly anymore, there is less chance that the target voice is confounded with one of the interfering voices. The assumed reduction of informational masking with four and eight interfering talkers in experiment 1 implies that a value of the SRT is approached which must come close to the SRT in noise. Yet, there remains a gap of some 3.5 dB, in favor of the noise masker. We cannot explain this by differences in energetic masking. Speech babble and speech noise have very similar short- and long-term spectro-temporal characteristics. So, it must be assumed that there is still some form of informational masking. A possible explanation for this is provided by Hawley et al. 共2004兲, who make a distinction between different levels of linguistic informational masking. In the case of one or two interfering talkers, when the interfering speech is still intelligible, grammatical and semantic information affect the ability of the listener to single out the target speech. With four or eight interfering talkers, the resulting babble can act as an informational masker at a lower linguistic level, by claiming phonetic and lexical processing resources even if this does not lead to any understanding of the interfering speech. V. CONCLUSIONS

From the present experiments on multimodal speech perception, we can draw the following conclusions. 共1兲 Tactile support of diotically presented speech yields a moderate but significant increase of intelligibility in critical listening conditions. The observed benefit in intelligibility is between 0% and 27% 共average 6%兲 and the shift in the SRT between 0 and 2.4 dB 共average 1 dB兲. Although this was tested only in a limited number of conditions, it seems that a similar benefit occurs when temporal modulations are presented visually. 共2兲 The benefit due to tactile support depends on the type of interference. It is relatively large when there are one or two interfering talkers, but vanishes when the interference is steady-state noise or babble. In the case that the target and interfering voice are the same, and pitch differences are introduced, the maximum benefit is reached for a minimum pitch difference 共0–2 semitones兲 and equal average speech levels. J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

共3兲 An increasing pitch difference between target and interfering speech causes a gradual increase of the intelligibility. The increase is up to 36%, averaged over SNRs from ⫺8 to ⫹4 dB. 共4兲 In conditions when target and interfering speech are highly similar 共same voices with no or small pitch differences兲, the psychometric function becomes very shallow, and can even show a local minimum around 0-dB SNR. The generic term ‘‘signal-to-noise ratio’’ 共SNR兲 is used throughout the paper. Noise can be either actual noise or interfering speech. ‘‘Signal’’ stands for target speech. 2 The index finger was used as the most convenient means to convey tactile information and because there is a synchrony in the perception of the auditory and tactile stimuli 共Hirsh and Sherrick, 1961兲. 3 The statistical analyses presented here are based on the raw percentage scores. Analysis of the arcsine-transformed scores yields essentially the same results in terms of significance. 4 Analogous to Versfeld et al. 共2000兲 and Plomp and Mimpen 共1979兲, we tried to derive psychometric functions of the SRT data for the conditions with one interfering talker in experiment 1. However, due to the limited number of data per SNR over 12 subjects, it was not possible to generate reliable slopes. 1

Assmann, P. F., and Summerfield, Q. 共1990兲. ‘‘Modeling the perception of concurrent vowels: Vowels with different fundamental frequencies,’’ J. Acoust. Soc. Am. 88, 680– 697. Bird, J., and Darwin, C. J. 共1998兲. ‘‘Effects of a difference in fundamental frequency in separating two sentences,’’ in Psychophysical and Physiological Advances in Hearing, edited by A. R. Palmer, A. Rees, A. Q. Summerfield, and R. Meddis 共Whurr, London, UK兲, pp. 263–269. Boersma, P. 共1993兲. ‘‘Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,’’ Proc. Institute of Phonetic Sciences University of Amsterdam 17, 97–110. Boersma, P., and Weenink, D. 共1996兲. ‘‘PRAAT. A system for doing phonetics by computer,’’ Report 132, Institute of Phonetic Sciences, University of Amsterdam 共http://www.fon.hum.uva.nl/praat兲. Brokx, J. P. L., and Nooteboom, S. G. 共1982兲. ‘‘Intonation and the perceptual separation of simultaneous voices,’’ J. Phonetics 10, 23–36. Bronkhorst, A. W. 共2000兲. ‘‘The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions,’’ Acust. Acta Acust. 86, 117–128. Brungart, D. S. 共2001兲. ‘‘Informational and energetic masking effects in the perception of two simultaneous talkers,’’ J. Acoust. Soc. Am. 109, 1101– 1109. Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. 共2001兲. ‘‘Informational and energetic masking effects in the perception of multiple simultaneous talkers,’’ J. Acoust. Soc. Am. 110, 2527–2538. Buus, S. 共1985兲. ‘‘Release from masking caused by envelope fluctuations,’’ J. Acoust. Soc. Am. 78, 1958 –1965. Culling, J. F., and Darwin, C. J. 共1993兲. ‘‘Perceptual separation of simultaneous vowels: Within and across-formant grouping by F 0 ,’’ J. Acoust. Soc. Am. 93, 3454 –3467. Darwin, C. J., Brungart, D. S., and Simpson, B. D. 共2003兲. ‘‘Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers,’’ J. Acoust. Soc. Am. 114, 2913–2922. Drullman, R., and Bronkhorst, A. W. 共2000兲. ‘‘Multichannel speech intelligibility and talker recognition using monaural, binaural, and threedimensional auditory presentation,’’ J. Acoust. Soc. Am. 107, 2224 –2235. Drullman, R., Festen, J. M., and Plomp, R. 共1994兲. ‘‘Effect of temporal envelope smearing on speech reception,’’ J. Acoust. Soc. Am. 95, 1053– 1064. Festen, J. M., and Plomp, R. 共1990兲. ‘‘Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing,’’ J. Acoust. Soc. Am. 88, 1725–1736. Hawley, M. L., Litovsky, R. L., and Culling, J. F. 共2004兲. ‘‘The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer,’’ J. Acoust. Soc. Am. 115, 833– 843. Hirsh, I. J., and Sherrick, C. E. 共1961兲. ‘‘Perceived order in different sense modalities,’’ J. Exp. Psychol. 62, 423– 432.

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech

3097

Mahar, D. M., Mackenzie, B., and McNicol, D. 共1994兲. ‘‘Modality-specific differences in the processing of spatially, temporally, and spatiotemporally distributed information,’’ Perception 23, 1369–1386. Moulines, E., and Charpentier, F. 共1990兲. ‘‘Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones,’’ Speech Commun. 9, 453– 467. Plomp, R., and Mimpen, A. M. 共1979兲. ‘‘Improving the reliability of testing the speech reception threshold for sentences,’’ Audiology 18, 43–52. Sherrick, C. E. 共1984兲. ‘‘Basic and applied research on tactile aids for deaf people: Progress and prospects,’’ J. Acoust. Soc. Am. 75, 1325–1342. Steeneken, H. J. M., and Houtgast, T. 共1986兲. ‘‘Comparison of some methods for measuring speech levels,’’ Report IZF 1986-20, TNO Institute for Perception, Soesterberg, The Netherlands.

3098

J. Acoust. Soc. Am., Vol. 116, No. 5, November 2004

Summers, I. R., Dixon, Ph. R., Cooper, Ph. G., Gratton, D. A., Brown, B. H., and Stevens, J. C. 共1994兲. ‘‘Vibrotactile and electrotactile perception of time-varying pulse trains,’’ J. Acoust. Soc. Am. 95, 1548 –11558. Versfeld, N. J., Daalder, L., Festen, J. M., and Festen, J. M. 共2000兲. ‘‘Method for the selection of sentence materials for efficient measurement of the speech reception threshold,’’ J. Acoust. Soc. Am. 107, 1671–1684. Weisenberger, J. M. 共1986兲. ‘‘Sensitivity to amplitude-modulated vibrotactile signals,’’ J. Acoust. Soc. Am. 80, 1707–1715. Weisenberger, J. M., and Miller, J. D. 共1987兲. ‘‘The role of tactile aids in providing information about acoustic stimuli,’’ J. Acoust. Soc. Am. 82, 906 –916.

R. Drullman and A. W. Bronkhorst: Audio-tactile cues in multitalker speech