Duplex Perception - APA PsycNET - American Psychological

1 downloads 0 Views 2MB Size Report
perception of speech using the sound of a slamming door. ... perceptual modules and (b) duplex perception occurs because distinct systems have rendered.
Journal of Experimental Psychology: Human Perception and Performance 1990, Vol. 16, No. 4, 742-754

Copyright 1990 by the American Psychological Association, Inc 0096-1523/90/$00.75

Duplex Perception: A Comparison of Monosyllables and Slamming Doors Carol A. Fowler

Lawrence D. Rosenblum

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Haskins Laboratories, New Haven, Connecticut, and Dartmouth College

University of California, Riverside

Duplex perception has been interpreted as revealing distinct systems for general auditory perception and speech perception. The systems yield distinct experiences of the same acoustic signal, the one conforming to the acoustic structure itself and the other to its source in vocaltract activity. However, this interpretation has not been tested by examining whether duplex perception can be obtained for nonspeech sounds that are not plausibly perceived by a specialized system. In five experiments, we replicate some of the phenomena associated with duplex perception of speech using the sound of a slamming door. Similarities between subjects' responses to syllables and door sounds are striking enough to suggest that some conclusions in the speech literature should be tempered that (a) duplex perception is special to sounds for which there are perceptual modules and (b) duplex perception occurs because distinct systems have rendered different percepts of the same acoustic signal.

Liberman and Mattingly (1985, 1989) proposed that speech perception is subserved by a "module" (Fodor, 1983, 1985) distinct from the general auditory system. The phonetic module fits the defining characteristics of modules outlined by Fodor (1983). There is neurophysiological evidence for a distinct part of the nervous system responsible for language. In addition, listeners have little conscious awareness of speech processing in that they cannot describe speech as sounding like anything else but speech (in contrast to, say, Morse code, which can be heard as both a meaningful message and a series of meaningless "dots and dashes"); that is, speech perception is "cognitively impenetrable." Further, speech processing is "informationally encapsulated"; for example, listeners hear synthetic speech and even highly impoverished "sine-wave speech" (Remez, Rubin, Pisoni, & Carrell, 1981) as speech even though they are aware that such signals are not produced by a human vocal tract. Finally, there is evidence that speech processing is domain specific; that is, it is special to speech. Evidence for this last claim comes from comparisons of the objects of speech perception with ostensible objects of nonspeech auditory perception. In speech, dimensions of the percept fail to correspond, in many cases at least, to obvious dimensions of structure in the acoustic speech signal and do correspond to the vocal-tract activity that might have produced the signal (see Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967, and Liberman & Mattingly, 1985, for reviews and discussion of the findings). In contrast, in at least some cases, objects of nonspeech auditory perception correspond more closely to acoustic structure. For example,

in isolation, a frequency glide will sound to a listener like a pitch guide. However, integrated with an acoustic signal for a consonant-vowel (CV) syllable, it will contribute to the perception of a consonant of which the pitch glide is not experienced as a part. Moreover, rather different frequency glides that sound quite distinct in isolation may sound like indistinguishable tokens of a given consonant when they are integrated into different syllabic frames. Apparently, they sound alike just when they signal the same consonantal gesture of the vocal tract (Liberman et al., 1967). The acoustic signals for the same gesture are different in different syllables because of coarticulatory overlap of consonant and vowel production. These findings and compatible others led to the development of the motor theory of speech perception. In the theory, phonetic gestures of the vocal tract, and not acoustic structure per se, are perceptual primitives of speech, whereas acoustic structure itself provides the primitives for auditory perception of other sounds. This implies that perceptual processing of speech must be distinct from general auditory processing, and it must somehow recover the articulatory source of the acoustic speech signal. According to the motor theory, to help unravel effects of coarticulation on the acoustic signal, listeners engage their own speech motor systems in perceiving the speech of others. The talker's acoustic signal serves as the basis for a hypothesis as to the sequence of consonants and vowels that, when coarticulated, would give rise to the signal. In the theory, listeners use their speech motor systems (in Liberman & Mattingly, 1985, an "innate vocal tract synthesizer") to test the hypothesis. This gives the percept its motor character. Not unexpected from this perspective are findings that talkers' idiosyncratic methods of articulating some phonetic segments (Bell-Berti, Raphael, Pisoni, & Sawusch, 1979) appear to affect their perception of those segments as produced by a speech synthesizer. The phenomenon of duplex perception (e.g., Liberman, Isenberg, & Rakerd, 1981; Mann & Liberman, 1983; Rand, 1974) has been interpreted as providing particularly strong evidence in favor of two claims of the motor theory: that

This research was supported by National Institute for Child and Human Development Grant HD 01994 to Haskins Laboratories. We thank Alvin Liberman for his comments on the manuscript and Bruno Repp for supplying the synthetic-speech stimuli used in the experiments. Correspondence concerning this article should be addressed to Carol A. Fowler, Haskins Laboratories, 270 Crown Street, New Haven, Connecticut 06511. 742

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

DUPLEX PERCEPTION IN SYLLABLES AND DOOR SLAMS

speech perception is subserved by a specialized perceptual system, and that the speech module yields a perceptual object that is not immediately based on acoustic structure. In duplex perception, listeners report hearing a single piece of acoustic input as both speech and nonspeech simultaneously. In one paradigm (e.g., Mann & Liberman, 1983), the first and second formants of a synthetic consonant-vowel speech syllable and the steady-state part of its third format (the "base") are presented to one ear, while the remaining portion of the signal, the third format transition, is presented to the other ear. The third formant transitions chosen for isolation in the experiment determine place of articulation for the syllableinitial consonant. However, in isolation, the transition sounds like a nonspeech pitch glide or "chirp." When the transition and base are presented dichotically in the appropriate temporal relationship, listeners report hearing an unambiguous integrated syllable in the ear receiving the base, and, at the same time, a nonspeech "chirp" in the ear receiving the transition. For the motor theorist, the duplex phenomenon provides strong support for a distinct speech module. The fact that the transition is heard simultaneously both as part of a speech syllable and as a nonspeech chirp implies that two distinct perceptual mechanisms give rise to the perceptual experience: one yielding perception of the speech syllable and one yielding perception of the pitch glide. Heard as nonspeech, the transition sounds like the frequency glide that it is, whereas, integrated with a syllable, it sounds like a consonant of which the glide is not experienced as a part. Liberman and Mattingly (1989) suggested that although the general auditory perceiving system yields a percept of the acoustic input itself, the speech module and a small number of other highly specialized modules of the auditory system do not. Whereas the outputs of the general auditory system are "homomorphic" with respect to the structure of stimulation at the sense organ (in that, for example, a frequency glide sounds like a pitch glide), the output of the speech module is "heteromorphic." A question is why listeners do not always experience duplex perception. That is, how does the speech module know what, in stimulation, belongs to its domain and how does it ordinarily prevent the general auditory system from yielding a distinct homomorphic percept of stimulation in the speech module's domain? Recently, Whalen and Liberman (1987) proposed an answer obtained using a new methodology for observing duplex perception. Their answer is that the speech module is preemptive: It gets initial access to acoustic input, and the general auditory system gets its leftovers. For their experiment, Whalen and Liberman used a synthetic speech base but a sinusoidal third formant transition and sounds were presented diotically rather than dichotically. The transitions were sinusoids to enhance the distinctness of the two parts of the duplex percept, which now would be heard in the same spatial location. In addition, however, a finding of duplex perception with sinusoidal transitions is all the more remarkable, because base and transition are discordant. In the experiment, subjects were first asked to label the transitions presented in isolation as either "da" or "ga." On this task, although subjects were able to discriminate the

743

transitions, as evidenced by their consistent assignment of one transition to the category "da" and the other to "ga," they heard neither as particularly "da"- or "ga"-like; averaged over subjects, accuracy was random on the task. Next, a "duplexity threshold" was determined for each subject. The base, which sounded like an ambiguous "da" to most subjects, was presented diotically with one of the transitions. Subjects were instructed to adjust the intensity level of the transition until they just started to hear both an ambiguous syllable ("da" or "ga" depending on which transition was presented) and a nonspeech "chirp." At these critical intensity levels, tests of duplexity were conducted by having subjects match the duplexed transition to transitions presented in isolation. Subjects performed well above chance on this task, indicating that true duplexity was occurring. Likewise, subjects were very accurate in identifying the consonant as "da" or "ga" both at the duplexity threshold and below it. There was in fact no significant difference between subjects' syllable-labeling performance when the transitions were presented at intensities above and below the duplexity threshold. The results of the experiment by Whalen and Liberman can be summarized as follows. When the base was presented alone, subjects generally reported hearing an ambiguous "da." When the transition was present, but at intensities below the duplexity threshold, subjects reported hearing an unambiguous "da" or "ga," depending on the transition. Finally, at transition intensities at or above the duplexity threshold, subjects reported hearing both the unambiguous syllable and a distinct nonspeech chirp, which corresponded to whichever transition was present. Because the transition contributes to the speech percept at intensity levels lower than those at which it is heard as nonspeech, Whalen and Liberman concluded that the processing of the transition as speech has priority. In other words, the speech module "preempts" the input and then passes any remainder of the signal to the general auditory system. Whalen and Liberman concluded that such preemption reflects "the profound biological significance of speech" (p. 171). We interpret the earlier research findings that led to the motor theory differently than did Liberman and his colleagues, and therefore are disposed to look for an alternative account of duplex perception. Evidence that speech percepts frequently fail to reflect the acoustic structure in an obvious way, but do reflect the vocal tract behaviors producing the signal, led Liberman and his colleagues to conclude that perception of speech is different from perception of other sound-producing events, and led them to look inside the perceiver for an explanation for the difference. From a different perspective, however, evidence that listeners to speech recover the physical sound-producing activity of the vocal tract suggests that, in its public aspect, perceiving speech is very much like perceiving other environmental events, and that there is no need to look inside the perceiver to explain how the percept acquires its motor character (Fowler, 1986; Fowler & Rosenblum, in press; Rosenblum, 1987). By the "public aspect" of perception, we refer to the events in an environment that perceivers can be shown to recover in perception and the information in stimulation available to perceptual systems that support that recovery. Accordingly,

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

744

CAROL A. FOWLER AND LAWRENCE D. ROSENBLUM

the public aspect of perception is complementary to its covert or private aspect, which includes whatever processing the nervous system may undertake in subserving perceptual recovery of environmental events. When we suggest that, in its public aspect, speech phenomenon is like perception of other environmental events, we are proposing that the kinds of things that can serve as perceptual objects (physical events in the environment) and the role that stimulation at the sense organ plays in perception are alike, but we are not making a claim that, necessarily, covert nervous system activities are the same. Visual perceivers experience seeing a world outside of themselves; they do not experience seeing the structured optic array that stimulates the retina. In visual perception, that is, optical structure serves as information for its causal environmental source (e.g., Gibson, 1966, 1979). Accordingly, in Liberman and Mattingly's terms, visual perception, like speech perception, is generally "heteromorphic;" dimensions of perceptual experience are not dimensions of the optical structure at the sense organ. More positively, however, dimensions of perceptual experience are those of the environmental causes of structure. Optic arrays are not objects of perception themselves; rather, they are the means by which observers perceive the environment. By the same token, if a perceiver sees, for example, a glass on a table and reaches out to pick it up, most likely he or she will pick up an object conforming in haptic experience to the object experienced visually. That is, in haptic perception, perceivers feel real-world objects that cause complex deformations of the skin (cf. Gibson, 1962); they do not perceive the skin deformations themselves; like vision, haptic perception is heteromorphic. More generally, in its public aspect, perception is the means by which organisms come to know their environment by using structure in stimulation at the sense organs, not as an object of perception itself, but as information about its causes in the environment. The kind of perception that Liberman and Mattingly call "heteromorphic" allows perceivers to know just one world, the one out there. That kind of perception is "homomorphic," mostly, with respect to environmental causes of structure in stimulation. In speech, listeners use an acoustic signal as information for its causal source—the gestures of the vocal tract that realize the talker's phonetic message. However, we claim, they do the same thing when they receive acoustic products of any sound-producing event; speech perception is not special in this regard. What then of duplex perception? Does it not, in any case, suggest that there is a distinct mechanism subserving speech perception? We hypothesize that it has another interpretation. The purpose of our experiments was to explore that possibility.

Experiment 1 The research by Whalen and Liberman suggested a starting point for our research. In that study, listeners only reported a duplex percept when the intensity of the transition was rather high. Possibly, the sine-wave transition was wholly integrated with the syllabic base until its intensity was too high to constitute a plausible syllable-initial transition for the base.

Excess energy in the transition was perceived as the acoustic product of a distal source distinct from the syllable. If that manner of "parsing" the acoustic structure underlies the duplex percept, then it should be possible to obtain duplex perception, and "preemptiveness," for the acoustic product of almost any sound-producing event if part of the product is made unnaturally intense in relation to the remainder. There have been reports of duplex perception in the literature for stimuli other than speech. However, in our view, some of the reports are mistaken and others do not challenge Whalen and Liberman's interpretation of their findings. Bregman (1987) reported instances of duplexity in visual perception. His examples are of occlusion of part of one object by another. In one example, part of a square is occluded by a transparent object with opaque stripes; one of the stripes overlays most of one side of the square. Even so, most viewers identify the partially occluded object as a complete square. In our view, this kind of example lacks the essential properties of duplex perception. The observer does not see one fragment of the pictorial display in qualitatively different ways at the same time. The observer sees that a stripe hides a side of the square. Moreover, the observer finds it completely acceptable if he or she is told that, in fact, the square is incomplete. In contrast, the isolated formant transition does not include a /g/. Rather, the transition gives rise to two qualitatively very distinct perceptual experiences, and the two perceived objects can even be heard as emanating from different locations in space. Finally, listeners cannot hear the syllable base unintegrated with the transition in the way that they can imagine an incomplete square behind the occlusion. There are also reports of duplex perception for musical stimuli (Collins, 1985; Pastore, Schmuckler, Rosenblum, & Szczesiul, 1983). We chose to look for duplex perception for other nonspeech sounds, however, because we wanted to use sounds for which it could not plausibly be supposed that listeners have a specialized perceptual module. A safe category of sounds in that regard would seem to be the category of products of human artifacts. We chose the sound of a door slamming. Our view of perception requires us to begin our investigation with sounds that are causal consequences of soundproducing events rather than arbitrarily synthesized sound patterns. This creates two difficulties for an attempt to replicate Whalen and Liberman's procedures exactly or, to a lesser degree, to replicate other procedures in which duplex perception is achieved for speech. One is that we do not have the means currently to substitute a caricature of a fragment of a door-slam sound for the fragment itself analogous to Whalen and Liberman's sine-wave formant transitions. This makes it more difficult to detect true duplexity, because the duplexed fragment is not as acoustically distinct from the integrated sound as it would be if it could be caricatured. A second difficulty is that nonspeech sounds must rarely, if ever, occur in da/ga-like pairs in which the sounds share most of their acoustic structure and differ by an easily parsable fragment of the spectrum. Whereas Whalen and Liberman and other investigators could show that perception of the base is changed appropriately depending on which transition is presented simultaneously with it, and could test for a duplex percept by having subjects identify the nonspeech chirp on trials designed

745

DUPLEX PERCEPTION IN SYLLABLES AND DOOR SLAMS

to elicit duplex perception, we cannot. However, we can ask whether duplex responses increase systematically with intensity of the chirp analog, and we can compare the relative frequency of duplex responses with responses indicating that the sound fragments do not integrate.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Method Subjects. Subjects were 16 students at Dartmouth College, Hanover, New Hampshire, who participated in the experiment for course credit. All reported normal hearing. Materials. We recorded the slamming of a metal door to a soundattenuating chamber. Next, we low-pass filtered the signal at 10 kHz and sampled it at 20 kHz using a 16-bit A to D converter. Figure 1 (top left) shows the waveform of the acoustic signal, which was 169 ms in duration, and (top right) a spectral cross-section taken 10 ms from stimulus onset. We .then filtered the signal digitally in two ways. We low-pass filtered the signal at 3 kHz to make a "base" analogue (bottom right of Figure 1), and we high-pass filtered it, also at 3 kHz, to make a "chirp" analogue, henceforth, the "excerpt" (bottom left). (The filtering distorts the signals somewhat. This may move our manipulation in the direction of that of Whalen and Liberman's use of sinusoidal transitions in making the fit between the parts somewhat discordant. However, we found the effects of filtering unnoticeable when the fragments were recombined in their appropriate intensity relations.) To us, the base sounded like a door slam; however, it was discriminable from the original metal-door sound because the metallic clanging sound was largely absent. The excerpt did not sound like a door slam, but rather like the sound of something being shaken. Finally, we made new versions of the high-pass filtered signal by attenuating or amplifying it. Sampled digitized voltages constituting the excerpt were multiplied by 0, 0.05, 0.1, 0.15, and 0.2 to make a

low-intensity series, by 0.9, 0.95, 1, 1.05, and 1.1 to make an intermediate-intensity series, and by 4, 4.5, 5, 5.5 and 6 to make a highintensity series. We made four test sequences from the set of originals. The first, meant to familiarize listeners with the sounds, presented the original metal-door slam, the base, and the excerpt in sequence five times. There was 1 s between items in a sequence of three and 3 s between sequences. The second test order, an identification test, presented each of the three signals—original metal door, base, and excerpt— eight times each in random order. There were 3 s between trials. In both of these sequences, stimuli were presented diotically at a comfortable listening level. Each of 45 trials in both final tests presented the base paired with 1 of the 15 attenuated or amplified excerpts. Each version of the excerpt appeared 3 times on the test. Trials were randomized, and there were 3 s between trials. The tests differed only in that, in one, stimuli were presented diotically, whereas in the other, they were presented dichotically. Procedure. Subjects were run in groups of 1 to 3 students. They took the tests in the order described previously except that the order of the diotic and dichotic versions of the test was counterbalanced. Subjects were first told that they would be listening to three different sounds, and that we wanted to know how people would identify or describe them. As we played the repeating sequence of signals (metal door, base, excerpt), they were to try to identify each sound. Failing that, they were to describe the sound as best they could. Having collected their written identifications or descriptions, we next gave them names to call each sound. We told them that the first sound (the original metal-door sound) was the sound of a metal door slamming, the second (the base) was the sound of a wooden door slamming, and the third (the excerpt) was the sound of some small metal pellets being shaken in a cup. We played the five sequences of

ORIGIN

£ -6000

2L 01000 0.020 0.040 0.060 0.080 0.100 SECONDS

0.120

0.140 0.160

^ o

"8000

_l

-100.00

>.

4 6 Frequency in kHz

:

0

0.00

'5>

J -20.00 I -4000 u



-1 -8000 4

6

10

-10000 Frequency in kHz

0

4

6

Frequency in kHz

Figure 1. (Clockwise from top left) Waveform of the metal-door sound; spectral cross-section from the door sound; cross-section after low-pass filtering; cross-section after high-pass filtering.

10

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

746

CAROL A. FOWLER AND LAWRENCE D. ROSENBLUM

the three sounds over to the subjects and asked them to associate the names "metal door" (M), "wooden door" (W), and "shaking sound" (S) to the three sounds, respectively. Next, they took the 24-item identification test to determine whether they could distinguish and label the three sounds. On each trial, they were asked to identify the sounds as the metal door, the wooden door, or the shaking sound by writing an identifying letter on the answer sheet. They were asked to guess if they were uncertain. On the diotic test, we told subjects that they would either hear one of the three sounds on each trial or else they would hear two of the sounds presented simultaneously. On their answer sheet next to each trial number, they were to write one identifying letter if they heard just one of the three sounds or two different letters if they heard two, guessing at the identity of either one or both if necessary. They were not told that when they heard two sounds, one would always be a shaking sound; accordingly, any pairing of the three sounds constituted an acceptable answer on those trials on which listeners heard two sounds. Instructions on the dichotic test were the same except that subjects were asked to assign their identifications to the ear in which the identified stimulus had been presented. On their answer sheet, the left-most of two blank spaces for each trial was for identification of stimuli heard in the left ear and the right-most space for identifications of stimuli heard in the right ear.

Results Identifications and descriptions. Six subjects identified the metal-door slam as the sound of a door slamming or being closed. All subjects identified the sound as a hard collision of some sort. (For example, 4 subjects repotted hearing the sound of a drum beat and 2 subjects reported the sound of a foot fall, such as "footsteps on stairs" and "boot clomping on floor.") Four subjects identified the base as a door closing and, as for the metal door, the remainder specified some form of hard collision. (For example, 2 heard a drum beat, 1 heard a book being dropped, and another heard a heavy box being dropped.) No one associated the excerpt with the sound of a door closing. Almost all reported something being shaken (e.g., maracas, castanets, tambourine, keys, baby rattle). Although listeners were not particularly good at identifying the door slams as such, they were very good at narrowing the possibilities down to a highly restricted class of sound-producing events (hard collisions involving rather heavy objects). It is not possible to tell from the literature, whether listeners' accuracy in identification of the door sound as such is or is not typical of sounds of this duration and familiarity. In any case, important for our purposes, listeners associated the excerpt with a rather different class of events than the door sound and the base. Identification test. No subject made an error identifying the shaking sound. All errors were confusions between the metal door and the base. On average, however, performance was high (91% correct) on the 16 trials of the identification test in which the sound was either the metal door or the base. The errors were evenly divided between those conditions. Duplex teats. The most frequently used response categories on these tests were M, W, and MS (metal door and shaking sound). We present responses in those categories in Figure 2 for the diotic (top) and the dichotic (bottom) tests.

Diotic

lew

miririU-

hijjh

intensity

wooden door nil-till door dnpli-x

& 3 middle

Kijrh

intensity

Figure 2. Major response categories in the diotic (top) and dichotic (bottom) conditions of Experiment 1.

In the figure, responses are collapsed across intensities within each of the three intensity ranges (low, middle, and high) of the excerpt. In general, there were no obvious trends in use of the response categories within each intensity range. On the diotic test, by analogy with the findings of Whalen and Liberman (1987), we might expect subjects to hear just the base when the intensity of the excerpt is far below its natural intensity relation to the base, to hear the integrated metal door at intermediate-intensity levels (which surround and include the natural intensity relation of the excerpt to the base), and to hear the metal door plus a residual shaking sound at high-intensity levels of the excerpt. That is exactly what we found. The W responses were predominant at low intensities of the excerpt (76% of total responses to stimuli in that intensity range), but they were rare at medium (13.8%) and high (1.3%) intensities. The decrease with intensity is highly significant, F(2, 28) = 138.8, p < .0001. (The Fs here and on other duplex identification tests are based on an arc-sine transformation of the data, required because variances were very different across the intensity ranges. In particular, they decreased markedly with decreases in response frequency. Analyses of variance [ANOVAs] using the arc-sine transformed data did not change the outcome in any important way. Effects that were significant using untransformed data remained significant under transformation. Likewise, effects that were nonsignificant in the one analysis were nonsignificant in the other). The M responses were uncommon at low (12.4%) and high intensities (11.7%), but predominant at intermediate intensities (64.8%). The effect of intensity was significant, f(2, 28) = 34.44, p