Audio-Visual Integration: Generalization Across Talkers

A Senior Honors Thesis

Presented in Partial Fulfillment of the Requirements for graduation with research

distinction in Speech and Hearing Science in the undergraduate colleges of The Ohio

State University


Courtney Matthews

The Ohio State University

June 2012

Project Advisor: Dr. Janet M. Weisenberger, Department of Speech and Hearing



Maximizing a hearing impaired individual’s speech perception performance involves training in both auditory and visual sensory modalities. In addition, some researchers have advocated training in audio-visual speech integration, arguing that it is an independent process (e.g., Grant and Seitz, 1998). Some recent training studies (James, 2009; Gariety, 2009; DiStefano, 2010; Ranta, 2010) have found that skills trained in auditory-only conditions do not generalize to audio-visual conditions, or vice versa, supporting the idea of an independent integration process, but suggesting limited generalizability of training. However, the question remains whether training can generalize in other ways, for example across different talkers. In the present study, five listeners received ten training sessions in auditory, visual, and audio-visual perception of degraded speech syllables spoken by three talkers, and were tested for improvements with an additional two talkers. A comparison of pre-test and post-test results showed that listeners improved with training across all modalities, with both the training talkers and the testing talkers, indicating that across-talker generalization can indeed be achieved. Results for stimuli designed to elicit McGurk-type audio-visual integration also suggested increases in integration after training, whereas other measures did not. Results are discussed in terms of the value of different measures of integration performance, as well as for implications for the design of improved aural rehabilitation programs for hearing-impaired persons.

Acknowledgments I would like to thank my project advisor, Dr. Janet Weisenberger, for all of her guidance and support throughout my honors thesis process. Because of her I was able to expand my knowledge more than I could have ever expected. I am extremely grateful for her time, assistance and patience. I would also like to thank my subjects for their flexibility, time and effort.

This project was supported by an ASC Undergraduate Scholarship and an SBS Undergraduate Research grant.

Abstract...………………………………………………………………………………………..2 Acknowledgments………………………………………………………………………….…...3 Table of Contents……………………………………………………………………………….4 Chapter 1: Introduction and Literature Review………………………………………………5 Chapter 2: Method…………………………………………………………………………….13 Chapter 3: Results and Discussion………………………………………………………….18 Chapter 4: Summary and Conclusions…………………….………………………….........23 Chapter 5: References………………………………………………………………….....….26 List of Figures……………………………………………………………………………….….28 Figures 1-13……………………………………………………………………………...……30

Effective aural rehabilitation programs provide hearing-impaired patients with training that can be generalized to different situations. It is important that patients can apply this training to everyday circumstances of speech perception. Maximizing an individual’s speech perception performance involves training in both auditory and visual sensory modalities. Although it has long been known that listeners will use both auditory and visual sensory modalities in situations where the auditory signal is compromised in some way (for example, listeners with hearing impairment), research has shown that listeners will use both of these modalities even when the auditory signal is perfect.

McGurk and MacDonald (1976) found that when listeners were presented simultaneously with a visual syllable /ga/ and an auditory syllable /ba/, they perceived the sound /da/, a “fusion” response. Although the auditory /ba/ was in no way distorted, the response occurs because the brain cannot ignore the visual stimulus. The resulting perception integrates, or fuses, the auditory stimulus /ba/, which has a bilabial place of articulation, and visual stimulus /ga/, which has a velar place of articulation, to form /da/, which has an intermediate alveolar place of articulation. When the stimuli were reversed, an auditory /ga/ presented with a visual /ba/, the most common response was a “combination” response of /bga/. A “combination” response occurs because the visual stimulus is too prominent to be ignored, so rather than fusing the stimuli the brain combines the prominent visual stimuli with the auditory signal to create a new perception. Subsequent studies have explored the limits of this audio-visual integration.

To understand the nature of this integration, it is important to consider the types of auditory and visual cues that are available in the speech signal.

Auditory Cues for Speech Perception In most situations the auditory cue alone is sufficient for listeners to understand speech sounds. Within the auditory signal there are three main cues for identifying speech: place of articulation, manner of articulation and voicing. Place of articulation refers to the physical location within the oral cavity where the airstream is obstructed.

Included in this category are bilabials (/b,m,p/), labiodentals (/f,v/), interdentals (/t,θ/), alveolars (/s,z/), palatal-alveolars (/Ӡ,ʃ/), palatals(/j,r/) and velars (/k,g,ŋ/). The manner of articulation refers to the way in which the articulators move and come in contact with each other during sound production. This includes stops (/p,b,t,d,k,g/) fricatives (/f,v,t,s,z,h/), affricates (/tʃ, dӠ/), nasals (/m,n,ŋ/), liquids (/l,r/) and glides (/j/). Voicing indicates whether or not the vocal folds vibrated during the production of the sound. If they do vibrate the sound is referred to as a voiced sound (/b,d,g,v,z,m,n,w,j,l,r,ð,ŋ,ӡ,dӡ/), and if they do not, the sound indicates a voiceless sound (/p,t,k,f,s,f,θ,ʃ,tʃ/) (Ladefoged, 2006). Cues to place, manner and voicing are present in the acoustic signal, in characteristics such as formant transitions, turbulence and resonance, and voice onset time.

Visual Cues for Speech Perception Although most of the information required for comprehending a speech signal can be obtained from auditory cues, McGurk and MacDonald (1976) showed that visual cues also play an important role in speech perception. Visual cues become especially useful in situations where the auditory signal is compromised, but as their study showed, even when the auditory signal is perfect visual cues are still used by listeners.

The sole characteristic of speech production that can be reliably visually detected is place of articulation, but even the results of this observation are often ambiguous (Jackson, 1988).

A primary reason that it is extremely difficult to identify speech sounds by visual cues alone is the fact that many sounds look alike. These are referred to as viseme groups, sets of phonemes that use the same place of articulation but vary in their voicing characteristics and manner of articulation (Jackson, 1988). Since place of articulation is the primary observable feature of speech sounds, it is extremely difficult to differentiate among phonemes that use the same place. The phonemes /p,b,m/ are an example of a viseme group; they all use a bilabial place of articulation, making them visually indistinguishable. It is also important to note that talkers are not all identical and that the clarity of visual speech cues can vary greatly. Jackson found in her study that it was easier to speechread talkers who created more viseme categories versus those talkers who created less. There are also other talker features that contribute to the ability to speechread, including gestures, head and eye movements and even mouth shape. All of these visual cues can aid a listener in any speaking situation but especially those situations in which the auditory signal is compromised.

Speech Perception with Reduced Auditory and Visual Signals Studies have shown that speech can still be intelligible in situations where the auditory cues are compromised. This is due to the fact that speech signals are somewhat “redundant,” meaning that they contain more than the minimum information required for identifying the sounds. Shannon et al. (1995) performed a study with speech signals modified to be similar to those produced by a cochlear implant. This was achieved by removing the fine structure information of the speech signals and replacing it with band-limited noise, while maintaining the temporal envelope of the speech. In the study different numbers of noise-bands were used and it was discovered that intelligibility of the sounds increased as the number of frequency bands increased.

However, high levels of speech recognition were reached with as few as three bands, indicating that speech signals can still be identified even with a large amount of information removed.

The study discussed above was expanded by Shannon et al. in 1998. There were four manipulations done within the study: the location of the band division was varied, the spectral distribution of the envelopes was warped, the frequencies of the envelope cues were shifted and spectral smearing was done. The factors that most negatively influenced intelligibility were found to be the warping of the spectral distribution and shifting the tonotopic organization of the envelope. The exact frequency cut offs and overlapping of the bands did not affect speech intelligibility as greatly.

Another study that examined the speech intelligibility of degraded auditory signals was performed by Remez et al. (1981), who reduced speech sounds to three sine waves that followed the three formants of the original auditory signal. Although it was reported that the signals were unnatural-sounding, they were highly intelligible to the listeners. This study further suggests that auditory cues are packed with more information than absolutely needed for identification, and that even highly degraded speech signals can still be understood.

Degraded visual cues can also still be useful signals in understanding speech.

Munhall et al. (1994) studied whether or not degraded visual cues affected speech intelligibility. They employed visual images degraded through band-pass and low-pass spatial filtering, which were presented to listeners along with auditory signals in noise.

High spatial frequency information was apparently not needed for speech perception and it was concluded that compromised visual signals can nonetheless be accurately identified (Munhall et al., 2004).

Audio-Visual Integration of Reduced Information Stimuli Studying audio-visual integration processes with compromised auditory signals is especially important because it simulates the experience of hearing impaired persons and provides insights into what promotes optimal perception. Information learned from these studies can then be used when designing aural rehabilitation programs for hearing impaired individuals. For this reason, some researchers have advocated specific training in audio-visual speech integration for aural rehabilitation programs.

Grant and Seitz (1998) offered evidence to support the idea that audio-visual integration is a process separate from auditory-only or visual-only speech perception. In experiments with hearing impaired persons, they found that audio-visual integration could not be predicted from auditory-only or visual-only performance, leading them to argue for independence of the integration process. Grant and Seitz thus suggested that specific integration training should also be incorporated into successful aural rehabilitation programs.

Effects of Training in Recent Studies More recent studies have further explored the relative value of modality-specific speech perception training. Many of these studies have employed normal-hearing listeners who have been presented with some form of degraded auditory stimulus to approximate situations encountered by hearing-impaired individuals. In our laboratory, James (2009) and Gariety (2009) tested syllable perception with syllables that had been degraded to mimic those generated by cochlear implants. To create their auditory stimuli they used a method similar to that employed by Shannon et al. (1995), in which the fine structure details of auditory stimuli were replaced with band-limited noise while preserving the temporal envelope. James (2009) and Gariety (2009) showed that the auditory-only component can be successfully trained. However, this training did not generalize to the audio-visual condition and thus did not improve integration results, leaving a question about whether integration is a skill that can benefit from training.

Ranta (2010) and DiStefano (2010) addressed the question of whether integration ability can be trained. They employed stimuli similar to those used by James (2009), but trained listeners only in the audio-visual condition. Results showed that integration can be trained, but the skills did not generalize to the auditory-only or the visual-only condition. The results of these studies suggest that skills do not generalize across modalities, supporting the argument that integration is a process independent of auditory-only or visual-only processing. However, because the value of aural rehabilitation programs is highly dependent on skills generalization, the question still remains whether this form of training can generalize in other ways, for example across different talkers.

