«Audio-Visual Integration: Generalization Across Talkers A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for graduation ...»
Some evidence suggests that this type of generalization is possible. For example, Richie and Kewley-Port (2008) trained listeners to identify vowels using audiovisual integration techniques. They found that training audio-visual integration was successful and that the trained listeners showed improvement from pre-test to post-test in both syllable recognition and sentence recognition, whereas the untrained listeners did not. More importantly, a substantial degree of generalization across talkers was observed. They suggest that audio-visual speech perception is a skill that, when done appropriately, can be trained to produce benefits to speech perception for persons with hearing impairment. They argued that implementing these techniques into aural rehabilitation could provide an important and effective part of a successful program for hearing impaired individuals.
Present Study The results from Richie and Kewley-Port (2008) offer encouragement for the possibility that across-talker generalization can be obtained. However, the question remains whether similar talker generalization can be observed for the consonant-based degraded stimuli used by Ranta (2010) and DiStefano (2010). The present study addresses this question by providing training in audio-visual speech integration with one set of talkers and testing for integration improvement with a different set of talkers. A group of normal-hearing listeners received ten training sessions in audio-visual perception of speech syllables produced by three talkers. The auditory component of these syllables was degraded in a manner consistent with the signals produced by multichannel cochlear implants (Shannon et al, 1995), similar to the methods used by James (2009), DiStefano (2010), and Ranta (2010). Listeners were periodically tested for improvement in auditory-only, visual-only and audio-visual perception with stimuli produced by both the training talkers and two additional talkers who had not been used in training. Consistent with the results of Richie and Kewley-Port, it was anticipated that integration would improve substantially for the training talkers. A smaller but still noticeable improvement was anticipated for the non-training talkers, reflecting some degree of generalization. Regardless of the results, findings should provide new insights to the limits of generalizability of audio-visual integration training, and how to produce more effective designs for aural rehabilitation programs for hearing impaired patients.
Participants The present study included five listeners, two males and three females, ages 21years. All five had normal hearing as well as normal or corrected vision, by selfreport. Participants were compensated $150 for their participation. Materials previously recorded from five adult talkers, two male and three female native Midwestern English speakers, were used as the stimuli.
Stimuli Selection A limited set of eight syllables were presented, all of which satisfied the following
1. The pairs of stimuli were minimal pairs; the initial consonant was their only
2. All stimuli contained the vowel /ae/, selected because of the lack of lip rounding or lip extension, which can create speech reading difficulties
3. Each category of articulation, including place (bilabial, alveolar velar), manner (stop, fricative, nasal), and voicing (voiced or voiceless), was represented
4. All syllables were presented without a carrier phrase.
The same set of single-syllable stimuli was used for each of the conditions:
The degraded audio-visual conditions included the following four dual-syllable (dubbed) stimuli. The first item in the pair represents the auditory stimulus while the second indicates the visual stimulus.
Stimuli Recording and Editing The stimuli used in this study were identical to those used in recent studies (e.g., James, 2009; DiStefano, 2010; and Ranta, 2010) in order to yield comparable results.
Speech samples from five talkers were degraded using a MATLAB script designed by Delgutte (2003). The speech signal was filtered into two broad spectral bands. Then, the fine structure of each band was replaced with band limited noise, while the temporal envelope remained intact. The resulting stimulus was a 2-channel stimulus, similar to those used by Shannon et al. (1998). Using a commercial video editing program, Video Explosion Deluxe, the degraded auditory stimuli were dubbed onto the visual stimuli.
The final step involved burning the stimulus sets onto DVDs using Sonic MY DVD. Four DVDs were created for each of the five talkers. Each of these DVDs contained sixty stimuli arranged in random order to eliminate the possibility of memorization from the participants.
Visual Presentation All participants were initially pre-tested using degraded auditory, visual and audio-visual conditions, and then received training in all three of these conditions. The visual portion of the stimulus was presented using a 50 cm video monitor positioned approximately 60 cm outside the window of a sound attenuating booth. The monitor was eye level to the participants and positioned about 120 cm away from them. The stimuli were presented using recorded DVDs on a DVD player. During auditory-only presentation the monitor screen was darkened.
Degraded Auditory Presentation The degraded auditory stimuli were presented from the headphone output of the DVD player through 300-ohm TDH-39 headphones at a level of approximately 75 dB SPL.
Testing Procedure Testing was conducted in the Ohio State University’s Speech and Hearing Department located in Pressey Hall. Participants were instructed to read over a set of instructions explaining the procedure and listing a closed-set of response possibilities, which included 14 possible responses. Included in the response set were the 8 presented stimuli along with 6 other possibilities, which reflected McGurk-type fusion and combination responses for the discrepant stimuli. These additional responses included syllables dat, nat, pcat, ptat, bgat and bdat.
Each participant was tested individually in a sound attenuating booth that faced the video monitor located outside of the booth. Auditory stimuli were transmitted through headphones inside the booth. The examiner recorded and scored the participant’s verbal responses as heard via an intercom system. Each participant was initially administered a pre-test including stimuli selected from a set of 15 DVDs, three for each of the five talkers, each DVD containing 60 randomly ordered syllables. In the pre-test, the listeners were presented with one DVD from each talker in each of the three listening conditions (auditory-only, visual-only and audio-visual). Each DVD contained 30 congruent stimuli expected to elicit the correct response. The remaining 30 stimuli were discrepant, designed to elicit McGurk-type responses. Participants were instructed to listen to/watch each DVD and to verbally respond the syllable they perceived for each stimulus. During the pre-test no feedback was provided.
The pre-test was followed by five training sessions in which participants received audio-visual training on two DVDs for each of the three training talkers. When presented with congruent stimuli, if the participant provided the correct response the examiner visually reinforced the response with a head nod. If the response was incorrect the examiner would provide the correct response via an intercom system. For the discrepant stimuli the appropriate responses were as follows, with the first column representing the visual stimulus, the second representing the auditory and the third
representing the expected McGurk-type response:
As with the congruent stimuli, if the participant responded correctly the examiner provided visual reinforcement, whereas, if they responded incorrectly they were told the appropriate McGurk-type response via an intercom system. The decision to use the McGurk-type responses as the appropriate response was made because Ranta’s study provided evidence to support the hypothesis that these responses can be trained and by using these McGurk-type responses we could determine if this training would generalize to other talkers.
Upon completing the five training sessions a mid-test identical to the pre-test was administered. Next, participants had five more training sessions identical to the first five.
Upon completing the additional five training sessions, a post-test identical to the midtest and the pre-test was administered to the participants. Each test took approximately 2-3 hours and the training sessions took approximately 8-10 hours. Training was divided into 1 or 2 sessions at a time. The participants were frequently encouraged to take breaks in order to prevent fatigue.
Results of the pre-test, mid-test and post-test were analyzed to determine whether or not improvements were seen in all three modalities and whether or not these improvements generalized from the training talkers to the testing talkers. Percent correct performance data for the congruent stimuli are presented first, followed by the percent response results for the discrepant stimuli.
Percent Correct Performance Figure 1 displays the averaged results for overall percent correct intelligibility performance in each modality for the auditory-only (A-only), visual-only (V-only) and audio-visual (A+V) (congruent) conditions for each testing situation, pre-test, mid-test and post-test. Results are shown for the stimuli produced by training talkers. Listeners showed improvements from pre-test to post-test in all three modalities. A two-factor repeated measures analysis of variance (ANOVA) was performed on arcsinetransformed percentages to assess the improvements and evaluate whether differences observed across testing sessions were statistically significant. ANOVA results indicated a significant main effect of test (pre vs. post), F(1,4)=50.525, p=.002, as well as a significant main effect of modality (A-only, V-only, A+V), F(2,8)=87.364, p.001. There was no significant interaction found between test and modality, F(2,8)=2.65, p=.13(ns).
Pairwise comparisons were also performed for these data. Results showed that there was no significant difference between the means of A-only and V-only performance, mean difference=.194, p=.015. A significant difference was found between A-only and A+V, mean difference=.456, p=.001, and between V-only and A+V, mean difference=.65, p.001.
It is important to note that the significant improvement from pre-test to post-test in all three modalities generalized to the testing talkers as well, as shown in Figure 2.
Figure 2 shows the results for overall percent correct intelligibility performance in each of the listening conditions, A-only, V-only and A+V, for each testing situation, pre-test, mid-test and post-test, for the talkers not used in the training sessions (i.e., the testing talkers). ANOVA results for the testing talkers revealed a significant main effect of test (pre vs. post), F(1,4)=45.499, p=.003 as well as a significant main effect of modality (Aonly, V-only, A+V), F(2,8)=115.052, p.001. As with the training talkers, there was no significant interaction found between test and modality, F(2,8)=1.431, p=.29 (ns).
Pairwise comparisons also revealed results similar to those of the training talkers. There was no significant difference between A-only and V-only, mean difference=.027, p=.591.
A significant difference was seen between A-only and A+V, mean difference=.550, p.001, as well as between V-only and A+V, mean difference=.523, p.001.
Figures 3-5 display these data in a format allowing easier comparison. In Figure 3 results are shown for percent correct performance in the A-only condition across tests with training and testing talkers, for side-by-side comparison. This graph shows that the listeners improved their performance from pre-test to post-test with both the training talkers and the testing talkers. ANOVA results revealed that there was a significant main effect of test (pre vs. post), F(1,4)=37.440, p=.004 as well as a significant main effect of talker (training vs. testing), F(1,4)=252.066, p.001. In Figure 4, results for the V-only condition are displayed. ANOVA results for these data show a significant effect of test, F(1,4)=141.307, p.001, but no difference across talkers, F(1,4)=.385, p=ns, and no significant interaction, F(1,6)=.234, p=ns. Figure 5 shows data for the A+V condition.
Here no significant effects were observed across tests, F(1,4)=4.550, p=.100, nor across talkers, F(1,4)=4.369, p=.105. Again, no interaction was observed, F(1,4)=1.395, p=.303.
Integration performance with the congruent stimuli across tests is shown in Figure 6. The averages for training talkers and testing talkers are shown. Here integration is defined as the difference between the percent correct in the A+V condition and the best single modality performance (A-only or V-only). Using this measure, the amount of integration actually declines slightly from pre-test to post-test for both the training talkers and the testing talkers. A two-factor ANOVA revealed that there was no significant main effect of test (pre vs. post), F(1,4)=3.642, p=.13. There was also no significant main effect of talker (training vs. testing), F(1,4)=1.359, p=.30. This decrease in integration could be attributed to the fact that the listeners showed greater improvements in the A-only and V-only conditions as compared to the A+V condition.
Figure 7 examines the results for stimuli produced by individual talkers. The pretest and post-test percent correct responses in the A-only condition across listeners this figure shows for the three training talkers as well as the two testing talkers. In this figure it is important to note that training talkers JK and EA and the testing talkers KS and DA all began with similar baseline percent correct intelligibility. However, training talker LG started off with a percent correct intelligibility that was slightly higher than the others and listeners showed a greater improvement in this modality with this talker.