«Audio-Visual Integration: Generalization Across Talkers A Senior Honors Thesis Presented in Partial Fulfillment of the Requirements for graduation ...»
The average percent correct responses for the pre-test and post-test for each of the talkers in the V-only condition are displayed in Figure 8. Unlike in the A-only condition, within this modality there was no particular talker who showed a baseline average intelligibility notably higher than the rest. Again improvements were seen from pre-test to post-test with the training talkers, and that this improvement appeared to generalize to the testing talkers.
These results are similar to those found in Figure 9, which shows percent correct responses for the pre-test and post-test for each of the talkers in the A+V condition.
Here we see that LG did have a higher baseline average intelligibility, but the difference was not as great as that seen in the A-only condition. Two important features of these data are that for each of the talkers, training and testing, we see an improvement in performance from pre-test to post-test, indicating that generalization occurred. Also, in this condition the pre-test average intelligibility for all talkers is higher than that in the single modality conditions. Even at the post-test, there was still room for improvement, ruling out a possible ceiling-effect. Thus, ceiling effects do not explain the decrease in integration observed in Figure 6.
Integration of Discrepant Stimuli The responses to discrepant stimuli in the A+V condition were categorized into “auditory” (percent of time subject chose the auditory stimulus as the response), “visual” (percent of time the subject chose a response reflecting the visual place of articulation), or “other” (any other type of response). Figure 10 shows the percent response averaged across listeners for the pre-test and post-test discrepant stimuli for the training talkers, and Figure 11 shows the results for the testing talkers. While an increase in “other” responses is seen, this increase was not statistically significant. ANOVA results revealed there was no main effect of test (pre vs. post), F(1,4)=6.221, p=.067, just missing the.05 alpha limit. There was also no significant main effect of talker (training vs. testing), F(1,4)=.125, p=.74. The fact that the “other” responses increased from pretest to post-test for both the training talkers and the testing talkers shows a decrease on reliance of the individual modalities and a possible increase in audio-visual integration.
To determine whether there had indeed been an increase in integration, the responses in the “other” category were further analyzed. Figures 12 and 13 show the results.
In Figures 12 and 13 “fusion” and “combination” responses indicate McGurk-type integration, whereas the “neither” category represents those responses that do not show integration. For both the training talkers and the testing talkers we see an increase in “fusion” and “combination” responses from pre-test to post-test and a corresponding decrease in “neither” responses. This suggests that training facilitated integration for the discrepant stimuli and this integration process appears to have generalized from the training talkers to the testing talkers. However, ANOVA results revealed that the main effect of test was not statistically significant, F(1,4) = 4.438, p=.103, although the main effect of talker approached significance, F(1,4)=6.831, p=.059.
Overall, the present results indicate that training in the A-only, V-only and A+V conditions with one set of talkers does generalize to a different set of talkers. For both sets of talkers, improvements in all testing modalities were observed from pre-test to post-test. Further, results for discrepant stimuli suggest that audio-visual integration increased from pre-test to post-test, as measured by an increase in McGurk-type fusion and combination responses. In contrast, integration for congruent stimuli, measured as the difference between A+V and the best single modality (A or V), appeared to decrease after training, because the improvement in single-modality conditions was greater than that for the A+V condition. This apparent inconsistency can be attributed to differences in the way integration measured in the present study and argues for further investigation into the utility of different measures of integration.
Grant (2002) critiqued and compared several models for predicting integration efficiency. He focused specifically on two models the pre-labeling model of Braida and the fuzzy logic model of Massaro and argues that the pre-labeling model is superior to the fuzzy logic model. One primary difference of these two models is their assumption about the time course of audio-visual integration. The pre-labeling model assumes that integration occurs early in the cognitive process, prior to a response decision. The fuzzy logic model, in contrast, assumes that integration is a later occurrence, after initial response decisions for each individual modality have been made. Grant applied both models to one data set and found conflicting results; the fuzzy logic model suggested there were no significant signs of inefficient integrators, while the pre-labeling model showed significant differences. Grant argued for the use of the pre-labeling model due to the fact that the fuzzy logic model uses a formula designed to minimize the difference between obtained and predicted scores. This creates a model that attempts to fit obtained A+V scores rather than act as a tool to predict optimal audio-visual speech perception performance. Rather than attempting to fit observed data, the pre-labeling model estimates audio-visual performance based on single-modality information and predicts performance based on the notion that there is no interference across modalities. In situations where this model has been used, the predicted audio-visual scores were always greater than or equal to actual performance whereas the predictions made using the fuzzy-logic model were equally distributed as over-predicting and under-predicting. Grant concluded that the pre-labeling model places a stronger emphasis on individual differences and is therefore a better model for measuring integration efficiency.
Tye-Murray et al. (2007) further analyzed the pre-labeling model. This model, as well as a computationally simpler integration efficiency model, was used to compare integration results for normal hearing and hearing-impaired subjects. Consistent with Grant’s findings, the pre-labeling model predicted higher integration performance than that observed for both hearing-impaired and normal hearing listeners. However, this model found no significant difference between the two groups of listeners, suggesting that while neither group achieved their maximum integration ability, their performances were comparable. The integration efficiency model also did not find a significant difference between the two groups. Unlike the pre-labeling model, the integration efficiency model predicted scores for audio-visual performance that were consistently lower than the actual scores. The integration efficiency model takes into account singlemodality performance for an individual listener. Tye-Murray et al. argued that this is beneficial, because it allows for a deeper investigation into a listener’s skills that can result in the most effective rehabilitation strategy. This model allows insight to a listener’s strengths, weaknesses and integration ability and allows for the formation a rehabilitation strategy that is customized for each hearing-impaired individual.
Recently, Altieri (2008) proposed a different type of model of audio-visual integration, one that employs listener reaction time as an indicator of cognitive processing complexity. While the present study did not collect reaction time data, future work could add this measure to empirical studies to determine its potential usefulness for aural rehabilitation.
Future work could use the present results to compare the measures used in the present study to model-predictive measures (Grant & Seitz, 1998), simple measures of integration efficiency (Tye-Murray et al., 2007), and processing capacity measures (Altieri, 2008) to determine which, if any, of these measures can be used to develop optimized aural rehabilitation strategies for hearing-impaired persons. Nonetheless, these results support the generalizability of training in audio-visual speech perception for aural rehabilitation programs, and argue strongly for inclusion of training in all modalities (auditory, visual, and audio-visual) to achieve maximum benefits.
DiStefano, S. (2010). Can audio-visual integration improve with training, Senior Honors Thesis, The Ohio State University.
Gariety, M. (2009). Effects of training on intelligibility and integration of sine-wave speech. Senior Honors Thesis, The Ohio State University.
Grant, K.W. & Seitz, P.F. (1998). Measures of auditory-visual integration in nonsense syllables and sentences. The Journal of the Acoustical Society of America, 104 (4), 2438-2450.
James, K. (2009). The effects of training on intelligibility of reduced information speech stimuli. Senior Honors Thesis, The Ohio State University.
McGurk, H., & MacDonald, J (1976). Hearing lips and seeing voices. Nature 264, 746Ranta, A. (2010). How does feedback impact training in audio-visual speech perception, Senior Honors Thesis, The Ohio State University.
Richie, C. & Kewley-Port, D. (2008). The effects of auditory-visual vowel identification training on speech recognition under difficult listening conditions. Journal of Speech, Language, and Hearing Research, 51, 1607-1619.
Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270, 303-304.
Tye-Murray, N., Sommers, M. S., & Spehar B. (2007). Audiovisual integration and lipreading abilities of older adults with normal and impaired hearing. Ear &
Figure 3: Percent correct responses for A-only tests, averaged separately across listeners for training talkers and testing talkers Figure 4: Percent correct responses for V-only tests, averaged separately across listeners for training talkers and testing talkers Figure 5: Percent correct responses for A+V congruent stimuli tests, averaged separately across listeners for training and testing talkers Figure 6: Amount of integration by test, averaged across listeners separately for training talkers and testing talkers Figure 7: Percent correct responses for pre-test and post-test averaged by talker in the
Figure 8: Percent correct responses for pre-test and post-test averaged responses by talker in the V-only condition Figure 9: Percent correct responses for pre-test and post-test averaged by talker in the
Figure 10: Percent response for discrepant stimuli averaged for training talkers across listeners, for pre-test and post-test Figure 11: Percent response for discrepant stimuli averaged for testing talkers across listeners, for pre-test and post-test Figure 12: McGurk-type integration results for pre-test and post-test, averaged across listeners for training talkers Figure 13: McGurk-type integration results for pre-test and post-test, averaged across listeners for testing talkers