«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
Next, I consider a number of plots focusing on showing evidence of goat-, lamb-, and wolf-type speaker populations, similar to those shown in the prior work of Doddington, et al.
. The ﬁrst plot, addressing goat-like tendencies, that is, causing missed detection errors in true speaker trials, is shown in Figures 3.12 and 3.13 for males and females, respectively.
Here, the average true speaker score is plotted against the number of true speaker trials for each speaker. In these plots a large amount of outlying average target scores would indicate greater variability across speakers. However, it is not clear in either plot that there are more outliers than would be expected if the target score distribution did not depend on the speaker, though there appear to be a handful of goat-ish speakers among females with fewer than ﬁve target trials (indicated by those points showing the lowest average target scores).
To look for a population of lambs, namely those speakers who cause false alarm errors as target speakers, I plot the true speaker scores for a target model against the highest impostor score for that target model. This plot is shown in Figure 3.14 for males, and in Figure 3.15 for female speakers. In both male and female plots there is a large cluster of points indicating speakers without lamb-ish tendencies, i.e., those with maximum impostor scores less than, or on par with, true speaker scores. However, there are also many instances showing maximum impostor scores greater than the target scores for target models, and also greater than most other maximum impostor scores, suggesting lamb-like tendencies for some speakers.
Finally, in Figures 3.16 and 3.17, I plot the average maximum impostor score against the number of test conversation sides for each impostor speaker, for male and female speakers, respectively. As was the case with the earlier plots of average target scores, there is no clear evidence that the average maximum impostor score distributions are speaker-dependent.
Interestingly, there seem to be more outliers on the low end, i.e., with low average maximum impostor scores, than on the high end (which would indicate wolf-ish tendencies).
3.1.4 Eﬀects of Speaker Demographics on System Scores Continuing with the UBM-GMM system with T-norm, using Switchboard-1 electret conversation sides, I now switch focus to consider whether speaker demographics are evident in system scores. For the Switchboard-1 corpus, the following information is available for each speaker: sex, birth year, education level, and dialect area. The possible education levels are less than high school, less than college, college, and more than college. The dialect area corresponds to the region where the speaker lived for his ﬁrst 10 years; the possible areas include New England, North Midland, South Midland, Western, New York City, Northern, Southern, and Mixed. In order to assess what characteristics have an impact on the scores produced by the system, I performed an analysis of variance (ANOVA) test for a number of diﬀerent score distributions, described below. In each case, the probability (p) given to show signiﬁcance level is the probability of being incorrect in concluding that the distributions are not the same.
Since trial independence is an incorrect assumption, target scores were averaged for target
CHAPTER 3. SPEAKER-DEPENDENT SYSTEM PERFORMANCE 41
Average true trial score Average true trial score 0.8 0.6 0.4 0.2 −0.2
0.8 0.6 0.4 0.2 −0.2
speakers over all target trials for each speaker before ANOVA analysis was done. I ﬁrst looked at the target scores for female speakers compared to the target scores for male speakers. In this case, I found that the distribution of male target scores diﬀered signiﬁcantly from the distribution of female target scores (p 0.01), with male target trials having a higher average score. Next, for female and male speakers separately, I considered the eﬀect of age, education level, and dialect. The target score distributions for diﬀerent age groups (20-29, 30-39, 40-49, and 50-69) did show a signiﬁcant diﬀerence (p = 0.054 for females and p = 0.013 for males), meaning that the score distribution for at least one age group diﬀered from the rest. However, a pair-wise comparison test (designed to keep the total probability of error to less than 10%) showed that signiﬁcant diﬀerences only occurred between two pairs of distributions: 20-29 versus 30-39 (for males only) and 20-29 versus 50-69 (for both males and females). Education level did not result in diﬀering target score distributions for either sex. Finally, although there appeared to be some diﬀerences in distributions for diﬀerent dialects, only males showed a signiﬁcant diﬀerence (i.e., at least one dialects distribution was diﬀerent, with p = 0.013), and pair-wise comparisons found only two pairs of dialects to have signiﬁcantly diﬀerent score distributions (New England versus New York City and Northern versus New York City).
For impostor scores, the assumption of trial independence is again incorrect. In this case, I considered three diﬀerent approaches: an assumption of target-test speaker pair independence, wherein impostor scores are averaged for each target-test speaker pair; an assumption of target speaker independence, wherein impostor scores are averaged for each target speaker; and an assumption of impostor speaker independence, wherein impostor scores are averaged for each impostor speaker. As would be expected, there is a signiﬁcant diﬀerence in score distributions for same-sex speaker trials and diﬀerent-sex speaker trials (p 0.001 for all averaging approaches). When comparing scores for which the target and impostor speaker have an age diﬀerence of 5 years or less to scores for which the age diﬀerence between speakers is greater than 5 years, there is also a signiﬁcant diﬀerence for both females and males (p 0.001 when averaging for each speaker pair or for impostor speakers, p 0.028 when averaging for target speakers). A comparison of the scores where the target and impostor speakers have the same education level to scores where the speakers have diﬀerent education levels did not show any signiﬁcant diﬀerences for either sex. Finally, looking at trials with speakers of the same dialect area versus trials with speakers of diﬀerent dialect areas, there were signiﬁcant diﬀerences when treating the speaker pairs independently (p = 0.079 for females and p = 0.013 for males), and for females when treating the impostor speakers independently (p = 0.063). Perhaps more signiﬁcant diﬀerences are not found in this case because the dialect region information collected does not accurately reﬂect dialectal diﬀerences for all the speakers.
CHAPTER 3. SPEAKER-DEPENDENT SYSTEM PERFORMANCE 48
3.2 Analysis of Recent System and Data Set I now move on from Switchboard-1 analysis to the more recent SRE08 corpus, which contains greater degrees of channel variability. The SRE08 short2-short3 condition uses roughly 2.5-3 minutes of speech for both training and testing . This speech may be taken from one side of a conversation between two people, or from part of an interview.
Furthermore, the data includes both telephone and microphone channels (there are 14 types of microphones).
Using the short2 and short3 conversation sides, I generate a set of trials diﬀerent from those used in the NIST evaluation. For my purposes, I use conversation sides from all speakers with at least 5 available speech utterances. In some cases, the same conversation side was recorded on multiple channels (telephone and microphones, or just microphones).
In these cases, I selected only one instance of that conversation side, in order to prevent the introduction of confounding factors due to having the same lexical content across diﬀerent speech samples. There are 416 speakers (256 female, 160 male), with 3049 conversation sides, and a total of 22,210 target trials. For each impostor speaker pair, ﬁve impostor trials are chosen (along with the corresponding trials that have the train and test data switched), for a total of 453,600 impostor trials.
In order to better address the eﬀects of channel variability, I use a UBM-GMM system with simpliﬁed factor analysis applied, implemented with the ALIZE toolkit . The UBM is trained using 1553 conversation sides from Fisher and Switchboard-2. The rank 70 eigenchannel U matrix for simpliﬁed factor analysis is trained using 1900 conversation sides from SRE04 telephone data (99 speakers with 10 conversation sides each) and SRE05 microphone data (91 speakers with 10 conversation sides each). For the given set of trials, the system has a minimum DCF of 0.382 and an EER of 8.93%.
3.2.1 Target Trials and Goat-ish Behavior I begin by performing an analysis of variance (ANOVA) test using all target trial scores for each speaker in order to determine if there is a speaker eﬀect on the means. With a resulting p 0.001, the null hypothesis that the target scores come from the same (speakerindependent) distribution can be rejected. Figure 3.18 shows a box plot for the male target scores, by speaker. It is clear that the distributions vary across speakers in this case.
Similarly, application of the Bartlett multiple-sample test for equal variances to the target scores also rejects the hypothesis that the scores come from normal distributions with the same variance.
Next, I perform a Kruskal-Wallis test, a non-parametric analysis of variance test that uses ranks and avoids the need for an assumption that the scores are normally distributed.
Once again, the results of such a test for the target scores are conclusive in rejecting the null hypothesis that the score distributions do not depend on speaker, with p 0.001.
CHAPTER 3. SPEAKER-DEPENDENT SYSTEM PERFORMANCE 49
3.2.2 Impostor Trials and Lamb-ish or Wolf-ish Behavior After averaging impostor scores for each impostor speaker pair, I considered both the set of these average impostor scores for each target speaker (looking for lambs) and the set of average impostor scores for each test speaker (looking for wolves). In both cases, application of ANOVA did not reject the null hypothesis (p 0.44 for female, male, and all speakers).
Similarly, the Kruskal-Wallis Test did not reject the null hypothesis that these scores do not depend on the speaker, though the female speakers came closest to signiﬁcant diﬀerences, with p = 0.11.
3.2.3 Distribution of Errors Across Speakers Using the threshold corresponding to the minimum DCF, errors for each speaker are counted. In particular, I count the number of false rejections (to ﬁnd goats), the number of false acceptance errors as the target speaker (to ﬁnd lambs), and the number of false acceptance errors as the test speaker (to ﬁnd wolves). Cumulative distributions of these errors are plotted for female and male speakers in Figures 3.19 and 3.20, respectively.
There is a very speaker-dependent distribution of errors for female speakers, for all three types of errors. In the case of false rejections, 50% of the errors are due to 38, or roughly 15% of the speakers. This is even more drastic for false acceptances as the target speaker, for which 18, or roughly 7% of the speakers cause 50% of the errors. For false acceptances as the test speaker, 61, or about 24% of speakers account for 50% of the errors.
The story is similar for male speakers. Once again, a speaker-dependent distribution of missed detection errors is observed, with 23, or about 14%, of the speakers producing 50% of the errors. Only 25, or 16%, of the speakers account for 50% of the false alarms as targets, while 33, or 21%, of the speakers produce 50% of false alarms as impostor speakers.
The uneven distribution of errors across speakers suggest goat-like, lamb-like, and wolflike tendencies for both male and female speakers.
3.3 Discussion The examination and analysis of system scores presented here has demonstrated that automatic speaker recognition system performance is dependent on the speakers. Speakers may be diﬃcult to correctly verify as the true speaker, and speakers may generate high impostor scores, as either the target speaker, the test speaker, or both.
However, I have also observed a dependence on which segments are selected for training and testing; certain conversation side train-test pairings may produce errors, while others corresponding to the same speaker or speaker pair may not, and scores are not symmetric for a given pair of conversation sides (i.e., switching which utterance is used to train the target model will change the score). Such results suggest that any attempts to predict or use information about how a system will respond to speakers may need to take an approach involving
CHAPTER 3. SPEAKER-DEPENDENT SYSTEM PERFORMANCE 51Cumulative false alarm errors (in %) Cumulative false rejection errors (in %)
Figure 3.20: Cumulative distribution of errors across male speakers, for false rejections, false acceptances as the target, and false acceptances as the impostor.
CHAPTER 3. SPEAKER-DEPENDENT SYSTEM PERFORMANCE 53conversation pairs. At the same time, averaging scores over sets of trials corresponding to a speaker can give a better sense of overall tendencies.
Furthermore, I have observed that there can often be a large degree of variation across speaker pairs; for the same target speaker, impostor scores may change signiﬁcantly from impostor speaker to impostor speaker. As such, I move away from the separate concepts of lamb and wolf, into a discussion of diﬃcult-to-distinguish impostor speaker pairs, i.e., those pairs for whom the system is likely to produce false alarm errors. At the same time, it is useful to keep in mind that within a given speaker population, there may well be an overall tendency for a particular speaker to cause false alarms, for a number of speaker pairings.