«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
Table 5.5 shows the tradeoﬀ in recall, precision, speciﬁcity, and F-measure values observed when varying the threshold for making a diﬃcult speaker decision, again considering threshold values of -0.
5, 0 (corresponding to the results of Table 5.4), and 0.5. These results are given for the SVM using a third order polynomial kernel, with rank normalized feature statistics.
Threshold Recall Precision Speciﬁcity F-measure -0.5 0.895 0.653 0.415 0.755 0 0.736 0.737 0.677 0.737 0.5 0.561 0.848 0.877 0.675
Table 5.5: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult target speakers using a third order polynomial kernel SVM trained with rank normalized feature statistics, comparing three diﬀerent decision thresholds for diﬃcult target speaker detection.
In the diﬃcult target speaker case, the drop in speciﬁcity (for a high recall threshold) and the drop in recall (for a high precision threshold) are larger than in the diﬃcult impostor speaker case. In order to obtain close to 90% recall, the false alarm rate becomes almost 60%. Again considering the operating point for low false alarms, with 5% of the diﬃcult speaker labels being incorrect (a threshold around 0.95), the average recall is 0.374, and the average precision is 0.922. Thus, in order to avoid incorrectly labeling diﬃcult target speakers, almost two-thirds of the diﬃcult target speakers will not be found. Such a low recall rate may not be suﬃcient in many applications. Given the diﬃcult nature of the task, it nevertheless provides an initial starting point that may be improved upon in the future.
Next, Table 5.6 shows results for an SVM using a third order polynomial kernel and rank normalized input features, for the three sets of speech feature statistics.
Feature set Recall Precision Speciﬁcity F-measure speech1 0.736 0.737 0.677 0.737 speech2 0.747 0.749 0.694 0.748 speech3 0.753 0.746 0.686 0.749 Table 5.6: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult target speakers using a third order polynomial kernel SVM trained with rank normalized feature statistics, comparing three sets of speech feature statistics, [speech1], [speech2], and [speech3].
As with the diﬃcult impostor speaker detection task, adding feature statistics (mean and variance normalized MFCCs or formant frequencies g1-g4) does not change results by much, though there are some small improvements.
CHAPTER 5. DETECTING DIFFICULT SPEAKERS 79To this point, my approach has treated male and female speakers together. However, male and female speakers may behave diﬀerently. In order to see if diﬃcult target speaker detection improves when females and males are two diﬀerent cases, female- and male-speciﬁc SVMs are trained. One disadvantage to this approach is that there are fewer easy and diﬃcult speakers to use for training. I consider two sets of MLPs, one trained using the 20% most and least diﬃcult speakers (female or male), and one trained using 25% most and least diﬃcult speakers. Table 5.7 shows recall, precision, speciﬁcity, and F-measure values for male and female diﬃcult target speaker detection. In each case, the number of easy and diﬃcult speaker examples is given (note that there are 256 female speakers and 160 male speakers total). These results are all for SVMs using third order polynomial kernels, and rank normalized input feature statistics.
Table 5.7: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult target speakers using a third order polynomial kernel SVM trained with rank normalized feature statistics, using SVMs trained separately for female and male speakers, with either 20% or around 25% of speakers taken as diﬃcult or easy examples.
In both female and male cases, the results do not improve over treating both sexes together. Recall increases slightly, at the cost of lower precision and speciﬁcity. Furthermore, note that increasing the number of speakers used as diﬃcult and easy examples does not improve results. Including more speakers also means that the speakers used for training are not necessarily the best examples of diﬃcult (or easy) ones, which potentially counteracts any gain from having more training examples. Training separate female and male SVMs for ﬁnding diﬃcult impostor speakers gave results similar to those observed here: there were no gains over using a sex-independent SVM, and increasing the number of training examples to 25% also failed to improve results compared to using the top and bottom 20%. Given more female and male speakers for training, an approach using separate female and male SVMs may yield improvements. However, for the data available here, it is better to maximize the training examples and use the same SVM to detect diﬃcult female and male speakers.
and diﬃcult speakers, for both target (true) speakers and impostor speakers. As input, I used a set of feature statistics calculated over speech regions, where the features include fundamental frequency, formant frequencies, energy, spectral slope, and MFCCs (both with and without mean and variance normalization).
Based on the results for the data set used here, this approach is more successful at ﬁnding diﬃcult impostor speakers than diﬃcult target speakers. One reason why ﬁnding diﬃcult target speakers is more challenging than ﬁnding diﬃcult impostors is that while there may be similar characteristics across diﬃcult impostor speakers (which make them confusable with other speakers), the characteristics that make target speakers diﬃcult may vary more from speaker to speaker. In both cases, however, recall and precision rates over 0.7 (or 0.8 in the case of diﬃcult impostors) can be obtained. Furthermore, the threshold for picking a diﬃcult speaker can be varied according to what errors are most important to minimize.
For a false alarm rate of 5%, over 60% of diﬃcult impostor speakers will still be found, and 37% of diﬃcult target speakers. Given the challenging nature of the task, these recall rates are not particularly high (especially in the case of target speakers). However, for certain applications, the loss in recall may still be worth the gain in precision and speciﬁcity. Given enough training examples of diﬃcult and easy speakers, there may be gains from treating female and male speakers separately. With limited data, though, better results are obtained by using the combined set of training examples in one sex-independent SVM.
One advantage of using feature statistics as the input to the SVM is that the statistics can be calculated over an individual conversation side or a set of conversation sides for the given speaker. This allows diﬃcult speaker detection to work for varying amounts of available data. In my approach here, each conversation side of the easy and diﬃcult speakers is used separately, with no exploitation of having more than one conversation side per speaker.
One avenue for future exploration is to see how results change depending on the number of utterances used for each speaker. It may also be possible to ﬁnd better feature statistics for detecting diﬃcult speakers; the optimal feature statistics may be diﬀerent for diﬃcult target and impostor speakers, as well as for female and male speakers.
Another possible direction for future investigation is to see how well diﬃcult conversation sides can be detected. The results of my error analysis, as well as the related work of Kahn et al. [34, 33], have shown that there can be particular conversation sides of a speaker that cause more errors than others. Being able to detect these “bad” utterances may provide very useful information for improving system performance.
Chapter 6 Conclusions and Future Work This focus of this dissertation was on the intrinsic, speaker-based factors that contribute to errors in automatic speaker recognition systems. Inspired by the well-known work of Doddington et al. , which both categorized speakers according to their tendencies to cause errors and demonstrated the existence of such speaker types, I aimed to further explore the phenomenon of speaker-dependent system performance. In particular, there are two main components of this exploration, which are reviewed in the following sections. Section 6.1 describes the analysis of speaker behavior for two data sets and two types of automatic speaker recognition systems, with which I both conﬁrm and build upon previous results demonstrating that system performance depends on speaker characteristics. Having established that certain speakers are more likely to cause errors than others, I then discuss a simple approach for ﬁnding these diﬃcult speakers in Section 6.2. Section 6.3 concludes with a discussion of contributions and possible future work.
6.1 Analysis of Speaker Behavior The aforementioned work of Doddington et al. analyzed errors only for female speakers, using data from the NIST 1998 Speaker Recognition Evaluation. In order to expand such analysis, I examined two data sets and two types of automatic speaker recognition systems, looking for speaker-dependent behaviors for both male and female speakers. The ﬁrst data set was Switchboard-1, a corpus of conversational speech collected from the telephone. I further restricted this data to one type of telephone handset in order to limit the eﬀects of extrinsic channel variability. Using scores from a GMM-UBM system, I began by considering a score confusion matrix for a set of 34 speakers with 10 conversation sides each. It was observed that the speakers varied both in how high their average true speaker scores were, as well as in how consistent the true speaker scores were across target-test pairs. There was also variability in how diﬀerent target models of the same speaker behaved; for some speakers, scores were consistent across all models, while for others there was greater score variation.
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 82Some impostor speaker pairs were more confusable than others, and some speakers had overall tendencies to have higher impostor scores.
Extending this analysis to include a large number of trials and speakers in Switchboard-1, I continued to show examples of varying speaker behavior, in terms of tendencies to have high or low target or impostor scores. For both female and male speakers, there was a correlation (around 0.6) between a tendency to cause high impostor scores as the target speaker and a tendency to cause high impostor scores as the test speaker.
For the Switchboard-1 data, I also investigated the possible eﬀects of speaker sex, age, education level, and dialect area on system scores. Using analysis of variance (ANOVA) tests, I found signiﬁcant diﬀerences between male and female score distributions. Signiﬁcant diﬀerences were also found for score distributions with impostor speakers who have less than a ﬁve year age diﬀerence compared to impostor speakers with more than a ﬁve year age diﬀerence. The results for education level and dialect area were inconclusive. Based on such ﬁndings, I concluded that the most salient of these speaker demographics was sex, a result in line with other observations regarding diﬀerences in speaker recognition behavior between males and females.
For the second data set, I used a more recent collection of conversational and interview speech used in the 2008 NIST Speaker Recognition Evaluation (SRE08); this data contains much more channel variability, including not only landline and cellular telephone data, but also data from a variety of microphones. For this corpus, I used a GMM-UBM system with simpliﬁed factor analysis, in order to better handle the diﬀerences in channel. Once again, a variety of speaker-dependent system performance was observed, including tendencies to cause false alarm or false rejection errors. For both female and male speakers, 50% of the false rejection and false alarm errors were caused by only 15-25% of the speakers.
6.2 Diﬃcult Speaker Detection My approach for ﬁnding diﬃcult speakers began with a method for calculating measures of similarity between impostor speaker pairs. Using statistics of features such as energy, formant frequencies, fundamental frequency, and spectral slope, calculated over all speech, I successfully obtained a variety of simple distance measures that could successfully select both easy- and diﬃcult-to-distinguish speaker pairs, as evaluated by diﬀerences in detection cost and false alarm probability across a large number of systems. Of the performance measures tested, the best feature-measure at ﬁnding the most and least diﬃcult-to-distinguish speaker pairs was the Euclidean distance between vectors of the mean ﬁrst, second, and third formant frequencies. Even greater success was attained by the Kullback-Liebler (KL) divergence between pairs of speaker-speciﬁc GMMs. Furthermore, an examination of the smallest and biggest distances (as computed by the KL divergence) revealed individual speaker tendencies to consistently fall among the most (or least) diﬃcult-to-distinguish speaker pairs.
I then used a set of feature statistics calculated over speech regions to train a support