«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
4.2 Results System performance for the selected speaker pairs is reported using the minimum detection cost function (DCF) and false alarm (FA) rate, since I am concerned with ﬁnding diﬃcult-to-distinguish impostor pairs. The DCF is deﬁned as a weighted sum of the miss (i.e., not identifying a target speaker match) and false alarm (i.e., identifying an impostor
speaker as the target speaker) error probabilities:
DCF = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1 − PTarget ) (4.2) In Equation (4.2), CMiss and CFalseAlarm are the relative costs of detection errors, and PTarget is the a priori probability of the speciﬁed target speaker. SRE08 used CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01.
For a given decision threshold, the FA rate is deﬁned as:
number of false alarm errors PFalseAlarm = (4.3) total number of nontarget trials For each speaker recognition system, I compute the percent diﬀerence in minimum DCF for the most (and least) similar speaker pairs relative to the minimum DCF over all speaker pairs. Relative to a FA rate of 1% on all speaker pairs, I calculate the percent diﬀerence in FA rate (at the decision threshold yielding 1% FA on all trials) for the most (and least) similar pairs. These relative diﬀerences are then averaged over all systems. With each featuremeasure, if more similar (i.e., closer) speaker pairs correspond to diﬃcult-to-distinguish speaker pairs, then diﬀerences in the DCF and FA rate should be positive and signiﬁcant. The converse holds for less similar speaker pairs, which will have signiﬁcant negative diﬀerences if they are easier for systems to distinguish.
Figures 4.1 and 4.
2 show performance diﬀerences for the top 1% most and least similar speaker pairs, respectively. For each feature group, the feature-measure pair yielding the largest DCF and FA changes is presented. Similarly, Figures 4.3 and 4.4 show results when considering the top 5% most and least similar speaker pairs, respectively.
Features of each type can select speaker pairs for which the most (or least) similar have worse (or better) performance than all speaker pairs. Furthermore, this diﬀerence in performance typically increases when a smaller fraction of speaker pairs is used, i.e., there is a bigger diﬀerence for the most similar 1% of speaker pairs than for the most similar 5%.
It should be noted that diﬀerences in performance are not uniform across diﬀerent speaker veriﬁcation systems.
The feature-measure that yields the largest average diﬀerence in performance for the 1% most similar speaker pairs is the Euclidean distance between vectors of the mean ﬁrst, second, and third formant frequencies. The next best feature-measures include other formant-based measures, the percent diﬀerence of median energy, and the correlation of histograms of LPC freqencies with minimum magnitude requirement. For the 1% least similar speaker pairs, results are fairly similar across feature-measures, with the correlation of LPC frequency
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 61
Figure 4.4: Relative diﬀerences in DCF and FA rate for the least similar 5% of speaker pairs, compared to all speaker pairs.
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 65histograms and spectral slope yielding the smallest diﬀerences. The Euclidean distance between vectors of the mean, ﬁrst, second, and third formant frequencies also appears to be the best feature-measures for ﬁnding the 5% most diﬃcult-to-distinguish speaker pairs, with the percent diﬀerence of the sum of formants and the absolute diﬀerence in LTAS local peak height being the next best. As with the 1% least similar speaker pairs, the 5% least similar show very consistent results across feature-measures, with reduced eﬀectiveness for the correlation of LPC frequency histograms and spectral slope.
Detection error tradeoﬀ (DET) curves are shown for example systems in Figures 4.5 and 4.6, using the Euclidean distance between vectors of the means of the ﬁrst, second and third formants, and the percent diﬀerence of the median energy, respectively. Although the system in Figure 4.6 has good separation among the diﬀerent DET curves, there is more overlap in the DET curves of Figure 4.5. Furthermore, Figure 4.5 reveals an asymmetry in behavior for dissimilar and similar speaker pairs, showing that the performance on diﬃcult-to-distinguish speaker pairs is closer to performance on all speaker pairs. While this asymmetry does not exist for all systems and all sets of selected speaker pairs (as evidenced by Figure 4.6), the trend does hold in most cases.
Given that I am using at most a few coarsely calculated features, it is impressive to see the diﬀerences in performance that can be obtained using these measures to select easy- or diﬃcult-to-distinguish speaker pairs. It is worth noting that a large reason for such success is due to the information gained by the relative ranking of speaker pairs. As a single, standalone number, a feature-measure may not have much use. However, when taken in the context of a group of feature-measures corresponding to a set of speaker pairs, the absolute values of the feature-measures no longer matter; instead, the gain lies in being able to order a set of speaker pairs from least to most similar.
While the results presented thus far are indeed promising, the diﬀerences in performance for similar speaker pairs (relative to all speaker pairs) still have potential to increase further.
Accordingly, I test a measure that utilizes Gaussian mixture models, with the motivation that GMMs may better predict speaker recognition system performance, given that many systems utilize cepstral feature-trained GMMs. Using SRI’s tools for training GMMs for speaker recognition , I trained speaker-speciﬁc GMMs via maximum a posteriori (MAP) adaptation from a universal background model trained on Fisher data. The input features were 12th order MFCCs plus energy, with deltas and double-deltas, and the models used 1024 Gaussians. For each unique pair of speaker-speciﬁc GMMs, an approximation to the Kullback-Leibler (KL) divergence (based on the unscented transform ) was used to measure similarity. Results are shown in Figure 4.7.
Compared to previous feature-measures, the KL divergence is indeed more eﬀective at ﬁnding diﬃcult- and easy-to-distinguish speaker pairs. DET curves for an example system are shown in Figure 4.8. Again, relative to performance on all speaker pairs, there is a larger performance gap for dissimilar speaker pairs than for similar speaker pairs.
Returning to the groups of speaker pairs selected by the KL divergence approximation for GMMs, I more closely examine the 1%, 3%, 5%, 10%, and 20% most and least similar speaker
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 66
Figure 4.5: DET curves for an illustrative speaker recognition system, using the Euclidean distance between vectors of the mean ﬁrst, second, and third formant frequencies for speaker pair selection.
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 67
Figure 4.6: DET curves for an illustrative speaker recognition system, using the percent diﬀerence of median energy for speaker pair selection.
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 68
Figure 4.8: DET curves for an illustrative speaker recognition system, using the approximated KL divergence between speaker-speciﬁc GMMs to select speaker pairs.
CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 70pairs. Overall, there are 150 speakers, with 87 female and 63 male, for which there are 1815 same-sex impostor speaker pairs with impostor trials in the SRE08 short2-short3 task. For the groups of speaker pairs with larger values for KL divergence, that is, those speaker pairs that are expected to be easier for systems to distinguish, the majority are male (close to 75% on average). The opposite tendency holds to a lesser extent for more similar pairs tending to be female, although the groups with the lowest 1% and 3% of KL divergence values still have more male speaker pairs. These results suggest that there is a greater range of diﬀerences among male speakers, so that there are likely to be more dissimilar male speaker pairs.
Furthermore, examining the number of times a particular speaker appears in a group
of similar or dissimilar speaker pairs, we note that there tend to be two types of speakers:
those who appear frequently as members of diﬃcult-to-distinguish speaker pairs, and those who occur frequently as members of easy-to-distinguish speaker pairs. In fact, there are 15 speakers (1 male, 14 female) that never appear in the most-dissimilar groups, and 24 speakers (10 male, 14 female) that never appear in the most-similar groups. Such a result is consistent with the existence of wolves and lambs, that is, the tendencies of a speaker to cause false alarm errors.
4.3 Discussion In summary, the results of this investigation demonstrate that it is possible to predict which speaker pairs will be diﬃcult for a typical speaker recognition system to distinguish.
Both diﬃcult- and easy-to-distinguish speaker pairs can be selected using a measure of similarity calculated from features like pitch, energy, or spectral slope. For the features considered here, using the Euclidean distance between vectors of mean ﬁrst, second, and third formant frequencies produces the largest diﬀerence in performance for similar and dissimilar speaker pairs. An even more successful measure is the KL divergence calculated between speaker-speciﬁc GMMs. Overall, the degree of success is higher for selecting dissimilar speaker pairs than it is for selecting similar speaker pairs, possibly because similarity in a single characteristic is not necessarily suﬃcient to identify a diﬃcult-to-distinguish speaker pair. Although the feature-measures cannot match the eﬀectiveness of ﬁnding diﬃcult-todistinguish speaker pairs by actually selecting such pairs using results for a given system, they still provide potentially useful information about speakers. In particular, one may be able to determine an overall tendency of a speaker to be similar or dissimilar to other speakers. Additionally, being able to rank a set of speaker pairs can be quite informative.
In the next chapter, I build upon this approach by using a set of feature statistics in order to detect diﬃcult speakers. I consider the task of ﬁnding diﬃcult target speakers, who are prone to causing false rejection errors, separately from the task of ﬁnding diﬃcult impostor speakers, who are prone to causing false alarms. Speciﬁcally, I train support vector machine (SVM) classiﬁers using examples of the most and least diﬃcult target and impostor speakers.
Chapter 5 Detecting Diﬃcult Speakers It has been observed that simple feature statistics can be used to provide measures of similarity between speakers. Up to this point, I have used these feature statistics individually.
Now, I investigate one method for using them jointly in order to make a prediction about whether a speaker will be diﬃcult, either as a true speaker or an impostor speaker. In particular, I train a support vector machine (SVM) to distinguish between examples of the speakers who cause the most and fewest errors, corresponding to the most and least diﬃcult speakers, respectively. Since speaker behavior is diﬀerent for target and impostor speakers, I train separate SVMs for detecting diﬃcult true speakers (who will cause false rejections) and diﬃcult impostor speakers (who will cause false alarms).
I begin by discussing the data set that will be used for these experiments in Section 5.1.
Section 5.2 describes the selection of feature statistics used as input to the SVMs.
Details of SVM training are covered in Section 5.3, including the method for determining the diﬃcult and easy speakers to use for training. The results of experiments are given in Section 5.4, and Section 5.5 concludes with a discussion of lessons learned.
5.1 Data Set for SVM Experiments For this approach, I need to ﬁnd speakers who cause very many or very few errors (of either the false rejection or false alarm type). Accordingly, these speakers need to have enough true speaker and impostor trials available for us to make a reliable decision about these error tendencies. This is especially an issue for the true speaker errors, given the limited number of target trials that are available.
In order to maximize the number of true speaker trials, as well as have a reasonable number of impostor trials, I use the same set of SRE08 data that I used for the analysis of 3.2. In particular, I take selected conversation sides from the SRE08 short2 and short3 train and test conditions, which correspond to roughly 2.5-3 minutes of speech per sample. I choose conversation sides from all speakers with at least 5 available speech utterances. Some