# «A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

Although the training and test sets are disjoint, they are selected from the same database of conversation sides of SRE08. In practice, it is not unreasonable to make an assumption that there will be a set of domain-speciﬁc data available for training that is representative of the data used in a given type of speaker recognition application. Due to data sparsity, I take a round robin approach (speciﬁcally, 10-fold cross-validation) in order to best utilize the available data.

5.2 Selection of Feature Statistics The feature statistics under consideration include statistics of energy, spectral slope, fundamental frequency, formant frequency, and MFCC features, where the statistics can be calculated over frames corresponding to various regions, including phones, groups of phones, and all speech. In the previous work on ﬁnding diﬃcult-to-distinguish impostor speaker pairs, I had success using feature statistics calculated over the whole utterance or all speech regions. I take the same approach here by choosing to calculate the feature statistics over all frames of speech. One additional motivation for such a choice is that it is generally more convenient to simply calculate statistics using speech frames rather than frames of particular phonetic regions, given that it is less computationally expensive to implement a speech/nonspeech detector than it is to obtain phonetic transcripts from an automatic speech or phone recognition system.

The complete set of features is as follows.

1. Energy [en], calculated in MATLAB, using 25ms frames with a 10ms stepsize

2. Spectral slope [spsl], calculated in MATLAB, using 30ms frames with a 10ms stepsize

3. Fundamental frequency [f0], calculated with the Snack sound toolkit [66], using the ESPS method, which relies on the normalized cross correlation function and dynamic programming, with a default window length of 7.5ms and a stepsize of 10ms, default minimum pitch of 60Hz and default maximum pitch of 400Hz

4. First three formant frequencies, [f1,f2,f3], calculated with the Snack sound toolkit, which estimates speech formant trajectories using dynamic programming for continuity constraints and the roots of a 12th order linear predictor polynomial as candidates; a

## CHAPTER 5. DETECTING DIFFICULT SPEAKERS 73

default window length of 49ms, a stepsize of 10ms, default cos4 windowing function, default preemphasis of 0.7, and a nominal ﬁrst formant frequency of 500Hz, specifying the number of formants to be 35. First four formant frequencies, [g1,g2,g3,g4], calculated with the same settings as [f1f3], except for the speciﬁcation that the number of formants is 4 (note that looking for 3 formants produces diﬀerent outputs than looking for 4 formants)

6. 19th order MFCCs plus energy [C0-C19], calculated using the Hidden Markov model Toolkit (HTK) [72], using 26 ﬁlter banks ranging from 200Hz to 3300Hz, frame length of 25ms, stepsize of 10ms, no normalizations

7. Mean- and variance-normalized 19th order MFCCs plus energy [N0-C19], calculated with HTK with the same settings as [C0-C19] The set of statistics computed for each feature over speech regions are mean, median, standard deviation, skewness, kurtosis, minimum, and maximum.

I include each type of feature and statistic in order to obtain feature statistics that may be informative in diﬀering ways. However, since the two sets of formant frequencies (calculated by ﬁnding the ﬁrst three [f1-f3] or the ﬁrst four [g1-g4]) are related, as are the normalized and non-normalized MFCCs ([N0-N19] and [C0-C19]), I consider three groups of features,

**with diﬀering degrees of similarity among the features:**

1. energy [en], spectral slope [spsl], fundamental frequency without zeros [f0no0], fundamental frequency including zeros [f0with0], the set of the ﬁrst three formant frequencies [f1-f3], and non-normalized MFCCs [C0-C19], for a total of 187 statistics [speech1]

2. same as (1), with addition of normalized MFCCs [N0-N19], for a total of 327 statistics [speech2]

3. same as (2), with addition of the ﬁrst four formant frequencies [g1-g4], for a total of 355 statistics [speech3]

5.3 SVM Training In order to train an SVM classiﬁer to detect diﬃcult speakers, there must be training data that corresponds to such diﬃcult speakers, as well as to non-diﬃcult speakers who will provide negative examples. To determine these speakers, I utilize the scores from an automatic speaker recognition system. Given a particular decision threshold, I can then evaluate how many false rejection and false acceptance errors occur among the trials of a given speaker, and rank the speakers according to these error rates. For each speaker, false acceptance errors as the target are counted along with false acceptance errors as the impostor (in other words, I do not distinguish between lamb-ish and wolf-ish speaker tendencies).

## CHAPTER 5. DETECTING DIFFICULT SPEAKERS 74

Roughly the top and bottom 20% of speakers (ranked according to their error rates) are used for training and testing. In particular, I take 80 speakers from each end of the diﬃculty spectrum. Those speakers with the lowest frequency of errors provide negative training examples, while the speakers with most frequently occurring errors provide positive examples.For this speaker selection, I utilize scores from a UBM-GMM system with simpliﬁed factor analysis applied. Details of the implementation may be found in Section 3.2. In order to count errors, I use the decision threshold corresponding to an overall false alarm rate of 1%.

As mentioned previously, there is limited data available in SRE08; to deal with this data sparsity, I utilize a round robin (or 10-fold cross-validation) approach, with 10 splits of the data. Given 10 disjoint sets of 4 diﬃcult and 4 easy speakers, I use 9 of the sets to train the SVM, and the remaining 1 to test, with each set being the test set exactly once. The results are then calculated across the ten test sets. To further ensure that these results are representative, I run the experiment 10 times, with random selection of the 10 splits each time.

Each speaker has 5 or more conversation sides that are used as separate examples. I consider two separate SVMs: one to detect diﬃcult true speakers, i.e., those that are prone to causing false rejection errors, and one to detect diﬃcult impostor speakers, i.e., those that are prone to causing false alarms.

In addition to considering a linear kernel for the SVM, I also test polynomial kernels of orders 2 and 3, in the event that a nonlinear mapping may prove useful for the detection task at hand. Furthermore, I use the input feature statistics both as they are as well as with a rank normalization applied. Rank normalization, wherein the features are assigned a relative ranking from minimum to maximum, is a technique that often yields nice improvements in the context of speaker recognition systems with SVM classiﬁers. The rank normalization mapping is learned from the examples used to train the SVM, and then applied to both the train and the test data. The SVMs are implemented using the SVM light toolkit [32].

First, I will show results for detecting a diﬃcult impostor speaker in Section 5.4.1, followed by results of detecting a diﬃcult target speaker in Section 5.4.2.

5.4.1 Detecting Diﬃcult Impostor Speakers The recall, precision, speciﬁcity, and F-measure are given in Table 5.1 for three versions of the SVM classiﬁer using the [speech1] set of feature statistics, which may or may not be rank normalized [rank,nonorm]. In these results, the SVM is detecting diﬃcult impostor speakers, who are likely to cause false acceptance errors, either as the target model or the test speaker. The three SVMs diﬀer in the kernel that they use, which may be a linear kernel [linear], a second order polynomial kernel [poly2], or a third order polynomial kernel [poly3]. These results are averages across the 10 runs of a 90% train - 10% test round-robin approach.

These results show that it is, in fact, possible to detect diﬃcult impostor speakers, most of the time, with recall and precision rates of around 0.84 and 0.86, respectively. Furthermore, note that rank normalization yields much more balanced results. Additionally, the linear and polynomial kernels perform about the same in terms of recall, with polynomial kernels giving a slight gain in precision, speciﬁcity, and F-measure.

Depending on the application, the recall or the precision may be more important. For those situations where it is important to ﬁnd all of the diﬃcult speakers, at the cost of including some easy speakers, the threshold for making the diﬃcult distinction can be lowered, thereby increasing the recall. On the other hand, it may be important to be very accurate about any diﬃcult speaker labels, at the cost of missing some diﬃcult speakers. In this scenario, the threshold for making the decision can be raised, and the precision increased.

## CHAPTER 5. DETECTING DIFFICULT SPEAKERS 76

For the linear kernel SVM using rank normalized feature statistics, results are given in Table5.2 for three diﬀerent thresholds:

-0.5, 0 (corresponding to the values in Table 5.1), and 0.5.

The choice of threshold for the detection of a diﬃcult speaker allows one to adjust according to the most important criterion. Even when improving results for one particular measure, the results stay fairly good across all performance measures, though the speciﬁcity (or true negative rate) does drop to 0.616 when the recall is increased. Since many applications might ﬁnd it very important to be correct about the speakers that are labeled as diﬃcult, I will examine the low false alarm case (this corresponds to a high speciﬁcity and precision). In particular, consider a false alarm rate of 5%, meaning that 5% of the diﬃcult labels will be incorrect. For the corresponding threshold (which is around 0.83), average recall is 0.612, and average precision is 0.959, for the SVM with a linear kernel and rank normalization of the input features. In other words, in order to be 95% correct about diﬃcult speaker decisions, over 60% of the diﬃcult speakers are found. Though this is not a very high recall rate, it may still be suﬃcient for some applications, and it provides a reasonable starting point on a ﬁrst try at this task.

Now, let us compare performance for the linear kernel SVM, when using more feature statistics, in particular, the [speech2] and [speech3] sets, which add normalized MFCCs and a diﬀerent set of four formant frequencies. Results for the three feature sets are given in Table 5.3.

In this case, the additional speech-based feature statistics do not add much information for distinguishing between easy and diﬃcult impostor speakers.

5.4.2 Detecting Diﬃcult Target Speakers Now, I present results for an SVM classiﬁer trained to detect diﬃcult target speakers, who tend to cause false rejection errors. Table 5.4 shows the recall, precision, speciﬁcity, and F-measure for SVMs trained using the set of [speech1] input feature statistics, both with and without rank normalization, for three SVM kernels, namely linear, order two polynomial, and order three polynomial. In each case, the results presented correspond to an average over ten runs of a round robin approach using a 90% - 10% split of the data.

In this case, there are fairly reasonable results, though detection of diﬃcult target speakers is not as successful the detection of diﬃcult impostors. The intuition behind why diﬃcult target speakers are not detected as successfully as diﬃcult impostor speakers is as follows.

To cause false alarm errors, impostor speakers must be confusable with other speakers; so, there may be overall characteristics that make a speaker more average or more similar to other speakers within the population. On the other hand, the characteristics that make a target speaker hard to recognize as himself may vary from speaker to speaker, so that it is harder to capture all the ways in which a single conversation side may indicate a tendency to cause false rejections.

Returning to the results of Table 5.4, observe that rank normalization once again really helps to improve performance overall. In the case of diﬃcult target speakers, there are small

## CHAPTER 5. DETECTING DIFFICULT SPEAKERS 77

** Table 5.1: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using SVMs with diﬀerent kernels (linear, second order polynomial [poly2], and third order polynomial [poly3]), with the [speech1] set of feature statistics as input, with or without rank normalization applied [rank,nonorm].**

** Threshold Recall Precision Speciﬁcity F-measure -0.5 0.915 0.770 0.616 0.836 0 0.838 0.851 0.794 0.844 0.5 0.726 0.914 0.904 0.809**

** Table 5.2: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using a linear kernel SVM trained with rank normalized feature statistics, comparing three diﬀerent decision thresholds for diﬃcult impostor speaker detection.**

Feature set Recall Precision Speciﬁcity F-measure speech1 0.838 0.851 0.794 0.844 speech2 0.831 0.852 0.798 0.841 speech3 0.830 0.859 0.809 0.844 Table 5.3: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using a linear kernel SVM trained with rank normalized feature statistics, comparing three sets of speech feature statistics, [speech1], [speech2], and [speech3].

** Table 5.4: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult target speakers using SVMs with diﬀerent kernels (linear, second order polynomial, and third order polynomial), with the [speech1] set of feature statistics as input, with or without rank normalization applied.**