# «A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

Finally, the preliminary work regarding the eﬀects of speaker demographics suggests that while sex is a factor in the score distributions, the other diﬀerences are not particularly informative with respect to system scores. Besides the ANOVA analysis, I observed other diﬀerences in behavior between male and female speakers. In general, male speakers appear to vary more widely from one another, in the sense that a given male target speaker will produce diﬀerent ranges of scores for diﬀerent male test speakers. On the other hand, female target speakers may often produce similar scores for diﬀerent female test speakers. Going forward, my work will continue to consider results over the entire population, as well as for males and females separately.

Chapter 4 Predicting Diﬃcult-to-distinguish Speaker Pairs As I have shown, automatic speaker recognition system performance depends at least in part on intrinsic speaker characteristics, and speakers may have a tendency to produce false alarms or false rejection errors. More speciﬁcally than a general per-speaker tendency to produce false alarm errors, there is an expectation that automatic speaker recognition systems will vary across impostor speaker pairs in how successfully those pairs are correctly classiﬁed. By comparing the performance for a given speaker pair to performance over all speaker pairs, one can determine which speaker pairs are most (or least) diﬃcult for a given system. Although these diﬃcult-to-distinguish impostor speaker pairs may vary to some degree from system to system, I am most interested in ﬁnding the speaker pairs that will be poorly performing for any speaker recognition system. Thus, rather than relying on a particular speaker recognition system’s output to select such speaker pairs, I aim to ﬁnd the universally diﬃcult-to-distinguish speaker pairs by utilizing a variety of features, such as pitch, formant frequencies, or energy.

There are several motivations for trying to predict the diﬃcult-to-distinguish impostor speaker pairs. First of all, if the speaker pairs most likely to cause errors can be identiﬁed, such information may be able to open a line of research into determining some of the issues related to intrinsic factors that remain in speaker recognition. Another possible application of this work would be as a tool for NIST to select more diﬃcult trials for future Speaker Recognition Evaluations, in order to present an even more challenging task. Finally, being able to ﬁnd the speaker pairs that are diﬃcult for an automatic system to distinguish could prove particularly useful in selecting a focus for a human expert in a speaker recognition task that utilizes both automatic system scores as well as human analysis, or as a method for sub-sampling the most salient speech samples in a speaker recognition task where it is impractical to fully process all the data that exists.

This investigation considers a basic set of features, including fundamental frequency statistics, energy statistics, long-term average spectrum (LTAS) energy statistics, formant

## CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 55

frequency statistics, histograms of frequencies obtained from linear predictive (LP) analysis, and spectral slope statistics. These feature choices are motivated by prior work in speaker recognition and other tasks involving characterization of speaker diﬀerences. For instance, speaker recognition approaches have used features like pitch and energy distributions or dynamics [1], prosodic statistics including duration and pitch-related features [59], and jitter and shimmer [25]. Formant frequencies and bandwidths, obtained using linear predictive analysis, were used as descriptors for perceptual speaker characterization by Necio˘lu et g al. [56], while McDougall and Nolan showed that formant frequency dynamics are speaker discriminative [49]. Kuwabara and Sagisaka considered many acoustic parameters as inﬂuences upon voice individuality, including pitch frequency, contour and ﬂuctuation, formant frequencies, trajectories and bandwidths, and LTAS [41].The aforementioned features, along with appropriate distance measures, are utilized as a way to select speaker pairs that are closer, or more similar (in terms of that feature-measure pair). The goal is to ﬁnd feature-measures for which similar speaker pairs correspond to speaker pairs that are diﬃcult for automatic speaker recognition systems to distinguish. As a more complex measure that may better predict speaker recognition system behavior, I also test the approximated Kullback-Liebler (KL) divergence between speaker-adapted Gaussian mixture models (trained on MFCC features).

I begin by describing my approach in greater detail in Section 4.1. Results are given in Section 4.2, and Section 4.3 provides a summary and discussion of ﬁndings.

4.1 Approach This approach tests a variety of measures calculated from diﬀerent features as a criterion for selecting similar (or dissimilar) speaker pairs for speaker recognition. I describe the features considered in Section 4.1.1, and the measures and process of speaker pair selection are discussed in Section 4.1.2. The data used is covered in Section 4.1.3.

4.1.1 Features The features described below are examined as potentially useful for speaker pair selection.

Features are calculated either using MATLAB, and the Voicebox toolkit [10], or using Praat [7]. The terms given in brackets indicate the terms we will use to refer to the features.

Note that the feature statistics calculated using Praat are computed over the entire input ﬁle, including both speech and non-speech regions. The features calculated with MATLAB compute statistics over only those regions of the input designated as speech by the voice activity detection (VAD) provided by NIST.

1. Pitch statistics (Praat): mean, median, range, and mean average slope of the fundamental frequency [f0 mean, f0 med, f0 range, f0 mas]. The range was set to consider

## CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 56

fundamental frequencies between 75Hz and 600Hz, with all other settings corresponding to default Praat parameters.2. Jitter and shimmer (Praat): jitter relative average perturbation, and shimmer 5-point amplitude perturbation quotient [jitt rap, shim apq5]. Jitter describes the variations in pitch. The relative average perturbation (RAP) computes the absolute diﬀerence between a pitch period and the average of that period and its two neighbors, then takes the average of this absolute diﬀerence and divides it by the average pitch period. Settings for computing the jitter RAP include a minimum fundamental frequency of 75Hz, a maximum fundamental frequency of 600Hz, a minimum period of 0.0001, a maximum period of 0.02, and a maximum period factor of 1.3 (which denotes the largest diﬀerence between consecutive intervals that will be included in the jitter computation).

Shimmer describes varying loudness (or amplitude) in the voice. The ﬁve-point Amplitude Perturbation Quotient (APQ5) calculates the average absolute diﬀerence between the amplitude of a period and the average of the amplitudes of it and its four closest neighbours, and then divides this average absolute diﬀerence by the average amplitude.

Parameter settings for computing the shimmer APQ5 include a minimum fundamental frequency of 75Hz, a maximum fundamental frequency of 600Hz, a minimum period of 0.0001, a maximum period of 0.02, a maximum period factor of 1.3, and a maximum amplitude factor of 1.6 (denoting the largest possible diﬀerence in amplitude between consecutive intervals that will be included in the shimmer computation).

3. Formant frequency statistics (Praat): mean and median of the ﬁrst three formants [f1 mean, f1 med, f2 mean, f2 med, f3 mean, f3 med]. The relevant parameter settings for formant frequency calculation include a window length of 25ms, a step size of

6.25ms, a +3 dB point for an inverted low-pass ﬁlter (with a slope of +6 dB/octave) of 50Hz (this is a pre-emphasis ﬁlter used to create a ﬂatter spectrum), a maximum number of 4 formants, and a maximum formant frequency of 4000Hz (due to the bandlimited nature of the data used here).

4. Energy statistics (Praat): mean and median energy [en mean, en med]. Default Praat settings were used, including a designation to subtract the overall mean energy.

5. Long term average spectrum energy statistics (Praat): mean, standard deviation, range, slope, and local peak height of LTAS energy [ltas mean, ltas stddev, ltas range, ltas slope, ltas lph]. Praat parameter settings include a ﬁlter bandwidth of 100Hz and a frequency range from 0 to 4000Hz. Furthermore, for local peak height calculation, there is a minimum peak height of 2400 and a maximum peak height of 3200.

6. Histograms of frequencies from roots of the LPC polynomial (MATLAB/Voicebox):

frequencies obtained from linear predictive coding (LPC) order 8 or order 14 polynomial coeﬃcient roots (both with and without a minimum magnitude requirement of 0.78

## CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 57

and 0.88, respectively1 ) contribute to a histogram with a bin size of 5 Hz covering the 5-3995 Hz range [hist8all, hist8minmag, hist14all, hist14minmag]. A frame length of 25ms and step size of 10ms were used for calculating the LPC coeﬃcients.7. Spectral slope statistics (MATLAB): mode and median of spectral slope, calculated over frequency range 0-4000 Hz [mode specsl, med specsl]. A frame length of 30ms and step size of 10ms were used to calculate per-frame spectral slope values, from which the mode and median values were computed.

4.1.2 Measures and speaker pair selection Features are calculated for each speech sample, and a measure is computed for every unique speaker pair in two diﬀerent ways. First is to average the feature values over all conversation sides of each speaker, and then calculate the measure for each speaker pair using these average per-speaker feature values [featavg]. The second method calculates a measure for each possible pairing of conversation sides for a given speaker pair (with one conversation side for each speaker), and then averages these measure values to obtain a single value for each unique speaker pair [measavg].

For scalar features, absolute diﬀerence [absdiﬀ] and percent diﬀerence [pctdiﬀ] are used as measures, where percent diﬀerence for values x and y is deﬁned as |x − y| Percent diﬀerence =, (4.1) (x+y) when x and y have the same sign (it is not used for features with both positive and negative values). In addition to the individual formants, sums of formants are used as scalar features (with absolute and percent diﬀerence measures), and the Euclidean distance [eucldist] is also calculated for vectors of formant frequencies, e.g. (f1,f2,f3). For the histograms of frequencies from LP analysis, a correlation coeﬃcient [corr] is calculated as a measure of similarity. Table

4.1 summarizes the possible feature-measure combinations, grouped according to feature type.

Based on the measure for each unique speaker pair, those pairs with the highest and lowest 1% (or 5%) of values are selected to determine if the measure of speaker similarity corresponds to the degree of diﬃculty for a speaker recognition system. For absolute diﬀerence, percent diﬀerence, and Euclidean distance, smaller values should indicate more similar speakers, while for correlation coeﬃcients, higher values indicate greater speaker similarity.

These values were chosen based on a preliminary inspection of histograms, and were not optimized for selecting speaker pairs.

## CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 58

4.1.3 Speech corpora The 2008 NIST Speaker Recognition Evaluation (SRE08) includes a condition (short2short3) which uses roughly 2.5-3 minutes of speech for each training and testing [53]. This speech is taken either from one side of a conversation between two people over the telephone (possibly recorded on a microphone), or from part of an interview recorded on a microphone (some interviewer speech may be present). Additional interview data was released for a followup evaluation experiment designed to further explore the new interview style of data collection.

Corpus for feature-measure calculation Speech data from the followup evaluation is used to calculate features for the speakers.

In particular, speech recorded on microphone 2 (a lavalier microphone placed on the subject) is used since it has good sound quality. These speaker features are then used in conjunction with a similarity measure in order to predict diﬃcult- and easy-to-distinguish speaker pairs.

The majority of speakers have four conversation sides used for the measure calculation (a small minority have three or ﬁve conversation sides).

Corpus for evaluation of selected speaker pairs The data used to evaluate speaker-pair selection is diﬀerent in several respects from the data used to perform the selection. Speciﬁcally, the selection data were collected in an interview, while the evaluation data were collected in either an interview or a telephone conversation. Also, the selection data were collected using a lavalier microphone, whereas the evaluation data were collected using a variety of microphones, including a telephone handset. Furthermore, though the speakers contained in each set are the same, the selection data does not overlap with evaluation data.

Speaker recognition system submissions from the SRE08 short2-short3 condition are used to compute performance on trials for the selected 1% (or 5%) of most and least similar speaker pairs. Of the 34 sites who shared their system submissions for the short2-short3 condition, 33 of these are used in the results. The total number of trials for short2-short3 (after removing trials for speakers not found in the selection data) is 55013, with 1815 unique impostor speaker pairs. When keeping 1% (or 19) of the speaker pairs, there are around 4000 trials on average, while 5% (or 91) of the speaker pairs corresponds to an average of roughly 11000 trials. When ﬁltering trials for selected speaker pairs, I removed target trials of speakers not included in any of the selected speaker pairs.