FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 7 | 8 || 10 | 11 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 9 ] --

4.2 Results System performance for the selected speaker pairs is reported using the minimum detection cost function (DCF) and false alarm (FA) rate, since I am concerned with finding difficult-to-distinguish impostor pairs. The DCF is defined as a weighted sum of the miss (i.e., not identifying a target speaker match) and false alarm (i.e., identifying an impostor

speaker as the target speaker) error probabilities:

DCF = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1 − PTarget ) (4.2) In Equation (4.2), CMiss and CFalseAlarm are the relative costs of detection errors, and PTarget is the a priori probability of the specified target speaker. SRE08 used CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01.

For a given decision threshold, the FA rate is defined as:

number of false alarm errors PFalseAlarm = (4.3) total number of nontarget trials For each speaker recognition system, I compute the percent difference in minimum DCF for the most (and least) similar speaker pairs relative to the minimum DCF over all speaker pairs. Relative to a FA rate of 1% on all speaker pairs, I calculate the percent difference in FA rate (at the decision threshold yielding 1% FA on all trials) for the most (and least) similar pairs. These relative differences are then averaged over all systems. With each featuremeasure, if more similar (i.e., closer) speaker pairs correspond to difficult-to-distinguish speaker pairs, then differences in the DCF and FA rate should be positive and significant. The converse holds for less similar speaker pairs, which will have significant negative differences if they are easier for systems to distinguish.

Figures 4.1 and 4.

2 show performance differences for the top 1% most and least similar speaker pairs, respectively. For each feature group, the feature-measure pair yielding the largest DCF and FA changes is presented. Similarly, Figures 4.3 and 4.4 show results when considering the top 5% most and least similar speaker pairs, respectively.

Features of each type can select speaker pairs for which the most (or least) similar have worse (or better) performance than all speaker pairs. Furthermore, this difference in performance typically increases when a smaller fraction of speaker pairs is used, i.e., there is a bigger difference for the most similar 1% of speaker pairs than for the most similar 5%.

It should be noted that differences in performance are not uniform across different speaker verification systems.

The feature-measure that yields the largest average difference in performance for the 1% most similar speaker pairs is the Euclidean distance between vectors of the mean first, second, and third formant frequencies. The next best feature-measures include other formant-based measures, the percent difference of median energy, and the correlation of histograms of LPC freqencies with minimum magnitude requirement. For the 1% least similar speaker pairs, results are fairly similar across feature-measures, with the correlation of LPC frequency


–  –  –

Figure 4.4: Relative differences in DCF and FA rate for the least similar 5% of speaker pairs, compared to all speaker pairs.


histograms and spectral slope yielding the smallest differences. The Euclidean distance between vectors of the mean, first, second, and third formant frequencies also appears to be the best feature-measures for finding the 5% most difficult-to-distinguish speaker pairs, with the percent difference of the sum of formants and the absolute difference in LTAS local peak height being the next best. As with the 1% least similar speaker pairs, the 5% least similar show very consistent results across feature-measures, with reduced effectiveness for the correlation of LPC frequency histograms and spectral slope.

Detection error tradeoff (DET) curves are shown for example systems in Figures 4.5 and 4.6, using the Euclidean distance between vectors of the means of the first, second and third formants, and the percent difference of the median energy, respectively. Although the system in Figure 4.6 has good separation among the different DET curves, there is more overlap in the DET curves of Figure 4.5. Furthermore, Figure 4.5 reveals an asymmetry in behavior for dissimilar and similar speaker pairs, showing that the performance on difficult-to-distinguish speaker pairs is closer to performance on all speaker pairs. While this asymmetry does not exist for all systems and all sets of selected speaker pairs (as evidenced by Figure 4.6), the trend does hold in most cases.

Given that I am using at most a few coarsely calculated features, it is impressive to see the differences in performance that can be obtained using these measures to select easy- or difficult-to-distinguish speaker pairs. It is worth noting that a large reason for such success is due to the information gained by the relative ranking of speaker pairs. As a single, standalone number, a feature-measure may not have much use. However, when taken in the context of a group of feature-measures corresponding to a set of speaker pairs, the absolute values of the feature-measures no longer matter; instead, the gain lies in being able to order a set of speaker pairs from least to most similar.

While the results presented thus far are indeed promising, the differences in performance for similar speaker pairs (relative to all speaker pairs) still have potential to increase further.

Accordingly, I test a measure that utilizes Gaussian mixture models, with the motivation that GMMs may better predict speaker recognition system performance, given that many systems utilize cepstral feature-trained GMMs. Using SRI’s tools for training GMMs for speaker recognition [37], I trained speaker-specific GMMs via maximum a posteriori (MAP) adaptation from a universal background model trained on Fisher data. The input features were 12th order MFCCs plus energy, with deltas and double-deltas, and the models used 1024 Gaussians. For each unique pair of speaker-specific GMMs, an approximation to the Kullback-Leibler (KL) divergence (based on the unscented transform [26]) was used to measure similarity. Results are shown in Figure 4.7.

Compared to previous feature-measures, the KL divergence is indeed more effective at finding difficult- and easy-to-distinguish speaker pairs. DET curves for an example system are shown in Figure 4.8. Again, relative to performance on all speaker pairs, there is a larger performance gap for dissimilar speaker pairs than for similar speaker pairs.

Returning to the groups of speaker pairs selected by the KL divergence approximation for GMMs, I more closely examine the 1%, 3%, 5%, 10%, and 20% most and least similar speaker


–  –  –

Figure 4.5: DET curves for an illustrative speaker recognition system, using the Euclidean distance between vectors of the mean first, second, and third formant frequencies for speaker pair selection.


–  –  –

Figure 4.6: DET curves for an illustrative speaker recognition system, using the percent difference of median energy for speaker pair selection.


–  –  –

Figure 4.8: DET curves for an illustrative speaker recognition system, using the approximated KL divergence between speaker-specific GMMs to select speaker pairs.


pairs. Overall, there are 150 speakers, with 87 female and 63 male, for which there are 1815 same-sex impostor speaker pairs with impostor trials in the SRE08 short2-short3 task. For the groups of speaker pairs with larger values for KL divergence, that is, those speaker pairs that are expected to be easier for systems to distinguish, the majority are male (close to 75% on average). The opposite tendency holds to a lesser extent for more similar pairs tending to be female, although the groups with the lowest 1% and 3% of KL divergence values still have more male speaker pairs. These results suggest that there is a greater range of differences among male speakers, so that there are likely to be more dissimilar male speaker pairs.

Furthermore, examining the number of times a particular speaker appears in a group

of similar or dissimilar speaker pairs, we note that there tend to be two types of speakers:

those who appear frequently as members of difficult-to-distinguish speaker pairs, and those who occur frequently as members of easy-to-distinguish speaker pairs. In fact, there are 15 speakers (1 male, 14 female) that never appear in the most-dissimilar groups, and 24 speakers (10 male, 14 female) that never appear in the most-similar groups. Such a result is consistent with the existence of wolves and lambs, that is, the tendencies of a speaker to cause false alarm errors.

4.3 Discussion In summary, the results of this investigation demonstrate that it is possible to predict which speaker pairs will be difficult for a typical speaker recognition system to distinguish.

Both difficult- and easy-to-distinguish speaker pairs can be selected using a measure of similarity calculated from features like pitch, energy, or spectral slope. For the features considered here, using the Euclidean distance between vectors of mean first, second, and third formant frequencies produces the largest difference in performance for similar and dissimilar speaker pairs. An even more successful measure is the KL divergence calculated between speaker-specific GMMs. Overall, the degree of success is higher for selecting dissimilar speaker pairs than it is for selecting similar speaker pairs, possibly because similarity in a single characteristic is not necessarily sufficient to identify a difficult-to-distinguish speaker pair. Although the feature-measures cannot match the effectiveness of finding difficult-todistinguish speaker pairs by actually selecting such pairs using results for a given system, they still provide potentially useful information about speakers. In particular, one may be able to determine an overall tendency of a speaker to be similar or dissimilar to other speakers. Additionally, being able to rank a set of speaker pairs can be quite informative.

In the next chapter, I build upon this approach by using a set of feature statistics in order to detect difficult speakers. I consider the task of finding difficult target speakers, who are prone to causing false rejection errors, separately from the task of finding difficult impostor speakers, who are prone to causing false alarms. Specifically, I train support vector machine (SVM) classifiers using examples of the most and least difficult target and impostor speakers.

Chapter 5 Detecting Difficult Speakers It has been observed that simple feature statistics can be used to provide measures of similarity between speakers. Up to this point, I have used these feature statistics individually.

Now, I investigate one method for using them jointly in order to make a prediction about whether a speaker will be difficult, either as a true speaker or an impostor speaker. In particular, I train a support vector machine (SVM) to distinguish between examples of the speakers who cause the most and fewest errors, corresponding to the most and least difficult speakers, respectively. Since speaker behavior is different for target and impostor speakers, I train separate SVMs for detecting difficult true speakers (who will cause false rejections) and difficult impostor speakers (who will cause false alarms).

I begin by discussing the data set that will be used for these experiments in Section 5.1.

Section 5.2 describes the selection of feature statistics used as input to the SVMs.

Details of SVM training are covered in Section 5.3, including the method for determining the difficult and easy speakers to use for training. The results of experiments are given in Section 5.4, and Section 5.5 concludes with a discussion of lessons learned.

5.1 Data Set for SVM Experiments For this approach, I need to find speakers who cause very many or very few errors (of either the false rejection or false alarm type). Accordingly, these speakers need to have enough true speaker and impostor trials available for us to make a reliable decision about these error tendencies. This is especially an issue for the true speaker errors, given the limited number of target trials that are available.

In order to maximize the number of true speaker trials, as well as have a reasonable number of impostor trials, I use the same set of SRE08 data that I used for the analysis of 3.2. In particular, I take selected conversation sides from the SRE08 short2 and short3 train and test conditions, which correspond to roughly 2.5-3 minutes of speech per sample. I choose conversation sides from all speakers with at least 5 available speech utterances. Some


conversation sides were recorded on multiple channels (telephone and microphones, or just microphones). In these cases, I select only one instance of that conversation side, in order to prevent the introduction of confounding factors due to having the same lexical content across different speech samples. There are 416 speakers (256 female, 160 male), with 3049 conversation sides, and a total of 22,210 target trials. For each impostor speaker pair, five impostor trials are chosen (along with the corresponding trials that have the train and test data switched), for a total of 453,600 impostor trials.

Pages:     | 1 |   ...   | 7 | 8 || 10 | 11 |   ...   | 12 |

Similar works:

«Neural dynamics in cortical populations Marius Pachitariu Dissertation submitted for the degree of Doctor of Philosophy of University College London Gatsby Computational Neuroscience Unit University College London 2014 Declaration I, Marius Pachitariu, declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except...»

«DIGESTING THE THIRD: RECONFIGURING BINARIES IN SHAKESPEARE AND EARLY MODERN THOUGHT by Rob Carson A thesis submitted in conformity with the requirements for the degree of doctor of Philosophy Graduate Department of English University of Toronto © Copyright by Rob Carson 2009 Dissertation Abstract “Digesting the Third: Reconfiguring Binaries in Shakespeare and Early Modern Thought” Rob Carson (PhD 2009) Department of English University of Toronto My argument assesses and reconfigures binary...»

«2016 Show Character Breakdown January 9, 2016 Auditions ANYTHING GOES RENO SWEENEY – Reno is a sexy and charismatic nightclub singer, formerly an evangelist. She is confident, clever, philosophical, funny, persuasive and extremely comfortable with herself. She is the consummate performer and a show stopper with a great belting voice. The actor must have great comedic timing and be able to command the stage. The playing age range (this is the age she appears to be on stage) can be 30-55 years...»

«Legal Status, Education, and Latino Youths’ Transition to Adulthood A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Gemma Punti IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Dr. Kendall King, Adviser June 2013 © Gemma Punti 2013 Acknowledgements First, I would like to thank the six participants, as well as their families and friends who welcomed me into their communities. They opened their homes and allowed...»

«Photomechanical Response of Molecular Nanostructures by Ivan Viktorovich Pechenezhskiy A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Physics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael F. Crommie, Chair Professor Feng Wang Professor Gerard Marriott Spring 2014 Photomechanical Response of Molecular Nanostructures Copyright 2014 by Ivan Viktorovich Pechenezhskiy Abstract...»

«JETS 57/4 (2014) 703–13 RECONSIDERING THE ROLE OF DECEPTION IN SOLOMON’S ASCENT TO THE THRONE MATTHEW NEWKIRK* I. INTRODUCTION 1. Defining deception. The study of deception in the Bible has become increasingly popular in recent years.1 Simply defined, “to deceive” means intentionally to cause someone to believe something one knows to be false. Deception is thus distinct from, though related to, lying. “To lie” means to communicate intentionally and explicitly that something is true...»

«The Theatre of Anon: Julia Margaret Cameron, Virginia Woolf, and the Performance of Alfred Tennyson’s Idylls of the King Joan Virginia Melville Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY ©2013 Joan Virginia Melville All rights reserved TABLE OF CONTENTS Table of Contents i-ii List of Images iii Abbreviations of Titles viii Dedication ix Introduction 1 Chapter One: Julia Margaret...»

«RVP Newsletter — Winter 2012/Spring 2013 Message from the President: The work of the Council for Research in Values and Philosophy (RVP) has been advancing rapidly with the needs of the times. You will find greater detail on the following pages, but below is a summary of some key RVP efforts and their themes. We find ourselves at a point of major transition from a world order based upon separate and competing nations to a global order in which modes of cooperation become newly possible and...»

«Body and Soul in Ancient Philosophy Edited by Dorothea Frede and Burkhard Reis Walter de Gruyter · Berlin · New York Contents Introduction....................................... 1 I. Presocratics Carl Huffman The Pythagorean conception of the soul from Pythagoras to Philolaus.......................................... 21 Christian Schäfer Das Pythagorasfragment des Xenophanes und die Frage nach der Kritik der...»

«Quantum Solids of Two Dimensional Electrons in Magnetic Fields Yong P. Chen A Dissertation Presented to the Faculty of Princeton University in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Department of Electrical Engineering November 2005 c Copyright by Yong P. Chen, 2006. All Rights Reserved I certify that I have read this thesis and that in my opinion it is fully adequate, in scope and in quality, as a dissertation for the degree of Doctor of Philosophy....»

«ABSTRACT Title of Dissertation: PASSIVE SCALAR DISPERSION IN A TURBULENT MIXING LAYER Ning Li, Doctor of Philosophy, 2004 Dissertation directed by: Professor James M. Wallace Department of Mechanical Engineering Experimental and numerical studies of spatially developed turbulent mixing layers with passive scalar concentrations was performed. In the experiment, a mixing layer was created by an S-shaped splitter plate in a wind tunnel, with a velocity ratio of 2:1. A concentration field was...»

«TEXTUAL PROJECTIONS: THE EMERGENCE OF A POSTCOLONIAL AMERICAN GOTHIC By ROBERT C. SCHACHEL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2006 Copyright 2006 by Robert C. Schachel To my father ACKNOWLEDGMENTS I thank my committee chair and mentor, Dr. David Leverenz, for all of his encouragement, support, insight, and academic guidance over the past eleven years....»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.