WWW.DISSERTATION.XLIBX.INFO FREE ELECTRONIC LIBRARY - Dissertations, online materials

<< HOME
CONTACTS

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 10 ] --

Although the training and test sets are disjoint, they are selected from the same database of conversation sides of SRE08. In practice, it is not unreasonable to make an assumption that there will be a set of domain-speciﬁc data available for training that is representative of the data used in a given type of speaker recognition application. Due to data sparsity, I take a round robin approach (speciﬁcally, 10-fold cross-validation) in order to best utilize the available data.

5.2 Selection of Feature Statistics The feature statistics under consideration include statistics of energy, spectral slope, fundamental frequency, formant frequency, and MFCC features, where the statistics can be calculated over frames corresponding to various regions, including phones, groups of phones, and all speech. In the previous work on ﬁnding diﬃcult-to-distinguish impostor speaker pairs, I had success using feature statistics calculated over the whole utterance or all speech regions. I take the same approach here by choosing to calculate the feature statistics over all frames of speech. One additional motivation for such a choice is that it is generally more convenient to simply calculate statistics using speech frames rather than frames of particular phonetic regions, given that it is less computationally expensive to implement a speech/nonspeech detector than it is to obtain phonetic transcripts from an automatic speech or phone recognition system.

The complete set of features is as follows.

1. Energy [en], calculated in MATLAB, using 25ms frames with a 10ms stepsize

2. Spectral slope [spsl], calculated in MATLAB, using 30ms frames with a 10ms stepsize

3. Fundamental frequency [f0], calculated with the Snack sound toolkit [66], using the ESPS method, which relies on the normalized cross correlation function and dynamic programming, with a default window length of 7.5ms and a stepsize of 10ms, default minimum pitch of 60Hz and default maximum pitch of 400Hz

4. First three formant frequencies, [f1,f2,f3], calculated with the Snack sound toolkit, which estimates speech formant trajectories using dynamic programming for continuity constraints and the roots of a 12th order linear predictor polynomial as candidates; a

CHAPTER 5. DETECTING DIFFICULT SPEAKERS 73

default window length of 49ms, a stepsize of 10ms, default cos4 windowing function, default preemphasis of 0.7, and a nominal ﬁrst formant frequency of 500Hz, specifying the number of formants to be 3

5. First four formant frequencies, [g1,g2,g3,g4], calculated with the same settings as [f1f3], except for the speciﬁcation that the number of formants is 4 (note that looking for 3 formants produces diﬀerent outputs than looking for 4 formants)

6. 19th order MFCCs plus energy [C0-C19], calculated using the Hidden Markov model Toolkit (HTK) [72], using 26 ﬁlter banks ranging from 200Hz to 3300Hz, frame length of 25ms, stepsize of 10ms, no normalizations

7. Mean- and variance-normalized 19th order MFCCs plus energy [N0-C19], calculated with HTK with the same settings as [C0-C19] The set of statistics computed for each feature over speech regions are mean, median, standard deviation, skewness, kurtosis, minimum, and maximum.

I include each type of feature and statistic in order to obtain feature statistics that may be informative in diﬀering ways. However, since the two sets of formant frequencies (calculated by ﬁnding the ﬁrst three [f1-f3] or the ﬁrst four [g1-g4]) are related, as are the normalized and non-normalized MFCCs ([N0-N19] and [C0-C19]), I consider three groups of features,

with diﬀering degrees of similarity among the features:

1. energy [en], spectral slope [spsl], fundamental frequency without zeros [f0no0], fundamental frequency including zeros [f0with0], the set of the ﬁrst three formant frequencies [f1-f3], and non-normalized MFCCs [C0-C19], for a total of 187 statistics [speech1]

2. same as (1), with addition of normalized MFCCs [N0-N19], for a total of 327 statistics [speech2]

3. same as (2), with addition of the ﬁrst four formant frequencies [g1-g4], for a total of 355 statistics [speech3]

5.3 SVM Training In order to train an SVM classiﬁer to detect diﬃcult speakers, there must be training data that corresponds to such diﬃcult speakers, as well as to non-diﬃcult speakers who will provide negative examples. To determine these speakers, I utilize the scores from an automatic speaker recognition system. Given a particular decision threshold, I can then evaluate how many false rejection and false acceptance errors occur among the trials of a given speaker, and rank the speakers according to these error rates. For each speaker, false acceptance errors as the target are counted along with false acceptance errors as the impostor (in other words, I do not distinguish between lamb-ish and wolf-ish speaker tendencies).

CHAPTER 5. DETECTING DIFFICULT SPEAKERS 74

Roughly the top and bottom 20% of speakers (ranked according to their error rates) are used for training and testing. In particular, I take 80 speakers from each end of the diﬃculty spectrum. Those speakers with the lowest frequency of errors provide negative training examples, while the speakers with most frequently occurring errors provide positive examples.

For this speaker selection, I utilize scores from a UBM-GMM system with simpliﬁed factor analysis applied. Details of the implementation may be found in Section 3.2. In order to count errors, I use the decision threshold corresponding to an overall false alarm rate of 1%.

As mentioned previously, there is limited data available in SRE08; to deal with this data sparsity, I utilize a round robin (or 10-fold cross-validation) approach, with 10 splits of the data. Given 10 disjoint sets of 4 diﬃcult and 4 easy speakers, I use 9 of the sets to train the SVM, and the remaining 1 to test, with each set being the test set exactly once. The results are then calculated across the ten test sets. To further ensure that these results are representative, I run the experiment 10 times, with random selection of the 10 splits each time.

Each speaker has 5 or more conversation sides that are used as separate examples. I consider two separate SVMs: one to detect diﬃcult true speakers, i.e., those that are prone to causing false rejection errors, and one to detect diﬃcult impostor speakers, i.e., those that are prone to causing false alarms.

In addition to considering a linear kernel for the SVM, I also test polynomial kernels of orders 2 and 3, in the event that a nonlinear mapping may prove useful for the detection task at hand. Furthermore, I use the input feature statistics both as they are as well as with a rank normalization applied. Rank normalization, wherein the features are assigned a relative ranking from minimum to maximum, is a technique that often yields nice improvements in the context of speaker recognition systems with SVM classiﬁers. The rank normalization mapping is learned from the examples used to train the SVM, and then applied to both the train and the test data. The SVMs are implemented using the SVM light toolkit [32].

–  –  –

First, I will show results for detecting a diﬃcult impostor speaker in Section 5.4.1, followed by results of detecting a diﬃcult target speaker in Section 5.4.2.

5.4.1 Detecting Diﬃcult Impostor Speakers The recall, precision, speciﬁcity, and F-measure are given in Table 5.1 for three versions of the SVM classiﬁer using the [speech1] set of feature statistics, which may or may not be rank normalized [rank,nonorm]. In these results, the SVM is detecting diﬃcult impostor speakers, who are likely to cause false acceptance errors, either as the target model or the test speaker. The three SVMs diﬀer in the kernel that they use, which may be a linear kernel [linear], a second order polynomial kernel [poly2], or a third order polynomial kernel [poly3]. These results are averages across the 10 runs of a 90% train - 10% test round-robin approach.

These results show that it is, in fact, possible to detect diﬃcult impostor speakers, most of the time, with recall and precision rates of around 0.84 and 0.86, respectively. Furthermore, note that rank normalization yields much more balanced results. Additionally, the linear and polynomial kernels perform about the same in terms of recall, with polynomial kernels giving a slight gain in precision, speciﬁcity, and F-measure.

Depending on the application, the recall or the precision may be more important. For those situations where it is important to ﬁnd all of the diﬃcult speakers, at the cost of including some easy speakers, the threshold for making the diﬃcult distinction can be lowered, thereby increasing the recall. On the other hand, it may be important to be very accurate about any diﬃcult speaker labels, at the cost of missing some diﬃcult speakers. In this scenario, the threshold for making the decision can be raised, and the precision increased.

CHAPTER 5. DETECTING DIFFICULT SPEAKERS 76

For the linear kernel SVM using rank normalized feature statistics, results are given in Table

5.2 for three diﬀerent thresholds:

-0.5, 0 (corresponding to the values in Table 5.1), and 0.5.

The choice of threshold for the detection of a diﬃcult speaker allows one to adjust according to the most important criterion. Even when improving results for one particular measure, the results stay fairly good across all performance measures, though the speciﬁcity (or true negative rate) does drop to 0.616 when the recall is increased. Since many applications might ﬁnd it very important to be correct about the speakers that are labeled as diﬃcult, I will examine the low false alarm case (this corresponds to a high speciﬁcity and precision). In particular, consider a false alarm rate of 5%, meaning that 5% of the diﬃcult labels will be incorrect. For the corresponding threshold (which is around 0.83), average recall is 0.612, and average precision is 0.959, for the SVM with a linear kernel and rank normalization of the input features. In other words, in order to be 95% correct about diﬃcult speaker decisions, over 60% of the diﬃcult speakers are found. Though this is not a very high recall rate, it may still be suﬃcient for some applications, and it provides a reasonable starting point on a ﬁrst try at this task.

Now, let us compare performance for the linear kernel SVM, when using more feature statistics, in particular, the [speech2] and [speech3] sets, which add normalized MFCCs and a diﬀerent set of four formant frequencies. Results for the three feature sets are given in Table 5.3.

In this case, the additional speech-based feature statistics do not add much information for distinguishing between easy and diﬃcult impostor speakers.

5.4.2 Detecting Diﬃcult Target Speakers Now, I present results for an SVM classiﬁer trained to detect diﬃcult target speakers, who tend to cause false rejection errors. Table 5.4 shows the recall, precision, speciﬁcity, and F-measure for SVMs trained using the set of [speech1] input feature statistics, both with and without rank normalization, for three SVM kernels, namely linear, order two polynomial, and order three polynomial. In each case, the results presented correspond to an average over ten runs of a round robin approach using a 90% - 10% split of the data.

In this case, there are fairly reasonable results, though detection of diﬃcult target speakers is not as successful the detection of diﬃcult impostors. The intuition behind why diﬃcult target speakers are not detected as successfully as diﬃcult impostor speakers is as follows.

To cause false alarm errors, impostor speakers must be confusable with other speakers; so, there may be overall characteristics that make a speaker more average or more similar to other speakers within the population. On the other hand, the characteristics that make a target speaker hard to recognize as himself may vary from speaker to speaker, so that it is harder to capture all the ways in which a single conversation side may indicate a tendency to cause false rejections.

Returning to the results of Table 5.4, observe that rank normalization once again really helps to improve performance overall. In the case of diﬃcult target speakers, there are small

CHAPTER 5. DETECTING DIFFICULT SPEAKERS 77

–  –  –

Table 5.1: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using SVMs with diﬀerent kernels (linear, second order polynomial [poly2], and third order polynomial [poly3]), with the [speech1] set of feature statistics as input, with or without rank normalization applied [rank,nonorm].

Threshold Recall Precision Speciﬁcity F-measure -0.5 0.915 0.770 0.616 0.836 0 0.838 0.851 0.794 0.844 0.5 0.726 0.914 0.904 0.809

Table 5.2: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using a linear kernel SVM trained with rank normalized feature statistics, comparing three diﬀerent decision thresholds for diﬃcult impostor speaker detection.

Feature set Recall Precision Speciﬁcity F-measure speech1 0.838 0.851 0.794 0.844 speech2 0.831 0.852 0.798 0.841 speech3 0.830 0.859 0.809 0.844 Table 5.3: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult impostor speakers using a linear kernel SVM trained with rank normalized feature statistics, comparing three sets of speech feature statistics, [speech1], [speech2], and [speech3].

–  –  –

Table 5.4: Recall, precision, speciﬁcity, and F-measure values for detecting diﬃcult target speakers using SVMs with diﬀerent kernels (linear, second order polynomial, and third order polynomial), with the [speech1] set of feature statistics as input, with or without rank normalization applied.

CHAPTER 5. DETECTING DIFFICULT SPEAKERS 78

gains from using polynomial kernels instead of linear, with a third order polynomial kernel improving results more than a second order polynomial.

Pages:     | 1 |   ...   | 8 | 9 || 11 | 12 |

Similar works:

«Characterization of the Roles of TopoIIIα-RMI1 in Maintaining Genome Integrity by Jay Tun-Chieh Yang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Biochemistry University of Toronto © Copyright by Jay Tun-Chieh Yang (2012) Characterization of the roles of TopoIIIα-RMI1 in maintaining genome integrity Doctor of Philosophy, 2012; Jay Tun-Chieh Yang; Department of Biochemistry, University of Toronto Abstract Bloom syndrome...»

«Rhodium Catalysts in the Oxidation of CO by O2 and NO: Shape, Composition, and Hot Electron Generation by James Russell Renzas A dissertation submitted in partial satisfaction of the Requirements for the degree of Doctor of Philosophy in Chemistry in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Gabor A. Somorjai, Chair Professor Stephen R. Leone Professor Jeffrey Bokor Spring 2010 Rhodium Catalysts in the Oxidation of CO by O2 and NO: Shape,...»

«Volume 1 (2013) Anchors on Neurath’s Boat: Non-Foundationalist Epistemic Entitlements Jonathan Lopez Department of Philosophy, University of British Columbia Abstract Recent developments in epistemology have attempted to revive the foundationalist picture of knowledge with the notion of epistemic entitlement. Essentially, epistemic entitlement is the idea that we are entitled to some beliefs, that is, we do not have to earn warrant in order to believe them, they come to us for free. Wright...»

«DEVELOPMENT OF TIME-HISTORY AND RESPONSE SPECTRUM ANALYSIS PROCEDURES FOR DETERMINING BRIDGE RESPONSE TO BARGE IMPACT LOADING By DAVID RONALD COWAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA © 2007 David Ronald Cowan To Zoey Elizabeth. ACKNOWLEDGMENTS Completion of this dissertation and the accompanying research would not have been feasible without the...»

«Parametric Optimization of Taper Cutting Process using Wire Electrical Discharge Machining (WEDM) A THESIS SUBMITTED IN FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE DEGREE OF Doctor of Philosophy IN MECHANICAL ENGINEERING BY Bijaya Bijeta Nayak (Roll No. 512ME124) Department of Mechanical Engineering National Institute of Technology, Rourkela 769008, India September–2015 NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA-769008, ODISHA, INDIA CERTIFICATE OF APPROVAL This to certify that the...»

«TRANSPORT-PROPERTY AND MASS SPECTRAL MEASUREMENTS IN THE PLASMA EXHAUST PLUME OF A HALL-EFFECT SPACE PROPULSION SYSTEYM by Lyon Bradley King A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Aerospace Engineering) Doctoral Committee Associate Professor Alec Gallimore Professor James Driscoll Associate Professor Brian Gilchrist Professor Tamas Gombosi ACKNOWLEDGMENTS Although the spine of this book bears my name, the work reported here...»

«The Impact of Newsroom Philosophy on Story Ideation and Story Narration By Lee B. Becker Tudor Vlad Amy Jo Coffey Lisa Hebert Nancy Nusser Noah Arceneaux James M. Cox Jr. Center for International Mass Communication Training and Research Grady College of Journalism and Mass Communication University of Georgia Athens, GA 30602 Contact: lbbecker@uga.edu tel. 706 542-5023 Presented to the Midwest Association for Public Opinion Research, November 19-20, 2004, Chicago. The Impact of Newsroom...»

«JONATHAN B. KING AN EXERCISE IN MORAL PHILOSOPHY: SEEKING TO UNDERSTAND “NOBODY” (Accepted 24 December 1996) ABSTRACT. The late Hannah Arendt proposed that many, perhaps most monstrous deeds are not committed by moral monsters but by individuals who do not “think.” However, understanding the signiﬁcance of “activity of thinking as such” requires a moral philosophy that transcends rational actor assumptions and instrumental reason centering, instead, on the conditions of...»

«SOFT COMPUTING BASED SPATIAL ANALYSIS OF EARTHQUAKE TRIGGERED COHERENT LANDSLIDES A Dissertation Presented to The Academic Faculty by Mesut Turel In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Civil and Environmental Engineering Georgia Institute of Technology December 2011 SOFT COMPUTING BASED SPATIAL ANALYSIS OF EARTHQUAKE TRIGGERED COHERENT LANDSLIDES Approved by: Dr. J. David Frost, Advisor Dr. Hermann Fritz School of Civil and Environmental...»

«Improvement of Flood Risk Assessment under Climate Change in Ho Chi Minh City with GIS Applications  A thesis approved to the Faculty of Environmental Sciences and Process Engineering at the Brandenburg University of  Technology in Cottbus Senftenberg in partial fulfillment of the requirement for the award of the academic degree of Doctor of Philosophy (Ph.D.) in Environmental Sciences. by Master of Science Tran Thong Nhat Matriculation Number: 3000078 from Song Cau Town, Phu Yen Province,...»

«THREE-DIMENSIONAL ANALYSIS OF TUNNELLING EFFECTS ON STRUCTURES TO DEVELOP DESIGN METHODS by Alan Graham Bloodworth Brasenose College Michaelmas Term 2002 A thesis submitted for the degree of Doctor of Philosophy at the University of Oxford THREE-DIMENSIONAL ANALYSIS OF TUNNELLING EFFECTS ON STRUCTURES TO DEVELOP DESIGN METHODS by Alan Graham Bloodworth Brasenose College Michaelmas Term 2002 A thesis submitted for the degree of Doctor of Philosophy at the University of Oxford ABSTRACT The...»

«ELECTRIC FIELD MANIPULATION OF POLYMER NANOCOMPOSITES: PROCESSING AND INVESTIGATION OF THEIR PHYSICAL CHARACTERISTICS A Dissertation by SUMANTH BANDA Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY December 2008 Major Subject: Materials Science & Engineering ELECTRIC FIELD MANIPULATION OF POLYMER NANOCOMPOSITES: PROCESSING AND INVESTIGATION OF THEIR PHYSICAL CHARACTERISTICS A Dissertation by...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.