WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 8 ] --

Finally, the preliminary work regarding the effects of speaker demographics suggests that while sex is a factor in the score distributions, the other differences are not particularly informative with respect to system scores. Besides the ANOVA analysis, I observed other differences in behavior between male and female speakers. In general, male speakers appear to vary more widely from one another, in the sense that a given male target speaker will produce different ranges of scores for different male test speakers. On the other hand, female target speakers may often produce similar scores for different female test speakers. Going forward, my work will continue to consider results over the entire population, as well as for males and females separately.

Chapter 4 Predicting Difficult-to-distinguish Speaker Pairs As I have shown, automatic speaker recognition system performance depends at least in part on intrinsic speaker characteristics, and speakers may have a tendency to produce false alarms or false rejection errors. More specifically than a general per-speaker tendency to produce false alarm errors, there is an expectation that automatic speaker recognition systems will vary across impostor speaker pairs in how successfully those pairs are correctly classified. By comparing the performance for a given speaker pair to performance over all speaker pairs, one can determine which speaker pairs are most (or least) difficult for a given system. Although these difficult-to-distinguish impostor speaker pairs may vary to some degree from system to system, I am most interested in finding the speaker pairs that will be poorly performing for any speaker recognition system. Thus, rather than relying on a particular speaker recognition system’s output to select such speaker pairs, I aim to find the universally difficult-to-distinguish speaker pairs by utilizing a variety of features, such as pitch, formant frequencies, or energy.

There are several motivations for trying to predict the difficult-to-distinguish impostor speaker pairs. First of all, if the speaker pairs most likely to cause errors can be identified, such information may be able to open a line of research into determining some of the issues related to intrinsic factors that remain in speaker recognition. Another possible application of this work would be as a tool for NIST to select more difficult trials for future Speaker Recognition Evaluations, in order to present an even more challenging task. Finally, being able to find the speaker pairs that are difficult for an automatic system to distinguish could prove particularly useful in selecting a focus for a human expert in a speaker recognition task that utilizes both automatic system scores as well as human analysis, or as a method for sub-sampling the most salient speech samples in a speaker recognition task where it is impractical to fully process all the data that exists.

This investigation considers a basic set of features, including fundamental frequency statistics, energy statistics, long-term average spectrum (LTAS) energy statistics, formant

CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 55

frequency statistics, histograms of frequencies obtained from linear predictive (LP) analysis, and spectral slope statistics. These feature choices are motivated by prior work in speaker recognition and other tasks involving characterization of speaker differences. For instance, speaker recognition approaches have used features like pitch and energy distributions or dynamics [1], prosodic statistics including duration and pitch-related features [59], and jitter and shimmer [25]. Formant frequencies and bandwidths, obtained using linear predictive analysis, were used as descriptors for perceptual speaker characterization by Necio˘lu et g al. [56], while McDougall and Nolan showed that formant frequency dynamics are speaker discriminative [49]. Kuwabara and Sagisaka considered many acoustic parameters as influences upon voice individuality, including pitch frequency, contour and fluctuation, formant frequencies, trajectories and bandwidths, and LTAS [41].

The aforementioned features, along with appropriate distance measures, are utilized as a way to select speaker pairs that are closer, or more similar (in terms of that feature-measure pair). The goal is to find feature-measures for which similar speaker pairs correspond to speaker pairs that are difficult for automatic speaker recognition systems to distinguish. As a more complex measure that may better predict speaker recognition system behavior, I also test the approximated Kullback-Liebler (KL) divergence between speaker-adapted Gaussian mixture models (trained on MFCC features).

I begin by describing my approach in greater detail in Section 4.1. Results are given in Section 4.2, and Section 4.3 provides a summary and discussion of findings.

4.1 Approach This approach tests a variety of measures calculated from different features as a criterion for selecting similar (or dissimilar) speaker pairs for speaker recognition. I describe the features considered in Section 4.1.1, and the measures and process of speaker pair selection are discussed in Section 4.1.2. The data used is covered in Section 4.1.3.

4.1.1 Features The features described below are examined as potentially useful for speaker pair selection.

Features are calculated either using MATLAB, and the Voicebox toolkit [10], or using Praat [7]. The terms given in brackets indicate the terms we will use to refer to the features.





Note that the feature statistics calculated using Praat are computed over the entire input file, including both speech and non-speech regions. The features calculated with MATLAB compute statistics over only those regions of the input designated as speech by the voice activity detection (VAD) provided by NIST.

1. Pitch statistics (Praat): mean, median, range, and mean average slope of the fundamental frequency [f0 mean, f0 med, f0 range, f0 mas]. The range was set to consider

CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 56

fundamental frequencies between 75Hz and 600Hz, with all other settings corresponding to default Praat parameters.

2. Jitter and shimmer (Praat): jitter relative average perturbation, and shimmer 5-point amplitude perturbation quotient [jitt rap, shim apq5]. Jitter describes the variations in pitch. The relative average perturbation (RAP) computes the absolute difference between a pitch period and the average of that period and its two neighbors, then takes the average of this absolute difference and divides it by the average pitch period. Settings for computing the jitter RAP include a minimum fundamental frequency of 75Hz, a maximum fundamental frequency of 600Hz, a minimum period of 0.0001, a maximum period of 0.02, and a maximum period factor of 1.3 (which denotes the largest difference between consecutive intervals that will be included in the jitter computation).

Shimmer describes varying loudness (or amplitude) in the voice. The five-point Amplitude Perturbation Quotient (APQ5) calculates the average absolute difference between the amplitude of a period and the average of the amplitudes of it and its four closest neighbours, and then divides this average absolute difference by the average amplitude.

Parameter settings for computing the shimmer APQ5 include a minimum fundamental frequency of 75Hz, a maximum fundamental frequency of 600Hz, a minimum period of 0.0001, a maximum period of 0.02, a maximum period factor of 1.3, and a maximum amplitude factor of 1.6 (denoting the largest possible difference in amplitude between consecutive intervals that will be included in the shimmer computation).

3. Formant frequency statistics (Praat): mean and median of the first three formants [f1 mean, f1 med, f2 mean, f2 med, f3 mean, f3 med]. The relevant parameter settings for formant frequency calculation include a window length of 25ms, a step size of

6.25ms, a +3 dB point for an inverted low-pass filter (with a slope of +6 dB/octave) of 50Hz (this is a pre-emphasis filter used to create a flatter spectrum), a maximum number of 4 formants, and a maximum formant frequency of 4000Hz (due to the bandlimited nature of the data used here).

4. Energy statistics (Praat): mean and median energy [en mean, en med]. Default Praat settings were used, including a designation to subtract the overall mean energy.

5. Long term average spectrum energy statistics (Praat): mean, standard deviation, range, slope, and local peak height of LTAS energy [ltas mean, ltas stddev, ltas range, ltas slope, ltas lph]. Praat parameter settings include a filter bandwidth of 100Hz and a frequency range from 0 to 4000Hz. Furthermore, for local peak height calculation, there is a minimum peak height of 2400 and a maximum peak height of 3200.

6. Histograms of frequencies from roots of the LPC polynomial (MATLAB/Voicebox):

frequencies obtained from linear predictive coding (LPC) order 8 or order 14 polynomial coefficient roots (both with and without a minimum magnitude requirement of 0.78

CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 57

and 0.88, respectively1 ) contribute to a histogram with a bin size of 5 Hz covering the 5-3995 Hz range [hist8all, hist8minmag, hist14all, hist14minmag]. A frame length of 25ms and step size of 10ms were used for calculating the LPC coefficients.

7. Spectral slope statistics (MATLAB): mode and median of spectral slope, calculated over frequency range 0-4000 Hz [mode specsl, med specsl]. A frame length of 30ms and step size of 10ms were used to calculate per-frame spectral slope values, from which the mode and median values were computed.

4.1.2 Measures and speaker pair selection Features are calculated for each speech sample, and a measure is computed for every unique speaker pair in two different ways. First is to average the feature values over all conversation sides of each speaker, and then calculate the measure for each speaker pair using these average per-speaker feature values [featavg]. The second method calculates a measure for each possible pairing of conversation sides for a given speaker pair (with one conversation side for each speaker), and then averages these measure values to obtain a single value for each unique speaker pair [measavg].

For scalar features, absolute difference [absdiff] and percent difference [pctdiff] are used as measures, where percent difference for values x and y is defined as |x − y| Percent difference =, (4.1) (x+y) when x and y have the same sign (it is not used for features with both positive and negative values). In addition to the individual formants, sums of formants are used as scalar features (with absolute and percent difference measures), and the Euclidean distance [eucldist] is also calculated for vectors of formant frequencies, e.g. (f1,f2,f3). For the histograms of frequencies from LP analysis, a correlation coefficient [corr] is calculated as a measure of similarity. Table

4.1 summarizes the possible feature-measure combinations, grouped according to feature type.

Based on the measure for each unique speaker pair, those pairs with the highest and lowest 1% (or 5%) of values are selected to determine if the measure of speaker similarity corresponds to the degree of difficulty for a speaker recognition system. For absolute difference, percent difference, and Euclidean distance, smaller values should indicate more similar speakers, while for correlation coefficients, higher values indicate greater speaker similarity.

These values were chosen based on a preliminary inspection of histograms, and were not optimized for selecting speaker pairs.

CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 58

–  –  –

4.1.3 Speech corpora The 2008 NIST Speaker Recognition Evaluation (SRE08) includes a condition (short2short3) which uses roughly 2.5-3 minutes of speech for each training and testing [53]. This speech is taken either from one side of a conversation between two people over the telephone (possibly recorded on a microphone), or from part of an interview recorded on a microphone (some interviewer speech may be present). Additional interview data was released for a followup evaluation experiment designed to further explore the new interview style of data collection.

Corpus for feature-measure calculation Speech data from the followup evaluation is used to calculate features for the speakers.

In particular, speech recorded on microphone 2 (a lavalier microphone placed on the subject) is used since it has good sound quality. These speaker features are then used in conjunction with a similarity measure in order to predict difficult- and easy-to-distinguish speaker pairs.

The majority of speakers have four conversation sides used for the measure calculation (a small minority have three or five conversation sides).

Corpus for evaluation of selected speaker pairs The data used to evaluate speaker-pair selection is different in several respects from the data used to perform the selection. Specifically, the selection data were collected in an interview, while the evaluation data were collected in either an interview or a telephone conversation. Also, the selection data were collected using a lavalier microphone, whereas the evaluation data were collected using a variety of microphones, including a telephone handset. Furthermore, though the speakers contained in each set are the same, the selection data does not overlap with evaluation data.

Speaker recognition system submissions from the SRE08 short2-short3 condition are used to compute performance on trials for the selected 1% (or 5%) of most and least similar speaker pairs. Of the 34 sites who shared their system submissions for the short2-short3 condition, 33 of these are used in the results. The total number of trials for short2-short3 (after removing trials for speakers not found in the selection data) is 55013, with 1815 unique impostor speaker pairs. When keeping 1% (or 19) of the speaker pairs, there are around 4000 trials on average, while 5% (or 91) of the speaker pairs corresponds to an average of roughly 11000 trials. When filtering trials for selected speaker pairs, I removed target trials of speakers not included in any of the selected speaker pairs.

CHAPTER 4. PREDICTING DIFFICULT-TO-DISTINGUISH SPEAKER PAIRS 60



Pages:     | 1 |   ...   | 6 | 7 || 9 | 10 |   ...   | 12 |


Similar works:

«CONFRONTING COMPLEXITY: A COMPREHENSIVE STATISTICAL AND COMPUTATIONAL STRATEGY FOR IDENTIFYING THE MISSING LINK BETWEEN GENOTYPE AND PHENOTYPE By Tricia Ann Thornton-Wells Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Neuroscience December, 2006 Nashville, Tennessee Approved: Professor Jonathan L. Haines Professor Michael P. McDonald Professor Jason H. Moore Professor...»

«Shifting Trade Networks: Sub-Saharan to Atlantic Exchange in Central Ghana 1355-1725 CE by Anne M. Compton A dissertation submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy (Anthropology) in the University of Michigan 2014 Doctoral Committee: Professor Joyce Marcus, Chair Professor John O’Shea Professor Raymond A. Silverman Professor Henry Wright © Anne M. Compton 2014 DEDICATION To Michael Church, for always believing in me ii ACKNOWLEDGMENTS Words...»

«BA RU C H S. B L U M B E R G FRANK MARGESON 28 july 1925. 5 april 2011 PROCEEDINGS OF THE AMERICAN PHILOSOPHICAL SOCIETY VOL. 155, NO. 3, SEPTEMBER 2011 biographical memoirs B ARUCH S. BLUMBERG (Barry) was a very active and enthusiastic member of the Society from his election on 25 April 1986 until his death on 5 April 2011. He presented papers: “Hepatitis B Virus and the Prevention of Primary Cancer of the Liver” (November 1980) and “Humboldt in Philadelphia” (April 1989), and...»

«Difazio, Rachel October 10, 2013 AACN Awards Committee American Association of Colleges of Nursing One Dupont Circle, NW Suite 530 Washington, DC 20036 Dear Awards Committee Members: As Dr. Rachel DiFazio’s dissertation chairperson and with the full support of the Boston College William F. Connell School of Nursing’s Dean and PhD Program Committee, it is with great enthusiasm that I am nominating Dr. DiFazio for the AACN Excellence in Advancing Nursing Science Award. Dr. DiFazio graduated...»

«STRUCTURAL GEOLOGY, PROPAGATION MECHANICS AND HYDRAULIC EFFECTS OF COMPACTION BANDS IN SANDSTONE A DISSERTATION SUBMITTED TO THE DEPARTMENT OF GEOLOGICIAL AND ENVIRONMENTAL SCIENCES AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Kurt Richard Sternlof March 2006 © Copyright by Kurt Richard Sternlof 2006 All Rights Reserved ii Abstract Low-porosity, low-permeability compaction bands (CBs) form a...»

«Chemical-Scale Studies of the Nicotinic and Muscarinic Acetylcholine Receptors Thesis by Michael McCann Torrice In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy California Institute of Technology Pasadena, CA 2009 (Defended July 25, 2008) ii © 2009 Michael McCann Torrice All Rights Reserved iii In memory of my grandparents: Margaret and Ronald McCann Virginia and Carl Torrice iv Acknowledgments The struggle itself towards the heights is enough to fill a man’s...»

«Application Platforms, Routing Algorithms and Mobility Behavior in Mobile Disruption-Tolerant Networks Arezu M. Moghadam Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2011 c 2011 Arezu M. Moghadam All Rights Reserved ABSTRACT Application Platforms, Routing Algorithms and Mobility Behavior in Mobile Disruption-Tolerant Networks Arezu M. Moghadam Mobile disruption-tolerant networks...»

«The Relationship Between Dominance and Vocal Communication in the Male Ring-Tailed Lemur (Lemur catta) by Laura McLachlan Bolt A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Anthropology University of Toronto © Copyright by Laura McLachlan Bolt, 2013 The Relationship Between Dominance and Vocal Communication in the Male Ring-Tailed Lemur (Lemur catta) Laura McLachlan Bolt Doctor of Philosophy Department of Anthropology University of...»

«PLANTING DEPTH OF TREES – A SURVEY OF FIELD DEPTH, EFFECT OF DEEP PLANTING AND REMEDIATION DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Richard G. Rathjens, M. S. The Ohio State University Dissertation Committee: Dr. T. Davis Sydnor, Advisor Approved by Dr. David S. Gardner Dr. Edward L. McCoy Adviser Dr. James D. Metzger Envir. & Natural Resources Graduate Program Dr. Brent...»

«SCALABLE SECURITY ARCHITECTURE FOR TRUSTED SOFTWARE DAVID CHAMPAGNE A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF ELECTRICAL ENGINEERING ADVISOR: RUBY B. LEE JUNE 2010 Copyright © 2010 by David Champagne. All rights reserved. À ma mère Lise et mon père Robert Merci d’avoir toujours cru en moi Abstract Security-critical tasks executing on general-purpose computers require...»

«Externalizing Behavior in Post-Institutionalized Children: An Examination of Parent Emotion Socialization Practices, Respiratory-Sinus Arrhythmia, and Skin Conductance A Dissertation SUBMITTED TO THE FACULTY OF UNIVERSITY OF MINNESOTA BY Adriana Marie Herrera IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Megan R. Gunnar Bonnie Klimes-Dougan May 2014 © Adriana Marie Herrera 2014 Acknowledgements I would like to acknowledge those people who provided invaluable...»

«KRITIKE An Online Journal of Philosophy Volume 9, Number 2 December 2015 ISSN 1908-7330 KRITIKE An Online Journal of Philosophy Volume 9, Number 2 December 2015 ISSN 1908-7330 THE DEPARTMENT OF PHILOSOPHY University of Santo Tomas Philippine Commission on Higher Education COPYRIGHTS All materials published by KRITIKE are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License KRITIKE supports the Open Access Movement. The copyright of an article published by the...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.