FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 || 3 | 4 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 2 ] --

the same from time to time. Similarly, there may be speakers for which an automatic speaker recognition system makes more decision errors. There are many sources of variation within and across speakers that may contribute to causing such errors, including basic physical attributes, language, accent, characteristics of speaking style, and changes in emotional state or health.

This thesis is inspired by the analysis of Doddington et al. [22], in which the authors characterized speakers in terms of their error tendencies. The default, well-behaved speakers are “sheep.” Speakers who cause a proportionately high number of false rejection errors as the target speaker are called “goats.” Those speakers who tend to cause false acceptance errors as the target speaker are “lambs,” and those who tend to cause false acceptance errors as the impostor speaker are labeled “wolves.” The existence of such speaker types was demonstrated through statistical tests using the outputs of an automatic speaker recognition system. Further analysis of additional data sets and different types of speaker recognition systems can provide more insight into the dependence that system performance has on the speakers.

Given that automatic speaker recognition system performance does depend on speaker characteristics, knowing which speakers are likely to cause errors is information that could prove useful for improving decision accuracy. Yet, limited work has been done to find these difficult speakers without the benefit of having a system’s output.

Furthermore, there are a number of real-world applications that rely on automatic speaker recognition technology, that could benefit from being able to find the most similar speakers or the most difficult trials to make a decision about. Inherent to certain tasks are populations of in-set and out-of-set speakers. That is, there may be a set of known speakers (i.e., in-set speakers), with associated speech samples, that needs to be distinguished from other, unknown speakers (i.e., out-of-set speakers). One example of this type of real-world application is that of fraud detection, where a company is trying to prevent fraud in the use of a call center or other phone-base system. Given a database of speaker models trained using speech samples from people known to have committed fraud, an automatic system may compare new speech data from incoming calls to the database of fraudster speaker models in order to detect possible fraudulent attempts, which must then be verified by a human listener. However, a human expert would be unable to listen to all calls if there are a large number of potential matches between new speech data and the fraudster models. A method for selecting the most error-prone speakers could thus prove very useful for focusing the efforts of a human listener in a smart way.

–  –  –

conversational speech, and one that is a more recent collection of both conversational speech and interview-style speech, recorded on a variety of channels, including landline and cellular telephone, as well as various types of microphones. Furthermore, in addition to considering a traditional speaker recognition system approach, for the second data set I utilize the outputs of a more contemporary approach that is better able to handle variations in channel.

The second component of this thesis investigates a straightforward approach to predict speakers that will be difficult for a system to correctly recognize. I use a variety of features to calculate feature statistics that are then used to compute a measure of similarity between speaker pairs. By ranking these similarity measures for a set of impostor speaker pairs, I determine those speaker pairs that are easy for a system to distinguish and those that are difficult-to-distinguish. I then develop an approach for combining a set of feature statistics in order to produce a comprehensive measure of how likely it is that a speaker will cause errors.

In particular, I use support vector machine (SVM) classifiers trained to distinguish between difficult and easy examples, in order to detect difficult impostor and target speakers.

I begin by covering relevant background material in Chapter 2, including typical features and systems for automatic speaker recognition, intrinsic speaker characteristics, and related error analyses of speaker recognition systems. Next, I explore the speaker-dependent performance of systems in Chapter 3. In Chapter 4, I introduce a simple approach to finding difficult-to-distinguish speaker pairs. I then describe a technique for detecting difficult target or impostor speakers in Chapter 5. Finally, I summarize and conclude my work in Chapter 6.

Chapter 2 Background There are several broad areas of prior work relevant to this dissertation. I begin in Section

2.1 by setting up the speaker recognition problem, while in Sections 2.2, 2.3, 2.4, and 2.5 I provide details about features, system approaches, relevant speech corpora, and measures of system performance, respectively. There are a number of intrinsic speaker qualities, which account for intra-speaker variability, as well as differences between speakers, that I describe in Section 2.6. The most directly related work involves error analysis pertaining to speaker recognition systems, which I discuss in Section 2.7.

2.1 The Speaker Recognition Problem As its name implies, automatic speaker recognition attempts to recognize, or identify, a given speaker by processing his/her speech automatically, that is to say, in a fully objective and reproducible manner, without the aid of human listening or analysis. In order to be able to recognize the speaker of a given test utterance, it is necessary to have training data first, so that the system can “learn” the speaker of interest. The term speaker recognition can be used to refer to a variety of tasks. One type of task is speaker identification, where the system must produce the identity of the speaker, given a test utterance, from a set of speakers. With closed-set speaker identification, the number of speakers in the set is fixed, and the system must choose which among the given speakers is a match to the speaker of the test utterance. Open-set speaker identification adds a layer of complexity by allowing the test utterance to belong to a speaker not in the set of speakers for whom there is training data available. A second type of task is speaker verification, which involves a hypothetical target speaker match to the test speaker, and the system must determine whether or not the test speaker identity is as claimed.

Regardless of which type of task, the problem may be further characterized as being textdependent or text-independent. In the text-dependent case, the train and test utterances are required to be a specific word or set of words; the system can then exploit the knowledge CHAPTER 2. BACKGROUND 5 of what is spoken in order to better make a decision. For the text-independent case, there is no constraint on what is said in the speech utterances, allowing for generalization to a wider variety of situations.

The dissertation work focuses on the text-independent speaker verification task. For each target (or hypothesis) speaker and test utterance pair, the system must decide whether or not the speaker identities are the same. In this case, two types of errors arise: false acceptance (or false alarm) and false rejection (or missed detection). A false accept occurs when the system incorrectly verifies an impostor test speaker as the target speaker. A false reject occurs when the system fails to verify a true test speaker as the target speaker. A trial refers to a target speaker and test utterance pair. In general, the training data of a target speaker may include one or more samples of speech, of varying lengths, and the test data may also include varying lengths of speech samples. For my purposes, the train and test utterances will both be a single conversation side, which is typically 2.5-3 minutes of speech. Therefore, a trial will correspond to a pair of train and test conversation sides. For each trial, the corresponding score simply refers to the output of a speaker recognition system given that train and test data. The score may or may not correspond to a likelihood. Furthermore, in order to make a decision for a trial given its score, there must be a decision threshold;

then, the system will decide that it’s a true speaker trial if the score is above the decision threshold, or decide that it’s an impostor trial if the score is below the decision threshold.

In general, speaker recognition errors may be caused by both extrinsic factors, such as channel effects or noise, and intrinsic factors, such as age, sex, speaking style, or other inherent speaker attributes. My focus is on the effects of intrinsic speaker characteristics.

In order to perform a speaker recognition task, a system must first parameterize the speech in a meaningful way that will allow the system to distinguish and characterize speakers and their speech; this step is addressed next in Section 2.2, which discusses some relevant features commonly used in speech processing applications. A number of typical system approaches and methods are then discussed in Section 2.3, while I describe commonly utilized speech corpora and performance measures in Sections 2.4 and 2.5. In Section 2.6, I will describe a variety of intrinsic factors that contribute to variations both within an individual speaker and across different speakers, and consider the potential impacts of such speaker characteristics, before concluding with an overview of relevant error analyses of speaker recognition systems in Section 2.7.

2.2 Speech Features The process of parameterizing a raw input, for example, speech, is referred to as feature extraction. For speech processing, low-level features are those based directly on frames of the speech signal, where frames correspond to a moving window, typically 25 ms long, with a given step size of typically 10ms. A length of 25ms and step size of 10ms corresponds to an overlap of 15ms between speech frames. High-level features, on the other hand, usually CHAPTER 2. BACKGROUND 6 incorporate information from more than just one frame of speech, and include, for example, speaker idiosyncrasies, prosodic patterns, pronunciation patterns, and word usage. The type of low-level acoustic features most often used in speaker recognition tasks are Mel-frequency cepstral coefficients, or MFCCs, which are described in Section 2.2.1. Section 2.2.2 provides a brief introduction to other acoustic and prosodic features, such as formant frequencies.

Finally, Section 2.2.3 introduces various types of speech segments, which may be used to calculate different types of features.

2.2.1 Cepstral Features MFCCs are generated by the process shown in Figure 2.1. First, an optional pre-emphasis filter is applied, to enhance the higher spectral frequencies and compensate for the unequal perception of loudness at different frequencies. Next, the speech signal is windowed as described above and the squared magnitude of the fast Fourier transform (FFT) is calculated for each frame. A Mel-frequency triangular filter bank is then applied, where Mel refers to an auditory scale based on pitch perception. There are different versions of the transformation from linear frequency scale to Mel frequency. One example, taken from [57], is given by

–  –  –

where Sk are the log-spectral vectors from the previous step, K is the total number of logspectral coefficients, and L is the number of coefficients to be kept (this is called the order of the MFCCs), with L ≤ K.

–  –  –

Furthermore, an energy term and/or its derivative can also be included in the feature parameterization.

Other commonly used cepstral features include linear-frequency cepstral coefficients (or LFCCs), which use a linear rather than Mel-based frequency bank, as well as features based on linear prediction, such as linear predictive coding coefficents (LPCCs) and perceptual linear prediction features (PLPs).

2.2.2 Other Acoustic and Prosodic Features Formant frequencies correspond to resonances of the vocal tract and can often be measured in spectrograms by amplitude peaks in the frequency spectrum. Vowels in particular can be largely characterized by the first and second formants, though any voiced speech segment will produce formants.

The fundamental frequency, or f0, is an acoustic property corresponding to the lowest harmonic in the frequency spectrum. Pitch and fundamental frequency are often used interchangeably as terms, though pitch is an auditory property that is perceived by human listeners, who place sounds on a pitch scale ranging from low to high. The intonation of speech is the pitch pattern. Jitter is a term to describe varying pitch in the voice. A related feature is shimmer, which describes varying loudness in the voice.

Other commonly used prosodic features include energy distributions and dynamics, and duration and timing information, such as speech rate or average duration of various speech segments. Prosody will be revisited in more detail in Section 2.6.1.

2.2.3 Speech Segments One concept that arises when considering higher-level features is that of speech segments.

The basic linguistic unit of speech is that of a phone, which corresponds to a vowel or consonant speech sound that may be described in terms of articulatory movements and acoustic properties. Phonemes are sounds that are used to differentiate words [42]. For instance, in the words got and not, /g/ and /n/ are two different phonemes that lead to different meanings. Phonemes may be pronounced in different ways, leading to different phones that are all instances of the same phoneme; although there are differences in pronunciation of these phones, their meaning does not change. In the remainder of this thesis, the term phone is used to refer to phoneme.

CHAPTER 2. BACKGROUND 8 Going beyond the phone, segments may be defined as groups of phones or syllables, as well as words, and sentences.

All of these types of segments may be used as the basis for calculating various types of features.

Pages:     | 1 || 3 | 4 |   ...   | 12 |

Similar works:

«Unsupervised Bayesian Data Cleaning Techniques for Structured Data by Sushovan De A Dissertation Presented in Partial Fulfillment of the Requirement for the Degree Doctor of Philosophy Approved May 2014 by the Graduate Supervisory Committee: Dr. Subbarao Kambhampati, Chair Dr. Yi Chen Dr. Sel¸uk Candan c Dr. Huan Liu ARIZONA STATE UNIVERSITY August 2014 ABSTRACT Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization;...»

«SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF BIOINFORMATIC DATA AND APPLICATIONS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Xiaorong Xiang, B.S., M.S. Gregory R. Madey, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana April 2007 c Copyright by Xiaorong Xiang All Rights Reserved SERVICE-ORIENTED ARCHITECTURE FOR INTEGRATION OF...»

«UNIVERSITY OF CALIFORNIA, IRVINE From Inception to Implementation: How SACPA has affected the Case Processing and Sentencing of Drug Offenders in One California County DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Criminology, Law and Society by Christine Lynn Gardiner Dissertation Committee: Professor Elliott P. Currie, Chair Professor C. Ronald Huff Professor Susan F. Turner © Christine Lynn Gardiner This dissertation of...»

«Teaching and Learning Critical Reading with Transnational Texts at a Mexican University: An Emergentist Case Study by Moisés Damián Perales Escudero A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (English and Education) in the University of Michigan Doctoral Committee Professor Diane Larsen-Freeman, Co-Chair Professor Mary Schleppegrell, Co-Chair Professor Anne R. Gere Professor Annemarie Palincsar Associate Professor Désirée...»

«1 27 August 2009 Mereologies as the Grammars of Chemical Discourses ROM HARRÉ AND JEAN-PIERRE LLORED `If you cut a crumb in half do you have two new crumbs or two halves of a crumb?’ John Palmer, quoted in the Sunday Times, 28 June 2009, News Review, p. 16. Since Robert Boyle’s corpuscularian philosophy, chemistry has been a mereological science. Displacing the metaphysics of `continuous substances’ and `qualities’ as the expression of “principles”, chemistry has been built on a...»

«Network Extenality and Mechanism Design by Xiaoming Xu Department of Computer Science Duke University Date: Approved: Kamesh Munagala, Supervisor Pankaj Kumar Agarwal Vincent Conitzer Sasa Pekec Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science in the Graduate School of Duke University Abstract Network Extenality and Mechanism Design by Xiaoming Xu Department of Computer Science Duke University Date:...»

«AN ASIAN INDIAN DIETARY ACCULTURATION MEASURE: INSTRUMENT DEVELOPMENT AND VALIDATION By Sumathi Venkatesh A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Human Nutrition – Doctor of Philosophy ABSTRACT AN ASIAN INDIAN DIETARY ACCULTURATION MEASURE: INSTRUMENT DEVELOPMENT AND VALIDATION By Sumathi Venkatesh Background: Asian Indian adults in the U.S. (~1% of the total population) have a high prevalence of type 2 diabetes...»

«PATRICK GRIM SUNY Distinguished Teaching Professor Department of Philosophy State University of New York at Stony Brook Stony Brook, New York 11794 cell (631) 790-2356 fax (631) 632-7522 patrick.grim@stonybrook.edu www.pgrim.org Specializations Philosophical Logic, Philosophical Computer Modeling (Agent-Based Modeling, Networks, Artificial Societies, and Evolutionary Game Theory), Ethics, Philosophy of Religion, Philosophy of Science Positions Stony Brook: Distinguished Teaching Professor, 2001...»

«RVP Newsletter — Winter 2012/Spring 2013 Message from the President: The work of the Council for Research in Values and Philosophy (RVP) has been advancing rapidly with the needs of the times. You will find greater detail on the following pages, but below is a summary of some key RVP efforts and their themes. We find ourselves at a point of major transition from a world order based upon separate and competing nations to a global order in which modes of cooperation become newly possible and...»

«The Pennsylvania State University The Graduate School Department of Anthropology SETTLEMENT AND POPULATION AT PIEDRAS NEGRAS, GUATEMALA A Thesis in Anthropology by Zachary Nathan Nelson © 2005 Zachary Nathan Nelson Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2005 Abstract My dissertation examines the relationship between settlement and population at Piedras Negras, Guatemala. This Classic Maya center developed from a small village into a...»

«CHROMATOGRAPHIC PROPERTIES OF SILICA-BASED MONOLITHIC HPLC COLUMNS By Jennifer Houston Smith Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In Chemistry APPROVED Harold M. McNair, Chairman Mark R. Anderson James O. Glanville Larry T. Taylor Jimmy W. Viers September 2002 Blacksburg, VA Keywords: HPLC, Monolith, Chromatography Copyright 2002, Jennifer H. Smith...»

«c 2012 Vijay Raman TRAFFIC-AWARE CHANNEL ALLOCATION AND ROUTING IN MULTICHANNEL, MULTI-RADIO WIRELESS NETWORKS BY VIJAY RAMAN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2012 Urbana, Illinois Doctoral Committee: Professor Nitin Vaidya, Chair Assistant Professor Matthew Caesar Assistant Professor Sayan Mitra Professor...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.