WWW.DISSERTATION.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Dissertations, online materials
 
<< HOME
CONTACTS



Pages:     | 1 | 2 || 4 | 5 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 3 ] --

2.3 System Approaches and Methodologies There are a number of statistical and discriminative-training based methods that have been explored for the speaker recognition task. Two of the most successful modeling approaches that have been used are the Gaussian mixture model (GMM) and the support vector machine (SVM), which are discussed here. Other techniques have utilized hidden Markov models (HMMs), artificial neural networks such as multi-layer perceptrons (MLPs), or vector quantization (VQ).

–  –  –

for a mixture of N Gaussians.

In a speaker recognition setting, there are several GMM approaches that can be taken.

Here, only the currently prevalent approach, referred to as UBM-GMM, is described. Two GMM models are needed: one for the target speaker and one for the background model [64]. Using training data from a large number of speakers, a speaker-independent universal background model, or UBM, is generated. The UBM training data is a type of systemlevel training data, which is chosen to be completely disjoint from the training data used to train target models for a given set of trials. So that every target speaker model is in the same space and can be compared to one another, the speaker-dependent models (using the corresponding target speaker training data) are adapted from the UBM using maximum a posteriori (MAP) adaptation. For a given test utterance X, and a given target speaker, a

log likelihood ratio (LLR) can then be calculated:

LLR(X) = log p (X|λtarget ) − log p (X|λU BM ) (2.7)

Comparing the LLR to a threshold, Θ, will determine the decision made about the test speaker’s identity: if LLR(X) Θ, the test speaker is identified as a true speaker match, otherwise, the test speaker is determined to be an impostor. The LLR is the score for the UBM-GMM system.

–  –  –

The SVM is used in speaker recognition by taking one or more positive examples of the target speaker, as well as a set of negative examples of impostor speakers, and producing a hyperplane decision boundary. Since there are far more impostor speaker examples than target speaker examples, a weighting factor is typically used to make the target example(s) count as much as all of the impostor examples. Once the hyperplane for a given target speaker is known, the test speaker can be classified as belonging to either the target speaker or impostor speaker class. Instead of a log likelihood ratio, a score can be produced by using the distance of the test data from the hyperplane boundary.

2.3.3 A Brief Historical Overview of Types of Systems Automatic speaker recognition systems can be categorized by the type of features they use and by the type of statistical modeling tool that they use. Features may range from low-level and short-term (based directly on the acoustics of the speech) to higher levels incorporating longer lengths of time, including prosodic, lexical, or semantic. MFCCs are an example of low-level, short-term features, while phone n-gram counts are higher-level, longer-term features. The overview of systems provided here, while not exhaustive, covers a variety of feature types and statistical learning methods, and is intended to give an idea of a range of approaches that have proven successful. In some cases, although a system alone may not have very good performance (compared to other systems), it may still be successful by contributing in a system fusion.

One conventional approach that has already been described in Section 2.3 is the cepstral GMM system [64, 61]. The cepstral SVM system utilizes a generalized linear discriminant sequence kernel to train an SVM classifier on a sequence of input cepstral features [12].

Some methods attempt to combine the advantages of the generative modeling of GMMs with the discriminative power of SVMs. One such approach is an SVM classifier that uses GMM supervectors as features [14]. The supervectors are the concatenated mean vectors from a GMM that has been MAP-adapted from a UBM to a speaker’s data, with the idea that this mapping from an utterance into a high-dimensional supervector space is similar to an SVM sequence kernel.

Another successful approach is the MLLR-SVM system, which uses maximum-likelihood linear regression (MLLR) transforms from a speech recognition system as features for speaker recognition [69, 68]. In the context of a speech recognition system, MLLR applies an affine transform to the Gaussian mean vectors in order to map speaker-independent means to speaker-dependent means. The coefficients from one of more of these MLLR adaptation transforms are used in an SVM speaker recognition system with very good results.

One type of non-acoustic feature is the word n-gram, where n-gram can encompass unigrams, bigrams, and so forth. The motivation for using such a feature for speaker recognition is that there are idiolectal differences among speakers, i.e., speakers vary in their word usage. Speaker-dependent unigram and bigram language models were first used in a target to background likelihood ratio framework, with promising results [21].

CHAPTER 2. BACKGROUND 11 There are also phone-based approaches.

Similar to the word n-gram modeling, the phone n-gram system first used frequency counts of phone n-grams, where phones are found using a phone recognizer, or possibly phone recognizers for multiple languages, in a likelihood ratio framework [2]. The use of phonetic information was extended in a number of techniques, including the use of binary trees [55], cross-stream modeling [30], and SVMs [13, 29]. Another example is a pronunciation modeling approach, where word-level automatic speech recognition (ASR) phone streams are compared with open-loop phone streams [39].





Additional methods seek to take advantage of the speaker information present in words, by using word-conditioning. A keyword HMM system trains background HMMs for a number of keywords, and adapts them to speaker; a likelihood ratio between the background and speaker models for each word are then calculated for a given test utterance, and the likelihood ratios are combined to produce a final system score [6]. The word-conditioned phone n-gram system considers phone n-grams only for a specific set of keywords [43].

A number of approaches have used prosodic features, including pitch and energy distributions or dynamics [1], and prosodic statistics including duration and pitch related features [59]. Nonuniform Extraction Region Features (NERFs) consider a number of features, including maximum or mean pitch, duration patterns, and energy contours, for various regions of speech, which are delimited by some sort of event, such as short pauses, long pauses, or schwas [35].

2.3.4 Channel Compensation Techniques One obvious component to a speech signal that is unrelated to the speech (or speaker) itself is the channel on which the speech is recorded. Although most speech corpora have been collected using the telephone, there are different types of handsets, including cellular, and there has also been a recent collection of data using different types of microphones. The biggest effect of having different types of channels present in the data occurs when there is a channel mismatch between the training and test data. That is, if a system’s target speaker model is trained using data from an electret telephone handset, for instance, but the test speech was collected from a carbon-button telephone handset, it will “sound” different to the system, even if the speaker is the same for both. In speaker recognition systems, the effects of channel variation are typically addressed using normalizations, on the feature-level, the model-level, or the score-level. Since various approaches are taken in different domains and in varying ways, they often improve performance when applied on top of each other.

Historically, channel effects have been the dominating cause of errors in automatic speaker recognition tasks. In early speaker recognition work, mismatch in the type of telephone handset of train and test data caused error rates over four times as great as in the case of matched handsets [62]. In the most recent 2010 NIST Speaker Recognition Evaluation, the effects of channel mismatch still exist, but to a far lesser extent, with very low overall error rates for the best systems, despite increased amounts of channel variability.

CHAPTER 2. BACKGROUND 12 Feature-level Normalizations Cepstral mean subtraction (CMS) is a fairly simple technique that is applied at the feature-level [3].

CMS subtracts the time average from the output cepstrum in order to produce a zero mean log cepstrum. That is, for a temporal sequence of each cepstral coefficient cm, T cm (t) = cm (t) − ˆ cm (τ ) (2.9) T τ =1 The purpose of CMS is to remove the effects of the transmission channel, yielding improved robustness. However, any non-linear channel effects will remain, as will any time-varying linear channel effects. Furthermore, CMS can remove some of the speaker characteristics, as the average cepstrum does contain speaker-specific information.

Another feature-level channel compensation method is feature mapping [63]. Feature mapping aims to map features from different channels into the same channel-independent feature space. A channel-independent root GMM is trained, and channel-dependent background GMMs are adapted from the root. Feature-mapping functions are obtained from the model parameter changes between the channel-independent and channel-dependent models. The most likely channel is detected for the speaker data, which is then mapped to the channel-independent space. Adaptation to target speaker models is done using mapped features, and during verification, the mapped features of the test utterance are used for scoring.

The root GMM is used as the UBM for calculating the log likelihood ratios.

Within-class covariance normalization (WCCN) is a feature normalization technique for SVM systems [28]. In this method, a generalized linear kernel is trained, using class label information (i.e., a target or impostor speaker), in order to find orthonormal directions in the feature space that maximize information relevant to the task. The weights of those directions are optimized to minimize an upper bound on the error rate.

Model-level Normalizations Speaker model synthesis (SMS) is a GMM model-based technique that utilizes channeldependent models [70]. Rather than having one speaker-independent UBM, the SMS approach begins with a channel- and gender-independent root model, and then uses Bayesian adaptation to obtain channel- and gender-dependent background models. Channel-specific target speaker models are also adapted from the appropriate background model, after the gender and channel of the target speaker’s training data have been detected. Furthermore, a transformation for each pair of channels is calculated using the channel-dependent background models; this transformation maps the weights, means, and variances of a channel a model to the corresponding parameters of a channel b model. During testing, if the detected channel of the test utterance matches the type of channel of the target speaker model, then that speaker model and the appropriate channel-dependent background model are used to calculate the LLR for that test utterance. On the other hand, if the detected channel of the CHAPTER 2. BACKGROUND 13 test utterance is not a match to the target speaker model, then a new speaker model is synthesized using the previously calculated transformation between the target and test channels.

Then, the synthesized model and the corresponding channel-dependent background model are used to calculate the LLR for the test utterance.

Nuisance attribute projection (NAP) is another model-based technique, designed for use in SVM systems [67]. This method aims to remove “nuisance” dimensions, that is, those irrelevant to the task of speaker recognition, by projecting points in the expansion space of the SVM onto a subspace designed to be more resistant to channel effects. A projection matrix is created (using a training data set) in order to minimize the average cross-channel distance, with a weight matrix which can be formulated to not only reduce cross-channel distances, but also increase cross-speaker distances. This minimization problem reduces to an eigenvalue problem, where the eigenvectors with the largest eigenvalues must be found.

–  –  –

2.3.5 Current State-of-the-Art Systems One current state-of-the-art approach utilizes joint factor analysis (JFA), which models speaker and session variability in GMMs [38]. A target speaker GMM is adapted from a UBM, and the speaker is represented by the means, covariance, and weights of the GMM.

JFA assumes that a speaker- and channel-dependent supervector can be decomposed into the sum of a speaker supervector, s, and a channel supervector, c. Furthermore, the speaker supervector is modelled as s = m + Dz + V y, where m is the speaker- and channel-independent supervector from the UBM, D is a diagonal matrix, V is a low-rank rectangular matrix, and y and z are independent normally distributed random vectors, with components corresponding to the speaker and residual factors, respectively. The channel-dependent supervector is modelled as c = U x, where U is a low-rank rectangular matrix and x is a normally distributed vector whose components corresponding to the channel factors. By estimating the speaker space matrix V, the channel space matrix U, and the residual matrix D, the speaker, channel, and residual factors can be calculated, and a score for a trial can be computed using a simple linear product. A simplified version of factor analysis can also be applied to a UBM-GMM system, using only the channel space matrix U, to do eigenchannel MAP adaptation [71, 48].

Another current approach that developed from JFA is the i-vector system [19]. In this method, the total variability is modeled in a single matrix, rather than as separate speaker and channels, i.e., s = m + Tw where T is the total variability matrix, and w is the i-vector (which stands for an intermediate size vector). The matrix T is trained in a similar way as V is in the previous approach, and i-vectors are extracted. Linear discriminant analysis (LDA) and WCCN are applied to the i-vectors as channel compensation, and a score is produced using cosine distance scoring.

CHAPTER 2. BACKGROUND 15



Pages:     | 1 | 2 || 4 | 5 |   ...   | 12 |


Similar works:

«Assessing and Detecting Malicious Hardware in Integrated Circuits By Trey Reece Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in Electrical Engineering December, 2014 Nashville, TN Approved: William H. Robinson, Ph.D. Bharat L. Bhuva, Ph.D. Gabor Karsai, Ph.D. Thomas D. Loveless, Ph.D. Bradley A. Malin, Ph.D. ELECTRICAL ENGINEERING Assessing and Detecting Malicious Hardware...»

«Why Does Time Seem to Pass? Simon Prosser University of St Andrews (Forthcoming in Philosophy and Phenomenological Research) 1. The Problem According to the B-theory, the passage of time is an illusion. Although times are objectively ordered, with every time earlier or later than every other, no time is objectively past, present or future. 1 The A-theory, by contrast, says that time passes. Here I shall use the term ‘A-theory’ to include any ‘dynamic’ view of time; the term thus...»

«ALL ABOUT THE BENJAMINS: THE NINETEENTH CENTURY CHARACTER ASSASSINATION OF BENJAMIN FRANKLIN by CHARLES ROBERT DIXON A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of English in the Graduate School of The University of Alabama TUSCALOOSA, ALABAMA Copyright Charles Robert Dixon ALL RIGHTS RESERVED ABSTRACT Early in his Autobiography, Benjamin Franklin proclaims that the chief benefit of the autobiographical form is...»

«NITRIC OXIDE FACILITATES NUCLEAR FACTOR OF ACTIVATED T-CELL (NFAT) ACTIVITY THROUGH AKT INDUCED GLYCOGEN SYNTHASE KINASE-3BETA (GSK-3Beta) PHOSPHORYLATION By JASON A. DRENNING A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 1 © 2008 Jason A. Drenning 2 To my wife, Tiffany Drenning, who sacrificed much for this degree 3 ACKNOWLEDGMENTS This work was completed...»

«Understanding the Mechanics of Tidewater Glacier Retreats: Observations and Analyses at Columbia Glacier, AK By Shad O’Neel B.A., University of Montana, 1997 M.S., University of Alaska, 2000 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Geophysics Program Department of Geological Sciences UMI Number: 3239433 UMI Microform 3239433 Copyright 2007 by ProQuest Information and...»

«Richard L. W. Clarke LITS2306 Notes 05A RENÉ DESCARTES MEDITATIONS ON FIRST PHILOSOPHY (1641) Descartes, René. Meditations on First Philosophy. Selected Philosophical Writings. Trans. John Cottingham, Robert Stoothoff and Dugald Murdoch. Cambridge: CUP, 1988. 73-122. Meditation I: What can be called into Doubt (“The General Demolition of My Opinions” [76]) Here, Descartes’s concerns are epistemological in nature as he plunges into the depth of skepticism, coming to the view that almost...»

«SAR HIGH SCHOOL Family Handbook 2015-2016 5776 th 503 West 259 Street Riverdale, NY 10471 718-548-2727 ● sarhighschool.org ● Fax 718-548-4400 Dedicated to the Memory of JJ Greenberg z”l ******************************************************************** This Family Handbook is intended for use by SAR families only. The information in this handbook is not to be used for commercial purposes or solicitations of any kind. We appreciate your cooperation in using this handbook in the spirit in...»

«COPYRIGHT NOTICE: Kalman P. Bland: The Artless Jew is published by Princeton University Press and copyrighted, © 2000, by Princeton University Press. All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher, except for reading and browsing via the World Wide Web. Users are not permitted to mount this file on any network...»

«Understanding the Athenian Fear of Socrates Richmond Journal of Philosophy 12 (Spring 2006) William P. Kiblinger Understanding the Athenian Fear of Socrates: A Reading of Plato's Apology of Socrates William P. Kiblinger Who was Socrates? Was he a sincere student or a sincere skeptic? Or was he sly and disingenuous? Or perhaps he was none of these, but more of a religious saint. Whatever the answer (if such a thing is finally possible), one thing is for sure: one must assess Socrates’ use of...»

«UNIVERSITY OF CALIFORNIA, IRVINE Computational REST: A New Model for Decentralized, Internet-Scale Applications DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Justin Ryan Erenkrantz Dissertation Committee: Professor Richard N. Taylor, Chair Professor Debra J. Richardson Professor Walt Scacchi Portions of Chapters 2,3,4,5 adapted from “From Representations to Computations: The Evolution of Web...»

«Article The Scepticism of Descartes’s Meditations James Thomas Laval théologique et philosophique, vol. 67, n° 2, 2011, p. 271-279.Pour citer cet article, utiliser l'information suivante : URI: http://id.erudit.org/iderudit/1007008ar DOI: 10.7202/1007008ar Note : les règles d'écriture des références bibliographiques peuvent varier selon les différents domaines du savoir. Ce document est protégé par la loi sur le droit d'auteur. L'utilisation des services d'Érudit (y compris la...»

«Sufficiency or Priority? Yitzhak Benbaji The doctrine of sufficiency says, roughly, that what is important from the point of view of morality is that each person should have enough.1 The doctrine has recently become a popular theme for philosophical analysis. The notion of ‘having enough’ and its ethical significance are by now central to any discussion of the ethics of distribution. The basic idea is that there is a privileged level of well-being, such that if X is badly off (below the...»





 
<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.