FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 5 ] --

A later study examining mimicry also aimed to determine how closely an impersonator could match certain acoustic parameters of his speech to those of speech from the target figure [24]. The professional impersonation artist was given three excerpts of speech from well-known figures and asked to imitate these speakers as closely as possible, in terms of voice quality, speech style, and speech rate. A comparative recording of the same speech material was made with the artist using his natural voice and speaking style in order to find the extent to which the artist had to change his voice. The impersonator was able to successfully change his global speech rate, though he had less control over more local articulatory timing. Global fundamental frequency was also successfully matched by the impersonator, who was able to CHAPTER 2. BACKGROUND 20 both increase and decrease his mean fundamental frequency (by 15-30 Hz) in order to do so. The impersonator had varying degrees in success at matching the first three formant frequencies of his speech to the targets.

There have also been a number of studies exploring the effects of voice modification on an automatic speaker recognition system. The effects of intentional voice alterations (such as changing pitch or adopting an accent) were tested both for human listening experiments as well as for automatic speaker recognition system performance [36]. The speech was collected from normal subjects (that is, people who are not professional or expert mimics), in a setting that simulated a telephone conversation. Speakers were asked to disguise their voice in a variety of ways, including changing pitch, changing duration, and mimicking an accent.

Automatic speaker recognition performance using a cepstral UBM-GMM system was evaluated for two conditions: training and test data from normal voice; and training from normal voice and testing from disguised voice. The normal-normal condition produced an EER of almost 0%, while the normal-disguised condition had an EER of 7.5%. However, using the decision threshold from the normal-normal system on the normal-disguised trials yielded an increase in false rejection rate from 7% to 40%, suggesting that systems are vulnerable to intentional voice disguises. A human listening experiment asked subjects to listen to two samples of about 5 seconds of speech and decide whether the utterances were spoken by the same speaker; if unsure, listeners could hear additional 5 second speech utterances, up to a limit of 20 seconds, when they had to make a final decision. The results indicated that in the normal-normal condition, automatic performance was similar to the lower quartile of human performance, though the automatic performance was better than humans in the normal-disguised case.

Another study investigated the effects of a transfer function-based voice transformation on automatic speaker recognition performance [8]. In the source-filter model of speech production, speech is modeled as a convolution of a sound source (i.e., the vocal cords) and a linear acoustic filter (i.e., the vocal tract). In the spectral domain, a speech signal X is then given by X(f ) = H(f )S(f ), where S(f ) is the Fourier transform of the source signal and H(f ) is the transfer function corresponding to the filter characteristics of a speaker, where transfer function refers to the mapping of input to output in the frequency domain for a linear time-invariant system (such as a filter). Given knowledge of the speaker recognition method, the voices of impostors were modified to target a specific speaker. By transforming the impostor speech in such a way as to match the transfer function of a targeted speaker, they were able to increase the false alarm rate of the system from less than 1% to 97%, when using the targeted speaker’s training utterance, and to 50% when using a different utterance of the targeted speaker. A previous study also tested computer voice-altered impostors, using a speech synthesis algorithm to model the spectral characteristics of a target voice [58].

In this case, the false acceptance rate increased from 1.5% to 86%.


2.7 Speaker Recognition Error Analysis 2.7.1 A Speaker Menagerie One of the inspirations for this thesis is the work of Doddington et al., who classified speakers in groups according to the types of speaker recognition errors they cause [22]. There are 4 types of speakers defined: “goats,” speakers who cause a large number of false rejections as a target speaker; “lambs,” speakers who cause a large number of false accepts as a target;

“wolves,” speakers who cause a large number of false accepts as an impostor test speaker;

and “sheep,” the default type of speaker. Through the use of statistical tests, the presence of goats, lambs, and wolves was shown for a UBM-GMM system using data from NIST’s 1998 Speaker Recognition Evaluation, for female speakers only.

The score for each trial of target-test pairs was considered a function of the test speaker index j and the model speaker index k. Thus, a score probability density function for a given test speaker (j) and model speaker (k) would be fs (•|j, k). By asserting the null hypothesis that there are no speaker differences, the existence of goats, lambs, and wolves could be shown by considering different score distributions and disproving the null hypothesis. For the case of goats, the density function need only include the case where j = k, in which the density should not depend on k if goats do not exist; that is, without goats, the distribution of true speaker scores should be the same for each true speaker. For lambs and wolves analysis, the case of interest is j = k, in which the density should not depend on k if lambs do not exist, and should not depend on j if wolves do not exist. That is, if there are no lambs, the distribution of impostor scores should be the same regardless of the model speaker, while if there are no wolves, then the distribution of impostor scores should be the same regardless of test speaker.

For goats, analysis comprised computing means and variances for the sets of scores belonging to the same true speaker, and then determining if the means and variances depend on the speaker. Under the assumption that the means and variances do not depend on the speaker, only 5% of the true speaker score means should lie outside the 2.5 and 97.5 percentiles of the hypothetical speaker-independent underlying score distribution with appropriate mean and variance; if this does not hold true, then the speakers below the hypothetical 2.5 percentile can be categorized as goats. The results showed that there were, in fact, more outliers than could be accounted for by a single speaker-independent distribution.

For lambs, graphical analysis involved plotting the maximum impostor score for a model speaker against each true speaker score for that model speaker. Although this plot did not indicate any lamb sub-population of models in this analysis, the models with high maximum impostor score may be considered lamb-like.

For wolves, after computing the maximum impostor score for each test utterance, then the means and variances of sets of maximum impostor scores for the same test speaker can be calculated. As with the distribution considered in the goat speaker analysis, the means are compared with the 2.5 and 97.5 percentiles of a hypothetical speaker-independent underlying CHAPTER 2. BACKGROUND 22 score distribution; if more than 5% of the means lie outside these hypothetical percentiles, then there is a speaker dependence, and the test speakers with means above the hypothetical

97.5 percentile may be considered wolves. Once again, there were more outliers than could be accounted for by a single distribution, indicating the existence of wolf-ish speakers.

Furthermore, the F-test, Kruskal-Wallis test, and Durbin test were used to reject the null hypotheses at the 0.01 significance levels for goats, lambs, and wolves. The F-test is a one-way analysis of variance test used to determine statistically whether there is a speaker effect. The F-test was applied to test for potential goats by using all true speaker scores for each speaker, while it tested for potential lambs and wolves by first averaging the scores corresponding to the same model-test speaker pair (over all test utterances), and then using all impostor trials for the model speakers (in the lamb case) or test speakers (in the wolf case). The Kruskal-Wallis test is also a one-way analysis of variance, but it is non-parametric and uses ranks. For speakers with at least 5 true speaker trials, all the true speaker scores were used (goats). As with the F-test, the impostor scores were averaged for each model-test speaker pair before the test is applied (for lambs and wolves). Ranks are assigned to all of the mean scores, and ranks are summed for each speaker. Finally, the Durbin test is a twoway analysis of variance by ranks test, and was applied only to impostor scores (for lambs and wolves testing), for which the data could be viewed as conditioned on the two different speakers (i.e., the model and test speakers for each impostor score). As with the previous tests, impostor scores were first averaged across test utterances, and then the Durbin test assigned ranks to the averaged scores. The ranks were then summed for each test or model speaker, corresponding to the lamb or wolf test, respectively.

Using the rank sums from the Durbin test, a mild correlation of about 0.26 was found to exist between lambs and wolves. There were no correlations found between goats and either lambs or wolves. Furthermore, the speakers were ranked according to how goat-like they were (using the Kruskal-Wallis test) and to how wolf-like and lamb-like they were (using the Durbin test). Then, a cumulative distribution of errors for the rank ordered speakers showed that the 25% most goat-like speakers contributed 75% of the false rejection errors, though false alarm errors were more evenly distributed across speakers.

2.7.2 Related Work Poh et al. extended the work of Doddington et al. by developing a user-specific score normalization (referred to as F-norm’s variant) in order to address “badly behaved” users of the system, i.e., those users who degrade system performance [60]. Furthermore, for a multimodal biometrics context, Poh et al. developed a fusion technique that decides whether or not to fuse the output of several systems on a per user basis.

For a closed set speaker identification task, Jin and Waibel implemented a “naive delambing method” in order to reduce the effects of speakers who were likely to be identified as another speaker [31]. In the context of a vector quantization (VQ) based technique, in which codebooks are trained for each speaker, Jin and Waibel found that the closest match CHAPTER 2. BACKGROUND 23 in cross-validation testing for some speakers was not the correct speaker himself, and thus developed a method for modifying the codebooks in such cases. Additionally, to further reduce the effects of lamb-like speakers, these lamb speakers were located in the set (using cross-validation testing), and a threshold was set for each lamb speaker’s belief heuristic value, so that identification as that lamb speaker could occur only if the score was above the belief heuristic.

2.7.3 Session Variability Beyond considering the effects of different types of speakers, there has also been work investigating the impact that the particular training and test utterances used have on system performance [34]. A UBM-GMM system with factor analysis on male telephone data from the 2008 NIST Speaker Recognition Evaluation was first analysed with respect to performance dependence on the target speaker, focusing on the lambs and wolves of the aforementioned Doddington menagerie. Results showed an uneven distribution of false alarm errors, with 26% of the speakers causing 50% of the errors, and the 6% worst speakers accounting for 17% of the errors. The distribution of false rejection errors was also uneven, with 8% of the target speakers causing 50% of the false rejection errors, and 25% of these errors were due to 6% of the speakers.

The study also investigated the effect of the training sample used for each target speaker.

Baseline performance corresponded to the training segment selected in the NIST evaluation.

The best and worst training utterances were also defined for each speaker by finding the utterance that minimized or maximized the sum of false acceptance and false rejection rates, respectively. The baseline NIST performance had an EER of 12.1%, while using the best training data yielded an EER of 4.1% and using the worst training data generated an EER of 21.9%. The variability in performance demonstrated that the choice of training segment can have a significant impact.

Additional work investigated possible causes for the variable performance [33]. In particular, using data from NIST SRE08 as well as a French database of controlled read speech, BREF 120, the dependence of performance on training session was further analyzed. When switching the train and test segments of the sets used in the aforementioned work on SRE08, they found that the ranking of performance remained the same. That is, the inverted case corresponding to the original worst training segments (which become test segments in the inversion) still had the highest EER (17%) and the inverted case corresponding to the original best training segments (which are test segments in the inversion) had the lowest EER (7.4%), with the inverted NIST set performing in between the two (at 13.5%). However, the differences in performance were smaller than in the original case, suggesting that the choice of training excerpts have a greater effect than the choice of testing excerpts.

Analysis of system performance on the BREF 120 database for both male and female speakers also showed a range of performance between choosing the best training utterances and the worst, with random selection of training segments yielding performance in between CHAPTER 2. BACKGROUND 24 the best and the worst. The distribution of phonetic content between different training excerpts was examined as a possible contributing cause for the difference in performance.

Pages:     | 1 |   ...   | 3 | 4 || 6 | 7 |   ...   | 12 |

Similar works:

«Gurnang Life Challenge Young Adult Offender Women Adventure Based Challenge Experiential Learning/Adventure Therapy Program Overview Contents Acknowledgments Executive Summary Introduction and Philosophical Overview What is the Issue? Dynamic Risks Gurnang Life Challenge Women Experiential Learning/Adventure Therapy Program Future of ABC Historical Overview What Is Adventurous Activities and Experiential learning Effectiveness of Adventurous / Experiential Learning activities What Else Is...»

«3D MODELING WITH DATA-DRIVEN SUGGESTIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Siddhartha Chaudhuri August 2011 © 2011 by Siddhartha Chaudhuri. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons AttributionNoncommercial-No Derivative Works 3.0...»


«Accidents The Dewey lectures are supposed to be autobiographical and also reflective about our profession hence, I guess the hope is, somewhat instructive. I don=t think my own history very instructive, but it did have some amusing twists, and I do have some concerns about the current state of the profession that I would like to share. So I thank the Dewey Society very much indeed for inviting me to give this lecture. It is an unexpected and a very pleasant honor! Current research has it that...»


«INVOLVEMENT OF OKLAHOMA CLERGY IN PROVIDING MARRIAGE PREPARATION By JOE DWAYNE WILMOTH Bachelor of Arts Evangel University Springfield, Missouri Master of Arts Assemblies of God Theological Seminary Springfield, Missouri Submitted to the Faculty of the Graduate College of the Oklahoma State University in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY July, 2005 COPYRIGHT By Joe Dwayne Wilmoth Graduate Date July 2005 ii INVOLVEMENT OF OKLAHOMA CLERGY IN PROVIDING...»

«TAM GIÁO CHƯ VỌNG [THE ERRORS OF THE THREE RELIGIONS] A TEXTUAL AND ANALYTICAL STUDY OF A CHRISTIAN DOCUMENT ON THE PRACTICES OF THE THREE RELIGIOUS TRADITIONS IN EIGHTEENTH-CENTURY VIETNAM A Dissertation submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Theological and Religious Studies By Anh Quoc Tran, S.J., S.T.L., M.A. Washington, DC April 7, 2011 Copyright...»

«INFORMATION IN FINANCIAL MARKETS by Bin Chang A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Joseph L. Rotman School of Management University of Toronto ©Copyright by Bin Chang, 2008 Abstract Information in Financial Markets by Bin Chang Doctor of Philosophy Joseph L. Rotman School of Management University of Toronto 2008 This thesis studies information in financial markets from three perspectives: the role of information...»

«Designing for Remixing: Supporting an Online Community of Amateur Creators by Andr´s Monroy-Hern´ndez e a S.M., Media Arts and Sciences, Massachusetts Institute of Technology (2007) B.S., Electronic Systems Engineering, Tecnol´gico de Monterrey (2001) o Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF...»

«Colombian Immigrant Children in the United States: Representations of Food and the Process of Creolization by María Claudia Duque-Páramo A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Anthropology College of Arts and Sciences University of South Florida Major Professor: Michael V. Angrosino, Ph.D. Mary E. Evans, Ph.D. Mario Hernandez, Ph.D. David A. Himmelgreen, Ph.D. Linda M. Whiteford, Ph.D. Date of Approval: November...»

«Metaphor in Diagrams Alan Frank Blackwell Darwin College Cambridge Dissertation submitted for the degree of Doctor of Philosophy University of Cambridge September 1998 Abstract Modern computer systems routinely present information to the user as a combination of text and diagrammatic images, described as “graphical user interfaces”. Practitioners and researchers in Human-Computer Interaction (HCI) generally believe that the value of these diagrammatic representations is derived from...»

«Isabelle Labonté MUSCLE LISSE BRONCHIQUE ET ASTHME Etudes in vivo et in vitro Thèse présentée à la Faculté des études supérieures de l'Université Laval dans le cadre du programme de doctorat en médecine expérimentale pour l'obtention du grade de Docteur es Philosophia (PhD) Département de médecine expérimentale Faculté de médecine Université Laval Québec 2009 © Isabelle Labonté, 2009 II Résumé Les cellules musculaires lisses (CML) jouent un rôle primordial dans la...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.