FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 12 |

«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»

-- [ Page 4 ] --

2.4 Speech Corpora There are a number of conversational speech corpora utilized for speaker verification tasks. Older corpora include Switchboard-1, Switchboard-2, and Fisher [45, 46, 17]. They contain speech data collected from telephone conversations between pairs of speakers; these conversations are typically around 5 minutes in length, so that each conversation side (i.e., the side of the conversation corresponding to one speaker) is roughly 2.5 minutes in length.

In addition to landline telephone data, there is a cellular telephone data set of Switchboard-2.

The National Institute of Standards and Technology (NIST) has coordinated Speaker Recognition Evaluations since 1997, and there are multiple corpora available from these evaluations; the most commonly used data sets correspond to the NIST 2004, 2005, 2006, 2008, and 2010 Speaker Recognition Evaluations (SREs) [50, 51, 52, 53, 54]. The evaluation data is taken from various stages of the larger Mixer collection [15, 16]. Each of the aforementioned SRE data sets include conversational telephone speech. Conversational speech recorded on a variety of microphones was included starting in SRE05. SRE08 introduced a different style of speech, specifically that of an interview; in these cases, most speech belongs to the interviewee, though some interviewer speech may be present. I will refer to each speech sample or utterance, whether obtained from a conversation or an interview, as a conversation side.

2.5 Performance Measures for the Speaker Verification Task The NIST Speaker Recognition Evaluations use two performance measures for speaker recognition systems, namely the detection cost function (DCF) and the equal error rate (EER). As mentioned previously, there are two types of errors that occur in speaker verification tasks: false acceptances, or false alarms, in which an impostor speaker is incorrectly verified as the target, and false rejections, or misses, in which a true speaker is rejected as the target. For every decision threshold, there will be false alarm and miss rates that indicate the probability of each type of error occurring.

The DCF is defined as a weighted sum of the miss and false alarm error probabilities:

DCF = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1 − PTarget ) (2.13) In Equation (2.13), CMiss and CFalseAlarm are the relative costs of detection errors, and PTarget is the a priori probability of the specified target speaker. I will use the values from SRE08, namely, CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01. When DCF is given here, it refers to the minimum possible DCF, i.e., to a cost that has been minimized over possible values of the decision threshold. The equal error rate (EER) is simply the rate at which false alarm and miss probabilities are equal.

CHAPTER 2. BACKGROUND 16 The minimum DCF and EER capture only two possible operating points for a system.

In order get a better sense for how good a system is overall, there are detection error tradeoff (DET) plots, which plot the false alarm rate against the miss rate over the entire range of decision thresholds [47]. By using a logarithmic scale, a receiver operating characteristic (ROC) curve becomes a line. The better the system, the closer the DET curve will be to the lower left of the plot (i.e., smaller error rates).

2.6 Intrinsic Speaker Qualities In general, a speech sample is affected by both intrinsic and extrinsic factors, where extrinsic factors include noise, room acoustics, and channel effects. Since the focus of my dissertation work is on intrinsic speaker characteristics, I now discuss a variety of issues and concepts relevant to a discussion of inherent speaker qualities. A brief overview of some of the major sources of variation within and among speakers is given in Section 2.6.1, including physical attributes, accent or dialect, prosody, and emotion. Additionally, in order to further explore the inherent difficulties of a speaker recognition task, the concept of the distinctiveness or recognizability of a speaker is covered in Section 2.6.2, along with various studies in which human listening has been applied to a speaker-related task. Finally, Section 2.6.3 presents work that deals with voice modifications attempted in order to fool an automatic speaker recognition system, as these studies are indicative of the effects that varying speaker characteristics can have.

2.6.1 Sources of Speaker Variation Physical Attributes At the most basic level, a person’s voice is characterized by his vocal apparatus. The length of the vocal tract, the size of the vocal folds in the larynx, the size and shape of the nasal cavity, and other anatomical features all contribute to the acoustic properties of a person’s speech, affecting formant frequencies of vowels, average pitch, pitch range, and qualities such as breathiness and nasality [20]. While an individual has a certain amount of control over the frequency characteristics of his speech and can speak outside of his typical range of everyday speech frequencies, the effects of other physical attributes, such as the size and shape of the nasal cavity, cannot be manipulated.

A person’s voice will also be affected by his age and health. Physical changes that occur as a child grows into an adult are the most obvious example of aging effects, especially for male voices. However, the voice quality also changes as an adult grows older. Examination of voice spectrograms for a set of subjects over a period of years showed that the frequency of the point of concentration of formants and the mean pitch frequency decreased with increasing age, and the individual distribution curves of mean pitch frequency became more narrow, i.e., the ability to vary fundamental frequency was lost in the aging process [23].

CHAPTER 2. BACKGROUND 17 Furthermore, a person’s health will impact the way his voice sounds; for instance, a cold may make the voice hoarse or more nasal.

Language, Dialect, and Accent The language choice of a speaker is another source of speaker individuality. In the case of multi-lingual speakers, their native language will typically influence the way they produce the speech sounds of other languages, giving rise to a foreign accent.

Furthermore, word and phone pronunciation can vary widely, even within the same language, leading to accents among native speakers. There are many accents in English, for instance: not only are there British, American, and Australian accents, but there are local regional accents within each of those groups.

In addition to different accents, languages often include different dialects, which may vary in the usage of certain words or grammatical forms, as well as word pronunciations.

Variations in dialect may reflect geographical, age, socio-economic, or educational differences between speakers.

Variability in Speech Production and Prosody Humans are able to listen to speech and identify the words and phones that are spoken.

However, the same word or phone may be produced in varying ways. Speakers will differ in the precise ways of articulating a sound, as well as the degree of coarticulation between consecutive sounds. Speech rate, often measured by the number of words or phones per second, is another characteristic that will vary from speaker to speaker.

In linguistics, prosody refers to various acoustic properties of speech that can convey additional information about the utterance or speaker. Types of prosodic information include loudness, pitch, tone, intonation, rhythm, and lexical stress. Variations in prosody may indicate things such as sarcasm, speaker emotion, emphasis, or whether an utterance is a statement or a question. Furthermore, prosody is suprasegmental, meaning that prosodic features are not limited to any one segment, but occur at a higher level, across multiple segments.

The concept of speech rhythm involves a number of timing parameters, including the tempo, pauses, and various durational patterns, which may for example, be measured as the mean and standard deviation of word or phone lengths. The prosodic tendencies of a given speaker help to define his speaking style. Additional lexical information such as word usage, and the relative frequency of disfluency classes (including pause-fillers, discourse markers, or backchannel expressions) can also contribute to a speaker’s individual speaking style. As described in Section 2.3.3, several of the higher-level systems for speaker recognition attempt to capture such individual variations in order to differentiate between speakers.

CHAPTER 2. BACKGROUND 18 Emotion The emotional state of a speaker can also impact the characteristics of his speech.

A number of acoustic parameters can be involved in conveying an emotion: the level, range and contour of the fundamental frequency (perceived as pitch); the vocal energy or amplitude (perceived as voice intensity); the energy distribution across the frequency spectrum (perceived in voice quality or timbre); formant location (related to articulation perception);

and a number of timing parameters, such as tempo and pauses [5].

As an example, joy typically manifests in speech as increases in the mean, range, and variability of fundamental frequency, along with an increase in mean energy. Joy may also cause a higher rate of articulation.

2.6.2 Speaker Recognizability or Inherent Challenges A concept that is related to inherent speaker characteristics is the recognizability of a person’s voice. One human listening experiment asked subjects to rate the distinctiveness of different speakers, in terms of a seven point scale describing how easy or hard the voice would be to remember [40]. An error analysis of a speaker recognition system that will be discussed in Section 2.7 also attempted to find speakers who were hard for the system to recognize. Though the results of human listening tasks may not always correspond to results obtained by automatic systems, they provide insight into the nature of challenges inherent to speaker recognition tasks.

Speaker verification by human listeners was compared to machine performance using NIST 1998 Speaker Recognition Evaluation data [65]. The human task was designed to emulate the paradigm of the NIST evaluation as closely as possible, though human constraints due to memory and fatigue imposed a limit on both the number of the trials as well as the length of speech samples. Listeners were asked to make a same or different speaker discrimination with confidence ratings (10 levels). Results showed that human listening, when individual decisions were combined, was comparable to or even better than typical computer algorithms, especially in the case of mismatched train and test handsets.

Recently, the 2010 NIST Speaker Recognition Evaluation included a human assisted speaker recognition task [27]. Participating sites evaluated a subset of trials, selected to be difficult, using any human assisted technique, including listening and examination of spectrograms or other features. The decision could be based on a group of humans, with no restriction on the use of experts or naive listeners. Analysis of results showed that this was largely a challenging task for humans, with fairly high error rates on many of the selected trials. For these difficult trials, automatic systems performed better than humans.

A study of voice identification by human listeners, relating to the reliability of the testimony of an earwitness (in a legal setting), examined a variety of issues, including familiar versus unfamiliar voices, the reliability or accuracy of voice identification, reliability as a function of time, and reliability as a function of whether or not the listener is trying to remember CHAPTER 2. BACKGROUND 19 the voice [18]. Examination of various studies yielded a number of conclusions. First, the length of the heard speech does not seem to have too great of an effect. Voice disguise and even unintentional changes in tone were found to greatly reduce identification accuracy, even under ideal conditions. When comparing incidentally and intentionally memorized voices, there was little evidence that voice identifications by witnesses who were unprepared or had little time to initiate efficent encoding strategies would be reliable. In terms of delay between the time of hearing the initial speech and making a voice identification, the greater the delay, the greater the likelihood of error and unreliability. Examination of the relationship between witness accuracy and confidence level showed promising, but inconclusive results.

2.6.3 Voice Modifications As mentioned in Section 2.6.1, speakers can manipulate their voices in certain ways, even if they cannot change certain physical attributes, like vocal tract lengths or the size and shape of their nasal cavities. Changes in a speaker’s voice, intentional or not, can impact speaker recognition performance.

One early study examined the effects of voice disguise and voice imitation on spectrograms [23]. For voice disguise, subjects kept the speech content the same across samples, but were allowed to differ from their normal voice in terms of pitch frequency, rate of articulation, pronunciation, and dialect. Comparison of the formant positions indicated that the formants could be shifted higher or lower than the normal voice, though the first formant was comparatively stable. In terms of voice imitation, the imitator was able to vary his mean fundamental frequency considerably in order to be more similar to a target, though he was generally unable to precisely match the formants or instantaneous fundamental frequencies of the speaker being imitated. It makes sense that the imitator could successfully change his overall average fundamental frequency, even if precise instantaneous fundamental frequencies could not be matched, given that the imitator is changing his voice according to his memory of perceived pitch of the target speaker (which may not match the actual instantaneous values). Similarly, although formant frequencies can potentially be changed, a speaker has certain habits of articulating speech sounds (leading to certain formant frequencies) that are often difficult to manipulate consciously over a continuous speech utterance.

The imitator was largely successful in imitating the speech melody of a given target.

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |   ...   | 12 |

Similar works:


«People Manipulate Objects (but Cultivate Fields): Beyond the Raster-Vector Debate in GIS Helen Couclelis Department of Geography, University of California Santa Barbara, CA 93106, USA A b s t r a c t. The ongoing debate in GIS regarding the relative merits of vector versus raster representations of spatial information is usually couched in technical terms. Yet the technical question of the most appropriate data structure begs the philosophical question of the most appropriate conceptualization...»

«Anthropic reasoning in multiverse cosmology and string theory∗ Steven Weinstein† Perimeter Institute for Theoretical Physics, 31 Caroline St, Waterloo, ON N2L 2Y5 Canada Dept. of Philosophy, University of Waterloo, Waterloo, ON N2L 3G1 Canada arXiv:hep-th/0508006v1 1 Aug 2005 Dept. of Physics, University of Waterloo, Waterloo, ON N2L 3G1 Canada July 31, 2005 Abstract Anthropic arguments in multiverse cosmology and string theory rely on the weak anthropic principle (WAP). We show that the...»

«Activating Intersubjectivities in contemporary dance Choreography A thesis submitted to Middlesex University In partial fulfilment of the requirements for the degree of Doctor of Philosophy April Nunes Tucker School of Arts and Education Middlesex University December 2009 ABSTRACT This doctoral project examines Maurice Merleau-Ponty’s account of phenomenological intersubjectivity and addresses a gap between his account of intersubjectivity and intersubjectivities present in the contemporary...»

«Design of new sulfate-based positive electrode materials for Liand Na-ion batteries Marine Reynaud To cite this version: Marine Reynaud. Design of new sulfate-based positive electrode materials for Liand Naion batteries. Material chemistry. Universit´ de Picardie Jules Verne, 2013. English. tele HAL Id: tel-01018912 https://tel.archives-ouvertes.fr/tel-01018912 Submitted on 7 Jul 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the...»

«Designing for Remixing: Supporting an Online Community of Amateur Creators by Andr´s Monroy-Hern´ndez e a S.M., Media Arts and Sciences, Massachusetts Institute of Technology (2007) B.S., Electronic Systems Engineering, Tecnol´gico de Monterrey (2001) o Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF...»

«UNDERSTANDING VIOLENT CONFLICT: A COMPARATIVE STUDY OF TAJIKISTAN AND UZBEKISTAN Idil Tuncer Kilavuz Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Central Eurasian Studies, Indiana University August 2007 UMI Number: 3278200 Copyright 2007 by Tuncer Kilavuz, Idil All rights reserved. UMI Microform 3278200 Copyright 2007 by ProQuest Information and Learning Company. All rights...»

«Ergo an open access journal of philosophy Modal MonogaMy C.S.I. JenkInS University of British Columbia Patience: Why, how could I love him and love you too? You can’t love two people at once! Bunthorne: [Aside.] Oh, can’t you, though! [Aloud.] I don’t believe you know what love is! Patience: [Sighing.] Yes, I do. There was a happy time when I didn’t, but a bitter experience has taught me.1. Introduction To begin, I want to distinguish two hypotheses: Moral Monogamy: The only morally...»

«Welcome to Your Retreat Planning Packet It is our hope and prayer that this manual will be a help to you in planning a retreat at one of the four sites within the Susquehanna Conference. This manual is a collection of material from a wide variety of places and times. We have tried to adapt the following material to be most useful to you: *Site Contact Information *Why A Retreat? *Choosing a Facility *What Our Sites Offer *Planning Your Retreat *Promotion *Sample Letter of Registration...»

«HOW TO TEACH SO THAT PEOPLE LEARN POSTCARDS FROM CORINTH • CHAPTER EXCERPT Postcards is the Users Guide for personal discipleship. While The Compass provides content for one-on-one discipleship, Postcards tackles the practice, philosophy, and difficult challenges that are a part of discipleship. Here, for example, is a partial list of the topics covered: Habitual Sin, Authority Issues, Christian Counseling, Fasting, Theological Conflicts, Coaching Through Trials, Challenging to Conferences,...»

«EXPERIENTIAL LEARNING IN UNDERGRADUATE PHARMACY CURRICULUM: A CASE STUDY OF COOPERATIVE EXPERIENCE OF PHARMACY STUDENTS by Kum Tong Certina Ho A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Curriculum, Teaching and Learning Ontario Institute for Studies in Education University of Toronto © Copyright by Kum Tong Certina Ho (2016) EXPERIENTIAL LEARNING IN UNDERGRADUATE PHARMACY CURRICULUM: A CASE STUDY OF CO-OPERATIVE EXPERIENCE OF...»

«Introspective Humility Tim Bayne and Maja Spener University of Oxford and St. Catherine's College Manor Road Oxford OX1 3UJ United Kingdom tim.bayne@gmail.com Published in Philosophical Issues, 20: 1-22 (2010). This version of the paper is only a draft. For purposes of quotation please consult the published version.1. Introduction Viewed from a certain perspective, nothing can seem more secure than introspection. Consider an ordinary conscious episode—say, your current visual experience of...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.