«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
the same from time to time. Similarly, there may be speakers for which an automatic speaker recognition system makes more decision errors. There are many sources of variation within and across speakers that may contribute to causing such errors, including basic physical attributes, language, accent, characteristics of speaking style, and changes in emotional state or health.
This thesis is inspired by the analysis of Doddington et al. , in which the authors characterized speakers in terms of their error tendencies. The default, well-behaved speakers are “sheep.” Speakers who cause a proportionately high number of false rejection errors as the target speaker are called “goats.” Those speakers who tend to cause false acceptance errors as the target speaker are “lambs,” and those who tend to cause false acceptance errors as the impostor speaker are labeled “wolves.” The existence of such speaker types was demonstrated through statistical tests using the outputs of an automatic speaker recognition system. Further analysis of additional data sets and diﬀerent types of speaker recognition systems can provide more insight into the dependence that system performance has on the speakers.
Given that automatic speaker recognition system performance does depend on speaker characteristics, knowing which speakers are likely to cause errors is information that could prove useful for improving decision accuracy. Yet, limited work has been done to ﬁnd these diﬃcult speakers without the beneﬁt of having a system’s output.
Furthermore, there are a number of real-world applications that rely on automatic speaker recognition technology, that could beneﬁt from being able to ﬁnd the most similar speakers or the most diﬃcult trials to make a decision about. Inherent to certain tasks are populations of in-set and out-of-set speakers. That is, there may be a set of known speakers (i.e., in-set speakers), with associated speech samples, that needs to be distinguished from other, unknown speakers (i.e., out-of-set speakers). One example of this type of real-world application is that of fraud detection, where a company is trying to prevent fraud in the use of a call center or other phone-base system. Given a database of speaker models trained using speech samples from people known to have committed fraud, an automatic system may compare new speech data from incoming calls to the database of fraudster speaker models in order to detect possible fraudulent attempts, which must then be veriﬁed by a human listener. However, a human expert would be unable to listen to all calls if there are a large number of potential matches between new speech data and the fraudster models. A method for selecting the most error-prone speakers could thus prove very useful for focusing the eﬀorts of a human listener in a smart way.
conversational speech, and one that is a more recent collection of both conversational speech and interview-style speech, recorded on a variety of channels, including landline and cellular telephone, as well as various types of microphones. Furthermore, in addition to considering a traditional speaker recognition system approach, for the second data set I utilize the outputs of a more contemporary approach that is better able to handle variations in channel.
The second component of this thesis investigates a straightforward approach to predict speakers that will be diﬃcult for a system to correctly recognize. I use a variety of features to calculate feature statistics that are then used to compute a measure of similarity between speaker pairs. By ranking these similarity measures for a set of impostor speaker pairs, I determine those speaker pairs that are easy for a system to distinguish and those that are diﬃcult-to-distinguish. I then develop an approach for combining a set of feature statistics in order to produce a comprehensive measure of how likely it is that a speaker will cause errors.
In particular, I use support vector machine (SVM) classiﬁers trained to distinguish between diﬃcult and easy examples, in order to detect diﬃcult impostor and target speakers.
I begin by covering relevant background material in Chapter 2, including typical features and systems for automatic speaker recognition, intrinsic speaker characteristics, and related error analyses of speaker recognition systems. Next, I explore the speaker-dependent performance of systems in Chapter 3. In Chapter 4, I introduce a simple approach to ﬁnding diﬃcult-to-distinguish speaker pairs. I then describe a technique for detecting diﬃcult target or impostor speakers in Chapter 5. Finally, I summarize and conclude my work in Chapter 6.
Chapter 2 Background There are several broad areas of prior work relevant to this dissertation. I begin in Section
2.1 by setting up the speaker recognition problem, while in Sections 2.2, 2.3, 2.4, and 2.5 I provide details about features, system approaches, relevant speech corpora, and measures of system performance, respectively. There are a number of intrinsic speaker qualities, which account for intra-speaker variability, as well as diﬀerences between speakers, that I describe in Section 2.6. The most directly related work involves error analysis pertaining to speaker recognition systems, which I discuss in Section 2.7.
2.1 The Speaker Recognition Problem As its name implies, automatic speaker recognition attempts to recognize, or identify, a given speaker by processing his/her speech automatically, that is to say, in a fully objective and reproducible manner, without the aid of human listening or analysis. In order to be able to recognize the speaker of a given test utterance, it is necessary to have training data ﬁrst, so that the system can “learn” the speaker of interest. The term speaker recognition can be used to refer to a variety of tasks. One type of task is speaker identiﬁcation, where the system must produce the identity of the speaker, given a test utterance, from a set of speakers. With closed-set speaker identiﬁcation, the number of speakers in the set is ﬁxed, and the system must choose which among the given speakers is a match to the speaker of the test utterance. Open-set speaker identiﬁcation adds a layer of complexity by allowing the test utterance to belong to a speaker not in the set of speakers for whom there is training data available. A second type of task is speaker veriﬁcation, which involves a hypothetical target speaker match to the test speaker, and the system must determine whether or not the test speaker identity is as claimed.
Regardless of which type of task, the problem may be further characterized as being textdependent or text-independent. In the text-dependent case, the train and test utterances are required to be a speciﬁc word or set of words; the system can then exploit the knowledge CHAPTER 2. BACKGROUND 5 of what is spoken in order to better make a decision. For the text-independent case, there is no constraint on what is said in the speech utterances, allowing for generalization to a wider variety of situations.
The dissertation work focuses on the text-independent speaker veriﬁcation task. For each target (or hypothesis) speaker and test utterance pair, the system must decide whether or not the speaker identities are the same. In this case, two types of errors arise: false acceptance (or false alarm) and false rejection (or missed detection). A false accept occurs when the system incorrectly veriﬁes an impostor test speaker as the target speaker. A false reject occurs when the system fails to verify a true test speaker as the target speaker. A trial refers to a target speaker and test utterance pair. In general, the training data of a target speaker may include one or more samples of speech, of varying lengths, and the test data may also include varying lengths of speech samples. For my purposes, the train and test utterances will both be a single conversation side, which is typically 2.5-3 minutes of speech. Therefore, a trial will correspond to a pair of train and test conversation sides. For each trial, the corresponding score simply refers to the output of a speaker recognition system given that train and test data. The score may or may not correspond to a likelihood. Furthermore, in order to make a decision for a trial given its score, there must be a decision threshold;
then, the system will decide that it’s a true speaker trial if the score is above the decision threshold, or decide that it’s an impostor trial if the score is below the decision threshold.
In general, speaker recognition errors may be caused by both extrinsic factors, such as channel eﬀects or noise, and intrinsic factors, such as age, sex, speaking style, or other inherent speaker attributes. My focus is on the eﬀects of intrinsic speaker characteristics.
In order to perform a speaker recognition task, a system must ﬁrst parameterize the speech in a meaningful way that will allow the system to distinguish and characterize speakers and their speech; this step is addressed next in Section 2.2, which discusses some relevant features commonly used in speech processing applications. A number of typical system approaches and methods are then discussed in Section 2.3, while I describe commonly utilized speech corpora and performance measures in Sections 2.4 and 2.5. In Section 2.6, I will describe a variety of intrinsic factors that contribute to variations both within an individual speaker and across diﬀerent speakers, and consider the potential impacts of such speaker characteristics, before concluding with an overview of relevant error analyses of speaker recognition systems in Section 2.7.
2.2 Speech Features The process of parameterizing a raw input, for example, speech, is referred to as feature extraction. For speech processing, low-level features are those based directly on frames of the speech signal, where frames correspond to a moving window, typically 25 ms long, with a given step size of typically 10ms. A length of 25ms and step size of 10ms corresponds to an overlap of 15ms between speech frames. High-level features, on the other hand, usually CHAPTER 2. BACKGROUND 6 incorporate information from more than just one frame of speech, and include, for example, speaker idiosyncrasies, prosodic patterns, pronunciation patterns, and word usage. The type of low-level acoustic features most often used in speaker recognition tasks are Mel-frequency cepstral coeﬃcients, or MFCCs, which are described in Section 2.2.1. Section 2.2.2 provides a brief introduction to other acoustic and prosodic features, such as formant frequencies.
Finally, Section 2.2.3 introduces various types of speech segments, which may be used to calculate diﬀerent types of features.
2.2.1 Cepstral Features MFCCs are generated by the process shown in Figure 2.1. First, an optional pre-emphasis ﬁlter is applied, to enhance the higher spectral frequencies and compensate for the unequal perception of loudness at diﬀerent frequencies. Next, the speech signal is windowed as described above and the squared magnitude of the fast Fourier transform (FFT) is calculated for each frame. A Mel-frequency triangular ﬁlter bank is then applied, where Mel refers to an auditory scale based on pitch perception. There are diﬀerent versions of the transformation from linear frequency scale to Mel frequency. One example, taken from , is given by
where Sk are the log-spectral vectors from the previous step, K is the total number of logspectral coeﬃcients, and L is the number of coeﬃcients to be kept (this is called the order of the MFCCs), with L ≤ K.
Furthermore, an energy term and/or its derivative can also be included in the feature parameterization.
Other commonly used cepstral features include linear-frequency cepstral coeﬃcients (or LFCCs), which use a linear rather than Mel-based frequency bank, as well as features based on linear prediction, such as linear predictive coding coeﬃcents (LPCCs) and perceptual linear prediction features (PLPs).
2.2.2 Other Acoustic and Prosodic Features Formant frequencies correspond to resonances of the vocal tract and can often be measured in spectrograms by amplitude peaks in the frequency spectrum. Vowels in particular can be largely characterized by the ﬁrst and second formants, though any voiced speech segment will produce formants.
The fundamental frequency, or f0, is an acoustic property corresponding to the lowest harmonic in the frequency spectrum. Pitch and fundamental frequency are often used interchangeably as terms, though pitch is an auditory property that is perceived by human listeners, who place sounds on a pitch scale ranging from low to high. The intonation of speech is the pitch pattern. Jitter is a term to describe varying pitch in the voice. A related feature is shimmer, which describes varying loudness in the voice.
Other commonly used prosodic features include energy distributions and dynamics, and duration and timing information, such as speech rate or average duration of various speech segments. Prosody will be revisited in more detail in Section 2.6.1.
2.2.3 Speech Segments One concept that arises when considering higher-level features is that of speech segments.
The basic linguistic unit of speech is that of a phone, which corresponds to a vowel or consonant speech sound that may be described in terms of articulatory movements and acoustic properties. Phonemes are sounds that are used to diﬀerentiate words . For instance, in the words got and not, /g/ and /n/ are two diﬀerent phonemes that lead to diﬀerent meanings. Phonemes may be pronounced in diﬀerent ways, leading to diﬀerent phones that are all instances of the same phoneme; although there are diﬀerences in pronunciation of these phones, their meaning does not change. In the remainder of this thesis, the term phone is used to refer to phoneme.
CHAPTER 2. BACKGROUND 8 Going beyond the phone, segments may be deﬁned as groups of phones or syllables, as well as words, and sentences.
All of these types of segments may be used as the basis for calculating various types of features.