«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
2.4 Speech Corpora There are a number of conversational speech corpora utilized for speaker veriﬁcation tasks. Older corpora include Switchboard-1, Switchboard-2, and Fisher [45, 46, 17]. They contain speech data collected from telephone conversations between pairs of speakers; these conversations are typically around 5 minutes in length, so that each conversation side (i.e., the side of the conversation corresponding to one speaker) is roughly 2.5 minutes in length.
In addition to landline telephone data, there is a cellular telephone data set of Switchboard-2.
The National Institute of Standards and Technology (NIST) has coordinated Speaker Recognition Evaluations since 1997, and there are multiple corpora available from these evaluations; the most commonly used data sets correspond to the NIST 2004, 2005, 2006, 2008, and 2010 Speaker Recognition Evaluations (SREs) [50, 51, 52, 53, 54]. The evaluation data is taken from various stages of the larger Mixer collection [15, 16]. Each of the aforementioned SRE data sets include conversational telephone speech. Conversational speech recorded on a variety of microphones was included starting in SRE05. SRE08 introduced a diﬀerent style of speech, speciﬁcally that of an interview; in these cases, most speech belongs to the interviewee, though some interviewer speech may be present. I will refer to each speech sample or utterance, whether obtained from a conversation or an interview, as a conversation side.
2.5 Performance Measures for the Speaker Veriﬁcation Task The NIST Speaker Recognition Evaluations use two performance measures for speaker recognition systems, namely the detection cost function (DCF) and the equal error rate (EER). As mentioned previously, there are two types of errors that occur in speaker veriﬁcation tasks: false acceptances, or false alarms, in which an impostor speaker is incorrectly veriﬁed as the target, and false rejections, or misses, in which a true speaker is rejected as the target. For every decision threshold, there will be false alarm and miss rates that indicate the probability of each type of error occurring.
The DCF is deﬁned as a weighted sum of the miss and false alarm error probabilities:
DCF = CMiss × PMiss|Target × PTarget + CFalseAlarm × PFalseAlarm|NonTarget × (1 − PTarget ) (2.13) In Equation (2.13), CMiss and CFalseAlarm are the relative costs of detection errors, and PTarget is the a priori probability of the speciﬁed target speaker. I will use the values from SRE08, namely, CMiss = 10, CFalseAlarm = 1, and PTarget = 0.01. When DCF is given here, it refers to the minimum possible DCF, i.e., to a cost that has been minimized over possible values of the decision threshold. The equal error rate (EER) is simply the rate at which false alarm and miss probabilities are equal.
CHAPTER 2. BACKGROUND 16 The minimum DCF and EER capture only two possible operating points for a system.
In order get a better sense for how good a system is overall, there are detection error tradeoﬀ (DET) plots, which plot the false alarm rate against the miss rate over the entire range of decision thresholds . By using a logarithmic scale, a receiver operating characteristic (ROC) curve becomes a line. The better the system, the closer the DET curve will be to the lower left of the plot (i.e., smaller error rates).
2.6 Intrinsic Speaker Qualities In general, a speech sample is aﬀected by both intrinsic and extrinsic factors, where extrinsic factors include noise, room acoustics, and channel eﬀects. Since the focus of my dissertation work is on intrinsic speaker characteristics, I now discuss a variety of issues and concepts relevant to a discussion of inherent speaker qualities. A brief overview of some of the major sources of variation within and among speakers is given in Section 2.6.1, including physical attributes, accent or dialect, prosody, and emotion. Additionally, in order to further explore the inherent diﬃculties of a speaker recognition task, the concept of the distinctiveness or recognizability of a speaker is covered in Section 2.6.2, along with various studies in which human listening has been applied to a speaker-related task. Finally, Section 2.6.3 presents work that deals with voice modiﬁcations attempted in order to fool an automatic speaker recognition system, as these studies are indicative of the eﬀects that varying speaker characteristics can have.
2.6.1 Sources of Speaker Variation Physical Attributes At the most basic level, a person’s voice is characterized by his vocal apparatus. The length of the vocal tract, the size of the vocal folds in the larynx, the size and shape of the nasal cavity, and other anatomical features all contribute to the acoustic properties of a person’s speech, aﬀecting formant frequencies of vowels, average pitch, pitch range, and qualities such as breathiness and nasality . While an individual has a certain amount of control over the frequency characteristics of his speech and can speak outside of his typical range of everyday speech frequencies, the eﬀects of other physical attributes, such as the size and shape of the nasal cavity, cannot be manipulated.
A person’s voice will also be aﬀected by his age and health. Physical changes that occur as a child grows into an adult are the most obvious example of aging eﬀects, especially for male voices. However, the voice quality also changes as an adult grows older. Examination of voice spectrograms for a set of subjects over a period of years showed that the frequency of the point of concentration of formants and the mean pitch frequency decreased with increasing age, and the individual distribution curves of mean pitch frequency became more narrow, i.e., the ability to vary fundamental frequency was lost in the aging process .
CHAPTER 2. BACKGROUND 17 Furthermore, a person’s health will impact the way his voice sounds; for instance, a cold may make the voice hoarse or more nasal.
Language, Dialect, and Accent The language choice of a speaker is another source of speaker individuality. In the case of multi-lingual speakers, their native language will typically inﬂuence the way they produce the speech sounds of other languages, giving rise to a foreign accent.
Furthermore, word and phone pronunciation can vary widely, even within the same language, leading to accents among native speakers. There are many accents in English, for instance: not only are there British, American, and Australian accents, but there are local regional accents within each of those groups.
In addition to diﬀerent accents, languages often include diﬀerent dialects, which may vary in the usage of certain words or grammatical forms, as well as word pronunciations.
Variations in dialect may reﬂect geographical, age, socio-economic, or educational diﬀerences between speakers.
Variability in Speech Production and Prosody Humans are able to listen to speech and identify the words and phones that are spoken.
However, the same word or phone may be produced in varying ways. Speakers will diﬀer in the precise ways of articulating a sound, as well as the degree of coarticulation between consecutive sounds. Speech rate, often measured by the number of words or phones per second, is another characteristic that will vary from speaker to speaker.
In linguistics, prosody refers to various acoustic properties of speech that can convey additional information about the utterance or speaker. Types of prosodic information include loudness, pitch, tone, intonation, rhythm, and lexical stress. Variations in prosody may indicate things such as sarcasm, speaker emotion, emphasis, or whether an utterance is a statement or a question. Furthermore, prosody is suprasegmental, meaning that prosodic features are not limited to any one segment, but occur at a higher level, across multiple segments.
The concept of speech rhythm involves a number of timing parameters, including the tempo, pauses, and various durational patterns, which may for example, be measured as the mean and standard deviation of word or phone lengths. The prosodic tendencies of a given speaker help to deﬁne his speaking style. Additional lexical information such as word usage, and the relative frequency of disﬂuency classes (including pause-ﬁllers, discourse markers, or backchannel expressions) can also contribute to a speaker’s individual speaking style. As described in Section 2.3.3, several of the higher-level systems for speaker recognition attempt to capture such individual variations in order to diﬀerentiate between speakers.
CHAPTER 2. BACKGROUND 18 Emotion The emotional state of a speaker can also impact the characteristics of his speech.
A number of acoustic parameters can be involved in conveying an emotion: the level, range and contour of the fundamental frequency (perceived as pitch); the vocal energy or amplitude (perceived as voice intensity); the energy distribution across the frequency spectrum (perceived in voice quality or timbre); formant location (related to articulation perception);
and a number of timing parameters, such as tempo and pauses .
As an example, joy typically manifests in speech as increases in the mean, range, and variability of fundamental frequency, along with an increase in mean energy. Joy may also cause a higher rate of articulation.
2.6.2 Speaker Recognizability or Inherent Challenges A concept that is related to inherent speaker characteristics is the recognizability of a person’s voice. One human listening experiment asked subjects to rate the distinctiveness of diﬀerent speakers, in terms of a seven point scale describing how easy or hard the voice would be to remember . An error analysis of a speaker recognition system that will be discussed in Section 2.7 also attempted to ﬁnd speakers who were hard for the system to recognize. Though the results of human listening tasks may not always correspond to results obtained by automatic systems, they provide insight into the nature of challenges inherent to speaker recognition tasks.
Speaker veriﬁcation by human listeners was compared to machine performance using NIST 1998 Speaker Recognition Evaluation data . The human task was designed to emulate the paradigm of the NIST evaluation as closely as possible, though human constraints due to memory and fatigue imposed a limit on both the number of the trials as well as the length of speech samples. Listeners were asked to make a same or diﬀerent speaker discrimination with conﬁdence ratings (10 levels). Results showed that human listening, when individual decisions were combined, was comparable to or even better than typical computer algorithms, especially in the case of mismatched train and test handsets.
Recently, the 2010 NIST Speaker Recognition Evaluation included a human assisted speaker recognition task . Participating sites evaluated a subset of trials, selected to be diﬃcult, using any human assisted technique, including listening and examination of spectrograms or other features. The decision could be based on a group of humans, with no restriction on the use of experts or naive listeners. Analysis of results showed that this was largely a challenging task for humans, with fairly high error rates on many of the selected trials. For these diﬃcult trials, automatic systems performed better than humans.
A study of voice identiﬁcation by human listeners, relating to the reliability of the testimony of an earwitness (in a legal setting), examined a variety of issues, including familiar versus unfamiliar voices, the reliability or accuracy of voice identiﬁcation, reliability as a function of time, and reliability as a function of whether or not the listener is trying to remember CHAPTER 2. BACKGROUND 19 the voice . Examination of various studies yielded a number of conclusions. First, the length of the heard speech does not seem to have too great of an eﬀect. Voice disguise and even unintentional changes in tone were found to greatly reduce identiﬁcation accuracy, even under ideal conditions. When comparing incidentally and intentionally memorized voices, there was little evidence that voice identiﬁcations by witnesses who were unprepared or had little time to initiate eﬃcent encoding strategies would be reliable. In terms of delay between the time of hearing the initial speech and making a voice identiﬁcation, the greater the delay, the greater the likelihood of error and unreliability. Examination of the relationship between witness accuracy and conﬁdence level showed promising, but inconclusive results.
2.6.3 Voice Modiﬁcations As mentioned in Section 2.6.1, speakers can manipulate their voices in certain ways, even if they cannot change certain physical attributes, like vocal tract lengths or the size and shape of their nasal cavities. Changes in a speaker’s voice, intentional or not, can impact speaker recognition performance.
One early study examined the eﬀects of voice disguise and voice imitation on spectrograms . For voice disguise, subjects kept the speech content the same across samples, but were allowed to diﬀer from their normal voice in terms of pitch frequency, rate of articulation, pronunciation, and dialect. Comparison of the formant positions indicated that the formants could be shifted higher or lower than the normal voice, though the ﬁrst formant was comparatively stable. In terms of voice imitation, the imitator was able to vary his mean fundamental frequency considerably in order to be more similar to a target, though he was generally unable to precisely match the formants or instantaneous fundamental frequencies of the speaker being imitated. It makes sense that the imitator could successfully change his overall average fundamental frequency, even if precise instantaneous fundamental frequencies could not be matched, given that the imitator is changing his voice according to his memory of perceived pitch of the target speaker (which may not match the actual instantaneous values). Similarly, although formant frequencies can potentially be changed, a speaker has certain habits of articulating speech sounds (leading to certain formant frequencies) that are often diﬃcult to manipulate consciously over a continuous speech utterance.
The imitator was largely successful in imitating the speech melody of a given target.