«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
2.3 System Approaches and Methodologies There are a number of statistical and discriminative-training based methods that have been explored for the speaker recognition task. Two of the most successful modeling approaches that have been used are the Gaussian mixture model (GMM) and the support vector machine (SVM), which are discussed here. Other techniques have utilized hidden Markov models (HMMs), artiﬁcial neural networks such as multi-layer perceptrons (MLPs), or vector quantization (VQ).
for a mixture of N Gaussians.
In a speaker recognition setting, there are several GMM approaches that can be taken.
Here, only the currently prevalent approach, referred to as UBM-GMM, is described. Two GMM models are needed: one for the target speaker and one for the background model . Using training data from a large number of speakers, a speaker-independent universal background model, or UBM, is generated. The UBM training data is a type of systemlevel training data, which is chosen to be completely disjoint from the training data used to train target models for a given set of trials. So that every target speaker model is in the same space and can be compared to one another, the speaker-dependent models (using the corresponding target speaker training data) are adapted from the UBM using maximum a posteriori (MAP) adaptation. For a given test utterance X, and a given target speaker, a
log likelihood ratio (LLR) can then be calculated:
LLR(X) = log p (X|λtarget ) − log p (X|λU BM ) (2.7)
Comparing the LLR to a threshold, Θ, will determine the decision made about the test speaker’s identity: if LLR(X) Θ, the test speaker is identiﬁed as a true speaker match, otherwise, the test speaker is determined to be an impostor. The LLR is the score for the UBM-GMM system.
The SVM is used in speaker recognition by taking one or more positive examples of the target speaker, as well as a set of negative examples of impostor speakers, and producing a hyperplane decision boundary. Since there are far more impostor speaker examples than target speaker examples, a weighting factor is typically used to make the target example(s) count as much as all of the impostor examples. Once the hyperplane for a given target speaker is known, the test speaker can be classiﬁed as belonging to either the target speaker or impostor speaker class. Instead of a log likelihood ratio, a score can be produced by using the distance of the test data from the hyperplane boundary.
2.3.3 A Brief Historical Overview of Types of Systems Automatic speaker recognition systems can be categorized by the type of features they use and by the type of statistical modeling tool that they use. Features may range from low-level and short-term (based directly on the acoustics of the speech) to higher levels incorporating longer lengths of time, including prosodic, lexical, or semantic. MFCCs are an example of low-level, short-term features, while phone n-gram counts are higher-level, longer-term features. The overview of systems provided here, while not exhaustive, covers a variety of feature types and statistical learning methods, and is intended to give an idea of a range of approaches that have proven successful. In some cases, although a system alone may not have very good performance (compared to other systems), it may still be successful by contributing in a system fusion.
One conventional approach that has already been described in Section 2.3 is the cepstral GMM system [64, 61]. The cepstral SVM system utilizes a generalized linear discriminant sequence kernel to train an SVM classiﬁer on a sequence of input cepstral features .
Some methods attempt to combine the advantages of the generative modeling of GMMs with the discriminative power of SVMs. One such approach is an SVM classiﬁer that uses GMM supervectors as features . The supervectors are the concatenated mean vectors from a GMM that has been MAP-adapted from a UBM to a speaker’s data, with the idea that this mapping from an utterance into a high-dimensional supervector space is similar to an SVM sequence kernel.
Another successful approach is the MLLR-SVM system, which uses maximum-likelihood linear regression (MLLR) transforms from a speech recognition system as features for speaker recognition [69, 68]. In the context of a speech recognition system, MLLR applies an aﬃne transform to the Gaussian mean vectors in order to map speaker-independent means to speaker-dependent means. The coeﬃcients from one of more of these MLLR adaptation transforms are used in an SVM speaker recognition system with very good results.
One type of non-acoustic feature is the word n-gram, where n-gram can encompass unigrams, bigrams, and so forth. The motivation for using such a feature for speaker recognition is that there are idiolectal diﬀerences among speakers, i.e., speakers vary in their word usage. Speaker-dependent unigram and bigram language models were ﬁrst used in a target to background likelihood ratio framework, with promising results .
CHAPTER 2. BACKGROUND 11 There are also phone-based approaches.
Similar to the word n-gram modeling, the phone n-gram system ﬁrst used frequency counts of phone n-grams, where phones are found using a phone recognizer, or possibly phone recognizers for multiple languages, in a likelihood ratio framework . The use of phonetic information was extended in a number of techniques, including the use of binary trees , cross-stream modeling , and SVMs [13, 29]. Another example is a pronunciation modeling approach, where word-level automatic speech recognition (ASR) phone streams are compared with open-loop phone streams .
Additional methods seek to take advantage of the speaker information present in words, by using word-conditioning. A keyword HMM system trains background HMMs for a number of keywords, and adapts them to speaker; a likelihood ratio between the background and speaker models for each word are then calculated for a given test utterance, and the likelihood ratios are combined to produce a ﬁnal system score . The word-conditioned phone n-gram system considers phone n-grams only for a speciﬁc set of keywords .
A number of approaches have used prosodic features, including pitch and energy distributions or dynamics , and prosodic statistics including duration and pitch related features . Nonuniform Extraction Region Features (NERFs) consider a number of features, including maximum or mean pitch, duration patterns, and energy contours, for various regions of speech, which are delimited by some sort of event, such as short pauses, long pauses, or schwas .
2.3.4 Channel Compensation Techniques One obvious component to a speech signal that is unrelated to the speech (or speaker) itself is the channel on which the speech is recorded. Although most speech corpora have been collected using the telephone, there are diﬀerent types of handsets, including cellular, and there has also been a recent collection of data using diﬀerent types of microphones. The biggest eﬀect of having diﬀerent types of channels present in the data occurs when there is a channel mismatch between the training and test data. That is, if a system’s target speaker model is trained using data from an electret telephone handset, for instance, but the test speech was collected from a carbon-button telephone handset, it will “sound” diﬀerent to the system, even if the speaker is the same for both. In speaker recognition systems, the eﬀects of channel variation are typically addressed using normalizations, on the feature-level, the model-level, or the score-level. Since various approaches are taken in diﬀerent domains and in varying ways, they often improve performance when applied on top of each other.
Historically, channel eﬀects have been the dominating cause of errors in automatic speaker recognition tasks. In early speaker recognition work, mismatch in the type of telephone handset of train and test data caused error rates over four times as great as in the case of matched handsets . In the most recent 2010 NIST Speaker Recognition Evaluation, the eﬀects of channel mismatch still exist, but to a far lesser extent, with very low overall error rates for the best systems, despite increased amounts of channel variability.
CHAPTER 2. BACKGROUND 12 Feature-level Normalizations Cepstral mean subtraction (CMS) is a fairly simple technique that is applied at the feature-level .
CMS subtracts the time average from the output cepstrum in order to produce a zero mean log cepstrum. That is, for a temporal sequence of each cepstral coeﬃcient cm, T cm (t) = cm (t) − ˆ cm (τ ) (2.9) T τ =1 The purpose of CMS is to remove the eﬀects of the transmission channel, yielding improved robustness. However, any non-linear channel eﬀects will remain, as will any time-varying linear channel eﬀects. Furthermore, CMS can remove some of the speaker characteristics, as the average cepstrum does contain speaker-speciﬁc information.
Another feature-level channel compensation method is feature mapping . Feature mapping aims to map features from diﬀerent channels into the same channel-independent feature space. A channel-independent root GMM is trained, and channel-dependent background GMMs are adapted from the root. Feature-mapping functions are obtained from the model parameter changes between the channel-independent and channel-dependent models. The most likely channel is detected for the speaker data, which is then mapped to the channel-independent space. Adaptation to target speaker models is done using mapped features, and during veriﬁcation, the mapped features of the test utterance are used for scoring.
The root GMM is used as the UBM for calculating the log likelihood ratios.
Within-class covariance normalization (WCCN) is a feature normalization technique for SVM systems . In this method, a generalized linear kernel is trained, using class label information (i.e., a target or impostor speaker), in order to ﬁnd orthonormal directions in the feature space that maximize information relevant to the task. The weights of those directions are optimized to minimize an upper bound on the error rate.
Model-level Normalizations Speaker model synthesis (SMS) is a GMM model-based technique that utilizes channeldependent models . Rather than having one speaker-independent UBM, the SMS approach begins with a channel- and gender-independent root model, and then uses Bayesian adaptation to obtain channel- and gender-dependent background models. Channel-speciﬁc target speaker models are also adapted from the appropriate background model, after the gender and channel of the target speaker’s training data have been detected. Furthermore, a transformation for each pair of channels is calculated using the channel-dependent background models; this transformation maps the weights, means, and variances of a channel a model to the corresponding parameters of a channel b model. During testing, if the detected channel of the test utterance matches the type of channel of the target speaker model, then that speaker model and the appropriate channel-dependent background model are used to calculate the LLR for that test utterance. On the other hand, if the detected channel of the CHAPTER 2. BACKGROUND 13 test utterance is not a match to the target speaker model, then a new speaker model is synthesized using the previously calculated transformation between the target and test channels.
Then, the synthesized model and the corresponding channel-dependent background model are used to calculate the LLR for the test utterance.
Nuisance attribute projection (NAP) is another model-based technique, designed for use in SVM systems . This method aims to remove “nuisance” dimensions, that is, those irrelevant to the task of speaker recognition, by projecting points in the expansion space of the SVM onto a subspace designed to be more resistant to channel eﬀects. A projection matrix is created (using a training data set) in order to minimize the average cross-channel distance, with a weight matrix which can be formulated to not only reduce cross-channel distances, but also increase cross-speaker distances. This minimization problem reduces to an eigenvalue problem, where the eigenvectors with the largest eigenvalues must be found.
2.3.5 Current State-of-the-Art Systems One current state-of-the-art approach utilizes joint factor analysis (JFA), which models speaker and session variability in GMMs . A target speaker GMM is adapted from a UBM, and the speaker is represented by the means, covariance, and weights of the GMM.
JFA assumes that a speaker- and channel-dependent supervector can be decomposed into the sum of a speaker supervector, s, and a channel supervector, c. Furthermore, the speaker supervector is modelled as s = m + Dz + V y, where m is the speaker- and channel-independent supervector from the UBM, D is a diagonal matrix, V is a low-rank rectangular matrix, and y and z are independent normally distributed random vectors, with components corresponding to the speaker and residual factors, respectively. The channel-dependent supervector is modelled as c = U x, where U is a low-rank rectangular matrix and x is a normally distributed vector whose components corresponding to the channel factors. By estimating the speaker space matrix V, the channel space matrix U, and the residual matrix D, the speaker, channel, and residual factors can be calculated, and a score for a trial can be computed using a simple linear product. A simpliﬁed version of factor analysis can also be applied to a UBM-GMM system, using only the channel space matrix U, to do eigenchannel MAP adaptation [71, 48].
Another current approach that developed from JFA is the i-vector system . In this method, the total variability is modeled in a single matrix, rather than as separate speaker and channels, i.e., s = m + Tw where T is the total variability matrix, and w is the i-vector (which stands for an intermediate size vector). The matrix T is trained in a similar way as V is in the previous approach, and i-vectors are extracted. Linear discriminant analysis (LDA) and WCCN are applied to the i-vectors as channel compensation, and a score is produced using cosine distance scoring.
CHAPTER 2. BACKGROUND 15