«A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering - Electrical Engineering ...»
A later study examining mimicry also aimed to determine how closely an impersonator could match certain acoustic parameters of his speech to those of speech from the target ﬁgure . The professional impersonation artist was given three excerpts of speech from well-known ﬁgures and asked to imitate these speakers as closely as possible, in terms of voice quality, speech style, and speech rate. A comparative recording of the same speech material was made with the artist using his natural voice and speaking style in order to ﬁnd the extent to which the artist had to change his voice. The impersonator was able to successfully change his global speech rate, though he had less control over more local articulatory timing. Global fundamental frequency was also successfully matched by the impersonator, who was able to CHAPTER 2. BACKGROUND 20 both increase and decrease his mean fundamental frequency (by 15-30 Hz) in order to do so. The impersonator had varying degrees in success at matching the ﬁrst three formant frequencies of his speech to the targets.
There have also been a number of studies exploring the eﬀects of voice modiﬁcation on an automatic speaker recognition system. The eﬀects of intentional voice alterations (such as changing pitch or adopting an accent) were tested both for human listening experiments as well as for automatic speaker recognition system performance . The speech was collected from normal subjects (that is, people who are not professional or expert mimics), in a setting that simulated a telephone conversation. Speakers were asked to disguise their voice in a variety of ways, including changing pitch, changing duration, and mimicking an accent.
Automatic speaker recognition performance using a cepstral UBM-GMM system was evaluated for two conditions: training and test data from normal voice; and training from normal voice and testing from disguised voice. The normal-normal condition produced an EER of almost 0%, while the normal-disguised condition had an EER of 7.5%. However, using the decision threshold from the normal-normal system on the normal-disguised trials yielded an increase in false rejection rate from 7% to 40%, suggesting that systems are vulnerable to intentional voice disguises. A human listening experiment asked subjects to listen to two samples of about 5 seconds of speech and decide whether the utterances were spoken by the same speaker; if unsure, listeners could hear additional 5 second speech utterances, up to a limit of 20 seconds, when they had to make a ﬁnal decision. The results indicated that in the normal-normal condition, automatic performance was similar to the lower quartile of human performance, though the automatic performance was better than humans in the normal-disguised case.
Another study investigated the eﬀects of a transfer function-based voice transformation on automatic speaker recognition performance . In the source-ﬁlter model of speech production, speech is modeled as a convolution of a sound source (i.e., the vocal cords) and a linear acoustic ﬁlter (i.e., the vocal tract). In the spectral domain, a speech signal X is then given by X(f ) = H(f )S(f ), where S(f ) is the Fourier transform of the source signal and H(f ) is the transfer function corresponding to the ﬁlter characteristics of a speaker, where transfer function refers to the mapping of input to output in the frequency domain for a linear time-invariant system (such as a ﬁlter). Given knowledge of the speaker recognition method, the voices of impostors were modiﬁed to target a speciﬁc speaker. By transforming the impostor speech in such a way as to match the transfer function of a targeted speaker, they were able to increase the false alarm rate of the system from less than 1% to 97%, when using the targeted speaker’s training utterance, and to 50% when using a diﬀerent utterance of the targeted speaker. A previous study also tested computer voice-altered impostors, using a speech synthesis algorithm to model the spectral characteristics of a target voice .
In this case, the false acceptance rate increased from 1.5% to 86%.
CHAPTER 2. BACKGROUND 21
2.7 Speaker Recognition Error Analysis 2.7.1 A Speaker Menagerie One of the inspirations for this thesis is the work of Doddington et al., who classiﬁed speakers in groups according to the types of speaker recognition errors they cause . There are 4 types of speakers deﬁned: “goats,” speakers who cause a large number of false rejections as a target speaker; “lambs,” speakers who cause a large number of false accepts as a target;
“wolves,” speakers who cause a large number of false accepts as an impostor test speaker;
and “sheep,” the default type of speaker. Through the use of statistical tests, the presence of goats, lambs, and wolves was shown for a UBM-GMM system using data from NIST’s 1998 Speaker Recognition Evaluation, for female speakers only.
The score for each trial of target-test pairs was considered a function of the test speaker index j and the model speaker index k. Thus, a score probability density function for a given test speaker (j) and model speaker (k) would be fs (•|j, k). By asserting the null hypothesis that there are no speaker diﬀerences, the existence of goats, lambs, and wolves could be shown by considering diﬀerent score distributions and disproving the null hypothesis. For the case of goats, the density function need only include the case where j = k, in which the density should not depend on k if goats do not exist; that is, without goats, the distribution of true speaker scores should be the same for each true speaker. For lambs and wolves analysis, the case of interest is j = k, in which the density should not depend on k if lambs do not exist, and should not depend on j if wolves do not exist. That is, if there are no lambs, the distribution of impostor scores should be the same regardless of the model speaker, while if there are no wolves, then the distribution of impostor scores should be the same regardless of test speaker.
For goats, analysis comprised computing means and variances for the sets of scores belonging to the same true speaker, and then determining if the means and variances depend on the speaker. Under the assumption that the means and variances do not depend on the speaker, only 5% of the true speaker score means should lie outside the 2.5 and 97.5 percentiles of the hypothetical speaker-independent underlying score distribution with appropriate mean and variance; if this does not hold true, then the speakers below the hypothetical 2.5 percentile can be categorized as goats. The results showed that there were, in fact, more outliers than could be accounted for by a single speaker-independent distribution.
For lambs, graphical analysis involved plotting the maximum impostor score for a model speaker against each true speaker score for that model speaker. Although this plot did not indicate any lamb sub-population of models in this analysis, the models with high maximum impostor score may be considered lamb-like.
For wolves, after computing the maximum impostor score for each test utterance, then the means and variances of sets of maximum impostor scores for the same test speaker can be calculated. As with the distribution considered in the goat speaker analysis, the means are compared with the 2.5 and 97.5 percentiles of a hypothetical speaker-independent underlying CHAPTER 2. BACKGROUND 22 score distribution; if more than 5% of the means lie outside these hypothetical percentiles, then there is a speaker dependence, and the test speakers with means above the hypothetical
97.5 percentile may be considered wolves. Once again, there were more outliers than could be accounted for by a single distribution, indicating the existence of wolf-ish speakers.
Furthermore, the F-test, Kruskal-Wallis test, and Durbin test were used to reject the null hypotheses at the 0.01 signiﬁcance levels for goats, lambs, and wolves. The F-test is a one-way analysis of variance test used to determine statistically whether there is a speaker eﬀect. The F-test was applied to test for potential goats by using all true speaker scores for each speaker, while it tested for potential lambs and wolves by ﬁrst averaging the scores corresponding to the same model-test speaker pair (over all test utterances), and then using all impostor trials for the model speakers (in the lamb case) or test speakers (in the wolf case). The Kruskal-Wallis test is also a one-way analysis of variance, but it is non-parametric and uses ranks. For speakers with at least 5 true speaker trials, all the true speaker scores were used (goats). As with the F-test, the impostor scores were averaged for each model-test speaker pair before the test is applied (for lambs and wolves). Ranks are assigned to all of the mean scores, and ranks are summed for each speaker. Finally, the Durbin test is a twoway analysis of variance by ranks test, and was applied only to impostor scores (for lambs and wolves testing), for which the data could be viewed as conditioned on the two diﬀerent speakers (i.e., the model and test speakers for each impostor score). As with the previous tests, impostor scores were ﬁrst averaged across test utterances, and then the Durbin test assigned ranks to the averaged scores. The ranks were then summed for each test or model speaker, corresponding to the lamb or wolf test, respectively.
Using the rank sums from the Durbin test, a mild correlation of about 0.26 was found to exist between lambs and wolves. There were no correlations found between goats and either lambs or wolves. Furthermore, the speakers were ranked according to how goat-like they were (using the Kruskal-Wallis test) and to how wolf-like and lamb-like they were (using the Durbin test). Then, a cumulative distribution of errors for the rank ordered speakers showed that the 25% most goat-like speakers contributed 75% of the false rejection errors, though false alarm errors were more evenly distributed across speakers.
2.7.2 Related Work Poh et al. extended the work of Doddington et al. by developing a user-speciﬁc score normalization (referred to as F-norm’s variant) in order to address “badly behaved” users of the system, i.e., those users who degrade system performance . Furthermore, for a multimodal biometrics context, Poh et al. developed a fusion technique that decides whether or not to fuse the output of several systems on a per user basis.
For a closed set speaker identiﬁcation task, Jin and Waibel implemented a “naive delambing method” in order to reduce the eﬀects of speakers who were likely to be identiﬁed as another speaker . In the context of a vector quantization (VQ) based technique, in which codebooks are trained for each speaker, Jin and Waibel found that the closest match CHAPTER 2. BACKGROUND 23 in cross-validation testing for some speakers was not the correct speaker himself, and thus developed a method for modifying the codebooks in such cases. Additionally, to further reduce the eﬀects of lamb-like speakers, these lamb speakers were located in the set (using cross-validation testing), and a threshold was set for each lamb speaker’s belief heuristic value, so that identiﬁcation as that lamb speaker could occur only if the score was above the belief heuristic.
2.7.3 Session Variability Beyond considering the eﬀects of diﬀerent types of speakers, there has also been work investigating the impact that the particular training and test utterances used have on system performance . A UBM-GMM system with factor analysis on male telephone data from the 2008 NIST Speaker Recognition Evaluation was ﬁrst analysed with respect to performance dependence on the target speaker, focusing on the lambs and wolves of the aforementioned Doddington menagerie. Results showed an uneven distribution of false alarm errors, with 26% of the speakers causing 50% of the errors, and the 6% worst speakers accounting for 17% of the errors. The distribution of false rejection errors was also uneven, with 8% of the target speakers causing 50% of the false rejection errors, and 25% of these errors were due to 6% of the speakers.
The study also investigated the eﬀect of the training sample used for each target speaker.
Baseline performance corresponded to the training segment selected in the NIST evaluation.
The best and worst training utterances were also deﬁned for each speaker by ﬁnding the utterance that minimized or maximized the sum of false acceptance and false rejection rates, respectively. The baseline NIST performance had an EER of 12.1%, while using the best training data yielded an EER of 4.1% and using the worst training data generated an EER of 21.9%. The variability in performance demonstrated that the choice of training segment can have a signiﬁcant impact.
Additional work investigated possible causes for the variable performance . In particular, using data from NIST SRE08 as well as a French database of controlled read speech, BREF 120, the dependence of performance on training session was further analyzed. When switching the train and test segments of the sets used in the aforementioned work on SRE08, they found that the ranking of performance remained the same. That is, the inverted case corresponding to the original worst training segments (which become test segments in the inversion) still had the highest EER (17%) and the inverted case corresponding to the original best training segments (which are test segments in the inversion) had the lowest EER (7.4%), with the inverted NIST set performing in between the two (at 13.5%). However, the diﬀerences in performance were smaller than in the original case, suggesting that the choice of training excerpts have a greater eﬀect than the choice of testing excerpts.
Analysis of system performance on the BREF 120 database for both male and female speakers also showed a range of performance between choosing the best training utterances and the worst, with random selection of training segments yielding performance in between CHAPTER 2. BACKGROUND 24 the best and the worst. The distribution of phonetic content between diﬀerent training excerpts was examined as a possible contributing cause for the diﬀerence in performance.