Loading...

Uncertainty Reduction in Speaker Verification with Short Duration Utterances

Maghsoodi, Nooshin | 2019

573 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 53068 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sameti, Hossein
  7. Abstract:
  8. The voice biometric is used in today’s telephone based speaker verification because of its unique feature for remote access. However, there are significant challenges in implementing such systems. One of these challenges is the need for sufficient data in the enrollment phase. In fact, the speaker verification system needs a dataset that covers phonetic variations of the language to be able to discriminate between different speakers. In real applications it’s not easy to ask the speakers to say long utterances. Therefore, an ideal speaker verification system should be able to find imposters without any constraint on the input lexicon whether the utterances are long or short.The results of our early experiments showed that the performance of factor analysis based approaches like i-vector degrades dramatically when the input utterances are too short. While reducing this uncertainty can improve the accuracy of speaker verification. In this thesis, we proposed some ideas to improve short duration speaker verification. These ideas are based on approximating the uncertainty and then its reduction. We combine Hidden Markov Models (HMMs) with i-vector extractors to address the problem of text dependent speaker recognition with random digit strings. In the first idea, we examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the i-vector estimates. We also employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. The experiments on RSR2015 part III show that the proposed method attains averaged relative improvement of 39% over baseline and 12% over state-of-the-art method in EER. Similar conclusions are drawn from our experiments on the Reddots corpus, where the same method is evaluated on phrases. Furthermore, we proposed using deep neural network with an additional fine-tuning step based on a function of the uncertainty. We also proposed using an RBM like structure to extract content subspace and then using it to normalize the input features against lexical variability. The next proposed method is based on using discriminative Gaussian process latent variable model instead of factor analysis based approaches (e. g. i-vector and JFA) to learn low dimensional discriminative features from the speaker supervector space. The rationale behind this approach is that large lexical variations in short duration utterances can be encoded by Gaussian process covariance function. Also, the proposed idea uses hidden variables’ distribution in the discriminative cost function. Experiments show the improvement of over 20% in EER relative to baseline methods
  9. Keywords:
  10. Speaker Verification ; Uncertainty Propagation ; Factor Analysis ; Gaussian process ; Deep Convolutional Neural Networks ; Short Duration ; Hidden Markov Model

 Digital Object List

 Bookmark

No TOC