Loading...

Design and Improvement of Sequence-level Objective Functions for DNN-based Large Vocabulary Continuous Speech Recognition

Hadian, Hossein | 2019

949 Viewed
  1. Type of Document: Ph.D. Dissertation
  2. Language: Farsi
  3. Document No: 51866 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sameti, Hossein
  7. Abstract:
  8. This thesis focuses on the problem of large vocabulary continuous speech recognition (LVCSR).Numerous research results in recent years proved effectiveness of deep neural networks (DNN) for LVCSR. As a result, many methods were proposed to incorporate DNNs in LVCSR. From one perspective we can look at these methods from the viewpoint of objective functions used for training DNNs. A frame-level objective function is one that is defined on frames locally, whereas a sequence-level objective function is defined on whole sequences. Since speech recognition is essentially a sequentional problem, here we focus on designing and imroving sequencelevel objective functions for DNNs. The main proposed methods for this problem in the literature include the sequence-discriminative method lattice-free maximum mutual information (LFMMI) and connectionist temporal classification (CTC). The state-of-the-art LF-MMI method is based on hidden Markov models (HMM) and MMI. Its only drawback (similar to all other HMM-DNN methods) is relying on a previously trained HMM-Gaussian mixture model (GMM) model. The CTC method, does not have this issue but gives significantly poor results in comparison with LF-MMI: 30-50% relatively worse word error rate (WER). This method is based on a probabilistic model which assumes an unconditional independency between subsequent phonemes, and is not capable of subphonic modeling.In this research, we propose four new methods. The first one is a new method based on LF-MMI which makes it independent of previously trained models. In other terms, this method – which we call flat-start LF-MMI – is comparable to CTC. This method enables us to discriminatively train a context-dependent (CD) acoustic model from scratch (without requiring any previously trained models or alignments) in a single stage. To allow CD modeling from scratch, we propose to use full biphones without any state tying. Evaluation results show 10 to 30% relative improvement in WER compared to other similar methods such as CTC. The second proposed method is a new approach for creating supervisions in LF-MMI. We relax the time constraints in the proposed supervisions, which leads to more freedom for the network to learn new alignments. This leads to 1-3% relative WER reduction on various databases, while it speeds up supervision creation (which is a costly step in LF-MMI training) 2-4 times. In particular, using this method, we improve the state-of-the-art WER on Switchboard from 13.2% to 12.7%. Finally, we propose two other methods to reduce overfitting in LF-MMI and flat-start LF-MMI. The first method, is a novel regularization method for all MMI-based methods (including LF-MMI) which can prevent overfitting to noisy data by connecting the numerator and denominator graphs. We show effectiveness of this method by evaluating it on data with noisy labels. The other proposed technique, is to use pruning in forward-backward to constrain the supervision (i.e., numerator graph) in flat-start LF-MMI. This leads to a further relative WER reduction (2 to 3%) for the proposed flat-start LF-MMI method, while making it faster
  9. Keywords:
  10. Large Vocabulary Continuous Speech Recognition ; Deep Neural Networks ; Sequence-level Objective Function ; Probabilistic Modeling ; Continuous Speech Recognition ; End-to-End Modeling

 Digital Object List

 Bookmark

...see more