Design and Improvement of Sequence-level Objective Functions for DNN-based Large Vocabulary Continuous Speech Recognition

Hadian, Hossein; Sameti, Hossein

Please enable javascript in your browser.

Design and Improvement of Sequence-level Objective Functions for DNN-based Large Vocabulary Continuous Speech Recognition

Hadian, Hossein | 2019

1000 Viewed

Type of Document: Ph.D. Dissertation
Language: Farsi
Document No: 51866 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Sameti, Hossein
Abstract:
This thesis focuses on the problem of large vocabulary continuous speech recognition (LVCSR).Numerous research results in recent years proved effectiveness of deep neural networks (DNN) for LVCSR. As a result, many methods were proposed to incorporate DNNs in LVCSR. From one perspective we can look at these methods from the viewpoint of objective functions used for training DNNs. A frame-level objective function is one that is defined on frames locally, whereas a sequence-level objective function is defined on whole sequences. Since speech recognition is essentially a sequentional problem, here we focus on designing and imroving sequencelevel objective functions for DNNs. The main proposed methods for this problem in the literature include the sequence-discriminative method lattice-free maximum mutual information (LFMMI) and connectionist temporal classification (CTC). The state-of-the-art LF-MMI method is based on hidden Markov models (HMM) and MMI. Its only drawback (similar to all other HMM-DNN methods) is relying on a previously trained HMM-Gaussian mixture model (GMM) model. The CTC method, does not have this issue but gives significantly poor results in comparison with LF-MMI: 30-50% relatively worse word error rate (WER). This method is based on a probabilistic model which assumes an unconditional independency between subsequent phonemes, and is not capable of subphonic modeling.In this research, we propose four new methods. The first one is a new method based on LF-MMI which makes it independent of previously trained models. In other terms, this method – which we call flat-start LF-MMI – is comparable to CTC. This method enables us to discriminatively train a context-dependent (CD) acoustic model from scratch (without requiring any previously trained models or alignments) in a single stage. To allow CD modeling from scratch, we propose to use full biphones without any state tying. Evaluation results show 10 to 30% relative improvement in WER compared to other similar methods such as CTC. The second proposed method is a new approach for creating supervisions in LF-MMI. We relax the time constraints in the proposed supervisions, which leads to more freedom for the network to learn new alignments. This leads to 1-3% relative WER reduction on various databases, while it speeds up supervision creation (which is a costly step in LF-MMI training) 2-4 times. In particular, using this method, we improve the state-of-the-art WER on Switchboard from 13.2% to 12.7%. Finally, we propose two other methods to reduce overfitting in LF-MMI and flat-start LF-MMI. The first method, is a novel regularization method for all MMI-based methods (including LF-MMI) which can prevent overfitting to noisy data by connecting the numerator and denominator graphs. We show effectiveness of this method by evaluating it on data with noisy labels. The other proposed technique, is to use pruning in forward-backward to constrain the supervision (i.e., numerator graph) in flat-start LF-MMI. This leads to a further relative WER reduction (2 to 3%) for the proposed flat-start LF-MMI method, while making it faster
Keywords:
Large Vocabulary Continuous Speech Recognition ; Deep Neural Networks ; Sequence-level Objective Function ; Probabilistic Modeling ; Continuous Speech Recognition ; End-to-End Modeling

Digital Object List

محتواي کتاب
view

Bookmark

فهرست شکل‌ها
فهرست جدول‌ها
فهرست نمادها
فهرست کلمات اختصاری
دیباچه
- بیان مساله و اهمیت آن
- رویکردهای قبلی برای حل مساله
- روش‌های پیشنهادی و ساختار گزارش
پژوهش‌های پیشین
- مقدمه
- ساختار کلی روش‌های بازشناسی گفتار مبتنی بر مدل مخفی مارکوف
  - شبکه عصبی برای مدل‌سازی محلی: روش CE
- توابع هدف دنباله‌ای وابسته به هم‌ترازی‌های قبلی
  - روش بیشینه‌سازی اطلاعات مشترک و گونه‌های مربوطه
  - روش LF-MMI
- توابع هدف دنباله‌ای مستقل از هم‌ترازی‌های قبلی
  - روش CTC
  - تابع هدف WER در چارچوب CTC
  - روش RNN-T
  - روش Attention-based
  - روش‌های دیگر
- جمع‌بندی
بررسی اولیه‌ی چند جنبه از روش LF-MMI
- مقدمه
- مدل‌سازی طول واج با استفاده از شبکه‌های عصبی ژرف
- استفاده از نرخ قاب کاهش‌یافته در روش CE
- استفاده از تکنیک boosting در روش LF-MMI
- نتیجه‌گیری
روش LF-MMI تخت‌آغاز
- مقدمه
- یک تابع هدف مبتنی بر بیشینه-درست‌نمایی HMM و مستقل از هر مدل قبلی
  - تابع هدف ML تخت‌آغاز
- تعمیم به روش LF-MMI
  - مدل زبانی گراف مخرج
  - مشکلات پیاده‌سازی
- مدل‌سازی وابسته‌به‌بافت بدون استفاده از درخت گره‌زنی
- مدل‌سازی بدون واژگان و انتها‌به‌انتها
- جمع‌بندی
نظارت محدود‌نشده برای LF-MMI
- مقدمه
- نظارت محدود‌شده
- نظارت محدود‌نشده
- جمع‌بندی
روش‌های پیشنهادی برای کاهش بیش‌برازش
- مقدمه
- کاهش بیش‌برازش در چارچوب MMI با اتصال صورت و مخرج
  - احتمال‌های پیش‌رو
  - احتمال‌های پس‌رو
- پیش‌رو-پس‌رو هرس‌شده
- نتیجه‌گیری
آزمایش‌ها
- مقدمه
- شرایط آزمایش
- آزمایش‌های مربوط به روش LF-MMI تخت‌آغاز
  - شروع تخت
  - تاثیر طول بیان‌های آموزشی
  - مقایسه نرخ خطای ML تخت‌آغاز و LF-MMI تخت‌آغاز
  - توپولوژی مدل مخفی مارکوف
  - مدل‌سازی بدون درخت
  - تحلیل اختلاف بین LF-MMI معمولی و تخت‌آغاز
  - تاثیر مدل زبانی و انطباق گوینده
  - نتایج نهایی با واج
  - نتایج نهایی با کاراکتر
- آزمایش‌های مربوط به نظارت محدودنشده
  - طول قطعه‌ها
  - میزان تولرانس
  - داده‌های نویزی
  - تاثیر بر سرعت
  - نتایج نهایی
- آزمایش‌های مربوط به نشتی صورت به مخرج
- آزمایش‌های مربوط به پیش‌رو-پس‌رو هرس‌شده
- جمع‌بندی
نتیجه‌گیری و آینده
- خلاصه
- نتیجه‌گیری
- آینده کار
مراجع
واژه‌نامه فارسی به انگلیسی
واژه‌نامه انگلیسی به فارسی

Friend's email
Your name
Your email
enter code