Improving Speech-to-Speech Translation in Audio Dubbing Systems

Rashidi, Sina; Sameti, Hossein

Please enable javascript in your browser.

Improving Speech-to-Speech Translation in Audio Dubbing Systems

Rashidi, Sina | 2024

0 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 57440 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Sameti, Hossein
Abstract:
Direct speech-to-speech translation, in which all compontents are trained jointly, is advantageous over cascaded approaches, because this method employs a simple yet effective pipeline to produce the outputs with a little inference time. Direct speech-to-speech translation models suffer from data scarcity issue, because they need parallel speech data in source and target languages. In this thesis, we present a novel direct speech-to-speech translation model to translate Persian speech to English, which is based on discrete speech units and uses a conformer-based encoder which is pretrained and a transformer-based causal decoder which uses relative position multi-head attention to do the task of speech-to-unit translation. The generated speech units are converted to speech waveform by a unit-based neural vocoder. Model training is done without relying on intermediate text features. Also to address the data scarcity issue, we build a new corpus of parallel speech data in Persian and English by translating the transcriptions of Persian speech to English with a Large Language Model, and then synthesizing the output speech with a state-of-the-art text-to-speech synthesis model. This corpus generates approximately 6 times more parallel data compared to the existing datasets. Experiment results show that the proposed model achieves 1.6 more ASR BLEU without using the built corpus and 4.6 more ASR BLEU with using the newly built corpus compared to direct baselines
Keywords:
Audio Dubbing ; Direct Speech-to-Speech Translation ; Speech-to-Speech Translation

Digital Object List

محتواي کتاب
view

Bookmark

مقدمه
- تعریف مسئله
- اهمیت موضوع
- ادبیات موضوع
- اهداف پژوهش
- ساختار پایان‌نامه
مفاهیم اولیه
- مقدمه
- شبکه‌های عصبی
- شبکه‌های عصبی مصنوعی
- شبکه‌های عصبی پیچشی
  - ساختار شبکه‌های عصبی پیچشی
- شبکه‌های عصبی بازخوردار
  - ساختار شبکه‌های عصبی بازخوردار
  - شبکه‌های حافظه کوتاه مدتِ بلند
- مبدل‌ها
  - کدکننده
  - کدگشا
  - سازوکار توجه
  - رمزگذاری موقعیتی
- کانفورمر
  - ساختار کانفورمر
- شبکه‌های مولد تخاصمی
  - ساختار شبکه‌های مولد تخاصمی
- مدل‌های زبانی بزرگ
- معیارهای ارزیابی
  - معیار میزان خطای حروف
  - معیار میزان خطای کلمه
  - معیار BLEU
  - معیار BLEU ASR
  - معیار METEOR
  - معیار MOS
- جمع‌بندی
کارهای پیشین
- مقدمه
- مدل‌های مبتنی بر طیف‌نگار گفتار
  - مدل ترنسلیتوترون
  - مدل ترنسلیتوترون 2
- مدل‌‌های مبتنی بر واحدهای گسسته گفتار
  - ترجمه گفتار-به-واحد
  - کدکننده صدای مبتنی بر واحد گسسته
- پیش‌آموزش خودنظارت در مدل‌های ترجمه گفتار-به-گفتار
  - مدل UnitY
- مدل ترنسلیتوترونِ بدون متن
  - ساختار مدل ترنسلیتوترون بدون متن
  - عملکرد مدل ترنسلیتوترون بدون متن
- جمع‌بندی
راهکار پیشنهادی
- مقدمه
- داده‌های استفاده شده
  - داده‌های Voice Common
  - داده‌های CVSS
  - داده‌های LJSpeech
- ساختار مدل پیشنهادی
  - کدکننده
  - کدگشا
  - وفق‌دهنده طول
  - پیش‌پردازش داده‌های گفتار مقصد
  - کدکننده صدا
- روش‌های افزونگی داده
  - اعوجاج زمانی
  - پوشش فرکانسی
  - پوشش زمانی
- تولید داده‌های جدید
  - داده‌های فارسی بیشتر
  - ترجمه ماشینی متن داده‌ها به زبان انگلیسی
  - تبدیل متون ترجمه‌شده به گفتارهای مقصد
  - پیکره گفتاری ساخته‌شده
- جمع‌بندی
آزمایش‌ها و نتایج جدید
- مقدمه
- روش پیاده‌سازی
- تولید پیکره داده‌های جدید
- فرایند آموزش مدل
  - فرایندهای پیش‌نیاز آموزش مدل اصلی
  - آموزش مدل اصلی
- نتایج آموزش مدل
  - نتایج آموزش مدل با داده‌های CVSS
  - نتایج آموزش مدل با پیکره ساخته‌شده
  - نتایج عملکرد مدل با معیار METEOR
- نمونه خروجی‌های مدل
- جمع‌بندی
نتیجه‌گیری
- نوآوری‌ها و دستاوردهای این پژوهش
- مسائل باقی‌مانده
- پیشنهادهایی برای ادامه کار
مراجع
واژه‌نامه

Friend's email
Your name
Your email
enter code