Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 57440 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Sameti, Hossein
- Abstract:
- Direct speech-to-speech translation, in which all compontents are trained jointly, is advantageous over cascaded approaches, because this method employs a simple yet effective pipeline to produce the outputs with a little inference time. Direct speech-to-speech translation models suffer from data scarcity issue, because they need parallel speech data in source and target languages. In this thesis, we present a novel direct speech-to-speech translation model to translate Persian speech to English, which is based on discrete speech units and uses a conformer-based encoder which is pretrained and a transformer-based causal decoder which uses relative position multi-head attention to do the task of speech-to-unit translation. The generated speech units are converted to speech waveform by a unit-based neural vocoder. Model training is done without relying on intermediate text features. Also to address the data scarcity issue, we build a new corpus of parallel speech data in Persian and English by translating the transcriptions of Persian speech to English with a Large Language Model, and then synthesizing the output speech with a state-of-the-art text-to-speech synthesis model. This corpus generates approximately 6 times more parallel data compared to the existing datasets. Experiment results show that the proposed model achieves 1.6 more ASR BLEU without using the built corpus and 4.6 more ASR BLEU with using the newly built corpus compared to direct baselines
- Keywords:
- Audio Dubbing ; Direct Speech-to-Speech Translation ; Speech-to-Speech Translation
-
محتواي کتاب
- view
- مقدمه
- مفاهیم اولیه
- کارهای پیشین
- راهکار پیشنهادی
- آزمایشها و نتایج جدید
- نتیجهگیری
- مراجع
- واژهنامه