Text Separation of Single-Channel Audio Sources Using Deep Neural Networks

Ramazani Bonab, Amirhossein; Motahari, Abolfazl

Please enable javascript in your browser.

Text Separation of Single-Channel Audio Sources Using Deep Neural Networks

Ramazani Bonab, Amirhossein | 2022

84 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 56287 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Motahari, Abolfazl
Abstract:
The problem of separation of audio sources is one of the oldest issues raised in the field of audio processing, which has been studied for more than half a century. The main focus of recent research in this field has been on improving the sound quality resulting from the separation of sound sources with the help of deep neural networks. This is despite the fact that in most applications of audio source separation, such as the application of meeting transcription, we do not need the separated audio of people. Rather, we need a pipeline of converting overlapping speech to text, which, by receiving the audio in which several people have spoken, outputs the text spoken by the people present in the environment. The pipelines that have been presented for this purpose so far include two main models, the first model is responsible for audio sepa- ration and the second model is for text conversion, and problems such as error rate and high processing load can be considered as their disadvantages. In this research, we examine audio source separation models and introduce new pipelines that, in addition to the error rate, reduce the processing load and improve privacy protection. In the introduced pipelines, to reduce the error rate of the existing pipelines, we train the audio-to-text conversion model on the outputs of the audio source separation model. On the other hand, to reduce processing load and protect privacy, we separate audio sources in a representation space. In this way, we can separate audio sources in a smaller space and while improving the overall speed of the process, we can also reduce the processing load imposed on the system. The results of the tests show a 2-fold decrease in execution time and a 10% error rate in the case of having two speakers
Keywords:
Audio Processing ; Deep Learning ; Encoder-Decoder ; Audio Source Separation ; Privacy Preserving

Digital Object List

محتواي کتاب
view

Bookmark

مقدمه
- بیان صورت مسئله
- کاربردها
- چالش‌ها
  - داده
  - تعداد منابع صوتی
  - نرخ خطای فرایند
  - هزینه‌ی پردازشی
- نوآوری‌ها
- ساختار فصل‌ها
مفاهیم پایه
- مقدمه
- مکانیزم توجه
  - مکانیزم توجه چند واحدی
  - مکانیزم خودتوجهی
- معماری ترنسفرمر
- جمع‌بندی
پژوهش‌های پیشین
- مقدمه
- روش‌های تک‌کاناله
  - روش‌های مبتنی بر حوزه‌ی فرکانس
  - روش‌های مبتنی بر حوزه‌ی زمان
- روش‌های چندکاناله
- جمع‌بندی
روش پیشنهادی
- مقدمه
- بیان رسمی مسئله
- معماری پیش‌زمینه
  - مدل wav2vec2.0
  - مدل ترنسفرمر دومسیره
- ‌فرایندهای پیشنهادی تفکیک منابع صوتی
  - ‌فرایند پایه‌ای
  - فرایند بهبودیافته
  - فرایند مبتنی بر فضای بازنمایی
  - فرایند مبتنی بر فضای بازنمایی بهبودیافته
- جمع‌بندی
آزمایش‌ها
- مقدمه
- داده‌ی آزمون
- معیار ارزیابی
  - فاصله‌ی ویرایشی
  - نرخ خطای نویسه
  - نرخ خطای کلمه
- نتایج آزمایش‌های انجام شده
  - نتایج فرایند پایه‌ای
  - نتایج فرایند بهبودیافته
  - نتایج فرایند مبتنی بر فضای بازنمایی
  - فرایند مبتنی بر فضای بازنمایی بهبودیافته
- مقایسه‌ی نتایج به دست آمده
  - مقایسه‌ی نرخ خطای کلمه
  - مقایسه‌ی زمان مورد نیاز هر فرایند
  - مقایسه‌ی حفظ حریم شخصی
- جمع‌بندی
جمع‌بندی و کارهای آتی
- جمع‌بندی
- کارهای آتی
  - طراحی فرایند سیگنال‌های صوتی نویزی
  - طراحی فرایند‌های چند زبانه
  - ساده‌سازی ساختار کدگشا و کدگذار در مدل تفکیک منابع صوتی
مراجع
واژه‌نامه
فهرست اختصارات
مطالب تکمیلی

Friend's email
Your name
Your email
enter code