Speech Enhancement Using Deep Neural Networks

Mohammadian Kalkhoran, Parisa; Sameti, Hossein

Please enable javascript in your browser.

Speech Enhancement Using Deep Neural Networks

Mohammadian Kalkhoran, Parisa | 2022

60 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 55821 (31)
University: Sharif University of Technology
Department: Languages and Linguistics Center
Advisor(s): Sameti, Hossein
Abstract:
Quality and intelligibility are two aspects of speech that are affected by various factors, such as background noise and echo. The performance of many commercial and military speech-based systems depends on at least one of these aspects of speech. Therefore, this research aims to design an improvement model to remove background noise and reverberation from the speech signal. The model training framework is based on deep learning methods and has a supervised approach in the time domain. The input of this system is the raw waveform of the speech signal mixed with noise and reverberation, and the output is the enhanced waveform of this signal. An architecture is proposed in this thesis based on the U^2-Net algorithm. U^2-Net was presented for the first time to detect the salient object in the image. U^2-Net has a two-level nested configuration with a U-shaped encoder-decoder structure, which brings two advantages to the model: on the one hand, it extracts richer global and local information at different scales, and on the other hand, it increases the depth of the network, without increasing the computational cost. In the proposed system, the multimodal loss function is used to train the model in both time and time-frequency domains so that the model can be optimized in both domains simultaneously and the quality of the improved signal can be more like human perception. In addition, the model also uses data augmentation solutions to use the available data better. We used a portion of DNS Challenge3 data for training, which was supplemented with data from the Large Farsdat dataset. Two advanced speech enhancement models, FullSubNet in the time-frequency domain and Denoiser in the time domain, have been selected to compare the results. However, due to limited hardware facilities and limited time for research, the model was not fully implemented, and the study remained at the initial level. The result of FullSubNet model training with two hours of training data in the Google Colab environment was SI-SNR, 9.49, STOI, 0.91, WB-PESQ, 1.72, and NB-PESQ, 2.58, which despite the training data being much less than the original article, the result was close to the reported outcome in terms of intelligibility (STOI measurement). But in terms of quality (PESQ measurement), it had a weaker performance. As for the pilot's conversations, although the FullSubNet and Denoiser models significantly improved quality, no improvement in automatic speech recognition was observed
Keywords:
Speech Enhancement ; Noise Removing ; Deep Neural Networks ; Speech Destruction ; Clean Speech ; Reverberation ; Dereverberation ; Background Noise

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code