Loading...

Text Separation of Single-Channel Audio Sources Using Deep Neural Networks

Ramazani Bonab, Amirhossein | 2022

75 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 56287 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Motahari, Abolfazl
  7. Abstract:
  8. The problem of separation of audio sources is one of the oldest issues raised in the field of audio processing, which has been studied for more than half a century. The main focus of recent research in this field has been on improving the sound quality resulting from the separation of sound sources with the help of deep neural networks. This is despite the fact that in most applications of audio source separation, such as the application of meeting transcription, we do not need the separated audio of people. Rather, we need a pipeline of converting overlapping speech to text, which, by receiving the audio in which several people have spoken, outputs the text spoken by the people present in the environment. The pipelines that have been presented for this purpose so far include two main models, the first model is responsible for audio sepa- ration and the second model is for text conversion, and problems such as error rate and high processing load can be considered as their disadvantages. In this research, we examine audio source separation models and introduce new pipelines that, in addition to the error rate, reduce the processing load and improve privacy protection. In the introduced pipelines, to reduce the error rate of the existing pipelines, we train the audio-to-text conversion model on the outputs of the audio source separation model. On the other hand, to reduce processing load and protect privacy, we separate audio sources in a representation space. In this way, we can separate audio sources in a smaller space and while improving the overall speed of the process, we can also reduce the processing load imposed on the system. The results of the tests show a 2-fold decrease in execution time and a 10% error rate in the case of having two speakers
  9. Keywords:
  10. Audio Processing ; Deep Learning ; Encoder-Decoder ; Audio Source Separation ; Privacy Preserving

 Digital Object List

 Bookmark

...see more