Loading...

Speech Activity Detection Using Deep Networks

Shahsavari, Sajad | 2017

533 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 49624 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sameti, Hossein
  7. Abstract:
  8. In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM, ANN, and RNN on it. We have collected our dataset in a semi-supervised approach, using subtitled movies, with a labeling accuracy of 95%. This semi-automatic method can help us collect huge amounts of labeled audio data with very high diversity in language, speaker, and channel. We model the problem of SAD as a classification task to two classes of speech and non-speech. When using GMM for this problem, we use two separate mixtures to model speech and non-speech. In the case of neural networks, we use a softmax layer at the end of the network, with two neurons which represent speech and non-speech, and train the network using stochastic gradient descent to minimize cross-entropy loss. The input to our models is the extracted MFCC and PLP features (concatenated to each other) from audio frames. We also investigate the effect of context by taking into account past and future frames. Our results show that, adding context improves the performance both for GMMs, and ANNs. Through different experiments, we finally achieved an accuracy of 81.61% using GMMs as base line, and 88.01% using ANNs
  9. Keywords:
  10. Gaussian Mixture Modeling ; Deep Multilayer Perceptron ; Recurrent Neural Networks ; Speech Activity Detection (SAD) ; Deep Neural Networks

 Digital Object List

 Bookmark

No TOC