Speech Activity Detection Using Deep Networks

Shahsavari, Sajad; Sameti, Hossein

Please enable javascript in your browser.

Speech Activity Detection Using Deep Networks

Shahsavari, Sajad | 2017

533 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 49624 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Sameti, Hossein
Abstract:
In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM, ANN, and RNN on it. We have collected our dataset in a semi-supervised approach, using subtitled movies, with a labeling accuracy of 95%. This semi-automatic method can help us collect huge amounts of labeled audio data with very high diversity in language, speaker, and channel. We model the problem of SAD as a classification task to two classes of speech and non-speech. When using GMM for this problem, we use two separate mixtures to model speech and non-speech. In the case of neural networks, we use a softmax layer at the end of the network, with two neurons which represent speech and non-speech, and train the network using stochastic gradient descent to minimize cross-entropy loss. The input to our models is the extracted MFCC and PLP features (concatenated to each other) from audio frames. We also investigate the effect of context by taking into account past and future frames. Our results show that, adding context improves the performance both for GMMs, and ANNs. Through different experiments, we finally achieved an accuracy of 81.61% using GMMs as base line, and 88.01% using ANNs
Keywords:
Gaussian Mixture Modeling ; Deep Multilayer Perceptron ; Recurrent Neural Networks ; Speech Activity Detection (SAD) ; Deep Neural Networks

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code