Loading...

Speech activity detection using deep neural networks

Shahsavari, S ; Sharif University of Technology

484 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/IranianCEE.2017.7985293
  3. Abstract:
  4. In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM, DNN, and RNN on it. We have collected our dataset in a semi-supervised approach, using subtitled movies, with a labeling accuracy of 95%. This semi-automatic method can help us collect huge amounts of labeled audio data with very high diversity in language, speaker, and channel. We model the problem of SAD as a classification task to two classes of speech and non-speech. When using GMM for this problem, we use two separate mixtures to model speech and non-speech. In the case of neural networks, we use a softmax layer at the end of the network, with two neurons which represent speech and non-speech, and train the network using stochastic gradient descent to minimize cross-entropy loss. The input to our models is the extracted MFCC and PLP features (concatenated to each other) from audio frames. We also investigate the effect of context by taking into account past and future frames. Our results show that, adding context improves the performance both for GMMs, and DNNs. Through different experiments, we finally achieved an accuracy of 81.61% using GMMs, and 85.18% using DNNs. © 2017 IEEE
  5. Keywords:
  6. Speech ; Speech recognition ; Stochastic systems ; Audio frames ; Classification tasks ; Cross entropy ; Labeling accuracies ; Semi-supervised ; Semiautomatic methods ; Speech activity detections ; Stochastic gradient descent ; Deep neural networks
  7. Source: 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1564-1568 ; 9781509059638 (ISBN)
  8. URL: https://ieeexplore.ieee.org/document/7985293