Improving Speech Signal Models for Statistical Parametric Speech Synthesis

Khorram, Soheil; Sameti, Hossein

Please enable javascript in your browser.

Improving Speech Signal Models for Statistical Parametric Speech Synthesis

Khorram, Soheil | 2014

975 Viewed

Type of Document: Ph.D. Dissertation
Language: Farsi
Document No: 47068 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Sameti, Hossein
Abstract:
Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade, due to its remarkable advantages such as high intelligibility and flexibility. Decision tree-clustered context-dependent hidden semi-Markov models are typically used in SPSS to represent probability densities of acoustic features given contextual factors. This research addresses four major limitations of this decision tree-based structure: (a) The decision tree structure lacks adequate context generalization; (b) It is unable to express complex context dependencies; (c) Parameters generated from this structure represent sudden transitions between adjacent states; (e) This structure is unable to capture dependencies between adjacent frames in a state; In order to alleviate the above limitations, we have proposed three novel statistical parametric models: (i) Hidden maximum entropy model (HMEM) using multiple overlapped decision trees: This model replaces non-overlapped clusters of decision tree with overlapped clusters generated through multiple decision trees. Thereafter, model distribution is estimated through the smoothest (maximum entropy) distribution that captures first and second order moments of training samples in each overlapped cluster. Due to the simultaneous use of multiple overlapped decision trees and maximum entropy measure, the first three mentioned issues are considerably alleviated. (ii) HMEM incorporating soft decision tree architecture: This structure extends the conventional hard decision tree-based clustering to a soft clustering approach. In this proposed model, soft clustering scheme along with maximum entropy estimation provides promising generalization capabilities. (iii) Gaussian conditional random field (GCRF): This model is a type of maximum entropy model that not only captures first and second moments, but also it is able to preserve correlations between adjacent frames in a hidden state. Therefore, it removes invalid state independence assumption and provides more accurate modeling approach. For each proposed model, we have also designed innovative maximum likelihood (ML)-based algorithms to cluster and estimate model parameters. In addition, parameter generation and forward-backward (or Viterbi) algorithms are introduced for all proposed models.
Experimental results prove that HMEM using heuristic contextual regions improves the accuracy of voiced/unvoiced detector by 1.5%; furthermore, the CMOS level of this system is 0.5 unit more than the baseline system for limited databases (less than 200 utterances), and 0.5 unit less than that for larger databases. Overlapped decision trees have considerably increased the overall quality of the baseline system. More precisely, mel-cepstrum parameters generated by overlapped decision trees are 0.35 dB closer to the natural speech. In addition, the proposed soft decision tree structure have reduced the root-mean-square-error (RMSE) of F0 modeling by approximately 40 CENT. In sum, all objective and subjective evaluations confirm the superiority of HMEM using overlapped and soft decision trees over the predominant hidden Markov model-based synthesizers
Keywords:
Hidden Semi-Markov Model ; Decision Making Tree ; Statistical Parametric Speech Synthesis ; Context-dependent Acoustic Modeling ; Maximum Entropy Model ; Overlapped Decision Making Tree ; Soft Decision Making Tree ; Gaussian Conditional Random Field

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code