Using Structural Language Modeling in Continous Speech Recognition Systems

, M.Sc. Thesis Sharif University of Technology SheikhShab, Golnar (Author) ; Sameti, Hossein (Supervisor)

Abstract

Language model is one of the most important parsts of an automated speech recognition system whiche incorporates the knowledge of Natural Language to the system to improve its accuracy. Conventionally used language model in recognition systems is ngram which usually is extracted from a large corpus using related frequency method. ngram model approximates the probability of a word sequence by multiplying its ngram probabilities and thus does not take into account the long distance dependencies. So, syntactic language models could be of interest. In this research after probing different syntactic language models, a mehtod for re-estimating ngram model, introduced by Stolcke in 1994, was...

Text-Independent Speaker Identification in Large Population Applications

, M.Sc. Thesis Sharif University of Technology Zeinali, Hossein (Author) ; Sameti, Hossein (Supervisor)

Abstract

The human speech conveys much information such as semantic contents, emotion and even speaker identity. Our goal in this thesis is the task of text-independent speaker identification (SI) in large population applications. Identification (test) time has become one of the most important issues in recent real time systems. Identification time depends on the cost of likelihood computation between test features and registered speaker models. For real time application of SI, system must identify an unknown speaker quickly. Hence the conventional SI methods cannot be used. The main goal in this thesis is to propose several methods that reduced identification time without any loss of identification...

Music Emotion Recognition

, M.Sc. Thesis Sharif University of Technology Pouyanfar, Samira (Author) ; Sameti, Hossein (Supervisor)

Abstract

Measuring emotions of music is one of the methods to determine music content. Music emotion detection is applicable in music retrieval, recognition of music genre and also music data management softwares. Music emotion is considered in different sciences such as physiology, psychology, musicology and engineering. First, we collected a database of different types of music with various emotions. These data have been labeled according to their emotions. In this project, four emotions (Angry, happy, relax and sad) have been used as labels based on Thayer’s two dimension emotion model. There are two basic steps for music emotion recognition similar to other recognition systems: Feature extraction...

Learning Dialogue Management in Spoken Dialogue Systems

, M.Sc. Thesis Sharif University of Technology Habibi, Maryam (Author) ; Sameti, Hossein (Supervisor)

Abstract

Applying spoken dialogue systems (SDS's) is growing in the real life more rapidly because of the advances in the design and management of these systems. The traditional touch tone computer telephony systems are being substituted by the SDS's. In a typical SDS, the user speaks naturally to the system through a phone line and the system provides the required information or performs the required action. Banking and ticket reservation are typical examples of the prevalent SDS's. A spoken dialogue system has four units: automatic speech recognition (ASR), natural language understanding (NLU), dialogue management (DM), and spoken language generation (SLG). In this work, the first spoken dialogue...

Introducing a Hybrid Language Model for Improving Performance of Continuous Speech Recognition Systems

, Ph.D. Dissertation Sharif University of Technology Bahrani, Mohammad (Author) ; Sameti, Hossein (Supervisor)

Abstract

The utilizing language model is one of the most effective methods for improving speech recognition performance. For speech recognition applications, several types of language models have been proposed for speech recognition applications that try to model some parts of language information, such as n-gram models, syntactic models, and semantic models. Although n-gram, syntactic and semantic models are able to model different structures that exist in natural language, they each only capture specific linguistic phenomena. None of them can simultaneously take into account all of language phenomena in a unified probabilistic framework. Recently, a number of semantic models called "latent topic...

Speech Enhancement Based on Statistical Methods

, Ph.D. Dissertation Sharif University of Technology Veisi, Hadi (Author) ; Sameti, Hossein (Supervisor)

Abstract

Signle-channel speech enhancement using hidden Markov model (HMM) based on minimum mean square error (MMSE) estimator is focused on and an HMM-based speech enhancement in Mel-frequency domain is proposed. The MMSE estimator results in a weighted sum filtering of the noisy signal in which accurate estimation of the filter values and filter weights comprise the main challenges. The cepstral domain modeling for speech enhancement is motivated by accurate filter selection in this domain. In the propsed framework, Mel-frequency spectral (MFS) and Mel-frequency cepstral (MFC) features are studied and experimented. In addition to the spectrum estimator, magnitude spectrum, log-magnitude spectrum...

Persian Speech Synthesis Using Hidden Markov Models

, M.Sc. Thesis Sharif University of Technology Bahaadini, Sara (Author) ; Sameti, Hossein (Supervisor)

Abstract

Scattered and little research in the field of Persian speech synthesis systems has been performed during the last ten years. Comprehensive framework that properly implements and adapts statistical speech synthesis methods for Persian has not been conducted yet. In this thesis, recent statistical parametric speech synthesis methods including CLUSTERGEN, traditional HMM-based speech synthesis and its STRAIGHT version, are implemented and adapted for Persian language. CCR test is carried out to compare these methods with each other and with unit selection method. Listeners Score samples based on CMOS. The methods were ranked by averaging the CCR scores. The results show that STRAIGHT-based...

Persian Statistical Natural Language Understanding Based on Partially Annotated Corpus

, M.Sc. Thesis Sharif University of Technology Jabbari, Fattaneh (Author) ; Sameti, Hossein (Supervisor)

Abstract

Spoken language understanding unit is one of the most important parts of a spoken dialogue system. The input of this system is the output of speech recognition unit. The main function of this unit is to extract the semantic information from the input utterances. There are two main types of approaches to do this task: rule-based approaches, and data-driven approaches. Today data-driven approaches are of more interest because they are more flexible and robust compared to the rule-based approaches. The main drawback of these methods is that they need a large amount of fully annotated or in some cases Treebank data. Preparing such data is time consuming and expensive. The goal of this thesis is...

Semantic Clustering of Persian Verbs

, M.Sc. Thesis Sharif University of Technology Aminian, Maryam (Author) ; Sameti, Hossein (Supervisor)

Abstract

Semantic classification of words based on unsupervised learning methods is a challenging issue in computational lexical semantics. The goal of this field of study is to recognize the words that are in the same semantic classes; i.e., can have the same set of arguments. Among all word categories, verb is known as one the most important and is assumed as the central part of the sentence in certain linguistic theories such as case grammar and dependency grammar. Based on Levin’s idea, diathesis alternations and the similarity between these alternations are the clues for the semantic classification of verbs. This idea is verified in languages such as English and German with promising results....

Robust Speech Recognition Based on Data Compensation and MDT Methods

, M.Sc. Thesis Sharif University of Technology BabaAli, Bagher (Author) ; Sameti, Hossein (Supervisor)

Abstract

Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected...

Speaker Adaptation in Eigen Voice Space for Statistical Parametric Speech Syntheis

, M.Sc. Thesis Sharif University of Technology Shams, Boshra (Author) ; Sameti, Hossein (Supervisor)

Abstract

Recently various speaker adaptation methods in HMM-based speech synthesis are proposed. The importance of adaptation techniques is that we can design a system in which speech is generated with high quality and target speaker characteristics through limited adaptation data sets.
In this research, we focus on adaptation based on clustering and develop a new and novel method using eigenvoices in order to adapt a new speaker. We employ this approach for the first time in HSMM-based speech synthesis systems and its goal is to reduce the parameters and adaptation data of the system. In our proposed method, first some speaker dependent models are trained. For each model we combine the...

Training-Based Speech Enhancement Using Non-Gaussian Distributions

, M.Sc. Thesis Sharif University of Technology Golrasan, Elham (Author) ; Sameti, Hossein (Supervisor)

Abstract

Statistical approaches (purely statistical and model-based) are the most efficient methods in single-channel speech enhancement. Despite these efficiencies, the problem of speech enhancement is still a challenge. Recent researches which propose univariate non-Gaussian distributions are more appropriate for speech signal in different domains. Based on these univariate distributions, statistical approaches have been modified and consequently better results have been reported. The purpose of this thesis is speech enhancement based on hidden Markov model using multivariate non-Gaussian distribution. The results of speech enhancement algorithm based on hidden Markov model in DCT and DFT domains...

Improving the Training Process of Understanding Unit in Spoken Dialog Systems Using Active Learning Methods

, M.Sc. Thesis Sharif University of Technology Hadian, Hossein (Author) ; Sameti, Hossein (Supervisor)

Abstract

This thesis aims at reducing the need for labeled data in the SLU domain by the means of active Learning methods. This need is due to the lack of labeled datasets for Spoken Language Understanding (SLU) in the Persian language, and fairly high labeling costs. Active learning methods enables the learner to choose the most informative instances to be labeled and used for training, and prevents labeling uninformative or redundant instances. For modeling the SLU system, several statistical models namely MLN (Markov Logic Networks), CRF (Conditional Random Fields), HMM (Hidden Markov Model) and HVS (Hidden Vector State) were reviewed, and finally CRF was chosen for its superior performance. The...

Music Track Detection Using Audio Fingerprinting

, M.Sc. Thesis Sharif University of Technology Yazdanian, Saeed (Author) ; Sameti, Hossein (Supervisor)

Abstract

Music information reterival systems have a lot of applications in music filtering and broadcast monitoring due to the huge amount of multimedia data these days. In these systems the feature extraction method is called audio fingerprinting. Small size of fingerprints allows the systems to search efficiently in thousands or millions numbers of audio songs. The input signal is usually just a couple of seconds long and degraded in several ways. The goal is to design a system which is robust to signal degradations and efficient to search. In this thesis one of the basic systems is reviewed and improved in several ways. This system uses spectrogram of signals to extract features and build an...

Normalization of Non-standard Texts for Persian language Using Neural
Networks

, M.Sc. Thesis Sharif University of Technology Seyyedi, Javad (Author) ; Sameti, Hossein (Supervisor)

Abstract

The purpose of this research is to normalize non-standard persian texts. We proposed a method to transfigure the texts with any non-standard structure into a formal and standard form. One of the major complications of the text normalization is the large variety of non-standard structures, and the fact that these diversities could not be classified in one constructional pattern. Furthermore, the concept of text normalization, in different situations, has multiple different definitions, and any of this settings needs a distinct normalization method. Supervised learning methods are not suitable for normalization due to variety of both standard and non-standard texts as well as the absence of...

Design and Performance Improvement of a Spoken Term Detection System

, M.Sc. Thesis Sharif University of Technology Ghadirinia, Marzieh (Author) ; Sameti, Hossein (Supervisor)

Abstract

Recently, widely application of video and radio data makes the exploiting an efficient speech information retrival systems highly crucial. In the present work, Our focus is on spoken term detection which is one of the most important approaches for information retrival. The present system is including two main steps: first, speech processing by means of automatic speech recognition. In recognition Step, we apply large vocabulary. In all recent approaches, the main concern is to retrieve words which are out of vocabulary (OOV). The state of the art to tackle the problem is to exploit the proxy kewords which are in vocabulary words and could be recognized instead of OOV words. Such proxies have...

Discriminative Articulatory Models for Spoken Term Detection in Low-Resource Conditions

, M.Sc. Thesis Sharif University of Technology Gomar, Zahra (Author) ; Sameti, Hossein (Supervisor)

Abstract

This thesis is focused on the spoken term detection system based on speech recognition in low resources conditions. A spoken term detection system is composed of two parts: speech recognition and search. In search of words, the method of proxy words is used as a basic approache to overcome the problem of OOV words. The main challenge in this thesis in the context of low resources, is poor training acoustic and language models and the small lexicon in speech recognition. Small lexicon increases the number of OOV words. In this thesis, two innovation has been proposed to improve the basic system. The first is training a bottleneck neural network for extraction the articulatory features of...

Improving Speech Signal Models for Statistical Parametric Speech Synthesis

, Ph.D. Dissertation Sharif University of Technology Khorram, Soheil (Author) ; Sameti, Hossein (Supervisor)

Abstract

Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade, due to its remarkable advantages such as high intelligibility and flexibility. Decision tree-clustered context-dependent hidden semi-Markov models are typically used in SPSS to represent probability densities of acoustic features given contextual factors. This research addresses four major limitations of this decision tree-based structure: (a) The decision tree structure lacks adequate context generalization; (b) It is unable to express complex context dependencies; (c) Parameters generated from this structure represent sudden transitions between adjacent states; (e) This...

High-Performance Keyword Spotting System for Persian Language

, M.Sc. Thesis Sharif University of Technology Ghorbani, Shahram (Author) ; Sameti, Hossein (Supervisor)

Abstract

Keyword spotting with high speed and accuracy is an important subject whithin speech processing domain especially when we are dealing with various transmission channels. In this research discriminative keyword spotting methods are compared with HMM-based approaches. We have employed the discriminative approaches as our baseline methods due to their higher accuracy. The drawback of the conventional discriminative methods is their high computation cost and long execution time. The discriminative approach consists of two steps: feature extraction and classification. We have proposed four ideas to improve the performance of the baseline method. To improve the speed of the process, in feature...

Telephony Text-Independent Speaker Verification in Total Variability Space

, M.Sc. Thesis Sharif University of Technology Mirian, Alireza (Author) ; Sameti, Hossein (Supervisor)

Abstract

Given two speech segments, the task of speaker verification is defined as determining whether or not both of them have been uttered by the same person. Most of the new approaches in speaker verification are based on Total Variability Space which is the result of applying a factor analysis on GMM mean supervector space. The representation of speech with arbitrary duration in this space is called i-vector.
In this thesis, first the basics of speaker verification is described and i-vector approaches are explained in more details. Then, a method for improving accuracy of Cosine Similarity Scoring is proposed which normalize the raw score using the score of test utterance against a model- and...