Sharif Digital Repository / Sharif University of Technology / Search result

Design and Improvement of Sequence-level Objective Functions for DNN-based Large Vocabulary Continuous Speech Recognition

, Ph.D. Dissertation Sharif University of Technology Hadian, Hossein (Author) ; Sameti, Hossein (Supervisor)

Abstract

This thesis focuses on the problem of large vocabulary continuous speech recognition (LVCSR).Numerous research results in recent years proved effectiveness of deep neural networks (DNN) for LVCSR. As a result, many methods were proposed to incorporate DNNs in LVCSR. From one perspective we can look at these methods from the viewpoint of objective functions used for training DNNs. A frame-level objective function is one that is defined on frames locally, whereas a sequence-level objective function is defined on whole sequences. Since speech recognition is essentially a sequentional problem, here we focus on designing and imroving sequencelevel objective functions for DNNs. The main proposed...

محتواي کتاب

Reducing speech recognition costs: By compressing the input data

, Article IS'2012 - 2012 6th IEEE International Conference Intelligent Systems, Proceedings ; 2012 , Pages 102-107 ; 9781467327824 (ISBN) Halavati, R ; Shouraki, S. B ; Sharif University of Technology

2012

Abstract

One of the key constraints of using embedded speech recognition modules is the required computational power. To decrease this requirement, we propose an algorithm that clusters the speech signal before passing it to the recognition units. The algorithm is based on agglomerative clustering and produces a sequence of compressed frames, optimized for recognition. Our experimental results indicate that the proposed method presents a frame rate with average 40 frames per second on medium to large vocabulary isolated word recognition tasks without loss of recognition accuracy which result in up to 60% faster recognition in compare to usual 100 fps fixed frame rate sampling. This value is quite...

The effect of phase information in speech enhancement and speech recognition

, Article 2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2 July 2012 through 5 July 2012 ; 2012 , Pages 1446-1447 ; 9781467303828 (ISBN) Langarani, M. S. E ; Veisi, H ; Sameti, H ; Sharif University of Technology

2012

Abstract

The majority of speech enhancement methods perform noise removal in spectral domain and construct the enhanced speech signal from the estimated magnitude of clean speech and the phase of the noisy speech. In this paper, we show that by incorporating the phase information in the enhancement process, the quality and intelligibility of speech signal are improved. In our investigations, the minimum mean-square error short-time spectral amplitude and MMSE log-spectral amplitude methods are used to estimate the magnitude spectrum of speech signal. By conducting six classes of experiments, it is shown that by taking the phase information into account, overall SNR and PESQ measures are improved. In...

A novel approach to HMM-based speech recognition system using particle swarm optimization

, Article BIC-TA 2009 - Proceedings, 2009 4th International Conference on Bio-Inspired Computing: Theories and Applications, 16 October 2009 through 19 October 2009 ; 2009 , Pages 296-301 ; 9781424438655 (ISBN) Najkar, N ; Razzazi, F ; Sameti, H ; Sharif University of Technology

Abstract

The main core of HMM-based speech recognition systems is the Viterbi Algorithm. Viterbi is performed using dynamic programming to find out the best alignment between input speech and given speech model. In this paper, dynamic programming is replaced by a search method which is based on particle swarm optimization algorithm. The major idea is focused on generating an initial population of segmentation vectors in the solution search space and improving the location of segments by an updating algorithm. Two methods are introduced for representation of each particle and movement structure. The results show that the effect of these factors is noticeable in finding the global optimum while...

Compensation of channel and noise distortions combining maximum likelihood based spectral subtraction and Normalization

, Article 2007 IEEE International Conference on Signal Processing and Communications, ICSPC 2007, Dubai, 14 November 2007 through 27 November 2007 ; 2007 , Pages 508-511 ; 9781424412365 (ISBN) Safayani, M ; Babaali, B ; Manzuri Shalmani, M. T ; Sameti, H ; Khaleghi, S ; Sharif University of Technology

2007

Abstract

Channel distortion may dramatically degrade speech recognition performance in a distant environment. Authors in their recent work [1] proposed a novel spectral subtraction method which they named it maximum likelihood based spectral subtraction (MLBSS). They indicated that recognition performance could be improved dramatically by adjusting filter parameters based on recognition results. Previous results show effectiveness of this method in dealing with additive distortion. In this paper we propose an approach for increasing robustness of this method against channel distortion in distant talking environment. We add Cepstral Mean Normalization (CMN) in designing MLBSS filter and show that by...

A Soft Spectrographic Mask Estimation for Speech Recognition

, M.Sc. Thesis Sharif University of Technology Esmaeelzadeh, Vahid (Author) ; Sameti, Hossein (Supervisor)

Abstract

Nowadays, robustness of the Automatic Speech Recognition (ASR) systems against various noises is major challenge in these systems. Missing feature speech recognition approaches are our goal in this thesis for achieving robust ASR systems. In these approaches, low SNR regions of a spectrogram are considered to be “missing” or “unreliable” and are removed from the spectrogram. Noise compensation is carried out by either estimating the missing regions from the remaining regions in some manner prior to recognition, or by performing recognition directly on incomplete spectrograms. These techniques clearly require a "spectrographic mask" which accurately labels the reliable and unreliable regions...

محتواي پايان نامه

Automatic Speech Recognition System for Pilot-Air Traffic Service Units Communications

, M.Sc. Thesis Sharif University of Technology Azadmanesh, Mahsa (Author) ; Bahrani, Mohammad (Supervisor) ; Baba Ali, Bagher (Co-Advisor) ; Pazooki, Farshad (Co-Advisor)

Abstract

Currently, in the Islamic Republic of Iran, after aviation accidents and incidents, conversations between pilots and air traffic controllers are re-examined by the State Air Transport Organization of the Islamic Republic of Iran and turned into text. The Automatic Recognition System for Pilot-Air Traffic Service Units’ Communication helps in the implementation of speech recognition. Reducing the time and cost of converting conversations into texts and creating an aviation database in the country are other uses of this system. In this research, after collecting and refining the actual conversation between pilots and air traffic controllers and examining seven methods, we design a system that...

محتواي کتاب

Language Modeling for Persian using Recurrent Neural Networks

, M.Sc. Thesis Sharif University of Technology Pourbagheri, Mohammad (Author) ; Sameti, Hossein (Supervisor)

Abstract

During recent years, neural networks have been used for language modeling in tasks related to natural language processing. In these models, various structures of neural networks have been used, and recurrent networks (RNN) have achieved good results in these tasks. Since RNNs are not limited to a fixed number of words for predicting next word, they have achieved better results than feedforward networks. However, these networks have problems to learn long sequences, and long short-term memory (LSTM) networks have been presented for solving this problem. In this research, language models are extracted for Persian language using RNN and LSTM, and are compared with n-gram-based models. For...

محتواي کتاب

Discriminative Articulatory Models for Spoken Term Detection in Low-Resource Conditions

, M.Sc. Thesis Sharif University of Technology Gomar, Zahra (Author) ; Sameti, Hossein (Supervisor)

Abstract

This thesis is focused on the spoken term detection system based on speech recognition in low resources conditions. A spoken term detection system is composed of two parts: speech recognition and search. In search of words, the method of proxy words is used as a basic approache to overcome the problem of OOV words. The main challenge in this thesis in the context of low resources, is poor training acoustic and language models and the small lexicon in speech recognition. Small lexicon increases the number of OOV words. In this thesis, two innovation has been proposed to improve the basic system. The first is training a bottleneck neural network for extraction the articulatory features of...

محتواي کتاب

A Speech Driven Web Browser

, M.Sc. Thesis Sharif University of Technology Rashidi Fard, Amin (Author) ; Vosoughi Vahdat, Bijan (Supervisor)

Abstract

Generally speaking a web browser is a software application for surfing the World Wide Web. A user with web browser can request some web pages on the Internet. This request would be sent to web server and would be analyzed. The result would be shown to end user by web browser GUI. A web browser has different parts such as HTML parser, Renderer, browser engine and GUI. The GUI is one of the most important parts of each web browser, because the end users interact with GUI. The classical GUI for surfing has been used in various platforms, such as the PC and Laptop Operating systems. Because of the technological advances and the introduction of tablets and other touch screen devices i. e, smart...

محتواي کتاب

Automatic Concept Extraction to Improve the Recognition Performance for Sequential Patterns

, Ph.D. Dissertation Sharif University of Technology Halavati, Ramin (Author) ; Bagheri Shouraki, Saeid (Supervisor)

Abstract

In this dissertation, we introduced a Fuzzy based representation and comparison method for sequential patterns such as speech and online handwriting. The new model, called Fuzzy Elastic Matching Machine (FEMM), is simpler than traditional HMM based approaches and is not based on the common statistical assumptions of HMM systems. The model was tested on isolated word and phoneme recognition tasks in speech recognition domain and isolated letter recognition in Persian handwriting recognition. We showed that this method is faster than traditional HMM based models and more robust to noise. To train the model, we introduced a Symbiogenesis-based evolutionary training algorithm. This algorithm...

محتواي پايان نامه

Evaluation of Performance and Power Improvement Methods for Inference in Deep Neural Network-based Speech-to-Text Conversion on Mobile Devices

, M.Sc. Thesis Sharif University of Technology Katebi, Hossein (Author) ; Goudarzi, Maziar (Supervisor)

Abstract

Automatic Speech Recognition (ASR) systems are a significant part of Personal Assistants in mobile phones. But because of the time-dependent nature of ASR systems, they are computation and memory-intensive tasks. On the other hand, mobile devices utilize a Low-Power design to extend battery life and improve user experience, making them incompatible with heavy-loaded tasks such as ASR systems. For instance, if we run an inference with a 60 seconds audio file on a well-known open-sourced Speech Recognition System named DeepSpeech, it will only take 49 seconds for a desktop PC to generate the results. Still, a mobile phone with ARM64 architecture with the same input file will take 92 seconds to...

محتواي کتاب

SR-NBS: A fast sparse representation based N-best class selector for robust phoneme classification

, Article Engineering Applications of Artificial Intelligence ; Vol. 28 , 2014 , pp. 155-164 Saeb, A ; Razzazi, F ; Babaie-Zadeh, M ; Sharif University of Technology

Abstract

Although exemplar based approaches have shown good accuracy in classification problems, some limitations are observed in the accuracy of exemplar based automatic speech recognition (ASR) applications. The main limitation of these algorithms is their high computational complexity which makes them difficult to extend to ASR applications. In this paper, an N-best class selector is introduced based on sparse representation (SR) and a tree search strategy. In this approach, the classification is fulfilled in three steps. At first, the set of similar training samples for the specific test sample is selected by k-dimensional (KD) tree search algorithm. Then, an SR based N-best class selector is...

HMM-based persian speech synthesis using limited adaptation data

, Article International Conference on Signal Processing Proceedings, ICSP ; Volume 1 , 2012 , Pages 585-589 ; 9781467321945 (ISBN) Bahmaninezhad, F ; Sameti, H ; Khorram, S ; Sharif University of Technology

2012

Abstract

Speech synthesis systems provided for the Persian language so far need various large-scale speech corpora to synthesize several target speakers' voice. Accordingly, synthesizing speech with a small amount of data seems to be essential in Persian. Taking advantage of a speaker adaptation in the speech synthesis systems makes it possible to generate speech with remarkable quality when the data of the speaker are limited. Here we conducted this method for the first time in Persian. This paper describes speaker adaptation based on Hidden Markov Models (HMMs) in Persian speech synthesis system for FARsi Speech DATabase (FARSDAT). In this regard, we prepared the whole FARSDAT, then for...

Automatic noise recognition based on neural network using LPC and MFCC feature parameters

, Article 2012 Federated Conference on Computer Science and Information Systems, FedCSIS 2012, 9 September 2012 through 12 September 2012 ; 2012 , Pages 69-73 ; 9781467307086 (ISBN) Haghmaram, R ; Aroudi, A ; Ghezel, M. H ; Veisi, H ; Sharif University of Technology

2012

Abstract

This paper studies the automatic noise recognition problem based on RBF and MLP neural networks classifiers using linear predictive and Mel-frequency cepstral coefficients (LPC and MFCC). We first briefly review the architecture of each network as automatic noise recognition (ANR) approach, then, compare them to each other and investigate factors and criteria that influence final recognition performance. The proposed networks are evaluated on 15 stationary and non-stationary types of noises with frame length of 20 ms in term of correct classification rate. The results demonstrate that the MLP network using LPCs is a precise ANR with accuracy rate of 99.9%, while the RBF network with MFCCs...

Optimum detection and location estimation of target lines in the range-time space of a search radar

, Article Aerospace Science and Technology ; Volume 15, Issue 8 , 2011 , Pages 627-634 ; 12709638 (ISSN) Moqiseh, A ; Sharify, S ; Nayebi, M. M ; Sharif University of Technology

2011

Abstract

The average likelihood ratio detector is derived as the optimum detector for detecting a target line with unknown normal parameters in the range-time data space of a search radar, which is corrupted by Gaussian noise. The receiver operation characteristics of this optimum detector is derived to evaluate its performance improvement in comparison with the Hough detector, which uses the return signal of several successive scans to achieve a non-coherent integration improvement and get a better performance than the conventional detector. This comparison, which is done through analytic derivations and also through simulation results, shows that the average likelihood ratio detector has a better...

Isolatedword recognition based on intelligent segmentation by using hybrid HTD-HMM

, Article International Conference on Circuits, Systems, Signal and Telecommunications - Proceedings, 21 October 2010 through 23 October 2010 ; October , 2011 , Pages 38-41 ; 9789604742714 (ISBN) Kazemi, A. R ; Ehsandoust, B. B ; Rezazadeh, C. A ; Ghaemmaghami, D. S ; Sharif University of Technology

2011

Abstract

In recent years, IWR (Isolated Word Recognition) was one of the main concerns of speech processing. The challenging problems in this field appear when the database become so large or when we have a lot of word with similarly pronounce in the database. This paper introduces a general solution for a traditional problem in isolated similarly pronounced word recognition, especially in large databases. One the important problem of traditional IWR is referred to their segmentation algorithm, their methods were lacking in efficiency due to the following reasons: First, using equal segmentation is not at all intelligent at all and as a result, cannot produce accurate results; besides, utilizing...

Non-speaker information reduction from Cosine Similarity Scoring in i-vector based speaker verification

, Article Computers and Electrical Engineering ; Volume 48 , November , 2015 , Pages 226–238 ; 00457906 (ISSN) Zeinali, H ; Mirian, A ; Sameti, H ; BabaAli, B ; Sharif University of Technology

Elsevier Ltd 2015

Abstract

Cosine similarity and Probabilistic Linear Discriminant Analysis (PLDA) in i-vector space are two state-of-the-art scoring methods in speaker verification field. While PLDA usually gives better accuracy, Cosine Similarity Scoring (CSS) remains a widely used method due to simplicity and acceptable performance. In this domain, several channel compensation and score normalization methods have been proposed to improve the performance. We investigate non-speaker information in cosine similarity metric and propose a new approach to remove it from the decision making process. I-vectors hold a large amount of non-speaker information such as channel effects, language, and phonetic content. This type...

A new bigram-PLSA language model for speech recognition

, Article Eurasip Journal on Advances in Signal Processing ; Volume 2010 , July , 2010 ; 16876172 (ISSN) Bahrani, M ; Sameti, H ; Sharif University of Technology

2010

Abstract

A novel method for combining bigram model and Probabilistic Latent Semantic Analysis (PLSA) is introduced for language modeling. The motivation behind this idea is the relaxation of the bag of words assumption fundamentally present in latent topic models including the PLSA model. An EM-based parameter estimation technique for the proposed model is presented in this paper. Previous attempts to incorporate word order in the PLSA model are surveyed and compared with our new proposed model both in theory and by experimental evaluation. Perplexity measure is employed to compare the effectiveness of recently introduced models with the new proposed model. Furthermore, experiments are designed and...

i-vector/HMM based text-dependent speaker verification system for RedDots challenge

, Article 17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016, 8 September 2016 through 16 September 2016 ; Volume 08-12-September-2016 , 2016 , Pages 440-444 ; 2308457X (ISSN) Zeinali, H ; Sameti, H ; Burget, L ; Cěrnocký, J. H ; Maghsoodi, N ; Sharif University of Technology

International Speech and Communication Association 2016

Abstract

Recently, a new data collection was initiated within the RedDots project in order to evaluate text-dependent and text-prompted speaker recognition technology on data from a wider speaker population and with more realistic noise, channel and phonetic variability. This paper analyses our systems built for RedDots challenge-the effort to collect and compare the initial results on this new evaluation data set obtained at different sites. We use our recently introduced HMM based i-vector approach, where, instead of the traditional GMM, a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction. Our systems are trained in a completely phraseindependent way on...