Sharif Digital Repository / Sharif University of Technology / Search result

Incorporating a novel confidence scoring method in a Persian spoken dialogue system

, Article SPA 2011 - Signal Processing: Algorithms, Architectures, Arrangements, and Applications - Conference Proceedings, 29 September 2011 through 30 September 2011, Poznan ; September , 2011 , Pages 74-78 ; 9781457714863 (ISBN) Sakhaee, E ; Sameti, H ; Babaali, B ; Sharif University of Technology

2011

Abstract

Reliability assessment of phonemes, syllabi, words, concepts or utterances has become the key feature of Automatic Speech Recognition (ASR) engines in order to make a decision to accept or reject a hypothesis. In this paper, we propose utterance-level confidence annotation based on combination of features extracted from multiple knowledge sources in Persian language. The experiment was conducted first to examine the performance of individual features, then to combine them using statistical data analysis and density estimation methods to assign a confidence score to utterances. Using the data collected from a Persian spoken dialogue system, we show that combining features from independent...

The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition

, Article Digital Signal Processing: A Review Journal ; Volume 21, Issue 1 , 2011 , Pages 36-53 ; 10512004 (ISSN) Veisi, H ; Sameti, H ; Sharif University of Technology

Abstract

This paper addresses the problem of automatic speech recognition in real applications in which the speech signal is altered by various noises. Feature compensation and model compensation robustness methods are studied. Parallel model combination (PMC) and its recent advances are reviewed and a novel algorithm called PC-PMC is proposed. This algorithm utilizes cepstral mean subtraction (CMS) normalization ability and principal component analysis (PCA) compression and de-correlation capability in the combination with PMC model transformation method. PC-PMC algorithm takes the advantages of additive noise compensation ability of PMC and convolutional noise removal capability of CMS and PCA. In...

Creating a corpus for automatic punctuation prediction in persian texts

, Article 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1537-1542 ; 9781509059638 (ISBN) Hosseini, S. M ; Sameti, H ; Sharif University of Technology

Abstract

We present a novel corpus for automatic punctuation prediction in persian texts. punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one,...

Phone duration modeling for LVCSR using neural networks

, Article 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, 20 August 2017 through 24 August 2017 ; Volume 2017-August , 2017 , Pages 518-522 ; 2308457X (ISSN) Hadian, H ; Povey, D ; Sameti, H ; Khudanpur, S ; Amazon Alexa; Apple; DiDi; et al.; Furhat Robotics; Microsoft ; Sharif University of Technology

International Speech Communication Association 2017

Abstract

We describe our work on incorporating probabilities of phone durations, learned by a neural net, into an ASR system. Phone durations are incorporated via lattice rescoring. The input features are derived from the phone identities of a context window of phones, plus the durations of preceding phones within that window. Unlike some previous work, our network outputs the probability of different durations (in frames) directly, up to a fixed limit. We evaluate this method on several large vocabulary tasks, and while we consistently see improvements inWord Error Rates, the improvements are smaller when the lattices are generated with neural net based acoustic models. Copyright © 2017 ISCA

Improving LF-MMI using unconstrained supervisions for ASR

, Article 2018 IEEE Spoken Language Technology Workshop, SLT 2018, 18 December 2018 through 21 December 2018 ; 2019 , Pages 43-47 ; 9781538643341 (ISBN) Hadian, H ; Povey, D ; Sameti, H ; Trmal, J ; Khudanpur, S ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

We present our work on improving the numerator graph for discriminative training using the lattice-free maximum mutual information (MMI) criterion. Specifically, we propose a scheme for creating unconstrained numerator graphs by removing time constraints from the baseline numerator graphs. This leads to much smaller graphs and therefore faster preparation of training supervisions. By testing the proposed un-constrained supervisions using factorized time-delay neural network (TDNN) models, we observe 0.5% to 2.6% relative improvement over the state-of-the-art word error rates on various large-vocabulary speech recognition databases. © 2018 IEEE

A POS-based fuzzy word clustering algorithm for continuous speech recognition systems

, Article 2007 9th International Symposium on Signal Processing and its Applications, ISSPA 2007, Sharjah, 12 February 2007 through 15 February 2007 ; 2007 ; 1424407796 (ISBN); 9781424407798 (ISBN) Momtazi, S ; Sameti, H ; Bahrani, M ; Hafezi, N ; Sharif University of Technology

2007

Abstract

Using word base n-gram language models in continuous speech recognition systems is so prevalent. For using this type of language models, we should extract them from large corpora. Since Persian corpora are not rich, therefore the extracted language models are not credible. For this reason, most researchers extract class n-grams instead of finding word n-grams. In this research a new idea for fuzzy word clustering is represented that each word can be assigned to more that one class. The Fuzzy c-mean algorithm is used for our clustering method and we have examined its various parameters of it. Finally, this algorithm was applied on 20000 most frequent Persian words extracted from "Persian Text...

Building and incorporating language models for Persian continuous speech recognition systems

, Article 5th International Conference on Language Resources and Evaluation, LREC 2006, 22 May 2006 through 28 May 2006 ; 2006 , Pages 2590-2593 Bahrani, M ; Sameti, H ; Hafezi, N ; Movasagh, H ; Sharif University of Technology

European Language Resources Association (ELRA) 2006

Abstract

In this paper building statistical language models for Persian language using a corpus and incorporating them in Persian continuous speech recognition (CSR) system are described We used Persian Text Corpus for building the language models First we preprocessed the texts of corpus by correcting the different orthography of words Also, the number of POS tags was decreased by clustering POS tags manually Then we extracted word based monogram and POS-based bigram and trigram language models from the corpus We also present the procedure of incorporating language models in a Persian CSR system By using the language models 274% reduction m word error rate was achieved in the best case

Computation of Confidence Measure for Detection of out of Vocabulary Words in a Continuous Speech Recognition System

, M.Sc. Thesis Sharif University of Technology Sakhaee, Elham (Author) ; Samti, Hossein (Supervisor)

Abstract

Automatic Speech Recognition (ASR) engines are too much sensitive to conditions such as noise, transmission line quality, etc. Thus in any real-world application ASR systems should be able to automatically assess reliability or probability of correctness for every decision made by the systems. One technique to increase intelligence of an ASR system is to compute a score, called “confidence measure” to indicate reliability of any recognition decision made by the system. This score can be computed at any required level such as phonemes, syllabi, words or even the whole utterance. Thus a robust and accurate confidence measure results in better detection of recognition errors, Out of Vocabulary...

محتواي پايان نامه

A fast phoneme recognition system based on sparse representation of test utterances

, Article 2014 4th Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, HSCMA 2014 ; 2014 , p. 32-36 Saeb, A ; Razzazi, F ; Babaei-Zadeh, M ; Sharif University of Technology

Abstract

In this paper, a fast phoneme recognition system is introduced based on sparse representation. In this approach, the phoneme recognition is fulfilled by Viterbi decoding on support vector machines (SVM) output probability estimates. The candidate classes for classification are adaptively pruned by a k-dimensional (KD) tree search followed by a sparse representation (SR) based class selector with adaptive number of classes. We applied the proposed approach to introduce a phoneme recognition system and compared it with some well-known phoneme recognition systems according to accuracy and complexity issues. By this approach, we obtain competitive phoneme error rate with promising computational...

Speaker models reduction for optimized telephony text-prompted speaker verification

, Article Canadian Conference on Electrical and Computer Engineering, 3 May 2015 through 6 May 2015 ; Volume 2015-June, Issue June , May , 2015 , Pages 1470-1474 ; 08407789 (ISSN) Kalantari, E ; Sameti, H ; Zeinali, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2015

Abstract

In this article a new scheme is proposed to use mean supervector in text-prompted speaker verification system. In this scheme, for each month name a subsystem is constructed and a final score based on passphrase is computed by the combination of the scores of these subsystems. Results from the telephony dataset of Persian month names show that the proposed method significantly reduces EER in comparison with the-State-of-the-art State-GMM-MAP method. Furthermore, it is shown that based on training set and testing set we can use 12 models per speaker instead of 220. Therefore, this scheme reduces EER and computational burden. In addition, the use of HMM instead of GMM as words' model improves...

Segmental HMM-based part-of-speech tagger

, Article 2010 International Conference on Audio, Language and Image Processing, ICALIP 2010, Shanghai, 23 November 2010 through 25 November 2010 ; 2010 , Pages 52-56 ; 9781424458653 (ISBN) Bokaei, M. H ; Sameti, H ; Bahrani, M ; Babaali, B ; Sharif University of Technology

2010

Abstract

This paper presents a solution in order to solve the problem of using HMM-based POS tagger in some languages where a word can be comprised of several tokens. Viterbi algorithm is modified in order to support segment of words within a model state. In the other word, the proposed system has a built-in tokenizer where indicates words boundaries as well as its corresponding tag sequence

A novel fuzzy approach to speech recognition

, Article Proceedings - HIS'04: 4th International Conference on Hybrid Intelligent Systems, Kitakyushu, 5 December 2004 through 8 December 2004 ; 2005 , Pages 340-345 ; 0769522912 (ISBN) Halavati, R ; Shouraki, S.B ; Eshraghi, M ; Alemzadeh, M ; Ziaie, P ; Ishikawa M ; Hashimoto S ; Paprzycki M ; Barakova E ; Yoshida K ; Koppen M ; Corne D.M ; Abraham A ; Sharif University of Technology

2005

Abstract

This paper presents a novel approach to speech recognition using fuzzy modeling. The task begins with conversion of speech spectrogram into a linguistic description based on arbitrary colors and lengths. While phonemes are also described using these fuzzy measures, and recognition is done by normal fuzzy reasoning, a genetic algorithm optimizes phoneme definitions so that to classify samples into correct phonemes. The method is tested over a standard speech data base and the results are presented. © 2005 IEEE

Evolution of speech recognizer agents by artificial life

, Article Wec 05: Fourth World Enformatika Conference, Istanbul, 24 June 2005 through 26 June 2005 ; Volume 6 , 2005 , Pages 237-240 ; 9759845857 (ISBN) Halavati, R ; Bagheri Shouraki, S ; Harati Zadeh, S ; Lucas, C ; Ardil C ; Sharif University of Technology

2005

Abstract

Artificial Life can be used as an agent training approach in large state spaces. This paper presents an artificial life method to increase the training speed of some speech recognizer agents which where previously trained by genetic algorithms. Using this approach, vertical training (genetic mutations and selection) is combined with horizontal training (individual learning through reinforcement learning) and results in a much faster evolution than simple genetic algorithm. The approach is tested and a comparison with GA cases on a standard speech data base is presented. COPYRIGHT © ENFORMATIKA

An interactive tool for extracting human knowledge in speech recognition

, Article WSEAS Transactions on Computers ; Volume 4, Issue 2 , 2005 , Pages 276-279 ; 11092750 (ISSN) Ghiathi, S. K. A ; Bagheri Shouraki, S ; Sharif University of Technology

2005

Abstract

Conventional features for speech recognition have not been evaluated in terms of importance in human speech recognition. In this paper a method for extracting important features in an interactive process has been introduced. This method can be used as an aid for experts in an ASR expert system. It has also been shown, as an application of our method, how an expert might find out the distinguishing features between "m" and "n". As another use, it has been illustrated that how our method could be used to check the sufficiency of information in the quantized filter-bank for speech recognition

A novel approach to HMM-based speech recognition systems using particle swarm optimization

, Article Mathematical and Computer Modelling ; Volume 52, Issue 11-12 , 2010 , Pages 1910-1920 ; 08957177 (ISSN) Najkar, N ; Razzazi, F ; Sameti, H ; Sharif University of Technology

2010

Abstract

The main core of HMM-based speech recognition systems is Viterbi algorithm. Viterbi algorithm uses dynamic programming to find out the best alignment between the input speech and a given speech model. In this paper, dynamic programming is replaced by a search method which is based on particle swarm optimization algorithm. The major idea is focused on generating an initial population of segmentation vectors in the solution search space and improving the location of segments by an updating algorithm. Several methods are introduced and evaluated for the representation of particles and their corresponding movement structures. In addition, two segmentation strategies are explored. The first...

Noise robust speech recognition using deep belief networks

, Article International Journal of Computational Intelligence and Applications ; Volume 15, Issue 1 , 2016 ; 14690268 (ISSN) Farahat, M ; Halavati, R ; Sharif University of Technology

World Scientific Publishing Co 2016

Abstract

Most current speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. In these systems acoustic inputs are represented by Mel Frequency Cepstral Coefficients temporal spectrogram known as frames. But MFCC is not robust to noise. Consequently, with different train and test conditions the accuracy of speech recognition systems decreases. On the other hand, using MFCCs of larger window of frames in GMMs needs more computational power. In this paper, Deep Belief Networks...

Noise reduction algorithm for robust speech recognition using MLP neural network

, Article PACIIA 2009 - 2009 2nd Asia-Pacific Conference on Computational Intelligence and Industrial Applications, 28 November 2009 through 29 November 2009 ; Volume 1 , 2009 , Pages 377-380 ; 9781424446070 (ISBN) Ghaemmaghami, M. P ; Razzazi, F ; Sameti, H ; Dabbaghchian, S ; BabaAli, B ; Sharif University of Technology

Abstract

We propose an efficient and effective nonlinear feature domain noise suppression algorithm, motivated by the minimum mean square error (MMSE) optimization criterion. Multi Layer Perceptron (MLP) neural network in the log spectral domain minimizes the difference between noisy and clean speech. By using this method as a pre-processing stage of a speech recognition system, the recognition rate in noisy environments is improved. We can extend the application of the system to different environments with different noises without re-training it. We need only to train the preprocessing stage with a small portion ofnoisy data which is created by artificially adding different types of noises from the...

Robust speech recognition using MLP neural network in log-spectral domain

, Article IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2009, 14 December 2009 through 16 December 2009, Ajman ; 2009 , Pages 467-472 ; 9781424459506 (ISBN) Ghaemmaghami, M. P ; Sametit, H ; Razzazi, F ; BabaAli, B ; Dabbaghchiarr, S ; Sharif University of Technology

Abstract

In this paper, we have proposed an efficient and effective nonlinear feature domain noise suppression algorithm, motivated by the minimum mean square error (MMSE) optimization criterion. A Multi Layer Perceptron (MLP) neural network in the log spectral domain has been employed to minimize the difference between noisy and clean speech. By using this method, as a pre-processing stage of a speech recognition system, the recognition rate in noisy environments has been improved. We extended the application ofthe system to different environments with different noises without retraining HMMmodel. We trained the feature extraction stage with a small portion of noisy data which was created by...

End-to-end speech recognition using lattice-free MMI

, Article 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018, 2 September 2018 through 6 September 2018 ; Volume 2018-September , 2018 , Pages 12-16 ; 2308457X (ISSN) Hadian, H ; Sameti, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

International Speech Communication Association 2018

Abstract

We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models. By end-to-end training, we mean flat-start training of a single DNN in one stage without using any previously trained models, forced alignments, or building state-tying decision trees. We use full biphones to enable context-dependent modeling without trees, and show that our end-to-end LF-MMI approach can achieve comparable results to regular LF-MMI on well-known large vocabulary tasks. We also compare with other end-to-end methods such as CTC in character-based and lexicon-free settings and show 5 to 25 percent...

Using ASR methods for OCR

, Article 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019, 20 September 2019 through 25 September 2019 ; 2019 , Pages 663-668 ; 15205363 (ISSN); 9781728128610 (ISBN) Arora, A ; Garcia, P ; Watanabe, S ; Manohar, V ; Shao, Y ; Khudanpur, S ; Chang, C. C ; Rekabdar, B ; Babaali, B ; Povey, D ; Etter, D ; Raj, D ; Hadian, H ; Trmal, J ; Sharif University of Technology

IEEE Computer Society 2019

Abstract

Hybrid deep neural network hidden Markov models (DNN-HMM) have achieved impressive results on large vocabulary continuous speech recognition (LVCSR) tasks. However, the recent approaches using DNN-HMM models are not explored much for text recognition. Inspired by the current work in automatic speech recognition (ASR) and machine translation, we present an open vocabulary sub-word text recognition system. The sub-word lexicon and sub-word language model (LM) helps in overcoming the challenge of recognizing out of vocabulary (OOV) words, and a time delay neural network (TDNN) and convolution neural network (CNN) based DNN-HMM optical model (OM) efficiently models the sequence dependency in the...