Sharif Digital Repository / Sharif University of Technology / Search result

Phone duration modeling for LVCSR using neural networks

, Article 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017, 20 August 2017 through 24 August 2017 ; Volume 2017-August , 2017 , Pages 518-522 ; 2308457X (ISSN) Hadian, H ; Povey, D ; Sameti, H ; Khudanpur, S ; Amazon Alexa; Apple; DiDi; et al.; Furhat Robotics; Microsoft ; Sharif University of Technology

International Speech Communication Association 2017

Abstract

We describe our work on incorporating probabilities of phone durations, learned by a neural net, into an ASR system. Phone durations are incorporated via lattice rescoring. The input features are derived from the phone identities of a context window of phones, plus the durations of preceding phones within that window. Unlike some previous work, our network outputs the probability of different durations (in frames) directly, up to a fixed limit. We evaluate this method on several large vocabulary tasks, and while we consistently see improvements inWord Error Rates, the improvements are smaller when the lattices are generated with neural net based acoustic models. Copyright © 2017 ISCA

Flat-Start single-stage discriminatively trained hmm-based models for asr

, Article IEEE/ACM Transactions on Audio Speech and Language Processing ; Volume 26, Issue 11 , 2018 , Pages 1949-1961 ; 23299290 (ISSN) Hadian, H ; Sameti, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

In recent years, end-to-end approaches to automatic speech recognition have received considerable attention as they are much faster in terms of preparing resources. However, conventional multistage approaches, which rely on a pipeline of training hidden Markov models (HMM)-GMM models and tree-building steps still give the state-of-the-art results on most databases. In this study, we investigate flat-start one-stage training of neural networks using lattice-free maximum mutual information (LF-MMI) objective function with HMM for large vocabulary continuous speech recognition. We thoroughly look into different issues that arise in such a setup and propose a standalone system, which achieves...

End-to-end speech recognition using lattice-free MMI

, Article 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018, 2 September 2018 through 6 September 2018 ; Volume 2018-September , 2018 , Pages 12-16 ; 2308457X (ISSN) Hadian, H ; Sameti, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

International Speech Communication Association 2018

Abstract

We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models. By end-to-end training, we mean flat-start training of a single DNN in one stage without using any previously trained models, forced alignments, or building state-tying decision trees. We use full biphones to enable context-dependent modeling without trees, and show that our end-to-end LF-MMI approach can achieve comparable results to regular LF-MMI on well-known large vocabulary tasks. We also compare with other end-to-end methods such as CTC in character-based and lexicon-free settings and show 5 to 25 percent...

Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition

, Article Applied Soft Computing Journal ; Volume 62 , January , 2018 , Pages 315-327 ; 15684946 (ISSN) Shahmoradi, S ; Bagheri Shouraki, S ; Sharif University of Technology

Elsevier Ltd 2018

Abstract

Sequential pattern recognition has long been an important topic of soft computing research with a wide area of applications including speech and handwriting recognition. In this paper, the performance of a novel fuzzy sequential pattern recognition tool named “Fuzzy Elastic Matching Machine” has been investigated. This tool overcomes the shortcomings of the HMM including its inflexible mathematical structure and inconsistent mathematical assumptions with imprecise input data. To do so, “Fuzzy Elastic Pattern” was introduced as the basic element of FEMM. It models the elasticity property of input data using fuzzy vectors. A sequential pattern such as a word in speech or a piece of writing is...

Learning of tree-structured Gaussian graphical models on distributed data under communication constraints

, Article IEEE Transactions on Signal Processing ; Volume 67, Issue 1 , 2019 , Pages 17-28 ; 1053587X (ISSN) Tavassolipour, M ; Motahari, S. A ; Manzuri Shalmani, M. T ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

In this paper, learning of tree-structured Gaussian graphical models from distributed data is addressed. In our model, samples are stored in a set of distributed machines where each machine has access to only a subset of features. A central machine is then responsible for learning the structure based on received messages from the other nodes. We present a set of communication-efficient strategies, which are theoretically proved to convey sufficient information for reliable learning of the structure. In particular, our analyses show that even if each machine sends only the signs of its local data samples to the central node, the tree structure can still be recovered with high accuracy. Our...

Acoustic modeling from frequency-domain representations of speech

, Article Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2 September 2018 through 6 September 2018 ; Volume 2018-September , 2018 , Pages 1596-1600 ; 2308457X (ISSN) Ghahremani, P ; Hadian, H ; Lv, H ; Povey, D ; Khudanpur, S ; Sharif University of Technology

International Speech Communication Association 2018

Abstract

In recent years, different studies have proposed new methods for DNN-based feature extraction and joint acoustic model training and feature learning from raw waveform for large vocabulary speech recognition. However, conventional pre-processed methods such as MFCC and PLP are still preferred in the state-of-the-art speech recognition systems as they are perceived to be more robust. Besides, the raw waveform methods - most of which are based on the time-domain signal - do not significantly outperform the conventional methods. In this paper, we propose a frequency-domain feature-learning layer which can allow acoustic model training directly from the waveform. The main distinctions from...

Frame-based face emotion recognition using linear discriminant analysis

, Article 3rd Iranian Conference on Signal Processing and Intelligent Systems, ICSPIS 2017, 20 December 2017 through 21 December 2017 ; Volume 2017-December , December , 2018 , Pages 141-146 ; 9781538649725 (ISBN) Otroshi Shahreza, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

In this paper, a frame-based method with reference frame was proposed to recognize six basic facial emotions (anger, disgust, fear, happy, sadness and surprise) and also neutral face. By using face landmarks, a fast algorithm was used to calculate an appropriate descriptor for each frame. Furthermore, Linear Discriminant Analysis (LDA) was used to reduce the dimension of defined descriptors and to classify them. The LDA problem was solved using the least squares solution and Ledoit-Wolf lemma. The proposed method was also compared with some studies on CK+ dataset which has the best accuracy among them. To generalize the proposed method over CK+ dataset, a landmark detector was needed....

Spoken CAPTCHA: a CAPTCHA system for blind users

, Article 2009 Second ISECS International Colloquium on Computing, Communication, Control, and Management, CCCM 2009, Sanya, 8 August 2009 through 9 August 2009 ; Volume 1 , 2009 , Pages 221-224 ; 9781424442461 (ISBN) Shirali Shahreza, S ; Abolhassani, H ; Sameti, H ; Shirali Shahreza, M. H ; Yangzhou University; Guangdong University of Business Studies; Wuhan Institute of Technology; IEEE SMC TC on Education Technology and Training; IEEE Technology Management Council ; Sharif University of Technology

2009

Abstract

Today, the Internet is used to offer different services to users. Most of these services are designed for human users, but unfortunately some computer programs are designed which abuse these services. CAPTCHA (Completely Automated Public Turing test to tell Computers and Human Apart) systems are designed to automatically distinguish between human users and computer programs and block such computer programs. Most of current CAPTCHA methods are using visual patterns and hence blind users cannot use them. In this paper, we propose a new CAPTCHA method which is designed for blind people. In this method, a small sound clip is played for the user and he/she is asked to say a word. Then the user...

Estimation of current-induced scour depth around pile groups using neural network and adaptive neuro-fuzzy inference system

, Article Applied Soft Computing Journal ; Volume 9, Issue 2 , 2009 , Pages 746-755 ; 15684946 (ISSN) Zounemat Kermani, M ; Beheshti, A. A ; Ataie Ashtiani, B ; Sabbagh Yazdi, S. R ; Sharif University of Technology

2009

Abstract

The process of local scour around bridge piers is fundamentally complex due to the three-dimensional flow patterns interacting with bed materials. For geotechnical and economical reasons, multiple pile bridge piers have become more and more popular in bridge design. Although many studies have been carried out to develop relationships for the maximum scour depth at pile groups under clear-water scour condition, existing methods do not always produce reasonable results for scour predictions. It is partly due to the complexity of the phenomenon involved and partly because of limitations of the traditional analytical tool of statistical regression. This paper addresses the latter part and...

Design and implementation of vector quantizer for a 600 bps vocoder based on MELP

, Article 11th International Conference on Advanced Communication Technology, ICACT 2009, Phoenix Park, 15 February 2009 through 18 February 2009 ; Volume 2 , 2009 , Pages 1487-1490 ; 17389445 (ISSN); 9788955191387 (ISBN) Khalili, F ; Ardebilipour, M ; Sameti, H ; IEEE Communications Society, IEEE ComSoc; IEEE Region 10 and IEEE Daejeon Section; Korean Institute of Communication Sciences, KICS; lEEK Communications Society, IEEK ComSoc; Korean Institute of Information Scientists and Engineers, KIISE; et al ; Sharif University of Technology

2009

Abstract

This paper describes a vector quantization of a 600 bps speech coding parameters based on the Mixed Excitation Linear Prediction (MELP) model, which was accepted as a standard in communication on narrow-band HF channels. The MELP speech coders are robust in difficult background noise environments and intended mostly for military communications. To reduce the bit rate, a joint vector quantization of multi-frame is developed that takes advantage of inherent inter-frame redundancy of the MELP parameters. By grouping parameters of 4 frames into a multi-frame and using vector quantization, bit rate is decreased 4 times and output speech is still intelligible

Likelihood-maximizing-based multiband spectral subtraction for robust speech recognition

, Article Eurasip Journal on Advances in Signal Processing ; Volume 2009 , 2009 ; 16876172 (ISSN) Babaali, B ; Sameti, H ; Safayani, M ; Sharif University of Technology

2009

Abstract

Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected...

Speaker recognition with random digit strings using uncertainty normalized HMM-Based i-Vectors

, Article IEEE/ACM Transactions on Audio Speech and Language Processing ; Volume 27, Issue 11 , 2019 , Pages 1815-1825 ; 23299290 (ISSN) Maghsoodi, N ; Sameti, H ; Zeinali, H ; Stafylakis, T ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

In this paper, we combine Hidden Markov Models HMMs with i-vector extractors to address the problem of text-dependent speaker recognition with random digit strings. We employ digit-specific HMMs to segment the utterances into digits, to perform frame alignment to HMM states and to extract Baum-Welch statistics. By making use of the natural partition of input features into digits, we train digit-specific i-vector extractors on top of each HMM and we extract well-localized i-vectors, each modelling merely the phonetic content corresponding to a single digit. We then examine ways to perform channel and uncertainty compensation, and we propose a novel method for using the uncertainty in the...

Using ASR methods for OCR

, Article 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019, 20 September 2019 through 25 September 2019 ; 2019 , Pages 663-668 ; 15205363 (ISSN); 9781728128610 (ISBN) Arora, A ; Garcia, P ; Watanabe, S ; Manohar, V ; Shao, Y ; Khudanpur, S ; Chang, C. C ; Rekabdar, B ; Babaali, B ; Povey, D ; Etter, D ; Raj, D ; Hadian, H ; Trmal, J ; Sharif University of Technology

IEEE Computer Society 2019

Abstract

Hybrid deep neural network hidden Markov models (DNN-HMM) have achieved impressive results on large vocabulary continuous speech recognition (LVCSR) tasks. However, the recent approaches using DNN-HMM models are not explored much for text recognition. Inspired by the current work in automatic speech recognition (ASR) and machine translation, we present an open vocabulary sub-word text recognition system. The sub-word lexicon and sub-word language model (LM) helps in overcoming the challenge of recognizing out of vocabulary (OOV) words, and a time delay neural network (TDNN) and convolution neural network (CNN) based DNN-HMM optical model (OM) efficiently models the sequence dependency in the...

An efficient real-time voice activity detection algorithm using teager energy to energy ratio

, Article 27th Iranian Conference on Electrical Engineering, ICEE 2019, 30 April 2019 through 2 May 2019 ; 2019 , Pages 1420-1424 ; 9781728115085 (ISBN) Hadi, M ; Pakravan, M. R ; Razavi, M. M ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

We define a new feature called Teager Energy to Energy and mathematically show that it provides distinguished values for pure tone and white noise signals. We then employ the Teager Energy to Energy feature to propose an efficient procedure for voice activity detection and use simulation results to evaluate its performance in different noisy environments. Furthermore, we experimentally demonstrate the performance of the proposed voice activity detection technique in a real-time voice processing embedded system. Experimental and simulation results show that the introduced procedure provides more reliable results with a reasonable amount of computational complexity in comparison with its...

Statistical association mapping of population-structured genetic data

, Article IEEE/ACM Transactions on Computational Biology and Bioinformatics ; Volume 16, Issue 2 , 2019 , Pages 636-649 ; 15455963 (ISSN) Najafi, A ; Janghorbani, S ; Motahari, A ; Fatemizadeh, E ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

Association mapping of genetic diseases has attracted extensive research interest during the recent years. However, most of the methodologies introduced so far suffer from spurious inference of the associated sites due to population inhomogeneities. In this paper, we introduce a statistical framework to compensate for this shortcoming by equipping the current methodologies with a state-of-the-art clustering algorithm being widely used in population genetics applications. The proposed framework jointly infers the disease-associated factors and the hidden population structures. In this regard, a Markov Chain-Monte Carlo (MCMC) procedure has been employed to assess the posterior probability...

Improving LF-MMI using unconstrained supervisions for ASR

, Article 2018 IEEE Spoken Language Technology Workshop, SLT 2018, 18 December 2018 through 21 December 2018 ; 2019 , Pages 43-47 ; 9781538643341 (ISBN) Hadian, H ; Povey, D ; Sameti, H ; Trmal, J ; Khudanpur, S ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

We present our work on improving the numerator graph for discriminative training using the lattice-free maximum mutual information (MMI) criterion. Specifically, we propose a scheme for creating unconstrained numerator graphs by removing time constraints from the baseline numerator graphs. This leads to much smaller graphs and therefore faster preparation of training supervisions. By testing the proposed un-constrained supervisions using factorized time-delay neural network (TDNN) models, we observe 0.5% to 2.6% relative improvement over the state-of-the-art word error rates on various large-vocabulary speech recognition databases. © 2018 IEEE

Learning of tree-structured gaussian graphical models on distributed data under communication constraints

, Article IEEE Transactions on Signal Processing ; Volume 67, Issue 1 , 2019 , Pages 17-28 ; 1053587X (ISSN) Tavassolipour, M ; Motahari, A ; Manzuri Shalmani, M. T ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

In this paper, learning of tree-structured Gaussian graphical models from distributed data is addressed. In our model, samples are stored in a set of distributed machines where each machine has access to only a subset of features. A central machine is then responsible for learning the structure based on received messages from the other nodes. We present a set of communication-efficient strategies, which are theoretically proved to convey sufficient information for reliable learning of the structure. In particular, our analyses show that even if each machine sends only the signs of its local data samples to the central node, the tree structure can still be recovered with high accuracy. Our...

Learning of tree-structured Gaussian graphical models on distributed data under communication constraints

, Article IEEE Transactions on Signal Processing ; Volume 67, Issue 1 , 2019 , Pages 17-28 ; 1053587X (ISSN) Tavassolipour, M ; Motahari, S. A ; Manzuri Shalmani, M. T ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2019

Abstract

In this paper, learning of tree-structured Gaussian graphical models from distributed data is addressed. In our model, samples are stored in a set of distributed machines where each machine has access to only a subset of features. A central machine is then responsible for learning the structure based on received messages from the other nodes. We present a set of communication-efficient strategies, which are theoretically proved to convey sufficient information for reliable learning of the structure. In particular, our analyses show that even if each machine sends only the signs of its local data samples to the central node, the tree structure can still be recovered with high accuracy. Our...

Replay spoofing countermeasure using autoencoder and siamese networks on ASVspoof 2019 challenge

, Article Computer Speech and Language ; Volume 64 , 2020 Adiban, M ; Sameti, H ; Shehnepoor, S ; Sharif University of Technology

Academic Press 2020

Abstract

Automatic Speaker Verification (ASV) is authentication of individuals by analyzing their speech signals. Different synthetic approaches allow spoofing to deceive ASV systems (ASVs), whether using techniques to imitate a voice or reconstruct the features. Attackers beat up the ASVs using four general techniques; impersonation, speech synthesis, voice conversion, and replay. The last technique is considered as a common and high potential tool for spoofing purposes since replay attacks are more accessible and require no technical knowledge of adversaries. In this study, we introduce a novel replay spoofing countermeasure for ASVs. Accordingly, we use the Constant Q Cepstral Coefficient (CQCC)...

A new word clustering method for building n-gram language models in continuous speech recognition systems

, Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 18 June 2008 through 20 June 2008, Wroclaw ; Volume 5027 LNAI , 2008 , Pages 286-293 ; 03029743 (ISSN) ; 354069045X (ISBN); 9783540690450 (ISBN) Bahrani, M ; Sameti, H ; Hafezi, N ; Momtazi, S ; Sharif University of Technology

2008

Abstract

In this paper a new method for automatic word clustering is presented. We used this method for building n-gram language models for Persian continuous speech recognition (CSR) systems. In this method, each word is specified by a feature vector that represents the statistics of parts of speech (POS) of that word. The feature vectors are clustered by k-means algorithm. Using this method causes a reduction in time complexity which is a defect in other automatic clustering methods. Also, the problem of high perplexity in manual clustering methods is abated. The experimental results are based on "Persian Text Corpus" which contains about 9 million words. The extracted language models are evaluated...