Sharif Digital Repository / Sharif University of Technology / Search result

Far-field continuous speech recognition system based on speaker localization and sub-band beamforming

, Article 6th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2008, Doha, 31 March 2008 through 4 April 2008 ; 2008 , Pages 495-500 ; 9781424419685 (ISBN) Asaei, A ; Taghizadeh, M. J ; Sameti, H ; Sharif University of Technology

2008

Abstract

This paper proposes a Distant Speech Recognition system based on a novel speaker Localization and Beamforming (SRLB) algorithm. To localize the speaker an algorithm based on Steered Response Power by utilizing harmonic structures of speech signal is proposed. This new scheme has the ability of speaker verification by fundamental frequency variation: therefore it can be utilized in the design of a speech recognition system for verified speakers. Then the performance of the Farsi speech recognition engine is evaluated under notorious conditions of noise and reverberation. Simulation results and tests on real data shows that by utilizing proposed localization scheme, recognition accuracy...

Introducing a framework to create telephony speech databases from direct ones

, Article 14th International Conference on Systems Signals and Image Processing, IWSSIP 2007 and 6th EURASIP Conference Focused on Speech and Image Processing, Multimedia Communications and Services, EC-SIPMCS 2007, Maribor, 27 June 2007 through 30 June 2007 ; November , 2007 , Pages 327-330 ; 9789612480295 (ISBN) Momtazi, S ; Sameti, H ; Vaisipour, S ; Tefagh, M ; Sharif University of Technology

2007

Abstract

A Comprehensive speech database is one of the important tools for developing speech recognition systems; these tools are necessary for telephony recognition, too. Although adequate databases for direct speech recognizers exist, there is not an appropriate database for telephony speech recognizers. Most methods suggested for solving this problem are based on building new databases which tends to consume much time and many resources; or they used a filter which simulates circuit switch behavior to transform direct databases to telephony ones, in this case resulted databases have many differences with real telephony databases. In this paper we introduce a framework for creating telephony speech...

Using Audio Speech Recognition Techniques in Augmented Reality Environment

, M.Sc. Thesis Sharif University of Technology Mirzaei, Mohammad Reza (Author) ; Ghorshi, Alireza (Supervisor) ; Mortazavi, Mohammad (Supervisor)

Abstract

Recently, many studies show that Augmented Reality (AR) and Automatic Speech Recognition (ASR) can help people with disabilities. In this thesis we examine the ability of combining AR and ASR technologies to implement a new system for helping deaf people. This system can instantly take a narrator's speech and convert it into a readable text and show it directly on AR display. Also, with this system, people do not need to learn sign-language to communicate with deaf people. To improve the accuracy of the system, we use Audio-Visual Speech Recognition (AVSR) as a backup for the ASR engine in noisy environments. AVSR is one of the advances in ASR technology that combines audio, video and facial...

محتواي پايان نامه

Language Modeling Using Recurrent Neural Networks

, M.Sc. Thesis Sharif University of Technology Rahimi, Adel (Author) ; Sameti, Hossein (Supervisor)

Abstract

This thesis examines the differences and the similarities between the two famous RNN blocks the Long Short Term Memory and the Gated Recurrent Unit. It measure different aspects such as computational complexity, Word Error Rate, and subjective human evaluation in the task of text generation.In the computational complexity experiment results show that the LSTM takes more time to compute, in comparison to the GRU. Moving on into the next experiment the GRU slightly outperforms the LSTM in terms of WER but the perplexity for the language models tested was the same. This shows that slight differences in the perplexity does not drastically change the WER. Having said, the results suggest that the...

محتواي کتاب

Using Discriminative Training Approaches for Large Vocabulary Isolated Word Recognition

, M.Sc. Thesis Sharif University of Technology Osati, Majid (Author) ; Sameti, Hossein (Supervisor)

Abstract

In this study, isolated word recognition problem has been studied in large scale and different acoustic models are engaged to solve the problem. Acoustic models, based on discriminative training methods, are compared our proposed approach with other available training methods. Acoustic models are built and trained based on HMM-GMM, HMM- subspace GMM and HMM-DNN using different training criteria such as Maximum Mutual Information (MMI), boosted MMI, Minimum Phoneme Error (MPE), and state-level Minimum Bayesian Risk (sMBR). Using these discriminative approaches led to improvement of speech recognition systems. Boosted MMI with boosting factor of 0.3 for HMM-DNN has resulted in Word Error Rate...

محتواي کتاب

Using Structural Language Modeling in Continous Speech Recognition Systems

, M.Sc. Thesis Sharif University of Technology SheikhShab, Golnar (Author) ; Sameti, Hossein (Supervisor)

Abstract

Language model is one of the most important parsts of an automated speech recognition system whiche incorporates the knowledge of Natural Language to the system to improve its accuracy. Conventionally used language model in recognition systems is ngram which usually is extracted from a large corpus using related frequency method. ngram model approximates the probability of a word sequence by multiplying its ngram probabilities and thus does not take into account the long distance dependencies. So, syntactic language models could be of interest. In this research after probing different syntactic language models, a mehtod for re-estimating ngram model, introduced by Stolcke in 1994, was...

محتواي پايان نامه

Deep Learning for Speech Recognition

, M.Sc. Thesis Sharif University of Technology Azadi Yazdi, Saman (Author) ; Sameti, Hossein (Supervisor)

Abstract

Speech recognition is one of the first goals of speech processing. Our goal in this thesis is to use deep learning for speech recognition. In recent years little improvement of speech recognition accuracies are reported. Deep learning is a new learning algorithm that results in improvement in many machine learning tasks. Following improvements reported in speech recognition in English language by deep learning, in this thesis we tried to improve accuracy over common and new recognition methods for Persian language.
First the overall structure of a typical speech recognition system is introduced. For this purpose, the modules of a speech recognition system are introduced. Deep multilayer...

محتواي کتاب

SFAVD: Sharif farsi audio visual database

, Article IKT 2013 - 2013 5th Conference on Information and Knowledge Technology, Shiraz, Iran ; 2013 , Pages 417-421 ; 9781467364904 (ISBN) Naraghi, Z ; Jamzad, M ; Sharif University of Technology

2013

Abstract

With increasing use of computers in everyday life, improved communication between machines and human is needed. To make a right communication and understand a humankind face which is made in a graphical environment, implementing the audio and visual projects like lip reading, audio and visual speech recognition and lip making are needed. Lack of a complete audio and visual database for this application in Farsi language made us provide a new complete Farsi database for this project that is called SFAVD. It is a unique audio and visual database which in addition to considering Farsi conceptual and speech structure, it considers influence of speech on lip changes. This database is created for...

Fundamental frequency estimation using modified higher order moments and multiple windows

, Article Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH ; 2011 , Pages 1965-1968 ; 19909772 (ISSN) Pawi, A ; Vaseghi, S ; Milner, B ; Ghorshi, S ; Sharif Univesity of Technology

2011

Abstract

This paper proposes a set of higher-order modified moments for estimation of the fundamental frequency of speech and explores the impact of the speech window length on pitch estimation error. The pitch extraction methods are evaluated in a range of noise types and SNRs. For calculation of errors, pitch reference values are calculated from manually-corrected estimates of the periods obtained from laryngograph signals. The results obtained for the 3 rd and 4 th order modified moment compare well with methods based on correlation and magnitude difference criteria and the YIN method; with improved pitch accuracy and less occurrence of large errors

Utilizing intelligent segmentation in isolated word recognition using a hybrid HTD-HMM

, Article International Conference on Circuits, Systems, Signal and Telecommunications - Proceedings, 21 October 2010 through 23 October 2010 ; October , 2011 , Pages 42-49 ; 9789604742714 (ISBN) Kazemi, R ; Sereshkeh, A. R ; Ehsandoust, B ; ; Sharif University of Technology

2011

Abstract

Isolated Word Recognition (IWR) is becoming increasingly attractive due to the improvement of speech recognition techniques. However, the accuracy of IWR suffers when large databases or words with similar pronunciation are used. The criterion for accurate speech recognition is suitable segmentation. However, the traditional method of segmentation equal segmentation does not produce the most accurate result. Furthermore, utilizing manual segmentation based on events is not possible in large databases. In this paper, we introduce an intelligent segmentation based on Hierarchical Temporal Decomposition (HTD). Based on this method, a temporal decomposition (TD) algorithm can be used to...

Pitch extraction using dyadic wavelet transform and modified higher order moment

, Article International Conference on Communication Technology Proceedings, ICCT, 11 November 2010 through 14 November 2010, Nanjing ; 2010 , Pages 833-836 ; 9781424468690 (ISBN) Choupan, J ; Ghorshi, S ; Mortazavi, M ; Sepehrband, F ; Sharif University of Technology

2010

Abstract

Pitch detection is the process of determining the period of the vocal cords closure or in another word the time duration of one glottal closed, open and returning phase. Dyadic wavelets transform (DyWT) and modified higher order moment, which is based on the autocorrelation function, are two pitch detection methods. DyWT is an accurate pitch detection method, however it has less accuracy compared to modified higher order moment. On the other hand modified higher order moment has high computational complexity and is time consuming. In this paper, we propose a pitch detection method based on DyWT which has use modified higher order moment. Modified higher order moment is applied only in some...

Divided POMDP method for complex menu problems in spoken dialogue systems

, Article 2010 IEEE Workshop on Spoken Language Technology, SLT 2010 - Proceedings, 12 December 2010 through 15 December 2010 ; 2010 , Pages 484-489 ; 9781424479030 (ISBN) Habibi, M ; Rahbar, S ; Sameti, H ; The Institute of Electrical and Electronics Engineers (IEEE); IEEE Signal Processing Society ; Sharif University of Technology

2010

Abstract

In this paper, a problem in spoken dialogue systems namely the menu problem, is introduced and solved by a POMDP model. To overcome the large size of the menu problem, a new method for achieving an optimal policy called divided POMDP method is introduced. Conditions for the problem to be solved by the proposed method are specified and the problem properties resulting in the given conditions are presented. The proposed method is evaluated using a typical menu problem with different menu sizes and it is shown that this method is superior to the conventional methods such as FRTDP for the problems it is capable to solve. Moreover, it converges faster in getting to an optimal policy

Union of low-rank subspaces detector

, Article IET Signal Processing ; Volume 10, Issue 1 , 2016 , Pages 55-62 ; 17519675 (ISSN) Joneidi, M ; Ahmadi, P ; Sadeghi, M ; Rahnavard, N ; Sharif University of Technology

Institution of Engineering and Technology

Abstract

The problem of signal detection using a flexible and general model is considered. Owing to applicability and flexibility of sparse signal representation and approximation, it has attracted a lot of attention in many signal processing areas. In this study, the authors propose a new detection method based on sparse decomposition in a union of subspaces model. Their proposed detector uses a dictionary that can be interpreted as a bank of matched subspaces. This improves the performance of signal detection, as it is a generalisation for detectors. Low-rank assumption for the desired signals implies that the representations of these signals in terms of some proper bases would be sparse. Their...

Persian large vocabulary name recognition system (FarsName)

, Article 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1580-1583 ; 9781509059638 (ISBN) Hajitabar, A ; Sameti, H ; Hadian, H ; Safari, A ; Sharif University of Technology

Abstract

There has been no isolated word recognition database for the Persian language so far. In this paper we introduce FarsName dataset which contains 20 thousands isolated-word Persian utterances spoken by 226 speakers from all regions of the country each saying an average of 88 Persian names. There is a total of 5235 unique names in this dataset. Various cell phone brands have been used to record this dataset. This indicates the high diversity of the utterances in this dataset. We have been able to achieve 10.34% WER on this set using Kaldi. This is a very good performance considering the recording environment have been normal and potentially noisy. © 2017 IEEE

Speech activity detection using deep neural networks

, Article 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1564-1568 ; 9781509059638 (ISBN) Shahsavari, S ; Sameti, H ; Hadian, H ; Sharif University of Technology

Abstract

In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM, DNN, and RNN on it. We have collected our dataset in a semi-supervised approach, using subtitled movies, with a labeling accuracy of 95%. This semi-automatic method can help us collect huge amounts of labeled audio data with very high diversity in language, speaker, and channel. We model the problem of SAD as a classification task to two classes of speech and non-speech. When using GMM for this problem, we use two separate mixtures to model speech and non-speech. In the case of neural networks, we use a softmax layer at the end of the network, with two neurons which represent speech and...

Frame-based face emotion recognition using linear discriminant analysis

, Article 3rd Iranian Conference on Signal Processing and Intelligent Systems, ICSPIS 2017, 20 December 2017 through 21 December 2017 ; Volume 2017-December , December , 2018 , Pages 141-146 ; 9781538649725 (ISBN) Otroshi Shahreza, H ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

In this paper, a frame-based method with reference frame was proposed to recognize six basic facial emotions (anger, disgust, fear, happy, sadness and surprise) and also neutral face. By using face landmarks, a fast algorithm was used to calculate an appropriate descriptor for each frame. Furthermore, Linear Discriminant Analysis (LDA) was used to reduce the dimension of defined descriptors and to classify them. The LDA problem was solved using the least squares solution and Ledoit-Wolf lemma. The proposed method was also compared with some studies on CK+ dataset which has the best accuracy among them. To generalize the proposed method over CK+ dataset, a landmark detector was needed....

Nevisa, a Persian continuous speech recognition system

, Article 13th International Computer Society of Iran Computer Conference on Advances in Computer Science and Engineering, CSICC 2008, Kish Island, 9 March 2008 through 11 March 2008 ; Volume 6 CCIS , 2008 , Pages 485-492 ; 18650929 (ISSN); 3540899847 (ISBN); 9783540899846 (ISBN) Sameti, H ; Veisi, H ; Bahrani, M ; Babaali, B ; Hosseinzadeh, K ; Sharif University of Technology

2008

Abstract

In this paper we have reviewed Nevisa Persian speech recognition engine. Nevisa is an HMM-based, large vocabulary speaker-independent continuous speech recognition system. Like most successful recognition systems, MFCC with some modification has been used as speech signal features. It also utilizes a VAD based on signal energy and zero-crossing rate. Maximum likelihood estimation criterion the core of which are the classical segmental k-means and Baum-Welsh algorithms is used for training the acoustic models. The system is based on phoneme modeling and utilizes synchronous beam search based on lexicon tree for decoding the acoustic utterances. Language modeling for Persian has been...

An efficient multi-band spectral subtraction method for robust speech recognition

, Article 2007 9th International Symposium on Signal Processing and its Applications, ISSPA 2007, Sharjah, 12 February 2007 through 15 February 2007 ; 2007 ; 1424407796 (ISBN); 9781424407798 (ISBN) Safayani, M ; Sameti, H ; Babaali, B ; Manzuri Shalmani, M. T ; Sharif University of Technology

2007

Abstract

In this paper we present a novel approach for adjusting a multi band spectral subtraction filter coefficients based on speech recognition system results. Currently most speech enhancement techniques are designed according to various waveform level criteria such as maximizing SNR or minimizing signal error. However improvement in these criteria does not necessarily result in increasing speech recognition performance. Only if these methods generate sequence of features that maximize or increase the likelihood of the correct transcription relative to other incorrect competing hypotheses, speech recognition performance will increase. Here we use an utterance with a known transcription and...

Coevolution of input sensors and recognition system to design a very low computation isolated, word speech recognition system

, Article Scientia Iranica ; Volume 14, Issue 6 , 2007 , Pages 625-630 ; 10263098 (ISSN) Halavati, R ; Shouraki, S. B ; Sharif University of Technology

Sharif University of Technology 2007

Abstract

Appropriate sensors are a crucial necessity for the success of recognition systems. Nature has always coevolved sensors and recognition systems and this can also be done in artificially intelligent systems. To get a very fast isolated word speech recognition system for a small embedded speech recognizer, an evolutionary approach has been used to create together the required sensors and appropriate recognition structures. The input sensors are designed and evolved through inspiration by the human auditory system and the classification is done by artificial neural networks. The resulting system is compared with a widely used speech recognition system, and the results are quite satisfactory. ©...

A robust voice activity detection based on wavelet transform

, Article 2nd International Conference on Electrical Engineering, ICEE, Lahore, 25 March 2008 through 26 March 2008 ; 2008 ; 9781424422937 (ISBN) Aghajani, K ; Manzuri, M. T ; Karami, M ; Tayebi, H ; Sharif University of Technology

2008

Abstract

Voice activity detection is an important step in some speech processing systems, such as speech recognition, speech enhancement, noise estimation, speech compression ... etc. In this paper a new voice activity detection algorithm based on wavelet transform is proposed. In this algorithm we use the energy in each sub band, and by two methods we extract feature vector from these values. Experimental results demonstrate advantage over different VAD methods. ©2008 IEEE