Loading...
Search for: speech-recognition
0.008 seconds
Total 131 records

    SR-NBS: A fast sparse representation based N-best class selector for robust phoneme classification

    , Article Engineering Applications of Artificial Intelligence ; Vol. 28 , 2014 , pp. 155-164 Saeb, A ; Razzazi, F ; Babaie-Zadeh, M ; Sharif University of Technology
    Abstract
    Although exemplar based approaches have shown good accuracy in classification problems, some limitations are observed in the accuracy of exemplar based automatic speech recognition (ASR) applications. The main limitation of these algorithms is their high computational complexity which makes them difficult to extend to ASR applications. In this paper, an N-best class selector is introduced based on sparse representation (SR) and a tree search strategy. In this approach, the classification is fulfilled in three steps. At first, the set of similar training samples for the specific test sample is selected by k-dimensional (KD) tree search algorithm. Then, an SR based N-best class selector is... 

    Audio-visual speech recognition techniques in augmented reality environments

    , Article Visual Computer ; Vol. 30, issue. 3 , March , 2014 , pp. 245-257 ; ISSN: 01782789 Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology
    Abstract
    Many recent studies show that Augmented Reality (AR) and Automatic Speech Recognition (ASR) technologies can be used to help people with disabilities. Many of these studies have been performed only in their specialized field. Audio-Visual Speech Recognition (AVSR) is one of the advances in ASR technology that combines audio, video, and facial expressions to capture a narrator's voice. In this paper, we combine AR and AVSR technologies to make a new system to help deaf and hard-of-hearing people. Our proposed system can take a narrator's speech instantly and convert it into a readable text and show the text directly on an AR display. Therefore, in this system, deaf people can read the... 

    An evolutionary decoding method for HMM-based continuous speech recognition systems using particle swarm optimization

    , Article Pattern Analysis and Applications ; Vol. 17, issue. 2 , 2014 , pp. 327-339 Najkar, N ; Razzazi, F ; Sameti, H ; Sharif University of Technology
    Abstract
    The main recognition procedure in modern HMM-based continuous speech recognition systems is Viterbi algorithm. Viterbi algorithm finds out the best acoustic sequence according to input speech in the search space using dynamic programming. In this paper, dynamic programming is replaced by a search method which is based on particle swarm optimization. The major idea is focused on generating initial population of particles as the speech segmentation vectors. The particles try to achieve the best segmentation by an updating method during iterations. In this paper, a new method of particles representation and recognition process is introduced which is consistent with the nature of continuous... 

    A fast phoneme recognition system based on sparse representation of test utterances

    , Article 2014 4th Joint Workshop on Hands-Free Speech Communication and Microphone Arrays, HSCMA 2014 ; 2014 , p. 32-36 Saeb, A ; Razzazi, F ; Babaei-Zadeh, M ; Sharif University of Technology
    Abstract
    In this paper, a fast phoneme recognition system is introduced based on sparse representation. In this approach, the phoneme recognition is fulfilled by Viterbi decoding on support vector machines (SVM) output probability estimates. The candidate classes for classification are adaptively pruned by a k-dimensional (KD) tree search followed by a sparse representation (SR) based class selector with adaptive number of classes. We applied the proposed approach to introduce a phoneme recognition system and compared it with some well-known phoneme recognition systems according to accuracy and complexity issues. By this approach, we obtain competitive phoneme error rate with promising computational... 

    SFAVD: Sharif farsi audio visual database

    , Article IKT 2013 - 2013 5th Conference on Information and Knowledge Technology, Shiraz, Iran ; 2013 , Pages 417-421 ; 9781467364904 (ISBN) Naraghi, Z ; Jamzad, M ; Sharif University of Technology
    2013
    Abstract
    With increasing use of computers in everyday life, improved communication between machines and human is needed. To make a right communication and understand a humankind face which is made in a graphical environment, implementing the audio and visual projects like lip reading, audio and visual speech recognition and lip making are needed. Lack of a complete audio and visual database for this application in Farsi language made us provide a new complete Farsi database for this project that is called SFAVD. It is a unique audio and visual database which in addition to considering Farsi conceptual and speech structure, it considers influence of speech on lip changes. This database is created for... 

    Speech enhancement using hidden Markov models in Mel-frequency domain

    , Article Speech Communication ; Volume 55, Issue 2 , 2013 , Pages 205-220 ; 01676393 (ISSN) Veisi, H ; Sameti, H ; Sharif University of Technology
    2013
    Abstract
    Hidden Markov model (HMM)-based minimum mean square error speech enhancement method in Mel-frequency domain is focused on and a parallel cepstral and spectral (PCS) modeling is proposed. Both Mel-frequency spectral (MFS) and Mel-frequency cepstral (MFC) features are studied and experimented for speech enhancement. To estimate clean speech waveform from a noisy signal, an inversion from the Mel-frequency domain to the spectral domain is required which introduces distortion artifacts in the spectrum estimation and the filtering. To reduce the corrupting effects of the inversion, the PCS modeling is proposed. This method performs concurrent modeling in both cepstral and magnitude spectral... 

    HMM-based persian speech synthesis using limited adaptation data

    , Article International Conference on Signal Processing Proceedings, ICSP ; Volume 1 , 2012 , Pages 585-589 ; 9781467321945 (ISBN) Bahmaninezhad, F ; Sameti, H ; Khorram, S ; Sharif University of Technology
    2012
    Abstract
    Speech synthesis systems provided for the Persian language so far need various large-scale speech corpora to synthesize several target speakers' voice. Accordingly, synthesizing speech with a small amount of data seems to be essential in Persian. Taking advantage of a speaker adaptation in the speech synthesis systems makes it possible to generate speech with remarkable quality when the data of the speaker are limited. Here we conducted this method for the first time in Persian. This paper describes speaker adaptation based on Hidden Markov Models (HMMs) in Persian speech synthesis system for FARsi Speech DATabase (FARSDAT). In this regard, we prepared the whole FARSDAT, then for... 

    Automatic noise recognition based on neural network using LPC and MFCC feature parameters

    , Article 2012 Federated Conference on Computer Science and Information Systems, FedCSIS 2012, 9 September 2012 through 12 September 2012 ; 2012 , Pages 69-73 ; 9781467307086 (ISBN) Haghmaram, R ; Aroudi, A ; Ghezel, M. H ; Veisi, H ; Sharif University of Technology
    2012
    Abstract
    This paper studies the automatic noise recognition problem based on RBF and MLP neural networks classifiers using linear predictive and Mel-frequency cepstral coefficients (LPC and MFCC). We first briefly review the architecture of each network as automatic noise recognition (ANR) approach, then, compare them to each other and investigate factors and criteria that influence final recognition performance. The proposed networks are evaluated on 15 stationary and non-stationary types of noises with frame length of 20 ms in term of correct classification rate. The results demonstrate that the MLP network using LPCs is a precise ANR with accuracy rate of 99.9%, while the RBF network with MFCCs... 

    Reducing speech recognition costs: By compressing the input data

    , Article IS'2012 - 2012 6th IEEE International Conference Intelligent Systems, Proceedings ; 2012 , Pages 102-107 ; 9781467327824 (ISBN) Halavati, R ; Shouraki, S. B ; Sharif University of Technology
    2012
    Abstract
    One of the key constraints of using embedded speech recognition modules is the required computational power. To decrease this requirement, we propose an algorithm that clusters the speech signal before passing it to the recognition units. The algorithm is based on agglomerative clustering and produces a sequence of compressed frames, optimized for recognition. Our experimental results indicate that the proposed method presents a frame rate with average 40 frames per second on medium to large vocabulary isolated word recognition tasks without loss of recognition accuracy which result in up to 60% faster recognition in compare to usual 100 fps fixed frame rate sampling. This value is quite... 

    Cepstral-domain HMM-based speech enhancement using vector Taylor series and parallel model combination

    , Article 2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2 July 2012 through 5 July 2012 ; July , 2012 , Pages 298-303 ; 9781467303828 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology
    2012
    Abstract
    Speech enhancement problem using hidden Markov model (HMM) and minimum mean square error (MMSE) in cepstral domain is studied. This noise reduction approach can be considered as weighted-sum filtering of the noisy speech signal in which the filters weights are estimated using the HMM of noisy speech. To have an accurate estimation of the noisy speech HMM, vector Taylor series (VTS) is proposed and compared with the parallel model combination (PMC) technique. Furthermore, proposed cepstral-domain HMM-based speech enhancement systems are compared with the renowned autoregressive HMM (AR-HMM) approach. The evaluation results confirm the superiority of the cepstral domain approach in comparison... 

    The effect of phase information in speech enhancement and speech recognition

    , Article 2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2 July 2012 through 5 July 2012 ; 2012 , Pages 1446-1447 ; 9781467303828 (ISBN) Langarani, M. S. E ; Veisi, H ; Sameti, H ; Sharif University of Technology
    2012
    Abstract
    The majority of speech enhancement methods perform noise removal in spectral domain and construct the enhanced speech signal from the estimated magnitude of clean speech and the phase of the noisy speech. In this paper, we show that by incorporating the phase information in the enhancement process, the quality and intelligibility of speech signal are improved. In our investigations, the minimum mean-square error short-time spectral amplitude and MMSE log-spectral amplitude methods are used to estimate the magnitude spectrum of speech signal. By conducting six classes of experiments, it is shown that by taking the phase information into account, overall SNR and PESQ measures are improved. In... 

    Combining augmented reality and speech technologies to help deaf and hard of hearing people

    , Article Proceedings - 2012 14th Symposium on Virtual and Augmented Reality, SVR 2012 ; 2012 , Pages 174-181 ; 9780769547251 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology
    2012
    Abstract
    Augmented Reality (AR), Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS) can be used to help people with disabilities. In this paper, we combine these technologies to make a new system for helping deaf people. This system can take the narrator's speech and convert it into a readable text and show it directly on AR display. To improve the accuracy of the system, we use Audio-Visual Speech Recognition (AVSR) as a backup for the ASR engine in noisy environments. In addition, we use the TTS system to make our system more usable for deaf people. The results of testing the system show that its accuracy is over 85 percent on average in different places. Also, the result of a... 

    Using augmented reality and automatic speech recognition techniques to help deaf and hard of hearing people

    , Article ACM International Conference Proceeding Series ; 2012 ; 9781450312431 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology
    2012
    Abstract
    Recently, many researches show Augmented Reality (AR) and Automatic Speech Recognition (ASR) can help people with disabilities. In this paper we implement an innovative system for helping deaf people by combining AR, ASR, and AVSR technologies. This system can instantly take narrator's speech and converts it into readable text and shows it directly on AR display. We show that our system's accuracy becomes over 85 percent on average, by using different ASR engines near using an AVSR engine in different noisy environments. We also show in a survey that more than 90 percent of deaf people on average need such system as assistant in portable devices, near using only text or only sign-language... 

    Support vector data description for spoken digit recognition

    , Article BIOSIGNALS 2012 - Proceedings of the International Conference on Bio-Inspired Systems and Signal Processing ; 2012 , Pages 32-37 ; 9789898425898 (ISBN) Tavanaei, A ; Ghasemi, A ; Tavanaei, M ; Sameti, H ; Manzuri, M. T ; Inst. Syst. Technol. Inf., Control Commun. (INSTICC) ; Sharif University of Technology
    2012
    Abstract
    A classifier based on Support Vector Data Description (SVDD) is proposed for spoken digit recognition. We use the Mel Frequency Discrete Wavelet Coefficients (MFDWC) and the Mel Frequency cepstral Coefficients (MFCC) as the feature vectors. The proposed classifier is compared to the HMM and results are promising and we show the HMM and SVDD classifiers have equal accuracy rates. The performance of the proposed features and SVDD classifier with several kernel functions are evaluated and compared in clean and noisy speech. Because of multi resolution and localization of the Wavelet Transform (WT) and using SVDD, experiments on the spoken digit recognition systems based on MFDWC features and... 

    Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement

    , Article IET Signal Processing ; Volume 6, Issue 1 , February , 2012 , Pages 54-63 ; 17519675 (ISSN) Veisi, H ; Sameti, H ; Sharif University of Technology
    2012
    Abstract
    A new voice activity detection (VAD) algorithm with soft decision output in Mel-frequency domain is developed based on hidden Markov model (HMM) and is incorporated in an HMM-based speech enhancement system. The proposed VAD uses a two-state ergodic HMM representing speech presence and speech absence. The states are constructed from noisy speech and noise HMMs used in the speech enhancement system. This composite model provides a robust detection of speech segments in the presence of noise and obviates the need for extra modeling in HMM-based speech enhancement applications. As the main purpose of the proposed VAD is to detect speech segments accurately, a hang-over mechanism is proposed and... 

    Persian language understanding using a two-step extended hidden vector state parser

    , Article IEEE International Workshop on Machine Learning for Signal Processing, 18 September 2011 through 21 September 2011 ; September , 2011 , Page(s): 1 - 6 ; 9781457716232 (ISBN) Jabbari, F ; Sameti, H ; Hadi Bokaei, M ; Sharif University of Technology
    Abstract
    The key element of a spoken dialogue system is a spoken language understanding (SLU) unit. Hidden Vector State (HVS) is one of the most popular statistical approaches employed to implement the SLU unit. This paper presents a two-step approach for Persian language understanding. First, a goal detector is used to identify the main goal of the input utterance. Second, after restricting the search space for semantic tagging, an extended hidden vector state (EHVS) parser is used to extract the remaining semantics in each subspace. This will mainly improve the performance of semantic tagger, while reducing the model complexity and training time. Moreover, the need for large amount of data will be... 

    Speaker phone mode classification using Gaussian mixture models

    , Article SPA 2011 - Signal Processing: Algorithms, Architectures, Arrangements, and Applications - Conference Proceedings, 29 September 2011 through 30 September 2011 ; September , 2011 , Pages 112-117 ; 9781457714863 (ISBN) Eghbal Zadeh, H ; Sobhan Manesh, F ; Sameti, H ; BabaAli, B ; Sharif University of Technology
    2011
    Abstract
    This study focuses on the mode classification of phones speaker modes using GMM 1. In this regard, speech data in both enabled and disabled speaker modes of cell phones and telephones were collected, processed and classified into two different categories. The different mixture numbers (1 to 4) of GMM and wave files sizes of 10, 20, 40 and 80 kb were tested in order to obtain an optimal condition for classification. The GMM method attained 87.99% correct classification rate on test data. This classification is important for speech enabled IVR 2 systems [1], dialog systems and many systems in speech processing in the sense that it could help to load an optimum model for increasing system... 

    Filter-bank design based on dependencies between frequency components and phoneme characteristics

    , Article European Signal Processing Conference, 29 August 2011 through 2 September 2011 ; Septembe , 2011 , Pages 2142-2145 ; 22195491 (ISSN) Mohammadi, S. H ; Sameti, H ; Tavanaei, A ; Soltani Farani, A ; Sharif University of Technology
    2011
    Abstract
    Mel-frequency Cepstral coefficients are widely used for feature extraction in speech recognition systems. These features use Mel-scaled filters. A new filter-bank based on dependencies between frequency components and phoneme characteristics is proposed. F-ratio and mutual information are used for this purpose. A new filter-bank is designed in which frequency resolution of sub-band filters is inversely proportional to the computed dependency values. These new filterbank is used instead of Mel-scaled filters for feature extraction. A phoneme recognition experiment on FARSDAT Persian language database showed that features extracted using the proposed filter-bank reach higher accuracy (63.92%)... 

    Optimum detection and location estimation of target lines in the range-time space of a search radar

    , Article Aerospace Science and Technology ; Volume 15, Issue 8 , 2011 , Pages 627-634 ; 12709638 (ISSN) Moqiseh, A ; Sharify, S ; Nayebi, M. M ; Sharif University of Technology
    2011
    Abstract
    The average likelihood ratio detector is derived as the optimum detector for detecting a target line with unknown normal parameters in the range-time data space of a search radar, which is corrupted by Gaussian noise. The receiver operation characteristics of this optimum detector is derived to evaluate its performance improvement in comparison with the Hough detector, which uses the return signal of several successive scans to achieve a non-coherent integration improvement and get a better performance than the conventional detector. This comparison, which is done through analytic derivations and also through simulation results, shows that the average likelihood ratio detector has a better... 

    Fundamental frequency estimation using modified higher order moments and multiple windows

    , Article Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH ; 2011 , Pages 1965-1968 ; 19909772 (ISSN) Pawi, A ; Vaseghi, S ; Milner, B ; Ghorshi, S ; Sharif Univesity of Technology
    2011
    Abstract
    This paper proposes a set of higher-order modified moments for estimation of the fundamental frequency of speech and explores the impact of the speech window length on pitch estimation error. The pitch extraction methods are evaluated in a range of noise types and SNRs. For calculation of errors, pitch reference values are calculated from manually-corrected estimates of the periods obtained from laryngograph signals. The results obtained for the 3 rd and 4 th order modified moment compare well with methods based on correlation and magnitude difference criteria and the YIN method; with improved pitch accuracy and less occurrence of large errors