Sharif Digital Repository / Sharif University of Technology / Search result

Speech enhancement using hidden Markov models in Mel-frequency domain

, Article Speech Communication ; Volume 55, Issue 2 , 2013 , Pages 205-220 ; 01676393 (ISSN) Veisi, H ; Sameti, H ; Sharif University of Technology

2013

Abstract

Hidden Markov model (HMM)-based minimum mean square error speech enhancement method in Mel-frequency domain is focused on and a parallel cepstral and spectral (PCS) modeling is proposed. Both Mel-frequency spectral (MFS) and Mel-frequency cepstral (MFC) features are studied and experimented for speech enhancement. To estimate clean speech waveform from a noisy signal, an inversion from the Mel-frequency domain to the spectral domain is required which introduces distortion artifacts in the spectrum estimation and the filtering. To reduce the corrupting effects of the inversion, the PCS modeling is proposed. This method performs concurrent modeling in both cepstral and magnitude spectral...

HMM-based persian speech synthesis using limited adaptation data

, Article International Conference on Signal Processing Proceedings, ICSP ; Volume 1 , 2012 , Pages 585-589 ; 9781467321945 (ISBN) Bahmaninezhad, F ; Sameti, H ; Khorram, S ; Sharif University of Technology

2012

Abstract

Speech synthesis systems provided for the Persian language so far need various large-scale speech corpora to synthesize several target speakers' voice. Accordingly, synthesizing speech with a small amount of data seems to be essential in Persian. Taking advantage of a speaker adaptation in the speech synthesis systems makes it possible to generate speech with remarkable quality when the data of the speaker are limited. Here we conducted this method for the first time in Persian. This paper describes speaker adaptation based on Hidden Markov Models (HMMs) in Persian speech synthesis system for FARsi Speech DATabase (FARSDAT). In this regard, we prepared the whole FARSDAT, then for...

Automatic noise recognition based on neural network using LPC and MFCC feature parameters

, Article 2012 Federated Conference on Computer Science and Information Systems, FedCSIS 2012, 9 September 2012 through 12 September 2012 ; 2012 , Pages 69-73 ; 9781467307086 (ISBN) Haghmaram, R ; Aroudi, A ; Ghezel, M. H ; Veisi, H ; Sharif University of Technology

2012

Abstract

This paper studies the automatic noise recognition problem based on RBF and MLP neural networks classifiers using linear predictive and Mel-frequency cepstral coefficients (LPC and MFCC). We first briefly review the architecture of each network as automatic noise recognition (ANR) approach, then, compare them to each other and investigate factors and criteria that influence final recognition performance. The proposed networks are evaluated on 15 stationary and non-stationary types of noises with frame length of 20 ms in term of correct classification rate. The results demonstrate that the MLP network using LPCs is a precise ANR with accuracy rate of 99.9%, while the RBF network with MFCCs...

Reducing speech recognition costs: By compressing the input data

, Article IS'2012 - 2012 6th IEEE International Conference Intelligent Systems, Proceedings ; 2012 , Pages 102-107 ; 9781467327824 (ISBN) Halavati, R ; Shouraki, S. B ; Sharif University of Technology

2012

Abstract

One of the key constraints of using embedded speech recognition modules is the required computational power. To decrease this requirement, we propose an algorithm that clusters the speech signal before passing it to the recognition units. The algorithm is based on agglomerative clustering and produces a sequence of compressed frames, optimized for recognition. Our experimental results indicate that the proposed method presents a frame rate with average 40 frames per second on medium to large vocabulary isolated word recognition tasks without loss of recognition accuracy which result in up to 60% faster recognition in compare to usual 100 fps fixed frame rate sampling. This value is quite...

Cepstral-domain HMM-based speech enhancement using vector Taylor series and parallel model combination

, Article 2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2 July 2012 through 5 July 2012 ; July , 2012 , Pages 298-303 ; 9781467303828 (ISBN) Veisi, H ; Sameti, H ; Sharif University of Technology

2012

Abstract

Speech enhancement problem using hidden Markov model (HMM) and minimum mean square error (MMSE) in cepstral domain is studied. This noise reduction approach can be considered as weighted-sum filtering of the noisy speech signal in which the filters weights are estimated using the HMM of noisy speech. To have an accurate estimation of the noisy speech HMM, vector Taylor series (VTS) is proposed and compared with the parallel model combination (PMC) technique. Furthermore, proposed cepstral-domain HMM-based speech enhancement systems are compared with the renowned autoregressive HMM (AR-HMM) approach. The evaluation results confirm the superiority of the cepstral domain approach in comparison...

The effect of phase information in speech enhancement and speech recognition

, Article 2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2 July 2012 through 5 July 2012 ; 2012 , Pages 1446-1447 ; 9781467303828 (ISBN) Langarani, M. S. E ; Veisi, H ; Sameti, H ; Sharif University of Technology

2012

Abstract

The majority of speech enhancement methods perform noise removal in spectral domain and construct the enhanced speech signal from the estimated magnitude of clean speech and the phase of the noisy speech. In this paper, we show that by incorporating the phase information in the enhancement process, the quality and intelligibility of speech signal are improved. In our investigations, the minimum mean-square error short-time spectral amplitude and MMSE log-spectral amplitude methods are used to estimate the magnitude spectrum of speech signal. By conducting six classes of experiments, it is shown that by taking the phase information into account, overall SNR and PESQ measures are improved. In...

Combining augmented reality and speech technologies to help deaf and hard of hearing people

, Article Proceedings - 2012 14th Symposium on Virtual and Augmented Reality, SVR 2012 ; 2012 , Pages 174-181 ; 9780769547251 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology

2012

Abstract

Augmented Reality (AR), Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS) can be used to help people with disabilities. In this paper, we combine these technologies to make a new system for helping deaf people. This system can take the narrator's speech and convert it into a readable text and show it directly on AR display. To improve the accuracy of the system, we use Audio-Visual Speech Recognition (AVSR) as a backup for the ASR engine in noisy environments. In addition, we use the TTS system to make our system more usable for deaf people. The results of testing the system show that its accuracy is over 85 percent on average in different places. Also, the result of a...

Using augmented reality and automatic speech recognition techniques to help deaf and hard of hearing people

, Article ACM International Conference Proceeding Series ; 2012 ; 9781450312431 (ISBN) Mirzaei, M. R ; Ghorshi, S ; Mortazavi, M ; Sharif University of Technology

2012

Abstract

Recently, many researches show Augmented Reality (AR) and Automatic Speech Recognition (ASR) can help people with disabilities. In this paper we implement an innovative system for helping deaf people by combining AR, ASR, and AVSR technologies. This system can instantly take narrator's speech and converts it into readable text and shows it directly on AR display. We show that our system's accuracy becomes over 85 percent on average, by using different ASR engines near using an AVSR engine in different noisy environments. We also show in a survey that more than 90 percent of deaf people on average need such system as assistant in portable devices, near using only text or only sign-language...

Support vector data description for spoken digit recognition

, Article BIOSIGNALS 2012 - Proceedings of the International Conference on Bio-Inspired Systems and Signal Processing ; 2012 , Pages 32-37 ; 9789898425898 (ISBN) Tavanaei, A ; Ghasemi, A ; Tavanaei, M ; Sameti, H ; Manzuri, M. T ; Inst. Syst. Technol. Inf., Control Commun. (INSTICC) ; Sharif University of Technology

2012

Abstract

A classifier based on Support Vector Data Description (SVDD) is proposed for spoken digit recognition. We use the Mel Frequency Discrete Wavelet Coefficients (MFDWC) and the Mel Frequency cepstral Coefficients (MFCC) as the feature vectors. The proposed classifier is compared to the HMM and results are promising and we show the HMM and SVDD classifiers have equal accuracy rates. The performance of the proposed features and SVDD classifier with several kernel functions are evaluated and compared in clean and noisy speech. Because of multi resolution and localization of the Wavelet Transform (WT) and using SVDD, experiments on the spoken digit recognition systems based on MFDWC features and...

Hidden-Markov-model-based voice activity detector with high speech detection rate for speech enhancement

, Article IET Signal Processing ; Volume 6, Issue 1 , February , 2012 , Pages 54-63 ; 17519675 (ISSN) Veisi, H ; Sameti, H ; Sharif University of Technology

2012

Abstract

A new voice activity detection (VAD) algorithm with soft decision output in Mel-frequency domain is developed based on hidden Markov model (HMM) and is incorporated in an HMM-based speech enhancement system. The proposed VAD uses a two-state ergodic HMM representing speech presence and speech absence. The states are constructed from noisy speech and noise HMMs used in the speech enhancement system. This composite model provides a robust detection of speech segments in the presence of noise and obviates the need for extra modeling in HMM-based speech enhancement applications. As the main purpose of the proposed VAD is to detect speech segments accurately, a hang-over mechanism is proposed and...

Persian language understanding using a two-step extended hidden vector state parser

, Article IEEE International Workshop on Machine Learning for Signal Processing, 18 September 2011 through 21 September 2011 ; September , 2011 , Page(s): 1 - 6 ; 9781457716232 (ISBN) Jabbari, F ; Sameti, H ; Hadi Bokaei, M ; Sharif University of Technology

Abstract

The key element of a spoken dialogue system is a spoken language understanding (SLU) unit. Hidden Vector State (HVS) is one of the most popular statistical approaches employed to implement the SLU unit. This paper presents a two-step approach for Persian language understanding. First, a goal detector is used to identify the main goal of the input utterance. Second, after restricting the search space for semantic tagging, an extended hidden vector state (EHVS) parser is used to extract the remaining semantics in each subspace. This will mainly improve the performance of semantic tagger, while reducing the model complexity and training time. Moreover, the need for large amount of data will be...

Speaker phone mode classification using Gaussian mixture models

, Article SPA 2011 - Signal Processing: Algorithms, Architectures, Arrangements, and Applications - Conference Proceedings, 29 September 2011 through 30 September 2011 ; September , 2011 , Pages 112-117 ; 9781457714863 (ISBN) Eghbal Zadeh, H ; Sobhan Manesh, F ; Sameti, H ; BabaAli, B ; Sharif University of Technology

2011

Abstract

This study focuses on the mode classification of phones speaker modes using GMM 1. In this regard, speech data in both enabled and disabled speaker modes of cell phones and telephones were collected, processed and classified into two different categories. The different mixture numbers (1 to 4) of GMM and wave files sizes of 10, 20, 40 and 80 kb were tested in order to obtain an optimal condition for classification. The GMM method attained 87.99% correct classification rate on test data. This classification is important for speech enabled IVR 2 systems [1], dialog systems and many systems in speech processing in the sense that it could help to load an optimum model for increasing system...

Filter-bank design based on dependencies between frequency components and phoneme characteristics

, Article European Signal Processing Conference, 29 August 2011 through 2 September 2011 ; Septembe , 2011 , Pages 2142-2145 ; 22195491 (ISSN) Mohammadi, S. H ; Sameti, H ; Tavanaei, A ; Soltani Farani, A ; Sharif University of Technology

2011

Abstract

Mel-frequency Cepstral coefficients are widely used for feature extraction in speech recognition systems. These features use Mel-scaled filters. A new filter-bank based on dependencies between frequency components and phoneme characteristics is proposed. F-ratio and mutual information are used for this purpose. A new filter-bank is designed in which frequency resolution of sub-band filters is inversely proportional to the computed dependency values. These new filterbank is used instead of Mel-scaled filters for feature extraction. A phoneme recognition experiment on FARSDAT Persian language database showed that features extracted using the proposed filter-bank reach higher accuracy (63.92%)...

Optimum detection and location estimation of target lines in the range-time space of a search radar

, Article Aerospace Science and Technology ; Volume 15, Issue 8 , 2011 , Pages 627-634 ; 12709638 (ISSN) Moqiseh, A ; Sharify, S ; Nayebi, M. M ; Sharif University of Technology

2011

Abstract

The average likelihood ratio detector is derived as the optimum detector for detecting a target line with unknown normal parameters in the range-time data space of a search radar, which is corrupted by Gaussian noise. The receiver operation characteristics of this optimum detector is derived to evaluate its performance improvement in comparison with the Hough detector, which uses the return signal of several successive scans to achieve a non-coherent integration improvement and get a better performance than the conventional detector. This comparison, which is done through analytic derivations and also through simulation results, shows that the average likelihood ratio detector has a better...

Fundamental frequency estimation using modified higher order moments and multiple windows

, Article Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH ; 2011 , Pages 1965-1968 ; 19909772 (ISSN) Pawi, A ; Vaseghi, S ; Milner, B ; Ghorshi, S ; Sharif Univesity of Technology

2011

Abstract

This paper proposes a set of higher-order modified moments for estimation of the fundamental frequency of speech and explores the impact of the speech window length on pitch estimation error. The pitch extraction methods are evaluated in a range of noise types and SNRs. For calculation of errors, pitch reference values are calculated from manually-corrected estimates of the periods obtained from laryngograph signals. The results obtained for the 3 rd and 4 th order modified moment compare well with methods based on correlation and magnitude difference criteria and the YIN method; with improved pitch accuracy and less occurrence of large errors

Incorporating a novel confidence scoring method in a Persian spoken dialogue system

, Article SPA 2011 - Signal Processing: Algorithms, Architectures, Arrangements, and Applications - Conference Proceedings, 29 September 2011 through 30 September 2011, Poznan ; September , 2011 , Pages 74-78 ; 9781457714863 (ISBN) Sakhaee, E ; Sameti, H ; Babaali, B ; Sharif University of Technology

2011

Abstract

Reliability assessment of phonemes, syllabi, words, concepts or utterances has become the key feature of Automatic Speech Recognition (ASR) engines in order to make a decision to accept or reject a hypothesis. In this paper, we propose utterance-level confidence annotation based on combination of features extracted from multiple knowledge sources in Persian language. The experiment was conducted first to examine the performance of individual features, then to combine them using statistical data analysis and density estimation methods to assign a confidence score to utterances. Using the data collected from a Persian spoken dialogue system, we show that combining features from independent...

Towards MPEG4 compatible face representation via hierarchical clustering-based facial feature extraction

, Article ISCI 2011 - 2011 IEEE Symposium on Computers and Informatics ; 2011 , Pages 436-441 ; 9781612846903 (ISBN) Ghahari, A ; Mosleh, M ; Sharif University of Technology

Abstract

Multi-view imaging and display systems has taken a divide and conquer approach to 3D sensing and visualization. We aim to make more reliable and robust automatic feature extraction and natural 3D feature construction from 2D features detected on a pair of frontal and profile view face images. We propose several heuristic algorithms to minimize possible errors introduced by prevalent imperfect orthogonal condition and non-coherent luminance trying to address the problems incurred with illumination discrepancies on common surface points in accommodation of multi-views. In our approach, we first extract the 2D features that are visible to both cameras in both views. Then, we estimate the...

Mel-scaled Discrete Wavelet Transform and dynamic features for the Persian phoneme recognition

, Article 2011 International Symposium on Artificial Intelligence and Signal Processing, AISP 2011, 15 June 2011 through 16 June 2011 ; June , 2011 , Pages 138-140 ; 9781424498345 (ISBN) Tavanaei, A ; Manzuri, M. T ; Sameti, H ; Sharif University of Technology

2011

Abstract

In this paper we use a feature vector consisting of the Mel Frequency Discrete Wavelet Coefficients to recognize spoken phonemes in the Persian language. The purpose of using wavelet in feature extraction is to benefit from its multi resolution analysis and localization property in time and frequency domains. The MFDWCs are obtained by applying the Discrete Wavelet Transform (DWT) to the Mel-scaled log filter bank energies of a speech frame. Feature vectors are used for the HMM-based phoneme recognition on a portion of the FarsDat Persian language database consisting of 35 hour recorded data for training and 15 hour for testing. We evaluate the performance of new features for clean speech...

Utilizing intelligent segmentation in isolated word recognition using a hybrid HTD-HMM

, Article International Conference on Circuits, Systems, Signal and Telecommunications - Proceedings, 21 October 2010 through 23 October 2010 ; October , 2011 , Pages 42-49 ; 9789604742714 (ISBN) Kazemi, R ; Sereshkeh, A. R ; Ehsandoust, B ; ; Sharif University of Technology

2011

Abstract

Isolated Word Recognition (IWR) is becoming increasingly attractive due to the improvement of speech recognition techniques. However, the accuracy of IWR suffers when large databases or words with similar pronunciation are used. The criterion for accurate speech recognition is suitable segmentation. However, the traditional method of segmentation equal segmentation does not produce the most accurate result. Furthermore, utilizing manual segmentation based on events is not possible in large databases. In this paper, we introduce an intelligent segmentation based on Hierarchical Temporal Decomposition (HTD). Based on this method, a temporal decomposition (TD) algorithm can be used to...

Isolatedword recognition based on intelligent segmentation by using hybrid HTD-HMM

, Article International Conference on Circuits, Systems, Signal and Telecommunications - Proceedings, 21 October 2010 through 23 October 2010 ; October , 2011 , Pages 38-41 ; 9789604742714 (ISBN) Kazemi, A. R ; Ehsandoust, B. B ; Rezazadeh, C. A ; Ghaemmaghami, D. S ; Sharif University of Technology

2011

Abstract

In recent years, IWR (Isolated Word Recognition) was one of the main concerns of speech processing. The challenging problems in this field appear when the database become so large or when we have a lot of word with similarly pronounce in the database. This paper introduces a general solution for a traditional problem in isolated similarly pronounced word recognition, especially in large databases. One the important problem of traditional IWR is referred to their segmentation algorithm, their methods were lacking in efficiency due to the following reasons: First, using equal segmentation is not at all intelligent at all and as a result, cannot produce accurate results; besides, utilizing...