Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 57725 (05)
- University: Sharif University of Technology
- Department: Electrical Engineering
- Advisor(s): Babaiezadeh Malmiri, Massoud
- Abstract:
- Blind Source Separation (BSS) aims to recover multiple source signals from mixtures without prior knowledge of the sources or the mixing process. This manuscript focuses on speech separation in convolutive mixtures, aiming to recover speech signals from mixtures that are linear combinations of the filtered versions of original signals. Most recent methods for blind source separation in both instantaneous and convolutive mixtures typically involve transforming time-domain mixed signals into the time-frequency domain using Short-Time Fourier Transform (STFT) and conducting the separation in this domain. Broadly, the separation process is followed by two main approaches. The first involves designing and applying beamforming filters to separate the mixed signals, leveraging the concept of receiving signals from specific directions or applying instantaneous filtering in each frequency bin to achieve statistical independence among the recovered signals. The second approach involves designing mask filters for the STFT matrices of the mixed signals, assigning each time-frequency point to a specific source. This manuscript employs a sinusoidal speech model as the framework for analyzing mixed signals and synthesizing the separated speech signals in the time-frequency domain and converting them back to the time domain. In this manuscript, two main structures are proposed for separating speech mixtures. The first structure develops a decision criterion or classification model to allocate each time-frequency point to one of the speakers. These decision criteria and models are designed using information extracted from speaker-dominant frames and the estimated spatial information of the speakers. The allocated time-frequency points are then used to generate time-varying and time-evolving sinusoids for synthesizing the recovered speech. The second structure involves designing convolutive demixing filters for each speaker in every frequency bin, resulting in separated speech signals as linear combinations of filtered mixed signals. The main approach used for estimating the demixing filters is using solely speaker-dominant frames as training data to learn filter coefficients. According to the perceptual evaluation criteria for speech quality (PESQ), the first proposed structure achieves an average separation score of 2.37 in determined convolutive mixtures compared to 2.43 in the new robust Global and Local Simplex Separation (GLOSS) method, and a score of 1.99 in underdetermined convolutive mixtures compared to 2.14 in the computationally intensive method based on multi-frame Full-rank spatial Covariance matrix Analysis (mf-FCA), with a processing time less than 0.01 times that of mf-FCA. Additionally, the second proposed structure achieves separation of determined convolutive mixtures with a score of 2.45 compared to 2.2 in GLOSS
- Keywords:
- Blind Sources Separation (BSS) ; Convolutive Mixtures ; Online Speaker Diarization ; Time Frequency Transform ; Short Time Fourier Transform ; Blind Speech Separation ; Sinusoidal Speech Model ; Band-to-Band Filters
-
محتواي کتاب
- view
- مقدمه
- مروری بر روشهای جداسازی کور منابع
- مدل سینوسی گفتار
- ساختار پیشنهادی برای جداسازی گفتار با استفاده از مدل سینوسی گفتار
- ساختار پیشنهادی برای جداسازی گفتار با فیلترهای باند به باند
- نتیجهگیری و پیشنهادات
- مراجع