Loading...

A possibilistic approach for building statistical language models

Momtazi, S ; Sharif University of Technology

549 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/ISDA.2009.197
  3. Abstract:
  4. Class-based n-gram language models are those most frequently-used in continuous speech recognition systems, especially for languages for which no richly annotated corpora are available. Various word clustering algorithms have been proposed to build such class-based models. In this work, we discuss the superiority of soft approaches to class construction, whereby each word can be assigned to more than one class. We also propose a new method for possibilistic word clustering. The possibilistic C-mean algorithm is used as our clustering method. Various parameters of this algorithm are investigated; e.g., centroid initialization, distance measure, and words' feature vector. In the experiments reported here, this algorithm is applied to the 20,000 most frequent Persian words, and the language model built with the clusters created in this fashion is evaluated based on its perplexity and the accuracy of a continuous speech recognition system. Our results indicate a 10% reduction in perplexity and a 4% reduction in word error rate. © 2009 IEEE
  5. Keywords:
  6. Class-based ; Clustering methods ; Continuous speech ; Distance measure ; Feature vectors ; Language model ; N-gram language models ; Persians ; Possibilistic ; Possibilistic approach ; Statistical language models ; Word clustering ; Word error rate ; Computational linguistics ; Continuous speech recognition ; Intelligent systems ; Packet networks ; Query languages ; Clustering algorithms
  7. Source: ISDA 2009 - 9th International Conference on Intelligent Systems Design and Applications, 30 November 2009 through 2 December 2009, Pisa ; 2009 , Pages 1014-1018 ; 9780769538723 (ISBN)
  8. URL: http://ieeexplore.ieee.org/document/5364438/?reload=true