Loading...

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science (M.Sc.) in Computer Engineering, Artificial Intelligence

Hosseini, Mohammad Saleh | 2016

471 Viewed
  1. Type of Document: M.Sc. Thesis
  2. Language: Farsi
  3. Document No: 48048 (19)
  4. University: Sharif University of Technology
  5. Department: Computer Engineering
  6. Advisor(s): Sameti, Hossein
  7. Abstract:
  8. Punctuation marks in every language, constitute an important part of a text. Not inserting these punctuations in text, makes the text ambiguous. The output text of automatic speech recognition (ASR) system, is typically a raw sequence of words, containing no punctuation marks. This makes the text difficult or even impossible to make sense of for humans, as well as for any further text processing tasks. The goal of this thesis is to perform automatic punctuation insertion in Persian texts lacking punctuation marks. To the best of our knowledge, this is the first work done in this context for the Persian language. For this purpose, firstly, we assembled a state-of-the-art corpus to train and test punctuation prediction models. The corpus is prepared by the modification of a manually part of speech tagged Persian corpus and is the first corpus specifically designed for the task of punctuation prediction in Persian texts. The final assembled corpus contains nearly 2.3 million words and 221 thousand punctuation marks. For punctuation prediction, we used CRF model. Our main contribution in this work, is using Ezafe feature. Our experiments show that this method, has a significant improvement compared to when the Ezafe feature is not used. The results show a micro-averaged F1 score of 63.11% which indicates a relative improvement of 1.86% compared to when the Ezafe feature is not exploited. Moreover, to investigate our model performance on ASR output texts, we first read certain texts for an ASR system with word error rate of 12% and then gave the outputs to our model to be automatically punctuated. Furthermore, in this case utilizing the Ezafe feature, improves the micro-averaged F1-score from 55.07% to 56.87% that shows a 3.27% relative improvement compared to when the Ezafe feature is not used
  9. Keywords:
  10. Natural Language Processing ; Ezafe ; Conditional Random Fields (CRF) ; Automatic Speech Recognition ; Persian Texts ; Corpus ; Punctuation Prediction ; Corpus Collection

 Digital Object List

 Bookmark

No TOC