Loading...

Creating a corpus for automatic punctuation prediction in persian texts

Hosseini, S. M ; Sharif University of Technology

416 Viewed
  1. Type of Document: Article
  2. DOI: 10.1109/IranianCEE.2017.7985288
  3. Abstract:
  4. We present a novel corpus for automatic punctuation prediction in persian texts. punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one, with almost 2.6 million words, including punctuation marks. We have made many diligent improvements to the already existing corpus to make one that deliberately facilitates experimental studies on Persian punctuation prediction: 1- replacing 3175 word types with their correct form, 2- normalizing the words (e.g. replacing kashida with hyphen), 3- correcting 451 and 192 words with incorrect DELM and DEFAULT tags, respectively, 4- investigating 17 word types to correct the punctuations around them, and 5- making numerous corrections to the punctuation marks. The final corpus contains nearly 2.3 million words and 221 thousand punctuation marks. Finally, we have trained and tested a CRF (conditional random field) model that shows a micro-averaged F1-score of 60.69% in our preliminary experiments. © 2017 IEEE
  5. Keywords:
  6. Corpus collection ; Natural language processing ; Persian texts ; Punctuation prediction ; Forecasting ; Linguistics ; Natural language processing systems ; Random processes ; Text processing ; Automatic speech recognition ; Conditional random field ; F1 scores ; Part Of speech ; Persians ; Prediction model ; Punctuation marks ; State of the art ; Speech recognition
  7. Source: 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1537-1542 ; 9781509059638 (ISBN)
  8. URL: https://ieeexplore.ieee.org/document/7985288