Persian Statistical Natural Language Understanding Based on Partially Annotated Corpus

Jabbari, Fattaneh; Sameti, Hossein

Please enable javascript in your browser.

Persian Statistical Natural Language Understanding Based on Partially Annotated Corpus

Jabbari, Fattaneh | 2011

658 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 43086 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Sameti, Hossein
Abstract:
Spoken language understanding unit is one of the most important parts of a spoken dialogue system. The input of this system is the output of speech recognition unit. The main function of this unit is to extract the semantic information from the input utterances. There are two main types of approaches to do this task: rule-based approaches, and data-driven approaches. Today data-driven approaches are of more interest because they are more flexible and robust compared to the rule-based approaches. The main drawback of these methods is that they need a large amount of fully annotated or in some cases Treebank data. Preparing such data is time consuming and expensive. The goal of this thesis is to propose a flexible method that does not need grammar rule extraction and is based on data. Additionally, it does not need fully annotated data; it rather uses partially annotated corpus and is able to capture hierarchical and long range dependencies. Partially annotated corpus is a set of sentences in which the keywords of sentences are used for annotation. To this end, two graphical models, Hidden Vector State (HVS) and Extended Hidden Vector State (EHVS) are implemented for Persian Language understanding. These methods need large amount of data because of data sparseness problem. To solve this problem, a two-step EHVS tagger is proposed and implemented. Two semi-supervised and active learning methods are also proposed in order to make use of unannotated data to improve the model, and reduce the cost of annotation. The experiments of the proposed methods are done on the University Information Kiosk. The accuracy of the two-step EHVS tagger improves 40.89% compared to HVS and 28.2% compared to EHVS. By applying semi-supervised learning by means of unlabeled samples, the accuracy increases from 43.30% to 55.67% in the best case. Additionally, employing active learning increases accuracy of EHVS from 43.30% to 58.76% in the best case. The experimental results demonstrate the effectiveness and feasibility of the proposed approaches
Keywords:
Spoken Langauge Understanding ; Semi-Supervised Learning ; Active Learning ; Data Driven Method ; Hidden Vector State ; Extended Hidden Vector State

Digital Object List

محتواي پايان نامه
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code