Sharif Digital Repository / Sharif University of Technology / Search result

Temporal relation classification in Persian and english contexts

, Article International Conference Recent Advances in Natural Language Processing, RANLP, Hissar ; September , 2013 , Pages 261-269 ; 13138502 (ISSN) Torbati, M. E ; Ghassem-Sani, G ; Mirroshandel, S. A ; Yaghoobzadeh, Y ; Hosseini, N. K ; Sharif University of Technology

2013

Abstract

This paper introduces the first pattern-based Persian Temporal Relation Classifier (PTRC) that finds the type of temporal relations between pairs of events in the Persian texts. The proposed system uses support vector machines (SVMs) equipped by combinations of simple, convolution tree, and string subsequence kernels (SSK). In order to evaluate the algorithm, we have developed a Persian TimeBank (PTB) corpus. PTRC not only increases the performance of the classification by applying new features and SSK, but also alleviates the probable adverse effects of the Free Word Orderness (FWO) of Persian on temporal relation classification. We have also applied our proposed algorithm to two standard...

History based unsupervised data oriented parsing

, Article International Conference Recent Advances in Natural Language Processing, RANLP ; September , 2013 , Pages 453-459 ; 13138502 (ISSN) Mesgar, M ; Ghasem Sani, G ; Sharif University of Technology

2013

Abstract

Grammar induction is a basic step in natural language processing. Based on the volume of information that is used by different methods, we can distinguish three types of grammar induction method: supervised, unsupervised, and semi-supervised. Supervised and semisupervised methods require large tree banks, which may not currently exist for many languages. Accordingly, many researchers have focused on unsupervised methods. Unsupervised Data Oriented Parsing (UDOP) is currently the state of the art in unsupervised grammar induction. In this paper, we show that the performance of UDOP in free word order languages such as Persian is inferior to that of fixed order languages such as English. We...

Unsupervised induction of persian semantic verb classes based on syntactic information

, Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Warsaw ; Volume 7912 LNCS , June , 2013 , Pages 112-124 ; 03029743 (ISSN) ; 9783642386336 (ISBN) Aminian, M ; Rasooli, M. S ; Sameti, H ; Sharif University of Technology

2013

Abstract

Automatic induction of semantic verb classes is one of the most challenging tasks in computational lexical semantics with a wide variety of applications in natural language processing. The large number of Persian speakers and the lack of such semantic classes for Persian verbs have motivated us to use unsupervised algorithms for Persian verb clustering. In this paper, we have done experiments on inducing the semantic classes of Persian verbs based on Levin's theory for verb classes. Syntactic information extracted from dependency trees is used as base features for clustering the verbs. Since there has been no manual classification of Persian verbs prior to this paper, we have prepared a...

Formal verification of temporal questions in the context of query-answering text summarization

, Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 28 May 2012 through 30 May 2012 ; Volume 7310 LNAI , May , 2012 , Pages 350-355 ; 03029743 (ISSN) ; 9783642303524 (ISBN) Mostafazadeh, N ; Bakhshandeh Babarsad, O ; Ghassem Sani, G ; Sharif University of Technology

2012

Abstract

This paper presents a novel method for answering complex temporal ordering questions in the context of an event and query-based text summarization. This task is accomplished by precisely mapping the problem of "query-based summarization of temporal ordering questions" in the field of Natural Language Processing to "verifying a finite state model against a temporal formula" in the realm of Model Checking. This mapping requires specific definitions, structures, and procedures. The output of this new approach is promisingly a readable and informative summary satisfying the user's needs

Temporal relation extraction using expectation maximization

, Article International Conference Recent Advances in Natural Language Processing, RANLP ; 2011 , Pages 218-225 ; 13138502 (ISSN) Mirroshandel, S. A ; Ghassem-Sani, G ; Sharif University of Technology

Abstract

The ability to accurately determine temporal relations between events is an important task for several natural language processing applications such as Question Answering, Summarization, and Information Extraction. Since current supervised methods require large corpora, which for many languages do not exist, we have focused our attention on approaches with less supervision as much as possible. This paper presents a fully generative model for temporal relation extraction based on the expectation maximization (EM) algorithm. Our experiments show that the performance of the proposed algorithm, regarding its little supervision, is considerable in temporal relation learning

Persian word embedding evaluation benchmarks

, Article 26th Iranian Conference on Electrical Engineering, ICEE 2018, 8 May 2018 through 10 May 2018 ; 2018 , Pages 1583-1588 ; 9781538649169 (ISBN) Zahedi, M. S ; Bokaei, M. H ; Shoeleh, F ; Yadollahi, M. M ; Doostmohammadi, E ; Farhoodi, M ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2018

Abstract

Recently, there has been renewed interest in semantic word representation also called word embedding, in a wide variety of natural language processing tasks requiring sophisticated semantic and syntactic information. The quality of word embedding methods is usually evaluated based on English language benchmarks. Nevertheless, only a few studies analyze word embedding for low resource languages such as Persian. In this paper, we perform such an extensive word embedding evaluation in Persian language based on a set of lexical semantics tasks named analogy, concept categorization, and word semantic relatedness. For these evaluation tasks, we provide three benchmark data sets to show the...

Nevisa, a Persian continuous speech recognition system

, Article 13th International Computer Society of Iran Computer Conference on Advances in Computer Science and Engineering, CSICC 2008, Kish Island, 9 March 2008 through 11 March 2008 ; Volume 6 CCIS , 2008 , Pages 485-492 ; 18650929 (ISSN); 3540899847 (ISBN); 9783540899846 (ISBN) Sameti, H ; Veisi, H ; Bahrani, M ; Babaali, B ; Hosseinzadeh, K ; Sharif University of Technology

2008

Abstract

In this paper we have reviewed Nevisa Persian speech recognition engine. Nevisa is an HMM-based, large vocabulary speaker-independent continuous speech recognition system. Like most successful recognition systems, MFCC with some modification has been used as speech signal features. It also utilizes a VAD based on signal energy and zero-crossing rate. Maximum likelihood estimation criterion the core of which are the classical segmental k-means and Baum-Welsh algorithms is used for training the acoustic models. The system is based on phoneme modeling and utilizes synchronous beam search based on lexicon tree for decoding the acoustic utterances. Language modeling for Persian has been...

Unsupervised grammar induction using a parent based constituent context model

, Article 18th European Conference on Artificial Intelligence, ECAI 2008, 21 July 2008 through 25 July 2008 ; Volume 178 , 2008 , Pages 293-297 ; 09226389 (ISSN); 978158603891 (ISBN) Mirroshandel, S. A ; Ghassem Sani, G ; Sharif University of Technology

IOS Press 2008

Abstract

Grammar induction is one of attractive research areas of natural language processing. Since both supervised and to some extent semi-supervised grammar induction methods require large treebanks, and for many languages, such treebanks do not currently exist, we focused our attention on unsupervised approaches. Constituent Context Model (CCM) seems to be the state of the art in unsupervised grammar induction. In this paper, we show that the performance of CCM in free word order languages (FWOLs) such as Persian is inferior to that of fixed order languages such as English. We also introduce a novel approach, called parent-based constituent context model (PCCM), and show that by using some...

A geographical question answering system

, Article 3rd International Conference on Web Information Systems and Technologies, Webist 2007, Barcelona, 3 March 2007 through 6 March 2007 ; Volume WIA , 2007 , Pages 308-314 Behrangi, E ; Ghasemzadeh, H ; Sheykh Esmaili, K ; Minaei Bidgoli, B ; Sharif University of Technology

2007

Abstract

Question Answering systems are one of the hot topics in context of information retrieval. In this paper, we develop an open-domain Question Answering system for spatial queries. We use Google for gathering raw data from the Web and then in a few iterations density of potential answers will be increased, finally based on a couple of evaluators the best answers are selected to be returned to user. Our proposed algorithm uses fuzzy methods to be more precise. Some experiments have been designed in order to evaluate the performance of our algorithm and results are totally promising. We will describe that how this algorithm can be applied to other type of questions as well

Unsupervised grammar induction using history based approach

, Article Computer Speech and Language ; Volume 20, Issue 4 , 2006 , Pages 644-658 ; 08852308 (ISSN) Feili, H ; Ghassem Sani, G ; Sharif University of Technology

2006

Abstract

Grammar induction, also known as grammar inference, is one of the most important research areas in the domain of natural language processing. Availability of large corpora has encouraged many researchers to use statistical methods for grammar induction. This problem can be divided into three different categories of supervised, semi-supervised, and unsupervised, based on type of the required data set for the training phase. Most current inductive methods are supervised, which need a bracketed data set for their training phase; but the lack of this kind of data set in many languages, encouraged us to focus on unsupervised approaches. Here, we introduce a novel approach, which we call...

Persian/arabic text font estimation using dots

, Article 6th IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2006, Vancouver, BC, 27 August 2006 through 30 August 2006 ; 2006 , Pages 420-425 ; 0780397541 (ISBN); 9780780397545 (ISBN) Shirali Shahreza, M. H ; Shirali Shahreza, S ; Sharif University of Technology

Institute of Electrical and Electronics Engineers Inc 2006

Abstract

Nowadays, computer is being used in many aspects of human life. A consequence of computer is electronic documents. Computers can't understand written documents. So, we need to convert written documents to electronic documents in order to be able to process them with computers. One of the common methods for converting written texts to electronic text is Optical Character Recognition (OCR). A lot of work has been done on English OCR, but Persian/Arabic OCR is still under development. A phase which commonly used in recognition part of an OCR system is estimating font size of text. Usually when the font size of text is found, the pen width is calculated. The pen width can be used for character...

The ODYSSEY tool-set for system-level synthesis of object-oriented models

, Article 5th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2005, Samos, 18 July 2005 through 20 July 2005 ; Volume 3553 , 2005 , Pages 394-403 ; 03029743 (ISSN) Goudarzi, M ; Hessabi, S ; Sharif University of Technology

Springer Verlag 2005

Abstract

We describe implementation of design automation tools that we have developed to automate system-level design using our ODYSSEY methodology, which advocates object-oriented (OO) modeling of the embedded system and ASIP-based implementation of it. Two flows are automated: one synthesizes an ASIP from a given C++ class library, and the other one compiles a given C++ application to run on the ASIP that corresponds to the class library used in the application. This corresponds, respectively, to hardware- and software-generation for the embedded system while hardware-software interface is also automatically synthesized. This implementation also demonstrates three other advantages: firstly, the...

A new linguistic steganography scheme based on lexical substitution

, Article 2014 11th International ISC Conference on Information Security and Cryptology, ISCISC 2014 ; 2014 , pp. 155-160 ; ISBN: 9781479953837 Yajam, H. A ; Mousavi, A. S ; Amirmazlaghani, M ; Sharif University of Technology

Abstract

Recent studies in the field of text-steganography shows a promising future for linguistic driven stegosystems. One of the most common techniques in this field is known as lexical substitution which provides the requirements for security and payload capacity. However, the existing lexical substitution schemes need an enormous amount of shared data between sender and receiver which acts as the stego key. In this paper, we propose a novel encoding method to overcome this problem. Our proposed approach preserves the good properties of lexical substitution schemes while it provides short length stego keys and significant robustness against active adversary attacks. We demonstrate high efficiency...

Towards unsupervised learning of temporal relations between events

, Article Journal of Artificial Intelligence Research ; Volume 45 , 2012 , Pages 125-163 ; 10769757 (ISSN) Mirroshandel, S. A ; Ghassem Sani, G ; Sharif University of Technology

2012

Abstract

Automatic extraction of temporal relations between event pairs is an important task for several natural language processing applications such as Question Answering, Information Extraction, and Summarization. Since most existing methods are supervised and require large corpora, which for many languages do not exist, we have concentrated our efforts to reduce the need for annotated data as much as possible. This paper presents two different algorithms towards this goal. The first algorithm is a weakly supervised machine learning approach for classification of temporal relations between events. In the first stage, the algorithm learns a general classifier from an annotated corpus. Then,...

ISO-TimeML event extraction in persian text

, Article 24th International Conference on Computational Linguistics - Proceedings of COLING 2012: Technical Papers, 8 December 2012 through 15 December 2012 ; December , 2012 , Pages 2931-2944 Yaghoobzadeh, Y ; Ghassem-Sani, G ; Mirroshandel, S. A ; Eshaghzadeh, M ; Sharif University of Technology

2012

Abstract

Recognizing TimeML events and identifying their attributes, are important tasks in natural language processing (NLP). Several NLP applications like question answering, information retrieval, summarization, and temporal information extraction need to have some knowledge about events of the input documents. Existing methods developed for this task are restricted to limited number of languages, and for many other languages including Persian, there has not been any effort yet. In this paper, we introduce two different approaches for automatic event recognition and classification in Persian. For this purpose, a corpus of events has been built based on a specific version of ISO-TimeML for Persian....

Exploiting multiview properties in semi-supervised video classification

, Article 2012 6th International Symposium on Telecommunications, IST 2012 ; 2012 , Pages 837-842 ; 9781467320733 (ISBN) Karimian, M ; Tavassolipour, M ; Kasaei, S ; Sharif University of Technology

Abstract

In large databases, availability of labeled training data is mostly prohibitive in classification. Semi-supervised algorithms are employed to tackle the lack of labeled training data problem. Video databases are the epitome for such a scenario; that is why semi-supervised learning has found its niche in it. Graph-based methods are a promising platform for semi-supervised video classification. Based on the multiview characteristic of video data, different features have been proposed (such as SIFT, STIP and MFCC) which can be utilized to build a graph. In this paper, we have proposed a new classification method which fuses the results of manifold regularization over different graphs. Our...

PEN: Parallel English-Persian news corpus

, Article Proceedings of the 2011 International Conference on Artificial Intelligence, ICAI 2011, 18 2011 through 21 July 2011 ; Volume 2 , July , 2011 , Pages 523-528 ; 9781601321855 (ISBN) Farajian, M. A ; ICAI 2011

2011

Abstract

Parallel corpora are the necessary resources in many multilingual natural language processing applications, including machine translation and cross-lingual information retrieval. Manual preparation of a large scale parallel corpus is a very time consuming and costly procedure. In this paper, the work towards building a sentence-level aligned English-Persian corpus in a semi-automated manner is presented. The design of the corpus, collection, and alignment process of the sentences is described. Two statistical similarity measures were used to find the similarities of sentence pairs. To verify the alignment process automatically, Google Translator was used. The corpus is based on news...

Collecting positive instances of "instance-of" relationship in the Persian language

, Article ICECT 2010 - Proceedings of the 2010 2nd International Conference on Electronic Computer Technology, 7 May 2010 through 10 May 2010, Kuala Lumpur ; May , 2010 , Pages 46-49 ; 9781424474059 (ISBN) Rastegari, Y ; Abolhassani, H ; Zibanezhad, B ; Sayadiharikandeh, M ; Sharif University of Technology

2010

Abstract

Fetching Lexico-Syntactic patterns from text rely on pairs of words (positive instances) that represent the target relation, and finding their simultaneous occurrence in text corpus. Due to existence of WordNet thesaurus (which contains the semantic relationship between words), collecting positive instances is easy. In non-english languages, it's hard to collect large number of positive instances in various contexts. We investigated some new ideas for collecting them in Persian language and finally run the best one and collected approximately 6,000 positive instances

Temporal relations learning with a bootstrapped cross-document classifier

, Article Frontiers in Artificial Intelligence and Applications ; Volume 215 , 2010 , Pages 829-834 ; 09226389 (ISSN) ; 9781607506058 (ISBN) Mirroshandel, S. A ; Ghassem Sani, G ; Sharif University of Technology

IOS Press 2010

Abstract

The ability to accurately classify temporal relation between events is an important task for a large number of natural language processing applications such as Question Answering (QA), Summarization, and Information Extraction. This paper presents a weakly-supervised machine learning approach for classification of temporal relation between events. In the first stage, the algorithm learns a general classifier from an annotated corpus. Then, it applies the hypothesis of "one type of temporal relation per discourse" and expands the scope of "discourse" from a single document to a cluster of topically-related documents. By combining the global information of such a cluster with local decisions...

Creating a corpus for automatic punctuation prediction in persian texts

, Article 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1537-1542 ; 9781509059638 (ISBN) Hosseini, S. M ; Sameti, H ; Sharif University of Technology

Abstract

We present a novel corpus for automatic punctuation prediction in persian texts. punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one,...