Loading...
Search for: text-processing
0.009 seconds
Total 29 records

    Text steganography by changing words spelling

    , Article 2008 10th International Conference on Advanced Communication Technology, Phoenix Park, 17 February 2008 through 20 February 2008 ; Volume 3 , 2008 , Pages 1912-1913 ; 17389445 (ISSN); 9788955191356 (ISBN) Shirali Shahreza, M ; Sharif University of Technology
    2008
    Abstract
    One of the important issues in security fields is hidden exchange of information. There are different methods for this purpose such as cryptography and steganography. Steganography is a method of hiding data within a cover media so that other individuals fail to realize their existence. In this paper a new method for steganography in English texts is proposed. In this method the US and UK spellings of words substituted in order to hide data in an English text. For example "color" has different spelling in UK (colour) and US (color). Therefore the data can be hidden in the text by substituting these words  

    Event Extraction in Persian Texts By Learning Methods

    , M.Sc. Thesis Sharif University of Technology Ershad, Mehdi (Author) ; Ghasem Sani, Gholamreza (Supervisor)
    Abstract
    Event Extraction in Texts is one of the main challenges of Natural Language Processing. Event extraction is one of necessary components of question answering, summarization and information extraction systems. The purpose of this project has been the design and implementation of different statistical methods for event extraction in Persian and also correcting and expanding an existing corpus named PresTimeBank. The new system is composed of a preliminary rule based module that annotates events and find their features based on a predefined set of rules. The result of this stage is then revised in a subsequent manual annotation process. The output is a corpus that is compliant with the ISO... 

    Ezafe Recognition Using Dependency Parsing

    , M.Sc. Thesis Sharif University of Technology Nassajian, Minoo (Author) ; Bahrani, Mohammad (Supervisor) ; Shojaei, Razieh (Co-Supervisor)
    Abstract
    Ezafe is regarded as one of the most controversial and challenging issues in different Persian Language Processing (NLP) fields. It is recognized and pronounced but usually not written. So, this results in a high degree of ambiguity in Persian texts. Dependency grammar plays a significant role in optimization problems. So, to recognize the position of Ezafe in a sentence, this grammar is used in this current study. This method helps speed up computer operations and use low memory. Within this framework, first we take a close look at Ezafe distribution in Persian text. We use Uppsala Persian Dependency Corpus (2015) to analyze parsed sentences. The Ezafe constructions under study include... 

    Multi-modal distance metric learning: A bayesian non-parametric approach

    , Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6 September 2014 through 12 September 2014 ; Volume 8927 , September , 2015 , Pages 63-77 ; 03029743 (ISSN) ; 9783319161983 (ISBN) Babagholami Mohamadabadi, B ; Roostaiyan, S. M ; Zarghami, A ; Baghshah, M. S ; Rother, C ; Agapito, L ; Bronstein, M. M ; Sharif University of Technology
    Springer Verlag  2015
    Abstract
    In many real-world applications (e.g. social media application), data usually consists of diverse input modalities that originates from various heterogeneous sources. Learning a similarity measure for such data is of great importance for vast number of applications such as classification, clustering, retrieval, etc. Defining an appropriate distance metric between data points with multiple modalities is a key challenge that has a great impact on the performance of many multimedia applications. Existing approaches for multi-modal distance metric learning only offer point estimation of the distance matrix and/or latent features, and can therefore be unreliable when the number of training... 

    Dynamic classifier selection using clustering for spam detection

    , Article 2009 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2009, Nashville, TN, 30 March 2009 through 2 April 2009 ; 2009 , Pages 84-88 ; 9781424427659 (ISBN) Famil saeedian, M ; Beigy, H ; Sharif University of Technology
    2009
    Abstract
    Most email users have encountered with spam problems, which have been addressed as a text classification or categorization problem. In this paper, we propose a novel spam detection method that uses ensemble of classifiers based on clustering and selection techniques. There is diversity in genre of e-mail's content and this method can find different topics in emails by clustering. It first computes disjoint clusters of emails, and then a classifier is trained on each cluster. When new email arrives, its cluster is identified. The classifier of the identified cluster is selected to classify the new email. Our method can extract many kinds of topics in emails. The evaluation shows that the... 

    Spam Detection using Dynamic Weighted Voting based on Clustering

    , Article 2008 2nd International Symposium on Intelligent Information Technology Application, IITA 2008, Shanghai, 21 December 2008 through 22 December 2008 ; Volume 2 , January , 2008 , Pages 122-126 ; 9780769534978 (ISBN) Famil Saeedian, M ; Beigy, H ; Sharif University of Technology
    2008
    Abstract
    In the last decade spam detection has been addressed as a text classification or categorization problem. In this paper we propose a new dynamic weighted voting method based on the combination of clustering and weighted voting, and apply it to the task of spam filtering. In order to classify a new sample, it first compares with all cluster centroids and its similarity to each cluster is identified; Classifiers in the vicinity of the input sample obtain greater weight for the final decision of the ensemble. The evaluation shows that the algorithm outperforms pure SVM. © 2008 IEEE  

    Page segmentation of Persian/Arabic printed text using ink spread effect

    , Article 2006 SICE-ICASE International Joint Conference, Busan, 18 October 2006 through 21 October 2006 ; 2006 , Pages 259-262 ; 8995003855 (ISBN); 9788995003855 (ISBN) Shirali Shahreza, S ; Manzuri Shalmani, M. T ; ShiraliShahreza, M. H ; Sharif University of Technology
    2006
    Abstract
    Nowadays, OCR (Optical Character Recognition) is widely used for converting written documents to digital documents. One of the OCR phases is page segmentation. In page segmentation, text regions must be found in input image. In addition, text parts like text columns must be separated. In this paper, a new method for segmenting Persian/Arabic printed text is proposed. This method is based on Ink Spread Effect idea, a new idea that has particular features. Main features of Persian/Arabic scripts are considered in designing this method. This method is skew resistant and can segment text within frames and tables or regions with gray background. © 2006 ICASE  

    A new segmentation technique for multi font Farsi/Arabic texts

    , Article 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '05, Philadelphia, PA, 18 March 2005 through 23 March 2005 ; Volume II , 2005 , Pages II757-II760 ; 15206149 (ISSN); 0780388747 (ISBN); 9780780388741 (ISBN) Omidyeganeh, M ; Nayeb, K ; Azmi, R ; Javadtalab, A ; Sharif University of Technology
    2005
    Abstract
    Segmentation is a very important stage of Farsi/Arabie character recognition systems. A new segmentation algorithm -for multi font Farsi/Arabic texts- based on the conditional labeling of the up contour and down contour is presented. A pre-processing technique is used to adjust the local base line for each subword. This algorithm uses adaptive base line for each subword to improve the segmentation results. This segmentation algorithm, in addition to up and down contours, takes advantage of their curvatures also. The algorithm was tested on a data set of printed Farsi texts, containing 22236 characters, in 18 different fonts. 97% of characters were correctly segmented. © 2005 IEEE  

    An Investigation into Reduplication and Lexicalization in Persian

    , M.Sc. Thesis Sharif University of Technology Gili, Maryam (Author) ; Eslami, Moharram (Supervisor) ; Khosravizadeh, Parvaneh (Supervisor)
    Abstract
    This study aims at classifying outputs of Reduplication (Total and Partial Reduplication) in Persian, in order to demonstrate the types and processes which lead to the lexicalization of some reduplicated forms and assign them lexical position in lexicography as individual lexical entries. There are distinct phonological, morphological and semantic evidences behind the lexicalization of some types of reduplicated items in Persian. Approximately about 269 totally reduplicated words and 293 partially reduplicated words have been collected and analyzed according to the said morphological, phonological and semantic considerations. Partial and Total Reduplication process is one of the productive... 

    Formal verification of temporal questions in the context of query-answering text summarization

    , Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 28 May 2012 through 30 May 2012 ; Volume 7310 LNAI , May , 2012 , Pages 350-355 ; 03029743 (ISSN) ; 9783642303524 (ISBN) Mostafazadeh, N ; Bakhshandeh Babarsad, O ; Ghassem Sani, G ; Sharif University of Technology
    2012
    Abstract
    This paper presents a novel method for answering complex temporal ordering questions in the context of an event and query-based text summarization. This task is accomplished by precisely mapping the problem of "query-based summarization of temporal ordering questions" in the field of Natural Language Processing to "verifying a finite state model against a temporal formula" in the realm of Model Checking. This mapping requires specific definitions, structures, and procedures. The output of this new approach is promisingly a readable and informative summary satisfying the user's needs  

    Persian/arabic text font estimation using dots

    , Article 6th IEEE International Symposium on Signal Processing and Information Technology, ISSPIT 2006, Vancouver, BC, 27 August 2006 through 30 August 2006 ; 2006 , Pages 420-425 ; 0780397541 (ISBN); 9780780397545 (ISBN) Shirali Shahreza, M. H ; Shirali Shahreza, S ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2006
    Abstract
    Nowadays, computer is being used in many aspects of human life. A consequence of computer is electronic documents. Computers can't understand written documents. So, we need to convert written documents to electronic documents in order to be able to process them with computers. One of the common methods for converting written texts to electronic text is Optical Character Recognition (OCR). A lot of work has been done on English OCR, but Persian/Arabic OCR is still under development. A phase which commonly used in recognition part of an OCR system is estimating font size of text. Usually when the font size of text is found, the pen width is calculated. The pen width can be used for character... 

    A new approach to persian/arabic text steganography

    , Article 5th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2006. In conjunction with 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, Software Architecture and Reuse, COMSAR 2006, Honolulu, HI, 10 July 2006 through 12 July 2006 ; Volume 2006 , 2006 , Pages 310-315 ; 0769526136 (ISBN); 9780769526133 (ISBN) Shirali Shahreza, M. H ; Shirali Shahreza, M ; Sharif University of Technology
    2006
    Abstract
    Conveying information secretly and establishing hidden relationship has been of interest since long past. Text documents have been widely used since very long time ago. Therefore, we have witnessed different method of hiding information in texts (text steganography) since past to the present. In this paper we introduce a new approach for steganography in Persian and Arabic texts. Considering the existence of too many points in Persian and Arabic phrases, in this approach, by vertical displacement of the points, we hide information in the texts. This approach can be categorized under feature coding methods. This method can be used for Persian/Arabic Watermarking. Our method has been... 

    A novel algorithm for using GA in concept weighting for text mining

    , Article WSEAS Transactions on Computers ; Volume 5, Issue 12 , 2006 , Pages 2992-2999 ; 11092750 (ISSN) Zaefarian, R ; Akhgar, B ; Siddiqi, J. I ; Zaefarian, G ; Gruzdz, A ; Ihnatowicz, A ; Sharif University of Technology
    2006
    Abstract
    The importance of good weighting methodology in information retrieval methods - the method that affects the most useful features of a document or query representative - is examined.. Weighting features is the thing that many information retrieval systems are regarding as being of minor importance as compared to find the feature and the experiments are confirming this. There are different methods for the term weighting such as TF*IDF and Information Gain Ratio which have been used in information retrieval systems, the paper provides a brief review of the related literature. This paper explores using GA for concept weighting which is a novel application to the field of text mining It proposes... 

    Designing a deep neural network model for finding semantic similarity between short persian texts using a parallel corpus

    , Article 7th International Conference on Web Research, ICWR 2021, 19 May 2021 through 20 May 2021 ; 2021 , Pages 91-96 ; 9781665404266 (ISBN) Hosseini Moghadam Emami, Z. S ; Tabatabayiseifi, S ; Izadi, M ; Tavakoli, M ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2021
    Abstract
    Text processing, as one of the main issues in the field of artificial intelligence, has received a lot of attention in recent decades. Numerous methods and algorithms are proposed to address the task of semantic textual similarity which is one of the sub-branches of text processing. Due to the special features of the Persian language and its non-standard writing system, finding semantic similarity is an even more challenging task in Persian. On the other hand, producing a proper corpus that can be used for training a model for finding semantic similarities, is of great importance. In this study, the main purpose is to propose a method for measuring the semantic similarity between short... 

    Normalization of Non-standard Texts for Persian language Using Neural
    Networks

    , M.Sc. Thesis Sharif University of Technology Seyyedi, Javad (Author) ; Sameti, Hossein (Supervisor)
    Abstract
    The purpose of this research is to normalize non-standard persian texts. We proposed a method to transfigure the texts with any non-standard structure into a formal and standard form. One of the major complications of the text normalization is the large variety of non-standard structures, and the fact that these diversities could not be classified in one constructional pattern. Furthermore, the concept of text normalization, in different situations, has multiple different definitions, and any of this settings needs a distinct normalization method. Supervised learning methods are not suitable for normalization due to variety of both standard and non-standard texts as well as the absence of... 

    Performance Evaluation and Improvement of Duplicate Question Detection in Developers’ Online Q&A Community

    , M.Sc. Thesis Sharif University of Technology Daliri, Majid (Author) ; Habibi, Jafar (Supervisor)
    Abstract
    In this research, we study one of the challenges in the field of software engineering, namely the detection of diplicate questions in Stackoverflow, the Q&A community of programmers. The works done in this area has problems such as complexity and reduced performance over time. The proposed solution is based on machine learning and modern representation learning methods. Representation is done with two approaches, domain specific learning and transfer learning. Fasttext and GloVe, the two word embeddings used in domain specific learning, and in transfer learning, the embedding of the universal sentence encoder has been used. Support vector machine and multilayer perceptron used as... 

    Diagnosis of coronary artery disease using cost-sensitive algorithms

    , Article Proceedings - 12th IEEE International Conference on Data Mining Workshops, ICDMW 2012 ; 2012 , Pages 9-16 ; 9780769549255 (ISBN) Alizadehsani, R ; Hosseini, M. J ; Sani, Z. A ; Ghandeharioun, A ; Boghrati, R ; Sharif University of Technology
    2012
    Abstract
    One of the main causes of death the world over are cardiovascular diseases, of which coronary artery disease (CAD) is a major type. This disease occurs when the diameter narrowing of one of the left anterior descending, left circumflex, or right coronary arteries is equal to or greater than 50 percent. Angiography is the principal diagnostic modality for the stenosis of heart vessels; however, because of its complications and costs, researchers are looking for alternative methods such as data mining. This study conducts data mining algorithms on the Z-Alizadeh Sani dataset which has been collected from 303 random visitors to Tehran's Shaheed Rajaei Cardiovascular, Medical and Research... 

    Bug localization using revision log analysis and open bug repository text categorization

    , Article 6th International IFIP WG 2.13 Conference on Open Source Systems, OSS 2010, Notre Dame, IN, 30 May 2010 through 2 June 2010 ; Volume 319 AICT , 2010 , Pages 188-199 ; 18684238 (ISSN) ; 9783642132438 (ISBN) Moin, A. H ; Khansari, M ; Sharif University of Technology
    2010
    Abstract
    In this paper, we present a new approach to localize a bug in the software source file hierarchy. The proposed approach uses log files of the revision control system and bug reports information in open bug repository of open source projects to train a Support Vector Machine (SVM) classifier. Our approach employs textual information in summary and description of bugs reported to the bug repository, in order to form machine learning features. The class labels are revision paths of fixed issues, as recorded in the log file of the revision control system. Given an unseen bug instance, the trained classifier can predict which part of the software source file hierarchy (revision path) is more... 

    Persian text classification based on topic models

    , Article 24th Iranian Conference on Electrical Engineering, ICEE 2016, 10 May 2016 through 12 May 2016 ; 2016 , Pages 86-91 ; 9781467387897 (ISBN) Ahmadi, P ; Tabandeh, M ; Gholampour, I ; Sharif University of Technology
    Institute of Electrical and Electronics Engineers Inc  2016
    Abstract
    With the extensive growth in information, text classification as one of the text mining methods, plays a vital role in organizing and management information. Most text classification methods represent a documents collection as a Bag of Words (BOW) model and then use the histogram of words as the classification features. But in this way, the number of features is very large; therefore performing text classification faces serious computational cost problems. Moreover, the BOW representation is unable to recognize semantic relations between words. Recently, topic-model approaches have been successfully applied for text classification to overcome the problems of BOW. Our main goal in this paper... 

    Creating a corpus for automatic punctuation prediction in persian texts

    , Article 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, 2 May 2017 through 4 May 2017 ; 2017 , Pages 1537-1542 ; 9781509059638 (ISBN) Hosseini, S. M ; Sameti, H ; Sharif University of Technology
    Abstract
    We present a novel corpus for automatic punctuation prediction in persian texts. punctuation prediction is an important task in automatic speech recognition (ASR). The output of ASR systems is typically a raw sequence of words with no punctuation marks; this makes the text difficult or even impossible to make sense of for humans and also for any text processing unit. In this work, we have assembled a state-of-the-art Persian corpus to train and test a punctuation prediction model. To the best of our knowledge, this is the first ever corpus specifically designed for punctuation prediction in Persian texts. The corpus is a modification of a manually part-of-speech (POS) tagged Persian one,...