Personal Name Disambiguation in Persian Written News

Saneei, Sara; Sameti, Hossein

Please enable javascript in your browser.

Personal Name Disambiguation in Persian Written News

Saneei, Sara | 2021

391 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 53705 (31)
University: Sharif University of Technology
Department: Languages and Linguistics Center
Advisor(s): Sameti, Hossein
Abstract:
Diverse personal names are mentioned in everyday news but news agencies do not separate entities with same or equal names. This could make irrelevant news appear while searching an ambiguous name. Personal Name Disambiguation in news seeks to partition a significant amount of news to distinct classes each of which belongs to a single entity in the real world. In this thesis, which up to the researcher is the first of its kind at least in Persian, researcher gained opportunity of using FarsiYar News Dataset and to be specific 50,000 of news in FarsNews dataset which were published in the year 1397. First of all, a database was built using these news data and then the nonstructured news were transformed to a weighted graph data structure so that it could be processed accurately. For the data preparation, each news represented by a node in a news-news graph in which edges defined that the two news share same named entities (NEs) and the number of these same NEs was the weight of the edge. Also, the lemmas of nouns of each news body became the node attribute in the graph. Then, three approaches were chosen: first, graph node embedding which was used both in the shallow learning mode and the deep neural network mode. For the shallow embedding learning, Deepwalk and Node2vec were used and for the deep one, GraphSAGE and a version of heterogeneous graph convolutional network (HGCN) worked as the part of the project. Second approach was utilizing the pre-trained representation using ParsBert which were trained via a Persian news corpus. The news-news graph was inputted to each of the algorithms, for each news a representation was exploited and then using the DBSCAN clustering method the numbers of clusters was prepared for providing as an input to the HAC. Applying HAC on news with the clusters defined by DBSCAN resulted the clusters. In the third approach that was the suggested one, all of the outputs of clustering for each of these methods - Node2vec, GraphSAGE, ParsBert, HGCN using Word2vec and HGCN using ParsBert- formed a feature vector of dimension 5 and were inputted to DBSCAN for the number of clusters again. Afterward, HAC gave the clustering results. About 1400 selected news were manually tagged by two persons. The suggested approach improved the mean of F-score on 30 random selected author names computed by B-cubed metric from 0/69 to 0/75. By applying this approach to the 6 ambiguous names in Persian news, the F-score reached to 0/73 on average
Keywords:
Graph Neural Network ; Text Mining ; Graph Embeding ; Graph Node Embedding ; Personal Name Disambiguation

Digital Object List

محتواي کتاب
view

Bookmark

No TOC

Friend's email
Your name
Your email
enter code