Ranking Farsi Web Pages using Deep Neural Networks

Zinvandi, Erfan; Behrouzi, Hamid Mohammadzadeh, Narjes Alhoda Kazemi, Reza

Please enable javascript in your browser.

Ranking Farsi Web Pages using Deep Neural Networks

Zinvandi, Erfan | 2024

4 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 56961 (05)
University: Sharif University of Technology
Department: Electrical Engineering
Advisor(s): Behrouzi, Hamid; Mohammadzadeh, Narjes Alhoda; Kazemi, Reza
Abstract:
The purpose of ranking Persian web pages is to retrieve the highest number of relevant documents related to the search query of Persian language users, with the minimum number of documents returned from the web. Information retrieval is one of the key issues in search engines. In this study, billions of documents were collected from Persian web pages, and due to infrastructure limitations, a few hundred million documents were indexed in a database like Elastic. Now, considering the user’s actual query, relevant documents should be retrieved from the indexed documents. To achieve this goal, a large Persian language model was needed. Existing large language models for Persian were not usable for two reasons: 1) the length of the documents exceeded 512 tokens, and 2) as this model was supposed to be used operationally in retrieving relevant documents in the search engine, the model’s output dimensions did not allow for timely response and usability in an operational environment. Therefore, a basic language model based on the BigBird structure was trained, which addressed these challenges. Since the goal of this model was to retrieve text information, the model was fine tuned based on a specific dataset dedicated to text information retrieval. This dataset consists of approximately 2.5 million queries and 7 million corresponding documents for model training. After training the model, we proceeded to represent the existing documents on the web and index them in the Vector Database (Vespa). In this database, approximately 180 million documents have been indexed in vector form, and this collection is contin uously growing. Considering the dynamic nature of the web environment and the evolving concepts in user queries over time, it is necessary to continuously update the trained model. If we want to change the model of documents and queries together, we would need to continuously update the stored document vectors, and this is not possible in an operational environment. To facilitate this, we separate the query model from the document model and only train the query model. The work mentioned in this study was conducted for the first time in Iran and was carried out at a level comparable to search engines worldwide. Currently, the Zarebin search engine has a reasonable capability to compete with major search engines such as Google
Keywords:
Deep Learning ; Transformers ; Web Search Engine ; Information Retrieval ; Vector Search ; Semantic Textual Similarity ; Text Information Retrieval

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code