Multi-modal Keyword Extraction from Video Clip and its Description

Alizadeh Aghmashhadi, Farahmand; Behroozi, Hamid Asgari, Ehsaneddin

Please enable javascript in your browser.

Multi-modal Keyword Extraction from Video Clip and its Description

Alizadeh Aghmashhadi, Farahmand | 2024

0 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 57461 (05)
University: Sharif University of Technology
Department: Electrical Engineering
Advisor(s): Behroozi, Hamid; Asgari, Ehsaneddin
Abstract:
The task of keyword prediction has been widely used in the field of natural language processing from past to present. In the past, keyword prediction was primarily performed on textual content. However, with the rapid growth of multimedia content and its use in social networks, the need for extracting and automatically generating appropriate keywords for videos has also increased. The use of suitable keywords significantly impacts content accessibility, visibility, and better classification. With the expansion of generative models, the keyword estimation problem can also be formulated as a text generation task. The proposed solutions have often focused on English-language content, and they usually perform poorly or are unusable in Persian.Therefore, a dataset of short Persian-language videos has been collected and created. It includes short video clips along with related titles, descriptions, tags, keyframes, descriptions of each frame, audio content, and corresponding transcripts. For modeling the problem, open-source multimodal models, including PaliGemma, IDEFICS, and Qwen2-VL, have been used in two modes: zeroshot and LoRA-Fintuned, on the created dataset. Evaluations are usually based on Exact matching at the lexical level. In this research, efforts have been made to evaluate the models from three aspects: reference agreement, diversity, and faithfullness, at both lexical and semantic levels. For the test set, several videos were manually labeled, and based on that, the performance of the models was evaluated. The results of this research show that the fine-tuned models are capable of generating relatively suitable Persian keywords using the information available in the text and images. These models can be used to improve search and content retrieval, and be employed in recommendation systems, or in keyword generation applications for content creators
Keywords:
Keyword Extaction ; Natural Language Processing ; Machine Learning ; Keyword Generation

Digital Object List

محتواي کتاب
view

Bookmark

Friend's email
Your name
Your email
enter code