Loading...
- Type of Document: M.Sc. Thesis
- Language: Farsi
- Document No: 57953 (19)
- University: Sharif University of Technology
- Department: Computer Engineering
- Advisor(s): Soleymani Baghshah, Mahdieh
- Abstract:
- Vision-language models like CLIP have demonstrated remarkable ability in extracting transferable features for downstream tasks. These features are particularly valuable for tasks such as image classification, captioning, and multimodal retrieval. However, the training process of these models is often based on a coarse-grained contrastive loss between the global embedding of images and texts. While this approach improves overall alignment, it may overlook the compositional structure and complex relationships present in both modalities. This issue is especially noticeable in cases where image-text pairs consist of multiple components and intricate relationships. Recent studies have shown that vision-language models struggle with compositional understanding, such as aligning attributes with objects and identifying relationships between them. These shortcomings can lead to misunderstanding of compositional content and reduced accuracy in tasks relying on these models. While some recent approaches have attempted to address these challenges by improving text-image alignment, they often fail either to accurately identify meaningful components or to achieve precise alignment between these components. To address these limitations, we propose a compositional alignment method. This approach leverages weak supervision in the form of text-image pairs to establish a more precise mapping between image and text components. Our method utilizes hierarchical analysis of components to enhance the model’s accuracy in identifying objects, their attributes, and their relationships. Experimental results demonstrate that this method improves the compositional understanding of the model and provides greater accuracy in vision-language tasks. For instance, our compositional alignment improved text-to-image retrieval accuracy in the CLIP model by 6.27\%
- Keywords:
- Vision-Language Models ; Weakly Supervised Learning ; Compositional Alignment ; Entity Relationship Identification ; Compositional Understanding ; Contrastive Loss Function ; Multimodal Information Retrieval
-
محتواي کتاب
- view
- مقدمه
- مفاهیم اولیه
- پژوهشهای پیشین
- روش پیشنهادی
- آزمایشات و نتایج تجربی
- جمعبندی و کارهای آتی
- مراجع
- واژهنامه