Compositional Generalization in Visual-Language Models

Abdollahi Alibeik, Ali; Soleymani Baghshah, Mahdieh

Please enable javascript in your browser.

Compositional Generalization in Visual-Language Models

Abdollahi Alibeik, Ali | 2025

0 Viewed

Type of Document: M.Sc. Thesis
Language: Farsi
Document No: 57953 (19)
University: Sharif University of Technology
Department: Computer Engineering
Advisor(s): Soleymani Baghshah, Mahdieh
Abstract:
Vision-language models like CLIP have demonstrated remarkable ability in extracting transferable features for downstream tasks. These features are particularly valuable for tasks such as image classification, captioning, and multimodal retrieval. However, the training process of these models is often based on a coarse-grained contrastive loss between the global embedding of images and texts. While this approach improves overall alignment, it may overlook the compositional structure and complex relationships present in both modalities. This issue is especially noticeable in cases where image-text pairs consist of multiple components and intricate relationships. Recent studies have shown that vision-language models struggle with compositional understanding, such as aligning attributes with objects and identifying relationships between them. These shortcomings can lead to misunderstanding of compositional content and reduced accuracy in tasks relying on these models. While some recent approaches have attempted to address these challenges by improving text-image alignment, they often fail either to accurately identify meaningful components or to achieve precise alignment between these components. To address these limitations, we propose a compositional alignment method. This approach leverages weak supervision in the form of text-image pairs to establish a more precise mapping between image and text components. Our method utilizes hierarchical analysis of components to enhance the model’s accuracy in identifying objects, their attributes, and their relationships. Experimental results demonstrate that this method improves the compositional understanding of the model and provides greater accuracy in vision-language tasks. For instance, our compositional alignment improved text-to-image retrieval accuracy in the CLIP model by 6.27\%
Keywords:
Vision-Language Models ; Weakly Supervised Learning ; Compositional Alignment ; Entity Relationship Identification ; Compositional Understanding ; Contrastive Loss Function ; Multimodal Information Retrieval

Digital Object List

محتواي کتاب
view

Bookmark

مقدمه
- مقدمه
- تعریف مسئله
- اهمیت موضوع
- اهداف پژوهش
- دستاوردها و نوآوری‌های پژوهش
- ساختار پایان‌نامه
- جمع‌بندی
مفاهیم اولیه
- مقدمه
- یادگیری تباینی
- مبدل
- مبدل‌های بینایی
  - ساختار و نحوه عملکرد
- مدل‌های متنی-تصویری
  - ساختار مدل‌های متنی-تصویری
  - نحوه آموزش مدل‌های متنی-تصویری
- مدل YOLO
  - معماری YOLO
  - عملکرد YOLO
- معرفی SpaCy
- جمع‌بندی
پژوهش‌های پیشین
- مدل‌های پایه متنی-تصویری
  - مدل CLIP
  - مدل ALIGN
- بررسی روش‌های کارآمد آموزش مدل‌های متنی-تصویری
  - مدل‌های کارآمد از نظر داده
  - مدل‌های کارآمد از نظر پارامتر
- هم‌ترازی ریزدانه در مدل‌های متنی-تصویری
  - هم‌ترازی ضمنی اجزاء متنی-تصویری
  - هم‌ترازی مستقیم بازنمائی اجزاء متنی-تصویری
- جمع‌بندی
روش‌‌ پیشنهادی
- پیش‌پردازش
- معماری مدل
- اهداف آموزشی
- استنتاج
- جمع‌بندی
آزمایشات و نتایج تجربی
- مجموعه‌داده‌های مورد استفاده
  - مجموعه‌داده Visual Genome
  - مجموعه‌داده MSCOCO
  - مجموعه‌داده Flickr30K
- برپایش تجربی
- سنجه‌های ارزیابی
  - فراخوانی رتبه‌ای
  - دقت دسته‌بندی
- بازیابی تصویر-متن به صورت بدون نمونه
  - نتایج و تحلیل
- ارزیابی مدل بر روی محک‌های ترکیبی
  - محک ARO برای ارزیابی درک ویژگی‌ها و روابط
  - محک SVO-Probes برای ارزیابی درک ویژگی‌ها و روابط
  - تحلیل عملکرد مدل پیشنهادی در محک‌های ترکیبی
- ارزیابی روش پیشنهادی در دسته‌بندی بدون نمونه
- تنظیم ابر‌پارامترها
- مطالعه فرسایش
  - تحلیل اجزاء تابع هزینه
  - تحلیل معماری شبکه
  - تحلیل تعداد لایه‌های شبکه
- تصویری‌سازی
  - تحلیل ماتریس‌های شباهت
  - تحلیل تفاوت‌های مدل پیشنهادی با CLIP
- جمع‌بندی
جمع‌بندی و کارهای آتی
- جمع‌بندی
- بررسی نقاط قوت و ضعف
- کارهای آتی
مراجع
واژه‌نامه

Friend's email
Your name
Your email
enter code