Passer au contenu

/ Department of Computer Science and Operations Research

Je donne

Rechercher

Présentation prédoc III de Le Zhang

Bonjour à tous,

Vous êtes tous et toutes cordialement invité.es à assister à la présentation de projet du prédoc III de Le Zhang, le 26 août à 13h30 (mode hybride).

Titre : Towards Language-Compatible Visual Representation Learning

Date: mardi 26 août à 13h30.

Location: Auditorium 1, Mila, 6650 (2e étage)

 

Jury

Président 
Aaron Courville
DirecteurAishwarya Agrawal
MembreSiva Reddy

Résumé

The rapid progress of vision-language models (VLMs) has been driven bylarge-scale contrastive training, yet they struggle with*compositionality*—understanding how objects, attributes, and relations combine—due to noisy data and shallow global objectives. This report advocates for *language-compatible visual representations* that mirror linguistic structure, enabling strongertext-to-image generation, more capable multimodal LLMs, and richer reasoning. Instead of retraining foundation models, we pursue post-training alignment through refined objectives, synthetic data, and lightweight mappings between frozen encoders. We explore three solutions: a refined contrastive framework, *VisMin *for systematic compositional evaluation,and the *SAIL framework* for fine-grained alignment, together pointing toward VLMs with deeper and more structured understanding.