Présentation prédoc III de Le Zhang
Bonjour à tous,
Vous êtes tous et toutes cordialement invité.es à assister à la présentation de projet du prédoc III de Le Zhang, le 26 août à 13h30 (mode hybride).
Titre : Towards Language-Compatible Visual Representation Learning
Date: mardi 26 août à 13h30.
Location: Auditorium 1, Mila, 6650 (2e étage)
Jury
| Président | Aaron Courville |
| Directeur | Aishwarya Agrawal |
| Membre | Siva Reddy |
Résumé
The rapid progress of vision-language models (VLMs) has been driven bylarge-scale contrastive training, yet they struggle with*compositionality*—understanding how objects, attributes, and relations combine—due to noisy data and shallow global objectives. This report advocates for *language-compatible visual representations* that mirror linguistic structure, enabling strongertext-to-image generation, more capable multimodal LLMs, and richer reasoning. Instead of retraining foundation models, we pursue post-training alignment through refined objectives, synthetic data, and lightweight mappings between frozen encoders. We explore three solutions: a refined contrastive framework, *VisMin *for systematic compositional evaluation,and the *SAIL framework* for fine-grained alignment, together pointing toward VLMs with deeper and more structured understanding.