Passer au contenu

/ Department of Computer Science and Operations Research

Je donne

Rechercher

Oscar Mañas' Predoc III Presentation

Dear all / Bonjour à tous,

We are happy to invite you to Oscar Mañas' Predoc III defense on September 7th, at 10:30am. (Hybrid event.)

Vous êtes cordialement invité.e.s à la présentation du sujet de recherche d'Oscar Mañas, le 7 septembre à 10h30. (Présentation hybride).


Title: Advancing Vision-Language Learning Using Large Language Models

Date: September, 7th, 2023 at 10:30am-13:00pm EST

Location: Auditorium 2 - 6650 rue Saint Urbain

 

Jury

PrésidentCourville, Aaron
Directeur de rechercheAgrawal, Aishwarya
Membre
Reddy, Siva

 

Abstract

Large language models (LLMs) exhibit impressive capabilities beyond language modeling, such as encyclopedic knowledge, in-context learning and basic reasoning skills. Language capability can be seen not only as a human-machine interface, but also as a reasoning and generalization engine.

In this thesis, we explore how we can leverage these LLM capabilities for multimodal vision-language (VL) learning. We believe that, like humans and other animals, AI systems should have a holistic understanding of the world around them. This means interacting with and processing multiple
modalities, such as describing images in natural language, answering natural language questions about images, generating images from natural language descriptions, etc. The key challenge, however, lies in effectively combining vision and language, given the fundamentally different nature of
both modalities: language is an abstract human construct, whereas vision is a low-level perceptual modality.

To this end, in this thesis, we aim to address the following questions:

  1. How to connect LLMs with the visual modality in a parameter- an data-efficient way? As a first step, we developed one of the first VL models capable of rapid task adaptation from just a handful of multimodal in-domain examples.
  2. How can we robustly evaluate natural language outputs generated by VL models? As a first step, we proposed an LLM-assisted evaluation metric for the task of answering questions about images.
  3. How to improve the quality of text-conditioned image generation? We are currently exploring ways of applying LLMs to refine user-provided text prompts in order to generate images which are more aligned with human intent.
  4. What is the optimal encoding/representation of visual information to be ingested by LLMs? First, we plan to investigate conditioning vision encoders on task-specific information to extract more relevant visual features for LLMs. Second, we will explore using more abstract forms of representation such as dense textual descriptions and scene graphs.