Passer au contenu

/ Department of Computer Science and Operations Research

Je donne

Rechercher

Leo Feng's Predoc III Presentation

Dear all / Bonjour à tous,

We are happy to invite you to Leo Feng's Predoc III presentation on Wednesday, December 20th, at 9:30 am (hybrid mode).

Vous êtes cordialement invité.e.s à la soutenance de thèse de Leo Feng, le mercredi décembre, à 9h30 (mode hybride)


Title: Memory Efficient Attention-based Models

Date: December 20th, 2023 - 9:30 to 11:30am

Location: Auditorium 2 - 6650 rue Saint Urbain

 

Jury

PrésidentBacon, Pierre-Luc
Directeur de rechercheBengio, Yoshua
Membre
Mitliagkas, Ioannis

 

Abstract

The use of popular attention mechanisms such as the Transformer has grown in popularity in recent years and is a vital part of recent AI successes. However, the computation of the attention mechanisms in Transformers is memory-intensive, scaling quadratically with the number of tokens. Meanwhile, modern hardware is often memory-constrained for applications. With the rapid growth in low-memory/compute domains (e.g. IoT devices), there is a strong incentive to design more memory-efficient attention mechanisms. In this predoc, we tackle the efficiency issue by introducing a series of works that design memory-efficient attention modules and architectures without modality-specific components.

The models we consider consist of (1) an encoder that encodes context tokens and (2) a retrieval mechanism that retrieves the information from the encodings for making downstream predictions. In the first work, we design a model that reduces the set of tokens to a fixed-sized latent bottleneck, As a result, the model only requires constant computation per prediction and linear memory for computing the encoding of the context tokens. In the second work, we introduce the Constant Memory Attention Block. This novel attention variant only requires constant memory to compute its output and constant computation to perform updates to its output given new context tokens. We show that by leveraging CMABs, we can design models that only require constant memory and perform constant computation updates. Lastly, we detail ongoing work on a tree-based retrieval mechanism for efficient retrievals when making predictions.