Passer au contenu

/ Department of Computer Science and Operations Research

Je donne

Rechercher

Navigation secondaire

Présentation prédoc III - Zhixuan Lin

Dear all /Bonjour à tous,

We are happy to invite you to the Predoc III evaluation of Zhixuan Lin on December 3rd at 9 am (hybrid mode).

Vous êtes cordialement invité.e.s à l'évaluation du Predoc III de Zhixuan Lin, le 3 décembre à 9h00 (mode hybride).


Title: Advancing Long-Context Sequence Modeling

Date: December 3rd at 9h00 am

Location:  Auditorium 2 (Mila 6650)

 

Jury

Président

Bang Liu

Director 

Aaron Courville

Regular Membre

Sarath Chandar

 

Abstract

 

Most modern sequence models fall into two major categories:
Transformer-based models and the recently revived recurrent sequence
models. Both types of models have pros and cons. Transformers have superior
long context capabilities but suffer from quadratic complexity with respect
to the context length. They also extrapolate poorly beyond the training
context length. Though recurrent sequence models still underperform the
Transformer in handling long contexts, they enjoy linear complexity and
have demonstrated strong performance on short-context tasks and better
length extrapolation. Drawing on insights from both model types, we aim to
deepen our understanding of long-context sequence modeling and develop
next-generation sequence models.

As a first step towards this goal, we propose the \emph{Forgetting
Transformer}, a Transformer variant that incorporates arguably the most
crucial component of recurrent sequence models: the forget gate. We show
that the Forgetting Transformer outperforms the Transformer on long-context
language modeling, length extrapolation, and short-context downstream tasks
while performing on par with the Transformer on long-context downstream
tasks. It also retains the ability of the Transformer to perform accurate
long-context retrieval and achieves perfect accuracy in a simplified
needle-in-the-haystack test. In contrast, all the tested recurrent sequence
models fail. Finally, we show that the Forgetting Transformer can be
implemented in a hardware-aware way with a simple modification to the Flash
Attention algorithm.

To conclude, we outline potential future directions, including utilizing
the forget gate in the Forgetting Transformer for efficient training and
inference and developing recurrent sequence models with better long-context
modeling abilities.