Présentation prédoc III - Benjamin Thérien - Département d'informatique et de recherche opérationnelle

Bonjour à tous,

Vous êtes invité à assister à l'examen Prédoc III de Benjamin Thérien, mercredi le 2 avril, à 12h30.

Title: Continual and Meta-learned Algorithms for Foundation Model Pre-training

Date: mercredi le 2 avril, 12h30

Location: Mila Auditorium 2

Jury

Président	Aaron Courville
Directeur	Irina Rish
Co-Directeur	Eugene Belilovsky
Membre	Sarath Chandar

Abstract

Scaling laws reliably predict that language modeling performance improvesas we scale training data and model parameters. Under this guidance, practitioners have scaled models to enormous sizes and pre-trained them on vast amounts of data. These foundation models are then post-trained to elicit the most desirable behaviors learned during pre-training, leading, most recently, to exceptional mathematical reasoning. While training the largest foundation models already requires massive processing power, with the need to keep models up to date and to continue improving them, there is little reason to believe the demand for massive pre-training will stagnate in the near future. As such, techniques that can reduce pre-training costs are of interest to our community. In this thesis, we take two approaches to reducing pre-training costs: improving optimization algorithms for pre-training through meta-learning and updating pre-trained models at a relatively low cost with continual pre-training.

In the first part of the presentation, we explore the idea of meta-learning neural network optimizers under the maximal update parameterization. By meta-learning the optimizer on tasks under this parameterization, we not only enable generalization to larger versions of the meta-training tasks, but we also generalize much larger versions of unseen tasks. These findings address the fundamental flaws of the largest learned optimizers trained today: generalizing to much larger tasks in total parameter count than those seen during meta-training. Therefore, our research marks an important step towards leveraging data-driven optimization algorithms for pre-training.

In the second part of the presentation, we explore the continual pre-training of large language models. Taking a dense decoder-only transformer pre-trained on a large English-language corpus, we extend its capabilities to newer English data and German data. We find that replaying previous data and carefully handling the learning rate schedule is all that is needed to match the performance of full re-training. These findings are then extended to sparsely-gated Mixture-of-Experts (MoE) transformers, where a critical consideration for a deployable model is maintaining a balanced routing load on previous data.

Back

Université de Montréal / Faculty of Arts and Science Department of Computer Science and Operations Research

Présentation prédoc III - Benjamin Thérien

Supporting the Department

NEED HELP?

FACULTY OF ARTS AND SCIENCE