Présentation prédoc III - Benjamin Thérien
Bonjour à tous,
Vous êtes invité à assister à l'examen Prédoc III de Benjamin Thérien, mercredi le 2 avril, à 12h30.
Title: Continual and Meta-learned Algorithms for Foundation Model Pre-training
Date: mercredi le 2 avril, 12h30
Location: Mila Auditorium 2
Jury
Président | Aaron Courville |
Directeur | Irina Rish |
Co-Directeur | Eugene Belilovsky |
Membre | Sarath Chandar |
Abstract
Scaling laws reliably predict that language modeling performance improvesas we scale training data and model parameters. Under this guidance, practitioners have scaled models to enormous sizes and pre-trained them on vast amounts of data. These foundation models are then post-trained to elicit the most desirable behaviors learned during pre-training, leading, most recently, to exceptional mathematical reasoning. While training the largest foundation models already requires massive processing power, with the need to keep models up to date and to continue improving them, there is little reason to believe the demand for massive pre-training will stagnate in the near future. As such, techniques that can reduce pre-training costs are of interest to our community. In this thesis, we take two approaches to reducing pre-training costs: improving optimization algorithms for pre-training through meta-learning and updating pre-trained models at a relatively low cost with continual pre-training.
In the first part of the presentation, we explore the idea of meta-learning neural network optimizers under the maximal update parameterization. By meta-learning the optimizer on tasks under this parameterization, we not only enable generalization to larger versions of the meta-training tasks, but we also generalize much larger versions of unseen tasks. These findings address the fundamental flaws of the largest learned optimizers trained today: generalizing to much larger tasks in total parameter count than those seen during meta-training. Therefore, our research marks an important step towards leveraging data-driven optimization algorithms for pre-training.
In the second part of the presentation, we explore the continual pre-training of large language models. Taking a dense decoder-only transformer pre-trained on a large English-language corpus, we extend its capabilities to newer English data and German data. We find that replaying previous data and carefully handling the learning rate schedule is all that is needed to match the performance of full re-training. These findings are then extended to sparsely-gated Mixture-of-Experts (MoE) transformers, where a critical consideration for a deployable model is maintaining a balanced routing load on previous data.