Nikolaus Howe's Predoc III Presentation
Dear all / Bonjour à tous,
We are happy to invite you to Nikolaus Howe's Predoc III defense on Thursday, August 24th, at 1pm (Hybrid event).
Vous êtes cordialement invité.e.s à la présentation du sujet de recherche de Nikolaus Howe, jeudi 24 août à 13h00. (Présentation hybride).
Title: RL Safety in the Era of Foundation Models
Date: August 24th, 2023 at 1:00pm-3:00pm PM EST
Location: Auditorium 1 - 6650 Rue Saint Urbain
Link: Video call link: https://meet.google.com/bzc-xbhs-qzx
Or dial: (US) +1 904-300-0332 PIN: 633 671 341#
Jury
Président | Courville, Aaron |
Directeur de recherche | Bacon, Pierre-Luc |
Membre | Bengio, Yoshua |
Abstract
As artificial intelligence (AI) systems are increasingly deployed across economy and society, it is key that they remain safely aligned with human values and intent. This thesis proposal discusses several contemporary challenges in AI safety and alignment, with a particular focus on robustness in the context of foundation models and reinforcement learning (RL).
At present, best practice for aligning foundation models with human preferences is to use reinforcement learning from human feedback (RLHF), where we optimize a pretrained model against a reward model learned from human preferences. Ensuring the reward model robustly matches human intent remains an open problem, and as such, a common failure mode of RLHF occurs when the fine-tuned model gets a high score on the reward model but performs poorly on the actual desired task. The first research direction presented in this thesis provides the first formal definition and characterization of a generalization of this phenomenon, known as *reward hacking*.
The second contribution discussed in this proposal, which represents ongoing work, studies robustness more broadly, asking how we can expect model scale and task complexity to affect robustness in the presence of an AI adversary. Here we explore two different directions: first, a multi-agent RL setting where an adversary is trained against a frozen victim in order to find exploits. So far, we have reproduced scaling laws similar to a previous result exploring time taken to develop mastery as a function of board size in Hex, one of these two-player board games. In the second direction, we aim to explore scaling laws for robustness in a foundation model setting. Specifically, we consider a collection of large language models trained on a binary classification task. After training, we test the model for vulnerabilities: for shorter sequences, we use a brute-force technique; for longer sequences, we will use state-of-the-art red-teaming techniques including soft- and hard-prompting approaches.
Finally, we discuss future work: after establishing initial results in a regular language setting, we intend to explore a more complex setting, possibly involving sequential interactions with a user. We will then study scaling of the "balance of power" between victim and adversary across model sizes, input lengths, and task complexities, alongside the effectiveness of (iterated) adversarial training.