Présentation prédoc III - David Dobre
Dear all /Bonjour à tous,
We are happy to invite you to the Predoc III evaluation of David Dobre on December 5th at 9h30 am (hybrid mode).
Vous êtes cordialement invité.e.s à l'évaluation du Predoc III de David Dobre, le 5 décembre à 9h30 (mode hybride).
Title: Towards Safe, Robust, and Transparent Language Models
Date: December 5th at 9h30 am
Location: Auditorium 2 (Mila 6650)
Jury
Président | Aaron Courville |
Director | Gauthier Gidel |
Regular Membre | Bang Liu |
Abstract
Recent advances in Large Language Models (LLMs) have led to powerful AI
assistants capable of complex tasks. However, as these systems become more
prevalent, concerns about their potential misuse grow. Despite being
trained for alignment with human values, LLMs remain vulnerable to
jailbreak attacks that circumvent their safety measures. In this
presentation we discuss the current state of jailbreaking and defending
LLMs, and our previous work showcasing embedding space attacks as an
important threat model in open-source LLMs. We find that embedding space
attacks can circumvent model alignment more efficiently than discrete
attacks or fine-tuning, and show their utility in the context of unlearning
where such attacks can extract supposedly deleted information from
unlearned models. We will also present two future directions of work to
improve LLM transparency and safety. The first approach is to fine-tune an
LLM to assess the safety of its outputs by generating a special "red-flag"
token in the presence of harmful responses, effectively serving as its own
implicit judge. In addition to the efficiency gain of not having to run
inference with a second judge model, this has the added advantage of
allowing models to maintain utility and helpfulness, while also giving a
clear indicator of whether a model has been jailbroken or not. The second
direction is to leverage advances in AI interpretability to develop more
nuanced metrics for quantifying jailbreak success using internal model
representations, and to develop representation-based objectives for
jailbreak attacks.