Présentation prédoc III - Moksh Jain
Dear all /Bonjour à tous,
We are happy to invite you to the Predoc III evaluation of Moksh Jain on
November 28th at 9h30 am (hybrid mode).
Vous êtes cordialement invité.e.s à l'évaluation du Predoc III de Moksh
Jain, le 28 novembre à 9h30 (mode hybride).
Title: Scientific Discovery Through a Probabilistic Lens
Date: November 28th at 9h30 am
Location: Auditorium 2, MILA + *Zoom Link
Jury
Président | Pierre-Luc Bacon |
Director | Yoshua Bengio |
Regular Membre | Simon Lacoste Julien |
Abstract
The ability to produce novel scientific knowledge by reasoning about
empirical observation and experimentation constitutes a cornerstone of
human cognition. Recent advances in the scale and sophistication of
experimental apparatus and pressing societal needs necessitate the
development of computational tools that can accelerate scientific discovery
through effective and scalable hypothesis generation and experimental
design. A key challenge is developing algorithms that can deal with the
inherent uncertainty in scientific discovery, due to underspecified
objectives and the large space of experiment designs and hypotheses. We
tackle this problem from a probabilistic perspective, and develop
principled probabilistic tools which can deal with uncertainty to augment
key steps of the process of scientific discovery.
We first consider the problem of experimental design in drug discovery –
specifically, the problem of generating large candidate libraries of
molecules to screen for target properties, where diversity of candidates is
as critical as their effectiveness. Through a probabilistic lens,
generating diverse candidates can be framed as sampling from a target
distribution defined by the utility of the candidate (e.g., acquisition
functions). We use generative flow networks (GFlowNets) to train a sampler
for generating candidates and incorporate it in an active learning loop.
Our approach significantly improves both the diversity and effectiveness of
generated candidates, as validated by preliminary wet-lab evaluations.
Next, we study the problem in the presence of multiple objectives and frame
it as sampling from a conditional distribution. With this framing we
develop multi-objective GFlowNets and demonstrate their ability to generate
diverse pareto-optimal candidates.
A natural benefit of the probabilistic perspective is the ability to
seamlessly incorporate prior knowledge. Foundation models such as language
models and diffusion models trained on web-scale data capture knowledge
about the world which can serve as rich priors for designing experiments
and proposing novel hypotheses. However, performing inference with this
knowledge is challenging. To address this, we leverage GFlowNet objectives
to fine-tune language models and diffusion models and demonstrate the
effectiveness of our approach in solving intractable inference problems in
language modeling such as infilling.
As a future direction we discuss how this probabilistic machinery can be
used to tackle scientific inverse problems with diffusion priors such as
protein scaffolding. Finally, we outline how the probabilistic perspective
can be extended to systems that can generate novel hypotheses in the form
of conjectures in the domain of formal mathematics.