Présentation prédoc III de Pascal Junior Tikeng Notsawo
Bonjour à tous,
Vous êtes tous et toutes cordialement invité.es à assister à la présentation de projet du prédoc III de Pascal Junior Tikeng Notsawo, le 4 septembre à 14h (mode hybride).
Titre : Toward Understanding Grokking: The Interplay of Regularization, Data Structure, and Optimization Dynamics
Date: jeudi 4 septembre à 14h
Location: Auditorium 1, MILA
Jury
| Président | Gauthier Gidel |
| Directeur | Irina Rish |
| Co-directeur | Guillaume Rabusseau |
| Co-directeur | Guillaume Dumas |
| Membre | Ioannis Mitliagkas |
Résumé
The remarkable ability of deep learning models to generalize in over-parameterized regimes remains a central mystery in machine learning. A particularly challenging case of this problem is grokking, a phenomenon of delayed generalization that occurs after overfitting when optimizing artificial neural networks with gradient-based methods. This thesis presents a comprehensive study of grokking, exploring the conditions under which it occurs, its underlying mechanisms, and methods for its early prediction. We first extend the study of grokking beyond simple algorithmic tasks to the general setting of finite-dimensional algebras, revealing how the algebraic structure of a problem influences learning difficulty and the emergence of generalization. We then challenge the conventional understanding of grokking's reliance on $\ell_2$ regularization by demonstrating that grokking step scales proportionally to $1/(\alpha \beta)$ when minimizing composite objectives of the form $f=g+\beta h$ using gradient descent with learning rate $\alpha$, where $g$ is the training error and $h$ is an arbitrary and appropriately chosen regularizer that enforces an inductive bias toward generalization (e.g., sparsity, low-rankness). We also show that the commonly used $\ell_2$-norm is not a reliable proxy for explaining grokking. Finally, to address the high computational cost of observing grokking, we propose a novel low-cost method for early prediction. We demonstrate that the spectral signature of the training loss in the early phases of optimization contains predictive signals, whose characteristics correlate with a model’s eventual ability to generalize. Together, these findings expand our understanding of grokking from a specific phenomenon to a universal dynamic in deep learning, driven by the interplay of optimization, regularization, and the intrinsic structure of the task. We also outline a concrete roadmap for future research, including plans to apply these insights to the broader context of language modeling and the evaluation of adversarial robustness.