Institut de Mathématiques de Bordeaux

Abstract: I will discuss optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball and their application to training huge neural networks. In recent work, I have proposed with my coauthors a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem. The resulting update rule unifies several existing optimization methods under a single framework, including spectral methods like Muon, which we are able to prove rigorous convergence results for. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, guarantees the transferability of hyperparameters like learning rate across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.

(Maths-IA) Training Neural Networks at Any Scale with Scion