Wilson Wu

Message

114

Wilson Wu

Ambiguous out-of-distribution generalization on an algorithmic task

Introduction It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes...

84Feb 13, 2025

The slingshot helps with learning

The slingshot effect is a late-stage training anomaly found in various adaptive gradient optimization methods. In particular, slingshots are present with AdamW, the optimizer most widely used for modern transformer training. The original slingshot paper observes that slingshots tend to occur alongside grokking, a phenomenon in which neural networks trained...

33Oct 31, 2024

LESSWRONG
LW

LESSWRONG
LW

Wilson Wu

Wilson Wu

Wilson Wu

Ambiguous out-of-distribution generalization on an algorithmic task

The slingshot helps with learning

Wilson Wu

Wilson Wu

Wilson Wu

Ambiguous out-of-distribution generalization on an algorithmic task

The slingshot helps with learning

Introduction