Review

Seems like gradient descent methods weren't using the relevant math bounds so far. Google released AutoBound as an open-source library

Here is what I consider a money shot of the article (notice it's a log-plot):

Performance of SafeRate when used to train a single-hidden-layer neural network on a subset of the MNIST dataset, in the full-batch setting.

Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance. Their repo implies that with float64 this scales to larger neural networks. LLMs seem to reliably get new capabilities with lower loss, at least.

What do you think?

Here are related technical details:

Optimizers that use upper bounds in this way are called majorization-minimization (MM) optimizers. Applied to one-dimensional logistic regression, AutoBound rederives an MM optimizer first published in 2009. Applied to more complex problems, AutoBound derives novel MM optimizers that would be difficult to derive by hand.

We can use a similar idea to take an existing optimizer such as Adam and convert it to a hyperparameter-free optimizer that is guaranteed to monotonically reduce the loss (in the full-batch setting). The resulting optimizer uses the same update direction as the original optimizer, but modifies the learning rate by minimizing a one-dimensional quadratic upper bound derived by AutoBound. We refer to the resulting meta-optimizer as SafeRate.

Using SafeRate, we can create more robust variants of existing optimizers, at the cost of a single additional forward pass that increases the wall time for each step by a small factor (about 2x slower in the example above).

This seems novel to neural network training, or am I missing something that Bayesian neural net people have been doing already?

Review

10

New Comment
9 comments, sorted by Click to highlight new comments since:

These kind of 'twist on known optimizers' papers are pretty common, and they mostly don't amount to too much. E.g., the only difference between Adam and "SafeRate[Adam direction]" is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here's a paper from ~30 years ago.

Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~, which is extremely low, and very far beyond what any plausible AGI training run will reach. SafeRate's advantage isn't supposed to be 'make loss go down harder', it's supposed to be 'more stable optimization process', which is exactly what you see in the plot above.

That's not to say SafeRate is worthless. The fact that they can do second order hyperparameter tuning with only a second forward pass, and not another pair of forward and backward passes, is somewhat interesting. It may also make large language model training more stable, which I understand to be an issue with tuning such training processes. However, it's extremely unlikely IMO to be some "multiple OOM jump" in training efficiency.

in the full-batch setting.

uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.

Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance

famous words

(Note: I hid this post from logged out users since it seemed capabilities-y)

oh nice! I'm not sure that logged in is enough to make me feel comfy, but it's certainly better than nothing. a karma threshold or something might make sense?

What is the purpose, beyond mere symbolism, of hiding this post to logged out users when the relevant data is available, in far more detail, on Google's official AI blog?

just don't want to be the ones helping things like this go viral. I would post more news here if I had a solid sense of who was benefiting from my news-gathering. I'd like to be able to make posts only visible to some specific group; I still wouldn't be posting anything not already public, and my taste is somewhat iffy, but I haven't done more newsposts of this kind than I have for related reasons.

Symbolism is coordination. Not contributing to destroying the world with your own hands, even if you can't stop others from doing it, is a good norm. Iterations of doing concerning things at least a little bit less than others.

Obviously fine. I posted here to get better than my single point estimate of what's up with this thing.