These kind of 'twist on known optimizers' papers are pretty common, and they mostly don't amount to too much. E.g., the only difference between Adam and "SafeRate[Adam direction]" is that they used their second-order method to automatically tune the learning rate of the Adam optimizer. Such automatic hyperparameter tuning has been a thing for a long time. E.g., here's a paper from ~30 years ago.
Also note that Adam pretty much keeps up with SafeRate in the above plot until the loss drops to ~, which is extremely low, and very far beyond what any plausible AGI training run will reach. SafeRate's advantage isn't supposed to be 'make loss go down harder', it's supposed to be 'more stable optimization process', which is exactly what you see in the plot above.
That's not to say SafeRate is worthless. The fact that they can do second order hyperparameter tuning with only a second forward pass, and not another pair of forward and backward passes, is somewhat interesting. It may also make large language model training more stable, which I understand to be an issue with tuning such training processes. However, it's extremely unlikely IMO to be some "multiple OOM jump" in training efficiency.
in the full-batch setting.
uh, yeah, no shit Adam hits a floor on the loss in this context. The entire point of Adam is to compute the running variance of gradients and scale the learning rate to take constant-ish step sizes. What this means in the full-batch setting is that once Adam gets close to a local minimum, it will just oscillate around that minimum, never going further down because it insists on scaling the learning rate by the inverse gradient variance. None of this matters for networks of practical size because they never actually get close to anything like a local minimum.
Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance
famous words
oh nice! I'm not sure that logged in is enough to make me feel comfy, but it's certainly better than nothing. a karma threshold or something might make sense?
What is the purpose, beyond mere symbolism, of hiding this post to logged out users when the relevant data is available, in far more detail, on Google's official AI blog?
just don't want to be the ones helping things like this go viral. I would post more news here if I had a solid sense of who was benefiting from my news-gathering. I'd like to be able to make posts only visible to some specific group; I still wouldn't be posting anything not already public, and my taste is somewhat iffy, but I haven't done more newsposts of this kind than I have for related reasons.
Symbolism is coordination. Not contributing to destroying the world with your own hands, even if you can't stop others from doing it, is a good norm. Iterations of doing concerning things at least a little bit less than others.
Obviously fine. I posted here to get better than my single point estimate of what's up with this thing.
Seems like gradient descent methods weren't using the relevant math bounds so far. Google released AutoBound as an open-source library.
Here is what I consider a money shot of the article (notice it's a log-plot):
Hopefully, they are just overfitting on MNIST. Otherwise, it pattern-matches to a huge advance. Their repo implies that with float64 this scales to larger neural networks. LLMs seem to reliably get new capabilities with lower loss, at least.
What do you think?
Here are related technical details:
This seems novel to neural network training, or am I missing something that Bayesian neural net people have been doing already?