Architecture-aware optimisation: train ImageNet and more without hyperparameters
A deep learning system is composed of lots of interrelated components: architecture, data, loss function and gradients. There is a structure in the way these components interact - however, the most popular optimisers (e.g. Adam and SGD) do not utilise this information. This means there are leftover degrees of freedom...
AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of "train ImageNet without hyperparameters". That said, there is a stronger sense of "hyperparameter-free" where the optimum batch size and architecture size would decide on the compute-optimal scaling. And, an even stronger sense where the architecture type is selected.
In other words, we have the following hierarchy of lack... (read more)