Chris Mingard — LessWrong

LESSWRONG
LW

Replying toArchitecture-aware optimisation: train ImageNet and more without hyperparameters

Architecture-aware optimisation: train ImageNet and more without hyperparameters

AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of "train ImageNet without hyperparameters". That said, there is a stronger sense of "hyperparameter-free" where the optimum batch size and architecture size would decide on the compute-optimal scaling. And, an even stronger sense where the architecture type is selected.

In other words, we have the following hierarchy of lack... (read more)

Architecture-aware optimisation: train ImageNet and more without hyperparameters

Chris Mingard

A deep learning system is composed of lots of interrelated components: architecture, data, loss function and gradients. There is a structure in the way these components interact - however, the most popular optimisers (e.g. Adam and SGD) do not utilise this information. This means there are leftover degrees of freedom in the optimisation process - which we currently have to take care of via manually tuning their hyperparameters (most importantly, the learning rate). If we could characterise these interactions perfectly, we could remove all degrees of freedom, and thus remove the need for hyperparameters.

Second-order methods characterise the sensitivity of the objective to weight perturbations using implicit architectural information via the Hessian, and... (read 333 more words →)

Replying toNTK/GP Models of Neural Nets Can't Learn Features

Chris Mingard5y

NTK/GP Models of Neural Nets Can't Learn Features

There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).
The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of optimiser trained NNs are to NNGPs, I think it is sensible to assume that they are similar. However, the important question is still whether this scales to large architectures and datasets, which become computationally much harder to test (as the NNGP kernel becomes harder and harder to compute with size of dataset).

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y

Parsing Chris Mingard on Neural Networks

Good guess ;)

Haha some things are pretty obvious - it's always really nice to get a very different perspective on an idea, thank you for continuing the conversation!

I see -- so you're saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite randomly-sampled nets

That is exactly what I'm saying. I don't know if it is testable in practice, but it is in theory ... I would be very interested to see anything about this - let me know if you find anything!

If it turns out that, in the limit of infinite width, feature learning does not work, what are your thoughts about my case for feature learning for the narrow (but trained-by-random-sampling) case? I would guess you find this case significantly more compelling than the infinite width case?

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y*

Parsing Chris Mingard on Neural Networks

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well.

Not exactly the same - it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.

They're clearly different in some respects -- (C) can do transfer learning but (A) cannot

I think this is the main source of our disagreement. First of all, while the posterior of an NNGP is equivalent to that of a trained-by-random-sampling infinitely wide NN, it does not contain all... (read 517 more words →)

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y*

Parsing Chris Mingard on Neural Networks

[Advance apologies if I haven't explained stuff well enough here. I think the important theme here is that we should maintain a way of thinking about the random sampling picture that is distinct from NNGPs.]

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Ah I see I need to explain myself further - the following is very counterintuitive but I think it's right. Learning features involves the movement of weights in the early layers, by definition. The claim I am making is that the reason why feature learning is good is not because it improves inductive bias - it is because it allows the network to be compressed. That... (read 473 more words →)

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y

Parsing Chris Mingard on Neural Networks

I 100% agree that Kolmogorov complexity is not the best measure of complexity here - and I would refer anyone to yours and Joar's comments at https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of for an excellent discussion of this. I am aware that Kolmogorov complexity is defined wrt a UTM, and I should have offered clarification in the blog that a lot of steps were used to make the link between Kolmogorov complexity and these types of input-output maps, and state that we only talk about Kolmgorov complexity because of the Levin bound (somewhat repurposed for input-output maps), which interestingly appears to capture the relationship between probabilities of functions and their complexities for several different complexity measures quite accurately.

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y

Parsing Chris Mingard on Neural Networks

[First thank you for your comments and observations - it's always interesting to read pushback]

First, I think my point about using the GP to measure the volume occupied functions locally to where SGD trained networks are initialised is important. We are not really comparing NNs to NNGPs (well, technically we are, but we are interpreting what the NNGP does differently). We are trying to argue that SGD acts as a random sampler - it will find functions with probability proportional to the volume of those functions local to where the optimiser is in parameter-space. We argue that this quantity is well approximated by the NNGP.

This is relevant to the comments on features:... (read 397 more words →)

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y

Parsing Chris Mingard on Neural Networks

Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y*

Parsing Chris Mingard on Neural Networks

I agree that "large volume-->simple" is what is shown by the evidence in the papers, as opposed to "simple--> large volume" which is in fact not a claim we do not make anywhere (if we do accidentally please let me know and I will fix it) - see https://arxiv.org/abs/1910.00971 for more detail on this, or Joar Skalse's comments on https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of, where he discusses functions which don't obey this rule - such as the identity function, which has small volume and is very simple. If optimisers find functions approximately proportional to their volume in parameter-space, this would be a good explanation for why neural networks struggle to learn identity functions. (In fact, theoretical... (read more)

Replying toParsing Chris Mingard on Neural Networks

Chris Mingard5y

Parsing Chris Mingard on Neural Networks

I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of - see in particular replies by Joar Skalse (the author of that post). You say that you don't think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit - I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures). In the main paper https://jmlr.org/papers/v22/20-676.html we do review a substantial amount of literature on topic. Some results that rely on "different hyperparameters result in different generalisation" type... (read more)