All of Chris Mingard's Comments + Replies

AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of "train ImageNet without hyperparameters". That said, there is a stronger sense of "hyperparameter-free" where the optimum batch size and architecture size would decide ... (read more)

There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).

The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of

... (read more)

Good guess ;)

Haha some things are pretty obvious - it's always really nice to get a very different perspective on an idea, thank you for continuing the conversation!

I see -- so you're saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite randomly-sampled nets

That is exactly what I'm saying. I ... (read more)

2interstice
I just came across this paper which derives an expression for the posterior distribution of the weights in each layer in the infinite-width limit. The result: the distribution is unchanged from the prior in every layer but the last. So it indeed seems that there is no feature learning in this limit.

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well. 

Not exactly the same - it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.

They're clearly different in some respects -- (C) can do transfer learning but (A) cannot

I think this is the main source of our disagreement. First of all, while the posterior of an N... (read more)

3interstice
Good guess ;) I see -- so you're saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite-width randomly-sampled nets. I think this is false, however -- that is, I think it's provable that the distribution of intermediate functions does not change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can't find a reference offhand though, I'll report back if I find anything resolving this one way or another.

[Advance apologies if I haven't explained stuff well enough here. I think the important theme here is that we should maintain a way of thinking about the random sampling picture that is distinct from NNGPs.]

Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.

Ah I see I need to explain myself further - the following is very counterintuitive but I think it's right. Learning features involves the movement of weights in the early layers, by definition. The claim I am making is that the reason why feature learning is good ... (read more)

3interstice
Yes, I think so. Let's go over the 'thin network' example -- we want to learn some function which can be represented by a thin network. But let's say a randomly-initialized thin network's intermediate functions won't be able to fit the function -- that is (with high probability over the random initialization) we won't be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible: (A) Expand the network's width until (with high probability) it's possible to fit the function by only altering the last layer (B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function (C) Keeping the width the same, train the network with SGD By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well. I just don't see any reason this should be the case -- on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They're clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear). My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear 'feature learning' mechanism. The NTK/GP can do so as well, sort of, since they're universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don't really have a great understanding of how feature learning in

I 100% agree that Kolmogorov complexity is not the best measure of complexity here - and I would refer anyone to yours and Joar's comments at https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of for an excellent discussion of this. I am aware that Kolmogorov complexity is defined wrt a UTM, and I should have offered clarification in the blog that a lot of steps were used to make the link between Kolmogorov complexity and these types of input-output maps, and state that we only talk about Kolmgorov comple... (read more)

[First thank you for your comments and observations - it's always interesting to read pushback]

First, I think my point about using the GP to measure the volume occupied functions locally to where SGD trained networks are initialised is important. We are not really comparing NNs to NNGPs (well, technically we are, but we are interpreting what the NNGP does differently). We are trying to argue that SGD acts as a random sampler - it will find functions with probability proportional to the volume of those functions local to where the optimiser is in parameter-... (read more)

3interstice
And thanks for engaging with my random blog comments! TBC, I think you guys are definitely on the right track in trying to relate SGD to function simplicity, and the empirical work you've done fleshing out that picture is great. I just think it could be even better if it was based around a better SGD scaling limit ;) Right, this is an even better argument that NNGPs/random-sampled nets don't learn features. I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To apply to SGD-trained nets, you'd need to show that the new features learned by SGD have the same distribution as the features found in an infinitely-wide random net, but I don't think this is the case. By illustration, some SGD-trained nets can develop expressive neurons like 'car detector', enabling them to fit the data with a relatively small number of such neurons. If you used an NNGP to learn the same thing, you wouldn't get a single 'car detector' neuron, but rather some huge linear combination of high-frequency features that can approximate the cars seen in the dataset. I think this would probably generalize worse than the network with an actual 'car detector'(this isn't empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this) Interesting, haven't seen this before. Just skimming the paper, it sounds like the very small learning rate + added white noise might result in different limiting behavior from usual SGD. Generally it seems that there are a lot of different possible limits one can take; empirically SGD-trained nets do seem to have 'feature learning' so I'm skeptical of limits that don't have that(I assume they don't have them for theoretical reasons, anyway. Would be interesting to actually examine the features found in networks trained like this, and to see if they can do transfer learning at all) re:'colored noise', not sure to what extent this matters. I think a more likely source of discrepancy

Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.

I agree that "large volume-->simple" is what is shown by the evidence in the papers, as opposed to "simple--> large volume" which is in fact not a claim we do not make anywhere (if we do accidentally please let me know and I will fix it) - see https://arxiv.org/abs/1910.00971 for more detail on this, or Joar Skalse's comments on https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of, where he discusses functions which don't obey this rule - such as the identity function, which has small volume a... (read more)

3interstice
Yeah, I didn't mean to imply that you guys said 'simple --> large volume' anywhere. I just think it's a point worth emphasizing, especially around here where I think people will imagine "Solomonoff Induction-like" when they hear about a "bias towards simple functions" But in the infinite-width setting, Bayesian inference in general is given by a GP limit, right? Initialization doesn't matter. This means that the arguments for lack of feature learning still go through. It's technically possible that there could be feature learning in finite-width randomly-sampled networks, but it seems strange that finiteness would help here(and any such learning would be experimentally inaccessible). This is a major reason that I'm skeptical of the "SGD as a random sampler" picture.

I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of - see in particular replies by Joar Skalse (the author of that post). You say that you don't think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit - I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures... (read more)

1interstice
I think we basically agree on the state of the empirical evidence -- the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they'll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits. re: the architecture being the source of inductive bias -- I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.