...There is an extensive discussion about feature learning in relation to the aforementioned Mingard et al result in the comments of this post. The conclusion of the discussion was that feature learning is uncoupled from inductive bias for infinite (and actually finite width with further conditons) neural networks when trained by a random-sampling process (essentially how NNGPs work).
The open question is whether the probability distribution over functions after each layer are the same whether you train with SGD or random sampling. Given how the posteriors of
Good guess ;)
Haha some things are pretty obvious - it's always really nice to get a very different perspective on an idea, thank you for continuing the conversation!
I see -- so you're saying that even though the distribution of output functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of intermediate functions might change. If true, this would mean that feature learning and inductive bias are 'uncoupled' for infinite randomly-sampled nets
That is exactly what I'm saying. I ...
By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I'm wrong] that all three methods should have the same inductive bias as well.
Not exactly the same - it is known that there is a width dependence on inductive biases. I believe that typically wide networks are better, although I know of some counterexamples.
They're clearly different in some respects -- (C) can do transfer learning but (A) cannot
I think this is the main source of our disagreement. First of all, while the posterior of an N...
[Advance apologies if I haven't explained stuff well enough here. I think the important theme here is that we should maintain a way of thinking about the random sampling picture that is distinct from NNGPs.]
Right, this is an even better argument that NNGPs/random-sampled nets don't learn features.
Ah I see I need to explain myself further - the following is very counterintuitive but I think it's right. Learning features involves the movement of weights in the early layers, by definition. The claim I am making is that the reason why feature learning is good ...
I 100% agree that Kolmogorov complexity is not the best measure of complexity here - and I would refer anyone to yours and Joar's comments at https://www.lesswrong.com/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of for an excellent discussion of this. I am aware that Kolmogorov complexity is defined wrt a UTM, and I should have offered clarification in the blog that a lot of steps were used to make the link between Kolmogorov complexity and these types of input-output maps, and state that we only talk about Kolmgorov comple...
[First thank you for your comments and observations - it's always interesting to read pushback]
First, I think my point about using the GP to measure the volume occupied functions locally to where SGD trained networks are initialised is important. We are not really comparing NNs to NNGPs (well, technically we are, but we are interpreting what the NNGP does differently). We are trying to argue that SGD acts as a random sampler - it will find functions with probability proportional to the volume of those functions local to where the optimiser is in parameter-...
Check out https://arxiv.org/pdf/1909.11522.pdf where we do some similar analysis of perceptrons but in higher dimensions. Theorem 4.1 shows that there is an anti-entropy bias - in other words, functions with either mostly 0s or mostly 1s are exponentially more likely to show up than expected under a uniform prior - which holds for perceptrons of any dimension. This proves a (fairly trivial) bias towards simple functions, although it doesn't say anything about why a function like 010101010101... appears more frequently than other functions in the maximum-entropy class.
I agree that "large volume-->simple" is what is shown by the evidence in the papers, as opposed to "simple--> large volume" which is in fact not a claim we do not make anywhere (if we do accidentally please let me know and I will fix it) - see https://arxiv.org/abs/1910.00971 for more detail on this, or Joar Skalse's comments on https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of, where he discusses functions which don't obey this rule - such as the identity function, which has small volume a...
I think a lot of the points you raise here have good answers at https://www.alignmentforum.org/posts/YSFJosoHYFyXjoYWa/why-neural-networks-generalise-and-why-they-are-kind-of - see in particular replies by Joar Skalse (the author of that post). You say that you don't think it surprising that the posteriors of NNs are similar to NNGPs on the data on which they were trained to fit - I think this statement is only unsurprising if you assume that SGD is not playing a particularly big role in the inductive bias (for small/medium scale datasets and architectures...
AGD can train any architecture, dataset and batch size combination (as far as we have tested), out-of-the-box. I would argue that this is a qualitative change to the current methods, where you have to find the right learning rate for every batch size, architecture and dataset combination, in order to converge in an optimal or near-optimal time. I think this is a reasonable interpretation of "train ImageNet without hyperparameters". That said, there is a stronger sense of "hyperparameter-free" where the optimum batch size and architecture size would decide ... (read more)