Thanks for writing this out Joar, it is a good exercise of clarification for all of us.
Perhaps a boring comment, but I do want to push back on the title ever so slightly: imo it should be My Criticism of SLT Proponents, i.e. people (like me) who have interpreted some aspects in perhaps an erroneous fashion (according to you).
Sumio Watanabe is incredibly careful to provide highly precise mathematical statements with rigorous proofs and at no point does he make claims about the kind of "real world deep learning" phenomena being discussed here. The only sense...
I would argue that the title is sufficiently ambiguous as to what is being claimed, and actually the point of contention in (ii) was discussed in the comments there too. I could have changed it to Why Neural Networks can obey Occam's Razor, but I think this obscures the main point. Regular linear regression could also obey Occam's razor (i.e. "simpler" models are possible) if you set high-order coefficients to 0, but the posterior of such models does not concentrate on those points in parameter space.
At the time of writing, basically nobody knew anyt...
Good question! The proof of the exact symmetries of this setup, i.e. the precise form of , is highly dependent on the ReLU. However, the general phenomena I am discussing is applicable well beyond ReLU to other non-linearities. I think there are two main components to this:
Edit: Originally the sequence was going to contain a post about SLT for Alignment, but this can now be found here instead, where a new research agenda, Developmental Interpretability, is introduced. I have also now included references to the lectures from the recent SLT for Alignment Workshop in June 2023.
Only in the illegal ways, unfortunately. Perhaps your university has access?
"Discontinuity" might suggest that this happens fast. Yet, e.g. in work on grokking, it actually turns out that these "sudden changes" happen over a majority of the training time (often, the x-axis is on a logarithmic scale). Is this compatible, or would this suggest that phenomena like grokking aren't related to the phase transitions predicted by SLT?
This is a great question and something that come up at the recent summit. We would definitely say that the model is in two different phases before and after grokking (i.e. when the test error is flat), but it...
However, is it correct that we need the "underlying truth" to study symmetries that come from other degeneracies of the Fisher information matrix? After all, this matrix involves the true distribution in its definition. The same holds for the Hessian of the KL divergence.
The definition of the Fisher information matrix does not refer to the truth whatsoever. (Note that in the definition I provide I am assuming the supervised learning case where we know the input distribution , meaning the model is , which ...
Now, for the KL-divergence, the situation seems more extreme: The zero's are also, at the same time, the minima of , and thus, the derivative disappears at every point in the set . This suggests every point in is singular. Is this correct?
Correct! So, the point is that things get interesting when is more than just a single point (which is the regular case). In essence, singularities are local minima of . In the non-realisable case this means they are zeroes of the minimum-loss level set. In fact we can abuse ...
Can you tell more about why it is a measure of posterior concentration.
...
Are you claiming that most of that work happens very localized in a small parameter region?
Given a small neighbourhood , the free energy is and measures the posterior concentration in since
where the inner term is the posterior, modulo its normalisation constant . The key here is that if we are comparing different regions of parameter space , then the free energy doesn't care about t...
Thanks for the comment Leon! Indeed, in writing a post like this, there are always tradeoffs in which pieces of technicality to dive into and which to leave sufficiently vague so as to not distract from the main points. But these are all absolutely fair questions so I will do my best to answer them (and make some clarifying edits to the post, too). In general I would refer you to my thesis where the setup is more rigorously explained.
...Should I think of this as being equal to , and would you call this quantity ? I was a bit confu
Thanks for writing that, I look forward to reading.
As for nomenclature, I did not define it - the sequence is called Distilling SLT, and this is the definition offered by Watanabe. But to add some weight to it, the point is that in the Bayesian setting, the predictive distribution is a reasonable object to study from the point of view of generalisation, because it says: "what is the probability of this output given this input and given the data of the posterior". The Bayes training loss (which I haven't delved into in this post) is the e...
With all due respect, I think you are misrepresenting what I am saying here. The sentence after your quote ends is
its relation to SGD dynamics is certainly an open question.
What is proven by Watanabe is that the Bayesian generalisation error, as I described in detail in the post, strongly depends on the singularity structure of the minima of , as measured by the RLCT . This fact is proven in [Wat13] and explained in more detail in [Wat18]. As I elaborate on in the post, translating this statement into the SGD / frequentist setting is a...
What is your understanding? It is indeed a deep mathematical theory, but it is not convoluted. Watanabe proves the FEF, and shows the RLCT is the natural generalisation of complexity in this setting. There is a long history of deep/complicated mathematics, with natural (and beautiful) theorems at the core, being pivotal to describing real world phenomena.
The point of the posts is not to argue that we can prove why particular architectures perform better than others (yet). This field has had, comparatively, very little work done to it yet within...
The way I've structured the sequence means these points are interspersed throughout the broader narrative, but its a great question so I'll provide a brief summary here, and as they are released I will link to the relevant sections in this comment.
If a model is singular, then Watanabe’s Free Energy Formula (FEF) can have big implications for the geometry of the loss landscape. Whether or not a particular neural network model is singular does indeed depend on its activation function, amongst other structures in its architecture.
In DSLT3 I will outline the ways simple two layer feedforward ReLU neural networks are singular models (ie I will show the symmetries in parameter space that produce the same input-output function), which is generalisable to deeper feedforward ReLU networks. There I will also ...
Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.