(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
- Saharon Shelah
As a true-born Dutchman I endorse Crocker's rules.
For my most of my writing see my short-forms (new shortform, old shortform)
Twitter: @FellowHominid
Personal website: https://sites.google.com/view/afdago/home
It also suggests that there might some sort of conservation law for pain for agents.
Conservation of Pain if you will
ingular Sure! I'll try and say some relevant things below. In general, I suggest looking at Liam Carroll's distillation over Watanabe's book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful.
The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population loss is a little subtle).
The [matrix of] second derivatives of the population loss at the minima is called the Fischer information metric. It's always degenerate [i.e. singular] for any statistical model with hidden states or hierarchichal structure. Analyses that don't take this into account are inherently flawed.
SLT tells us that the local geometry around the minimum nevertheless controls the learning and generalization behaviour of any Bayesian learner for large N. N doesn't have to be that large though, empirically the asymptotic behaviour that SLT predicts is already hit for N=200.
In some sense, SLT says that the broad basin intuition is broadly correct but this needs to be heavily caveated. Our low-dimensional intuition for broad basin is misleading. For singular statistical models (again everything used in ML is highly singular) the local geometry around the minima in high dimensions is very weird.
Maybe you've heard of the behaviour of the volume of a sphere in high dimensions: most of it is contained on the shell. I like to think of the local geometry as some sort of fractal sea urchin. Maybe you like that picture, maybe you don't but it doesn't matter. SLT gives actual math that is provably the right thing for a Bayesian learner.
[real ML practice isn't Bayesian learning though? Yes, this is true. Nevertheless, there is both empirical and mathematical evidence that the Bayesian quantitites are still highly relevant for actual learning]
SLT says that the Bayesian posterior is controlled by the local geometry of the minimum. The dominant factor for N~>= 200 is the fractal dimension of the minimum. This is the RLCT and it is the most important quantity of SLT.
There are some misconception about the RLCT floating around. One way to think about is as an 'effective fractal dimension' but one has to be careful about this. There is a notion of effective dimension in the standard ML literature where one takes the parameter count and mods out parameters that don't do anything (because of symmetries). The RLCT picks up on symmetries but it is not just that. It picks up on how degenerate directions in the fischer information metric are ~= how broad is the basin in that direction.
Let's consider a maximally simple example to get some intuition. Let the population loss function be . The number of parameters and the minimum is at .
For the minimum is nondegenerate (the second derivative is nonzero). In this case the RLCT is half the dimension. In our case the dimension is just so
For the minimum is degenerate (the second derivative is zero). Analyses based on studying the second derivatives will not see the difference between but in fact the local geometry is vastly different. The higher is the broader the basin around the minimum. The RLCT for is . This means, the is lower the 'broader' the basin is.
Okay so far this only recapitulates the broad basin story. But there are some important points
This is all answered very elegantly by singular learning theory.
You seem to have a strong math background! I really encourage you take the time and really study the details of SLT. :-)
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.
To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.
The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.
I don't have the time to recap this story here.
All proofs are contained in the Watanabe's standard text, see here
Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!
But also yes... I think I am saying that
EDIT: no hype about future work. Wait and see ! :)
I remembered hearing about the paper from a friend and thinking it couldn't possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic - a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier.
Logical Inductors get around the diagonalization barrier in a very clever way. I won't spoil how it does here. I recommend the interested reader to watch Andrew's Critch talk on Logical Induction.
It was the main reason convincing that MIRI != clowns but were doing substantial research.
The Logical Induction paper has a fairly thorough discussion of previous work. Relevant previous work to mention is de Finetti's on betting and probability, previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer's work on aggregation of experts. There is also a field which doesn't have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant.
Actually, since we're on the subject of scientific discoveries
Singular Learning Theory is another way of "talking about the breadth of optima" in the same sense that Newton's Universal Law of Gravitation is another way of "talking about Things Falling Down".
Military nerds correct me if I'm wrong but I think the answer might be the following. I'm not a pilot etc etc.
Stealth can be a bit of a misleading term. F35 aren't actually 'stealth aircraft' - they are low-observable aircraft. You can detect F35s with longwave radar.
The problem isn't knowing that there is a F35 but to get a weapon -grade lock on it. This is much harder and your grainy gpt-interpreted photo isn't close to enough for a missile I think. You mentioned this already as a possibility.
The Ukrainians pioneered something similar for audio which is used to detect missiles & drones entering Ukrainian airspace.