Alexander Gietelink Oldenziel

(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
                                                                                                           - Saharon Shelah

 

As a true-born Dutchman I endorse  Crocker's rules.

For my most of my writing see my short-forms (new shortform, old shortform)

Twitter: @FellowHominid

Personal website: https://sites.google.com/view/afdago/home

Sequences

Singular Learning Theory

Wiki Contributions

Comments

ingular Sure! I'll try and say some relevant things below. In general, I suggest looking at Liam Carroll's distillation over Watanabe's book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful. 

The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population loss is a little subtle).

The [matrix of] second derivatives of the population loss at the minima is called the Fischer information metric. It's  always  degenerate  [i.e. singular] for any statistical model with hidden states or hierarchichal structure. Analyses that don't take this into account are inherently flawed. 

SLT tells us that the local geometry around the minimum nevertheless controls the learning and generalization behaviour of any Bayesian learner for large N. N doesn't have to be that large though, empirically the asymptotic behaviour that SLT predicts is already hit for N=200.

In some sense, SLT says that the broad basin intuition is broadly correct but this needs to be heavily caveated. Our low-dimensional intuition for broad basin is misleading. For singular statistical models (again everything used in ML is highly singular) the local geometry around the minima in high dimensions is very weird. 

Maybe you've heard of the behaviour of the volume of a sphere in high dimensions: most of it is contained on the shell. I like to think of the local geometry as some sort of fractal sea urchin. Maybe you like that picture, maybe you don't but it doesn't matter. SLT gives actual math that is provably the right thing for a Bayesian learner. 

[real ML practice isn't Bayesian learning though? Yes, this is true. Nevertheless, there is both empirical and mathematical evidence that the Bayesian quantitites are still highly relevant for actual learning]

SLT says that the Bayesian posterior is controlled by the local geometry of the minimum. The dominant factor for N~>= 200 is the fractal dimension of the minimum. This is the RLCT and it is the most important quantity of SLT. 

There are some misconception about the RLCT floating around. One way to think about is as an 'effective fractal dimension' but one has to be careful about this. There is a notion of effective dimension in the standard ML literature where one takes the parameter count and mods out parameters that don't do anything (because of symmetries). The RLCT picks up on symmetries but it is not just that. It picks up on how degenerate directions in the fischer information metric are ~= how broad is the basin in that direction. 

Let's consider a maximally simple example to get some intuition. Let the population loss function be . The number of parameters  and the minimum is at 

For  the minimum is nondegenerate (the second derivative is nonzero). In this case the RLCT is  half the dimension. In our case the dimension is just  so 

For  the minimum is degenerate (the second derivative is zero). Analyses based on studying the second derivatives will not see the difference between but in fact the local geometry is vastly different. The higher  is the broader the basin around the minimum. The RLCT for  is . This means, the  is lower the 'broader' the basin is. 

Okay so far this only recapitulates the broad basin story. But there are some important points

  • this is an actual quantity that can be estimated at scale for real networks that provably dominates the learning behaviour for moderately large 
  • SLT says that the minima with low rlct will be preferred. It evens says how much they will be preferred. There is tradeoff between lower rlct minima with moderate loss ('simpler solutions') and minima with higher rlct but lower loss. As  This means that the RLCT is actually 'the right notion of model complexity/ simplicty' in the parameterized Bayesian setting. This is too much to recap in this comment but I refer you to Hoogland & van Wingerden's post here. This is the also the start of the phase transition story which I regard as the principal insight of SLT. 
  • The RLCT doesn't just pick up on basin broadness. It also picks up on more elaborate singular structure. E.g. a crossing valley type minimum like . I won't tell you the answer but you can calculate it yourself using Shaowei Lin's cheat sheet. This is key - actual neural networks have highly highly singular structure that determines the RLCT. 
  • The RLCT is the most important quantity in SLT but SLT is not just about the RLCT. For instance, the second most important quantity the 'singular fluctuation' is also quite important. It has a strong influence on generaliztion behaviour and is the largest factor in the variance of trained models. It controls approximation to Bayesian learning like the way neural networks are trained. 
  • We've seen that the directions defined by the matrix of second derivatives is fundamentally flawed because neural networks are highly singular. Still, there is something noncrazy about studying these directions. There is upcoming work which I can't discuss in detail yet that explains to large degree how to correct this naive picture both mathematically and empirically. 

This is all answered very elegantly by singular learning theory.

You seem to have a strong math background! I really encourage you take the time and really study the details of SLT. :-)

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.

The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.

To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.

The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.

I don't have the time to recap this story here.

Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!

But also yes... I think I am saying that

  • Singular Learning Theory is the first highly accurate model of breath of optima.
    •  SLT tells us to look at a quantity Watanabe calls , which has the highly-technical name 'real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
    • By computing simple examples (see Shaowei's guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
    • The RLCT = first-order term for in-distribution generalization error and also Bayesian learning (technically the 'Bayesian free energy').  This justifies the name of 'learning coefficient' for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions. 
    • Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won't be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
  • It's one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.

EDIT: no hype about future work. Wait and see ! :)

Answer by Alexander Gietelink OldenzielApr 25, 2024192
  • Scott Garrabrant's discovery of Logical Inductors. 

I remembered hearing about the paper from a friend and thinking it couldn't possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic -  a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier. 

Logical Inductors get around the diagonalization barrier in a very clever way.  I won't spoil how it does here. I recommend the interested reader to watch Andrew's Critch talk on Logical Induction. 

It was the main reason convincing that MIRI != clowns but were doing substantial research.  

The Logical Induction paper has a fairly thorough discussion of previous work.  Relevant previous work to mention is de Finetti's on betting and probability,  previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer's work on aggregation of experts.  There is also a field which doesn't have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant. 

  • The Antikythera artifact: a Hellenistic Computer.  
    • You probably learned heliocentrism= good, geocentrism=bad, Copernicus-Kepler-Newton=good epicycles=bad. But geocentric models and heliocentric models are equivalent, it's just that Kepler & Newton's laws are best expressed in a heliocentric frame. However, the raw data of observations is actually made in a geocentric frame. Geocentric models stay closer to the data in some sense. 
    • Epicyclic theory is now considered bad, an example of people refusing to see the light of scientific revolution. But actually, it was an enormous innovation. Using high-precision gearing epicycles could be actually implemented on a (Hellenistic) computer  implicitly doing Fourier analysis to predict the motion of the planets. Astounding. 
    • A Roman author (Pliny the Elder?) describes a similar device in posession of Archimedes of Rhodes. It seems likely that Archimedes or a close contemporary (s) designed the artifact and that several were made in Rhodes. 

Actually, since we're on the subject of scientific discoveries 

  • Discovery & description of the complete Antikythera mechanism.  The actual artifact that was found is just a rusty piece of bronze. Nobody knew how it worked.  There were several sequential discoveries over multiple decades that eventually led to the complete solution of the mechanism.The final pieces were found just a few years ago. An astounding scientific achievement. Here is an amazing documentary on the subject: 

Singular Learning Theory is another way of "talking about the breadth of optima" in the same sense that Newton's Universal Law of Gravitation is another way of "talking about Things Falling Down". 

Yes, beautiful example ! Van Leeuwenhoek was the one-man ASML of the 17th century. In this case, we actually have evidence to the counterfactual impact as other lensmakers trailed van Leeuwenhoek by many decades.


It's plausible that high-precision measurement and fabrication is the key bottleneck in most technological and scientific progress- it's difficult to oversell the importance of van Leeuwenhoek. 

Antonie van Leeuwenhoek made more than 500 optical lenses. He also created at least 25 single-lens microscopes, of differing types, of which only nine have survived. These microscopes were made of silver or copper frames, holding hand-made lenses. Those that have survived are capable of magnification up to 275 times. It is suspected that Van Leeuwenhoek possessed some microscopes that could magnify up to 500 times. Although he has been widely regarded as a dilettante or amateur, his scientific research was of remarkably high quality.[39]

The single-lens microscopes of Van Leeuwenhoek were relatively small devices, the largest being about 5 cm long.[40][41] They are used by placing the lens very close in front of the eye. The other side of the microscope had a pin, where the sample was attached in order to stay close to the lens. There were also three screws to move the pin and the sample along three axes: one axis to change the focus, and the two other axes to navigate through the sample.

Van Leeuwenhoek maintained throughout his life that there are aspects of microscope construction "which I only keep for myself", in particular his most critical secret of how he made the lenses.[42] For many years no one was able to reconstruct Van Leeuwenhoek's design techniques, but in 1957, C. L. Stong used thin glass thread fusing instead of polishing, and successfully created some working samples of a Van Leeuwenhoek design microscope.[43] Such a method was also discovered independently by A. Mosolov and A. Belkin at the Russian Novosibirsk State Medical Institute.[44] In May 2021 researchers in the Netherlands published a non-destructive neutron tomography study of a Leeuwenhoek microscope.[22] One image in particular shows a Stong/Mosolov-type spherical lens with a single short glass stem attached (Fig. 4). Such lenses are created by pulling an extremely thin glass filament, breaking the filament, and briefly fusing the filament end. The nuclear tomography article notes this lens creation method was first devised by Robert Hooke rather than Leeuwenhoek, which is ironic given Hooke's subsequent surprise at Leeuwenhoek's findings.

Here are some reflections I wrote on the work of Grothendieck and relations with his contemporaries & predecessors. 

Take it with a grain of salt - it is probably too deflationary of Grothendieck's work, pushing back on mythical narratives common in certain mathematical circles where Grothendieck is held to be an Christ-like figure. I pushed back on that a little.  Nevertheless, it would probably not be an exaggeration to say that Grothendieck's purely scientific contributions [as opposed to real-life consequences] were comparable to those of Einstein. 

Load More