Disclaimer: I'm by no means an expert on singular learning theory and what I present below is a simplification that experts might not endorse. Still, I think it might be more comprehensible for a general audience than going into digressions about blowing up singularities and birational invariants.
Here is my current understanding of what singular learning theory is about in a simplified (though perhaps more realistic?) discrete setting.
Suppose you represent a neural network architecture as a map where , is the set of all possible parameters of (seen as floating point numbers, say) and is the set of all possible computable functions from the input and output space you're considering. In thermodynamic terms, we could identify elements of as "microstates" and the corresponding functions that the NN architecture maps them to as "macrostates".
Furthermore, suppose that comes together with a loss function evaluating how good or bad a particular function is. Assume you optimize using something like stochastic gradient descent on the function with a particular learning rate.
Then, in general, we have the following results:
- SGD defines a Markov chain structure on the space whose stationary distribution is proportional to on parameters for some positive constant that depends on the learning rate. This is just a basic fact about the Langevin dynamics that SGD would induce in such a system.
- In general is not injective, and we can define the "-complexity" of any function as . Then, the probability that we arrive at the macrostate is going to be proportional to .
- When is some kind of negative log-likelihood, this approximates Solomonoff induction in a tempered Bayes paradigm - we raise likelihood ratios to a power - insofar as the -complexity is a good approximation for the Kolmogorov complexity of the function , which will happen if the function approximator defined by is sufficiently well-behaved.
The intuition for why we would expect (3) to be true in practice has to do with the nature of the function approximator . When is small, it probably means that we only need a small number of bits of information on top of the definition of itself to define , because "many" of the possible parameter values for are implementing the function . So is probably a simple function.
On the other hand, if is a simple function and is sufficiently flexible as a function approximator, we can probably implement the functionality of using only a small number of the bits in the codomain of , which leaves us the rest of the bits to vary as we wish. This makes quite large, and by extension the complexity quite small.
The vague concept of "flexibility" mentioned in the paragraph above requires to have singularities of many effective dimensions, as this is just another way of saying that the image of has to contain functions with a wide range of -complexities. If is a one-to-one function, this clean version of the theory no longer works, though if is still "close" to being singular (for instance, because many of the functions in its image are very similar) then we can still recover results like the one I mentioned above. The basic insights remain the same in this setting.
I'm wondering what singular learning theory experts have to say about this simplification of their theory. Is this explanation missing some important details that are visible in the full theory? Does the full theory make some predictions that this simplified story does not make?
I don't think this is something that requires explanation, though. If you take an arbitrary geometric object in maths, a good definition of its singular points will be "points where the tangent space has higher dimension than expected". If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function, probably suggesting that there are more directions you can move along without changing the function being implemented at all. So the function being implemented is simple, and the rest of the argument works as I outline it in the post.
I think I understand what you and Jesse are getting at, though: there's a particular behavior that only becomes visible in the smooth or analytic setting, which is that minima of the loss function that are more singular become more dominant as n→∞ in the Boltzmann integral, as opposed to maintaining just the same dominance factor of e−O(d). You don't see this in the discrete case because there's a finite nonzero gap in loss between first-best and second-best fits, and so the second-best fits are exponentially punished in the limit and become irrelevant, while in the singular case any first-best fit has some second best "space" surrounding it whose volume is more concentrated towards the singularity point.
While I understand that, I'm not too sure what predictions you would make about the behavior of neural networks on the basis of this observation. For instance, if this smooth behavior is really essential to the generalization of NNs, wouldn't we predict that generalization would become worse as people switch to lower precision floating point numbers? I don't think that prediction would have held up very well if someone had made it 5 years ago.