When you say "networks" do you mean simple multi-layer perceptrons?
Or is this a more general description which includes stuff like:
transformers, state-space models, diffusion models, recursive looping models, reservoir computing models, next generation reservoir computing models, spiking neural nets, Komolgrov-Arnold networks, FunSearch, etc.
Anything where you fit parametrised functions to data. So, all of these, except maybe FunSearch? I haven't looked into what that actually does, but at a quick google it sounds more like an optimisation method than an architecture. Not sure learning theory will be very useful for thinking about that.
You can think of the learning coefficient as a sort of 'effective parameter count' in a generalised version of the Bayesian Information Criterion. Unlike the BIC, it's also applicable to architectures where many parameter configurations can result in the same function. Like the architectures used in DeepLearning.
This is why models with neural network style architectures can e.g. generalise past the training data even when they have more parameters than training data points. People used to think this made no sense, because they had BIC-based intuitions that said you'd inevitably overfit. But the BIC isn't actually applicable to these architectures. You need the more general form, the WBIC, which has the learning coefficient in the formula in place of the parameter count.
You may find this interesting "On the Covariance-Hessian Relation in Evolution Strategies":
https://arxiv.org/pdf/1806.03674
It makes a lot of assumptions, but as I understand it if you: a. Sample points near the minima [1]. b. Select only the lowest loss point from that sample and save it. c. Repeat that process many times d. Create a covariance matrix of the selected points
The covariance matrix will converge to the inverse of the Hessian, assuming the loss landscape is quadratic. Since the inverse of a matrix has the same rank, you could probably just use this covariance matrix to bound the local learning coefficient.
Though since a covariance matrix has rank less than n-1 (where n is the number of sample points) you would need to sample and evaluate roughly d/2 points. The process seems pretty parallelize-able though.
[1] Specifically using an an isotropic, unit variance normal distribution centered at the minima.
Why would we want or need to do this, instead of just calculating the top/bottom Hessian eigenvalues?
Isn't calculating the Hessian for large statistical models kind of hard? And aren't second derivatives prone to numerical errors?
Agree that this is only valuable if sampling on the loss landscape is easier or more robust than calculating the Hessian.
Getting the Hessian eigenvalues does not require calculating the full Hessian. You use Jacobian vector product methods in e.g. JAX. The Hessian itself never has to be explicitly represented in memory.
And even assuming the estimator for the Hessian pseudoinverse is cheap and precise, you'd still need to get its rank anyway, which would by default be just as expensive as getting the rank of the Hessian.
That makes sense, I guess it just comes down to an empirical question of which is easier.
Question about what you said earlier: How can you use the top/bottom eigenvalues to estimate the rank of the Hessian? I'm not as familiar with this so any pointers would be appreciated!
The rank of a matrix = the number of non-zero eigenvalues of the matrix! So you can either use the top eigenvalues to count the non-zeros, or you can use the fact that an matrix always has eigenvalues to determine the number of non-zero eigenvalues by counting the bottom zero-eigenvalues.
Also for more detail on the "getting hessian eigenvalues without calculating the full hessian" thing, I'd really recommend Johns explanation in this linear algebra lecture he recorded.
Thanks for this! I misinterpreted Lucius as saying "use the single highest and single lowest eigenvalues to estimate the rank of a matrix" which I didn't think was possible.
Counting the number of non-zero eigenvalues makes a lot more sense!
TL;DR: In a neural network with d parameters, the (local) learning coefficient λ can be upper and lower bounded by the rank of the network's Hessian d1:
d12≤λ≤d12+d−d14.
The lower bound is a known result. The upper bound is a claim by me, and this post contains the proof for it.[1] If you find any problems, do point them out.
Edit 16.08.2024: The original version of this post had a three in the denominator of the upper bound. Dmitry Vaintrob spotted an improvement to make it a four.
Introduction
The learning coefficient λ is a measure of loss basin volume and model complexity. You can think of it sort of like an effective parameter count of the neural network. Simpler models that do less stuff have smaller λ.
Calculating λ for real networks people actually use is a pain. My hope is that these bounds help make estimating it a bit easier.
In a network with d parameters, the learning coefficient is always a number
0≤λ≤d2.
An existing result in the literature says that if you’ve calculated the rank of the network’s Hessian d1,[2] you get a tighter lower bound
d12≤λ.
I claim that we can also get a tighter upper bound
λ≤d12+d−d14,
where d−d1 will be the dimension of the Hessian kernel, meaning the number of zero eigenvalues it has.[3]
This is neat because it means we can get some idea of how large λ is using only standard linear algebra methods. All we need to know is how many zero eigenvalues the Hessian has.[4] Singular Learning Theory introductions often stress that just looking at the Hessian isn’t enough to measure basin volume correctly. But here we see that if you do it right, the Hessian eigenspectrum can give you a pretty good idea of how large λ is. Especially if there aren't that many zero eigenvalues.
Intuitively, the lower bound works because a direction in the parameters w that isn't free to vary to second order in the Taylor expansion won't become any more free to vary if you pile on a bunch of higher order terms. The Second order term strictly dominates the higher order ones, they can't cancel it out.
Qualitatively speaking, the upper bound works for the same reason. The higher order terms in the Taylor expansion of the loss can only matter so much. The Hessian is the leading term, so it can impact λ the most, adding 12 per Hessian rank to it. The remaining terms of order O(w3) and above can only add a maximum of 14 for the remaining directions.
The proof for the upper bound will just be a small modification of the proof for theorem 7.2 on pages 220 and 221 of Algebraic Geometry and Statistical Learning Theory. Maybe read that first if you want more technical context.
Some words on notation
In the following, I’ll mostly stick to the notation and conventions of the book Algebraic Geometry and Statistical Learning Theory. You can read about all the definitions there. I'm too lazy to reproduce them all.
To give some very rough context, K(w) is sort of like the 'loss' at parameter configuration w, φ(w) is our prior over parameters, and Z(n) is the partition function after updating on n data points.[5]
Theorem:
Let W⊂Rd be the set of parameters of the model. If there exists an open set U⊂W such that
{w∈U:K(w)=0,φ(w)>0}
is not an empty set, and we define d1= rank(H) as the rank of the Hessian H at a w0∈U
Hi,j=∂2K(w)∂wi∂wj|w=w0
with wi,wj elements of some orthonormal basis {w1,…wd} of Rd, then
λ≤d12+d−d14.
Proof:
We can assume w0=0 without loss of generality. If ϵ1,ϵ2 are sufficiently small constants,
Z(n)=∫exp(−nK(w))φ(w)dw≥∫|w(1)|≤ϵ1,|w(2)|≤ϵ2exp{−nK(w)}φ(w)dw.
Here, w(1)∈W/ker(H),w(2)∈ker(H).
If we pick {w1,…wd} to be the Hessian eigenbasis, then for sufficiently small |w|>0
K(w)=12d1∑i,i=1Hi,iw(1)iw(1)i+O(|w(1)|2|w(2)|)+O(|w(2)|2|w(1)|)+O(|w|4).
There is no O(|w(2)|3) term because we are at a local optimum, so no direction can have a leading term that is an odd power.
For sufficiently large n, transforming Z(n) by w′(1)=n12w(1),w′(2)=n14w(2) yields
Z(n)≥n−d12n−d−d14∫|w′(1)|≤1,|w′(2)|≤1exp{−nK(w′(1),w′(2)}φ(n−12w′(1),n−14w′(2))dw′.
Rearranging and substituting gives
Z(n)nd12+d−d13≥∫|w′|≤1exp{−12∑d1i=1Hi,iw′(1)iw′(1)i+n−14O(|w′(1)|2|w′(2)|)+O(|w′(2)|2|w′(1)|)+O(|w′|4)}φ(n−12w′(1),n−14w′(2))dw′.
The right-hand side will converge to a positive constant c>0 as n→∞:
∫|w′|≤1exp{−12∑d1iHi,iw′(1)iw′(1)i+O(|w′(2)|2|w′(1)|)+O(|w′(2)|4)}φ(0)dw′=c .
Thus, we have
limn→∞Z(n)nd12+d−d14≥c . (1)
By Theorem 7.1(1), page 218 of Algebraic Geometry and Statistical Learning Theory,[6] we also know
limn→∞Z(n)nλ(logn)m−1=c∗ , (2)
where c∗>0 is some positive constant and m is the order of the largest pole of
ζ(z)=∫K(w)zφ(w)dw.
Putting (1) and (2) together, we get
λ≤d12+d−d14.
Future work
I think it is possible to prove progressively tighter upper and lower bounds for the learning coefficient λ using information from progressively higher-order terms in the Taylor expansion of the loss.
For example, if the third and fourth-order terms involving w(2) happened to all be zero as well for a particular model, we could've transformed w′(2)=n16w(2) in the proof instead, and obtained the even tighter upper bound λ≤d12+d−d16.
I have some early results for an upper bound formula like this, though right now it only works under specific conditions[7].
Since the higher-dimensional tensors in these O(|w|n) terms can be very computationally intensive to deal with, the real challenge in making these progressively stricter bounds for λ practically useful would be finding tricks to make handling those tensors cheaper. A SPAR team I'm mentoring is currently trying out some ideas on that front on real networks.
Acknowledgments
Credit to Dmitry Vaintrob for spotting the improvement that tightened the upper bound from d12+d−d13 in the original version of this post to d12+d−d14. Thanks to Daniel Murfet for discussion and feedback on the proof idea.
I couldn’t find the upper bound in the literature and Daniel Murfet hadn’t heard of it either, which is why I wrote this. But it could be around somewhere and I just missed it.
For the learning coefficient, this will be the smallest Hessian rank out of all points in the set of minimum loss. For the local learning coefficient, it'll be the smallest Hessian rank out of all minimum loss points in the local neighbourhood we specify.
This will hold for the Hessian rank at any point with minimum loss, though the upper bound from the point with the smallest Hessian rank will of course be the tightest.
Or, in practice, how many Hessian eigenvalues are numerically small enough to not matter at the noise scale we choose.
I.e. the integral of our Bayesian posterior over the parameters after updating on n data points.
Careful, at least my version of the book has a typo here. I think the fraction should be the other way around from the way it's written in the book.
Only for MSE loss, and only if we are at a global optimum where we match the output labels exactly.