[minor]
I'm a bit confused or I think you need some additional caveats in the intro. I would have said that Bayesian statistics with typical models is well understood as bias toward short description length but there are important caveats in the neural network case, at least conceptually.
(That said, minimum description length is off by constant factor, but this doesn't seem to be what you're getting at.)
You say:
TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.
And
while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic
But later you say:
The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks.
Maybe I can clarify a few points here:
- It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of "backup heads" observed in transformers seems like a good candidate. Can you think of others?
I've heard thirdhand (?) of a transformer whose sublayers will dampen their outputs when is added to that sublayer's input. IE there might be a "target" amount of to have in the residual stream after that sublayer, and the sublayer itself somehow responds to ensure that happens?
If there was some abnormality and there was already a bunch of present, then the sublayer "error corrects" by shrinking its output.
I'm surprised to see an application of the Banach fixed-point theorem as an example of something that's too implicit from the perspective of a computer scientist. After all, real quantities can only be represented in a computer as a sequence of approximations — and that's exactly what the theorem provides.
I would have expected you to use, say, the Brouwer fixed-point theorem instead, because Brouwer fixed points can't be computed to arbitrary precision in general.
(I come from a mathematical background, fwiw.)
This comment of mine is a bit cheeky, since there are plenty of theoretical computer scientists who think about characterising terms as fixed points, and logic programming is a whole discipline that is about characterising the problem rather than constructing a solution, but broadly speaking I think it is true among less theoretically-minded folks that "program" means "thing constructed step by step from atomic pieces".
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
The folklore relating the simplicity bias in Bayesian statistics to description length is incomplete: while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic. The content of this complexity measure remains quite mysterious, but in this note we point out that in a particular setting it includes a bias towards runtime error-correction. This suggests caution when reasoning about the role of inductive biases in neural network training.
Acknowledgements. Thanks to Jesse Hoogland, Liam Carroll, Rumi Salazar and Simon Pepin Lehalleur for comments.
1. Background
1.1 Relevance to Deep Learning
Consider the problem of solving an ordinary differential equation. A constructive proof involves actually writing down a solution, or an algorithm that in finite time will produce a solution. The Picard-Lindelöf theorem proves that a solution to a broad class of initial value problems exists, but the proof is not constructive: it sets up a contraction mapping on a complete metric space and appeals to the Banach fixed point theorem.
While the Picard-Lindelöf theorem uniquely characterises the solution as the fixed point of a contraction mapping, and gives an iterative process for approximating the solution, it does not construct the solution. However a construction is not necessary for many of the applications of Picard-Lindelöf (in differential geometry, topology and many parts of analysis). This mode of reasoning about mathematical objects, where it suffices to have characterised[1] them by (universal) properties, is pervasive in modern mathematics (in the above example, the characterising property is the differential equation, or its associated contraction mapping). However this may seem quite alien to a computer scientist or programmer, who for historical reasons tend to think that there is only one mode of reasoning about mathematical objects, and that is centred on the study of a construction.
In an era where programs are increasingly the product of gradient descent rather than human construction, this attitude is untenable. We may have to accept a mode of reasoning about learned programs, based on understanding the nature of the problems to which they are a solution and the iterative processes that produce them. To understand the implicit algorithms learned by neural networks, it may be necessary from this perspective to understand
We do not currently have a good understanding of these matters. If we understood these inductive biases better, it could conceivably help us in the context of AI alignment to answer questions like "how likely is deceptive alignment", "how likely is consequentialism", and "what goals are instrumentally convergent"?
This note is about the inductive biases of the Bayesian learning process (conditioned on more samples, the posterior increasingly localises around true parameters). Since Bayesian statistics is both fundamental and theoretically tractable, this seems potentially useful for understanding the inductive biases of neural network training. However it is worth noting that the relation between these is not understood at present.
1.2 Singular Learning Theory
The asymptotic expansion of the Bayesian free energy, or "free energy formula'', proven by Watanabe in Singular Learning Theory (SLT) introduces the learning coefficient λ as a measure of complexity that balances model accuracy in the process of model selection.
In models that are regular or minimally singular the learning coefficient is easy to understand: it is the effective number of parameters in the model, in a suitable sense. More precisely if a parameter is irrelevant, in the sense that any variation does not change the prediction, this leads to a decrease of 12 in the learning coefficient and thus 2λ agrees with an effective parameter count in minimally singular models.[2] Therefore, in these cases, the tradeoff between model accuracy and complexity embodied in the minimisation of the free energy is familiar.
However, in more degenerate models such as neural networks, this connection between the learning coefficient and parameter counting breaks down. In particular, the learning coefficient depends on the data distribution. This is not an obstacle to theoretically deriving or empirically estimating the learning coefficient, but it does mean that we may lack good intuitions for what this (positive, rational) number is measuring.
1.3 Minimum Description Length
There is a circle of ideas containing Occam's razor, Kolmogorov complexity and the Minimum Description Length (MDL) which strongly informs the intuitions in the machine learning community about the meaning of "simplicity" and its role in determining the inductive biases of neural network training. However it is important to note that the mathematical hypotheses required for the attractive coherence among these ideas do not apply to neural networks.
A good reference for the classical treatment of the MDL is
For the relation of the MDL to Bayesian model selection and the asymptotic expansion of the free energy (in the minimally singular case) see
There has been some attempt in recent years to address the fact that the classical treatment of MDL is not applicable to singular models like neural networks:
It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established, one should not presume any fundamental theoretical connection between the simplicity bias of Bayesian statistics of neural networks and any simple notion of "description length".
2. Turing Machines With Errors
In this section we give an example where the complexity measure in the free energy asymptotic for singular statistical models (the learning coefficient) is sensitive to algorithmic structure that goes beyond the number of effective parameters.
The example is derived from a literature that has attempted to bridge semantics of logic with statistical learning theory, based on ground-breaking work of Ehrhard-Regnier on differential linear logic. We have tried to make the presentation here self-contained, but the reader can find more information in the following references:
We consider the problem of predicting the outputs of a computable generating process by finding codes for a Universal Turing Machine (UTM). We have in mind a standard kind of UTM U with a description tape (where we write the code which specifies the machine to be simulated), a state tape (where the state of the simulated machine is written) and a work tape (containing the state of the tape of the simulated machine). Some input sequence x∈Σ∗ is written to the work tape, a code c is written to the description tape, an initial state is written to the state tape, and then the UTM proceeds to simulate the Turing Machine M with code c until it halts with the output M(x) on the work tape, if it halts.
We can consider this process as actually instantiated in a machine or, more abstractly, an automata. The role of error in such processes is very interesting and leads ultimately to the problem of run-time error correction in modern computing: every step of communication or processing in a computer involves some probability of error and, although small, the large number of steps and size of the messages involved means that some error correction may be necessary for meaningful computation to take place.
We suppose that some computable process in the environment is outputting strings y∈Σ∗ according to some true distribution q(y|x), with each symbol having some small error from the output T(x) of a TM T. The problem of statistical inference is to figure out which TM it is, from a given set of input-output pairs Dn={(xi,yi)}ni=1. Of course there will be (infinitely) many TMs consistent with the set of samples Dn, so we expect that the result of statistical inference will be a probability distribution over codes. This is what we call the Bayesian posterior p(c|Dn). We might bias this towards shorter machines by putting a prior φ(c) over the space of models (codes).
This is an old problem, but we consider it from a new angle. Rather than considering a discrete space of codes with a special prior, we extend our set of allowed models of the generating process beyond TMs (as represented by their codes) to include codes with error channels. In just the same way that each "organ'' of the automata[3] in von Neumann's work may have some probability of error, we allow each symbol ci of the code c of M to have some probability of error when it is read by the UTM. The specification of the allowed distribution of errors for each symbol in c defines a point in the space W of codes-with-error-channels, which we take as the parameter space of our new extended statistical model.
Given one of these codes-with-error-channels w∈W and an input sequence x∈Σ∗ the contents of the work tape of U after some given number of steps depends on the errors encountered during the execution of the UTM; that is, it is a probability distribution p(y|x,w). This is our model. Those parameters w∈W which lead to final work tapes close to the true output y, in the sense that the KL divergence between the two probability distributions is small, are more highly preferred by the Bayesian posterior p(w|Dn), which is now defined on the extended space of models W.
2.1 Error as a Probe of Structure
Of course, if the generating process T is computable then its distribution of outputs q(y|x) can be realised by a model p(y|x,w) where w has no errors (for example, the code of T). So what is there to be gained by considering this larger space of models?
The distribution over outputs that results from a given distribution of possible errors in reading a symbol ci of the code c reflects the way that the symbol is used in the computation. While this is clear informally (to understand how something works, try perturbing one of its components and see what happens) this can be given a formal interpretation in the context of differential linear logic, where there is a connection between this sensitivity analysis and the Taylor series expansion of a proof. The simplest example is that if the introduction of errors in ci does not affect the output distribution at all then it is reasonable to conclude that ci is irrelevant to the computation. This already suggests a natural connection between the local learning coefficient λ(c) and the program length of c whereby the posterior p(w|Dn) tends to concentrate around shorter programs which predict the dataset Dn to the same degree of accuracy. We will turn to more interesting examples soon.
K(w)=KL(q(x,y)||p(x,y|w))The upshot is that given a TM M with code c the way that the probability distribution p(y|x,w) varies when we vary w near c∈W (noting that p(M(x)|x,c)=1) encodes a lot of information about the structure of the algorithm M when executed on the input x, via the effects of perturbing each one of the bits in the code c. This variation in turn is reflected in the local geometry of the function
near c, of which the local learning coefficient λ(c) is a scalar invariant. The conclusion is that the algebraic geometry of the function germ (K,c) should be expected to reflect the internal structure of the TM M to some degree. At the moment it remains unclear how strong this connection between geometry of K and internal structure of M actually is.
2.2 The Example
We fix an input x∈Σ∗ for which the true output is T(x)∈Σ. For simplicity we assume this is a single symbol and that we judge models p(y|x,w) on their predictions for the contents of this single tape square. The contribution of the data sample x to the KL divergence K(w) can be shown to be equivalent[4] to ∫q(x)H(x,w)dx where
H(x,w)=∑σ∈Σ(δσ,T(x)−p(σ|x,w))2which is a polynomial in the coefficients specifying the probability distributions that make up w. Here δσ,T(x) is the Kronecker delta which is 1 if σ=T(x) and zero otherwise.
Let us now consider the local behaviour of K (equivalently H) at a point of parameter space which represents a TM M with code c. We fix a single symbol of this code and a perturbation of this code in some direction u. That is, we consider a parameter w=w(ε) which agrees with c except that in the ith position we replace the symbol ci by the distribution ci+εu. For concreteness let us take the original symbol of the code to be ci=0– and take u to be the formal linear combination of symbols 1–−0– so that for small εci+εu is a probability distribution representing a small chance of reading a 1– instead of a 0–. Then we can expand H(x,w) in powers of ε
H(x,w)=h0+h1ε+h2ε2+⋯where each hi=hi(x) is a function of x only (since we assume there is no uncertainty in the other positions of c). Taking ε→0 we obtain H(x,c) which we may assume is zero; that is, we assume that M gives the correct answer M(x)=T(x) for the fixed input x. Thus h0=0 and we have
H(x,w)=∑i>0hiεi.We are interested in the order of this polynomial, that is the least i such that hi≠0.
To go further we have to think in more detail about how the UTM U computes the probability p(σ|x,w) that is, how the uncertainty in the code w is propagated to uncertainty in the output. Since we run the simulated machine for some fixed number of steps, the UTM itself is run for some number of steps, and thus makes use of exactly l samples from the ith symbol of the code for some l that is independent of x. The possible trajectories of U are thus dictated by the sequences of possible results μ=(μ1,…,μl)∈Σl and to each we assign a probability
P(μ)=P(μ1)⋯P(μl)In our example, the distribution in w at the ith position is
ci+εu=0–+ε(1–−0–)=(1−ε)0–+ε1–so that we need only consider sequences where samples 0–,1– appear, and
P(μ)=(1−ε)aεb=[a∑j=0(aj)(−1)jεj]εb=a∑j=0(aj)(−1)jεb+jwhere a=a(μ) is the number of times 0– is sampled in μ and a+b=l.
One can compute that when you run the UTM U as above the resulting probability distribution over outputs is
p(σ|x,w)=∑μδσ,U(μ,x)P(μ)where the sum is over all executions of U making use of a sequence of samples μ that, on input x, leads to the output symbol U(μ,x) equal to σ. Here U(μ,x)∈Σ is defined by running the UTM and intervening in any step where the UTM goes to read from the ith symbol of the code, so that on the rth read for 1≤r≤l the UTM sees the symbol μr.
Combining the above equations
h1=∂∂εH(x,w)∣∣ε=0=∑σ∈Σ∂∂ε(δσ,T(x)−p(σ|x,w))2∣∣ε=0=−∑σ∈Σ2(δσ,T(x)−p(σ|x,w))∣∣ε=0∂∂εp(σ|x,w)∣∣ε=0which is zero since when ε=0 we have w=c and p(T(x)|x,c)=1 so the first factor in every summand vanishes. Thus h1=0. For t≥2
ht=1t!∂t∂εtH(x,w)∣∣ε=0=1t!∑σ∈Σ∂t∂εt(δσ,T(x)−p(σ|x,w))2∣∣ε=0=−2t!∑σ∈Σ∂t−1∂εt−1{(δσ,T(x)−p(σ|x,w))∂∂εp(σ|x,w)}∣∣ε=0=2t!∑σ∈Σt−1∑s=1(t−1s){∂s∂εsp(σ|x,w)∂t−s∂εt−sp(σ|x,w)}∣∣ε=0.For s≥1
∂s∂εsp(σ|x,w)∣∣ε=0=∑μδσ,U(μ,x)∂s∂εsP(μ)∣∣ε=0=∑μδσ,U(μ,x)a∑j=0(aj)(−1)j∂s∂εsεb+j∣∣ε=0.
∂s∂εsp(σ|x,w)∣∣ε=0=s!∑μ s.t. b(μ)≤s(−1)s−b(μ)δσ,U(μ,x)(l−b(μ)l−s).Recall that b=b(μ) is l−a(μ) where a(μ) is the number of 0–'s appearing in the sequence μ. Thus b(μ) is the number of 1–s sampled in the sequence μ, that is, the number of errors. The derivative is nonzero only when s=b+j which for 0≤j≤a is only possible (given that s≥1 and 0≤b(μ)≤l) if b≤s≤l. The only "paths'' or sequences μ that contribute to the s-fold derivative are ones which contain ≤s errors. Substituting in the above yields
where we have used that
(a(μ)s−b(μ))=(l−b(μ)l−s).We can use this formula to compute the coefficient ht.
Definition. We say that the Turing Machine M with code c is robust to s errors on input x in position i if, for any execution path μ of the UTM U initialised with c on the description tape and x on the work tape involving b(μ)≤s errors, U(μ,x)=M(x).
Aσ,j=j!∑b≤j∑b(μ)=b(−1)j−b(aj−b)=j!∑b≤j(−1)j−b(lb)(l−bj−b)=l!(l−j)!∑b≤j(−1)j−bj!b!(j−b)!=l!(l−j)!∑b≤s(−1)j−b(jb)=0.Lemma. If M is robust to s errors on input x in position i then ht=0 for t≤s+1. That is, the order of H(x,w) in ε is strictly greater than s+1.
Proof. We set Aσ,j=∂j∂εjp(σ|x,w)∣∣ε=0. If M is robust to s errors in the stated sense, then it is clear from the above that Aσ,j=0 when σ≠M(x) and j≤s since it is never true that U(μ,x)=σ when μ is such that b(μ)≤j. In the case where σ=M(x) and j≤s
Hence for any σ∈Σ and any j≤s we have Aσ,j=0. Thus for t≤s+1
ht=2t!∑σ∈Σt−1∑j=1(t−1j)Aσ,jAσ,t−j=2t!t−1∑j=1(t−1j)AM(x),jAM(x),t−j=0since every summand involves a term AM(x),j with j≤s. □
Example. Since any TM is robust to zero errors, we always have h1=0. If M is robust to one error on input x then h2=0, so the polynomial H(x,w) is at least cubic in ε.
If M is robust to s errors on every input x then H(w)=∫H(x,w)dx has order >s+1 in ε, from which we infer that K(w) has order >s+1 in ε locally at c. If 0–,1– are the only allowed symbols in the code at this position then the local learning coefficient satisfies λ(c)≤1s+2.
2.3 Summary
The example illustrates one simple way in which the structure of a TM M with code c (in this case, the capability to recover from read errors in its specification during run-time) influences the geometry of the function germ (K,c). We expect there are many further forms of degeneracy, which for example involve sophisticated interactions between bits of the code.
In a minimally singular model every used parameter costs 12 (in the sense that it increases λ by that amount) and so our intuition might suggest that in the current setting every used bit in the code should cost 12. It is true that every used bit costs at most 12 but we have just seen that not all bits are created equal: a bit which is "error-corrected" in the sense that when the UTM executes c it is robust to ≤s errors in that bit, only costs 1s+2<12. The principle of minimisation of free energy therefore suggests that, all else being equal, the Bayesian posterior will prefer codes where the bits are error-corrected in this way. That is, two codes of equal length and matching the outputs of the generating process equally well, may nonetheless have neighbourhoods assigned different probability by the posterior, if one of them has error-correction and the other does not. Thus "simplicity'' (in the sense of λ) includes robustness to errors.
We note that it is straightforward to provide a TM with run-time error-correction at the cost of execution time, by running the program multiple times and applying a majority vote to determine the answer. More sophisticated schemes are possible, and there is a literature that has worked out various ways of doing this.
3. Conclusion
The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks. We do not yet possess conceptually simple general intuitions about the inductive biases of Bayesian statistics. In this note we have exhibited one case in which the simplicity bias is more exotic.
3.1 Questions and Open Problems
The analysis in this note is preliminary. The set of people working on ideas like this is small, and if you have relevant background in mathematics or ML you could probably figure out something useful. Notes:
Simon Pepin Lehalleur says: "defined implicitly" is another common way to get at the same idea. Statistics is in some sense all about implicit definitions ("Statistics is the inverse problem to probability").
In the case where the KL divergence or loss function is locally, after a change of variables, a sum of squares ∑d′i=1w2i in Rd then since changing each of the wi for 1≤i≤d′ increases the loss, we refer to these parameters as "relevant" whereas changing each of the wi for i>d′ does not change the loss so we refer to these parameters as "irrelevant". More precisely, there is a finite range within which these parameters can be varied without changing the loss.
By an "organ" von Neumann means a fundamental unit, from which more complicated computations can be built.
In the sense that there exist c1,c2>0 such that with H(w)=∫H(x,w)dx we have c1H(w)≤K(w)≤c2H(w). In particular the local learning coefficients of K and H agree.