AI notkilleveryoneism researcher, focused on interpretability.
Personal account, opinions are my own.
I have signed no contracts or agreements whose existence I cannot mention.
for a large enough (overparameterized) architecture - in other words it can be measured by the
The sentence seems cut off.
Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.
This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.
Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.
Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?
Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?
Does the Solomonoff Prior Double-Count Simplicity?
Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?
Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:
where ranges over programs for a prefix-free Universal Turing Machine.
Observation: If we have a simple one kilobit program that outputs prediction , we can construct nearly different two kilobit programs that also output by appending arbitrary "dead code" that never executes.
For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.
If we compare against another two kilobit program outputting a different prediction , the prediction from would get more contributions in the sum, where is the very small number of bits we need to delimit the DEADCODE garbage string. So we're automatically giving ca. higher probability – even before applying the length penalty . has less 'burdensome details', so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length already due to this multiplicity alone.
So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say bytes:
and then followed the exponential decay of the Solomonoff prior for programs longer than bytes, wouldn't that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1]
Am I missing something here?
Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I'm trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.
At a very brief skim, it doesn't look like the problem classes this paper looks at are problem classes I'd care about much. Seems like a case of scoping everything broadly enough that something in the defined problem class ends up very hard.
Yes, that's right.
EDIT: Sorry, misunderstood your question at first.
Even if , all those subspaces will have some nonzero overlap with the activation vectors of the active subnets. The subspaces of the different small networks in the residual stream aren't orthogonal.
You can complain that you don't know how to execute physics equations
I'm confused, in what sense don't we know how to do this? Lattice quantum field theory simulations work fine.
Agreed. I do value methods being architecture independent, but mostly just because of this:
At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implmentation on one particular architecture.