Lucius Bushnaq

AI notkilleveryoneism researcher, focused on interpretability. 

Personal account, opinions are my own. 

I have signed no contracts or agreements whose existence I cannot mention.

Wiki Contributions

Comments

Sorted by

Agreed. I do value methods being architecture independent, but mostly just because of this: 

and maybe a sign that a method is principled

At scale, different architectures trained on the same data seem to converge to learning similar algorithms to some extent. I care about decomposing and understanding these algorithms, independent of the architecture they happen to be implemented on. If a mech interp method is formulated in a mostly architecture independent manner, I take that as a weakly promising sign that it's actually finding the structure of the learned algorithm, instead of structure related to the implmentation on one particular architecture.

for a large enough (overparameterized) architecture - in other words it can be measured by the 

The sentence seems cut off.

Sure. But what’s interesting to me here is the implication that, if you restrict yourself to programs below some maximum length, weighing them uniformly apparently works perfectly fine and barely differs from Solomonoff induction at all.

This resolves a remaining confusion I had about the connection between old school information theory and SLT. It apparently shows that a uniform prior over parameters (programs) of some fixed size parameter space is basically fine, actually, in that it fits together with what algorithmic information theory says about inductive inference.

Yes, my point here is mainly that the exponential decay seems almost baked into the setup even if we don't explicitly set it up that way, not that the decay is very notably stronger than it looks at first glance.

Given how many words have been spilled arguing over the philosophical validity of putting the decay with program length into the prior, this seems kind of important?

Why aren’t there 2^{1000} less programs with such dead code and a total length below 10^{90} for p_2, compared to p_1?

Does the Solomonoff Prior Double-Count Simplicity?

Question: I've noticed what seems like a feature of the Solomonoff prior that I haven't seen discussed in any intros I've read. The prior is usually described as favoring simple programs through its exponential weighting term, but aren't simpler programs already exponentially favored in it just through multiplicity alone, before we even apply that weighting?

Consider Solomonoff induction applied to forecasting e.g. a video feed of a whirlpool, represented as a bit string . The prior probability for any such string is given by:

where  ranges over programs for a prefix-free Universal Turing Machine.

Observation: If we have a simple one kilobit program  that outputs prediction , we can construct nearly  different two kilobit programs that also output  by appending arbitrary "dead code" that never executes. 

For example:
DEADCODE="[arbitrary 1 kilobit string]"
[original 1 kilobit program ]
EOF
Where programs aren't allowed to have anything follow EOF, to ensure we satisfy the prefix free requirement.

If we compare  against another two kilobit program  outputting a different prediction , the prediction  from  would get  more contributions in the sum, where  is the very small number of bits we need to delimit the DEADCODE garbage string. So we're automatically giving  ca.   higher probability – even before applying the length penalty  has less 'burdensome details', so it has more functionally equivalent implementations. Its predictions seem to be exponentially favored in proportion to its length  already due to this multiplicity alone.

So, if we chose a different prior than the Solomonoff prior which just assigned uniform probability to all programs below some very large cutoff, say  bytes:


and then followed the exponential decay of the Solomonoff prior for programs longer than  bytes, wouldn't that prior act barely differently than the Solomonoff prior in practice? It’s still exponentially preferring predictions with shorter minimum message length.[1] 

Am I missing something here?
 

  1. ^

    Context for the question: Multiplicity of implementation is how simpler hypotheses are favored in Singular Learning Theory despite the prior over neural network weights usually being uniform. I'm trying to understand how those SLT statements about neural networks generalising relate to algorithmic information theory statements about Turing machines, and Jaynes-style pictures of probability theory.

Reply1111

At a very brief skim, it doesn't look like the problem classes this paper looks at are problem classes I'd care about much. Seems like a case of scoping everything broadly enough that something in the defined problem class ends up very hard. 

EDIT: Sorry, misunderstood your question at first.

Even if , all those subspaces will have some nonzero overlap  with the activation vectors of the  active subnets. The subspaces of the different small networks in the residual stream aren't orthogonal.

You can complain that you don't know how to execute physics equations

I'm confused, in what sense don't we know how to do this? Lattice quantum field theory simulations work fine. 

Load More