I think that for each NN architecture+prior+task/loss, conditioning the initialization prior on train data (or doing some other bayesian thing) is typically basically a completely different learning algorithm than (S)GD-learning, because local learning is a very different thing, which is one reason I doubt the story in the slides as an explanation of generalization in deep learning[1].[2] But setting this aside (though I will touch on it again briefly in the last point I make below), I agree it would be cool to have a story connecting the parametrized bayesian thing to something like Solomonoff induction. Here's an outline of an attempt to give a more precise story extending the one in Lucius's slides, with a few afaik-open problems:
Separately from the above bridge attempt, it is not at all obvious to me that parametrized bayes in fact has such good generalization behavior at all (i.e., "at least as good as deep learning", whatever that means, let's say)[4]; here's some messages on this topic I sent to [the group chat in which the posted discussion happened] later:
"i'd be interested in hearing your reasons to think that NN-parametrized bayesian inference with a prior given by canonical initialization randomization (or some other reasonable prior) generalizes well (for eg canonical ML tasks or boolean functions), if you think it does — this isn't so clear to me at all
some practical SGD-NNs generalize decently, but that's imo a sufficiently different learning process to give little evidence about the bayesian case (but i'm open to further discussion of this). i have some vague sense that the bayesian thing should be better than SGD, but idk if i actually have good reason to believe this?
i assume that there are some other practical ML things inspired by bayes which generalize decently but it seems plausible that those are still pretty local so pretty far from actual bayes and maybe even closer to SGD than to bayes, tho idk what i should precisely mean by that. but eg it seems plausible from 3 min of thinking that some MCMC (eg SGLD) setup with a non-galactic amount of time on a NN of practical size would basically walk from init to a local likelihood max and not escape it in time, which sounds a lot more like SGD than like bayes (but idk maybe some step size scheduling makes the mixing time non-galactic in some interesting case somehow, or if it doesn't actually do that maybe it can give a fine approximation of the posterior in some other practical sense anyway? seems tough). i haven't thought about variational inference much tho — maybe there's something practical which is more like bayes here and we could get some relevant evidence from that
maybe there's some obvious answer and i'm being stupid here, idk :)
one could also directly appeal to the uniformly random program analogy but the current version of that imo doesn't remotely constitute sufficiently good reason to think that bayesian NNs generalize well on its own"
(edit: this comment suggests https://arxiv.org/pdf/2002.02405 as evidence that bayes-NNs generalize worse than SGD-NNs. but idk — I haven't looked at the paper yet — ie no endorsement of it one way or the other from me atm)
to the extent that deep learning in fact exhibits good generalization, which is probably a very small extent compared to sth like Solomonoff induction, and this has to do with some stuff I talked about in my messages in the post above; but I digress ↩︎
I also think that different architecture+prior+task/loss choices probably give many substantially-differently-behaved learning setups, deserving somewhat separate explanations of generalization, for both bayes and SGD. ↩︎
edit: Instead of doing this thing with circuits, you could get an alternative "principled generalization baseline/ceiling" from doing the same thing with programs instead (i.e., have a complexity prior on turing machines and condition it on seen input-output pairs), which I think ends up being equivalent (up to a probably-in-some-sense-small term) to using the kolmogorov complexities of these functions (thought of "extensionally" as strings, ie just listing outputs in some canonical order (different choices of canonical order should again give the same complexities (up to a probably-in-some-sense-small term))). While this is probably a more standard choice historically, it seems worse for our purposes given that (1) it would probably be strictly harder to build a bridge from NNs to it (and there probably just isn't any NNs <-> programs bridge which is as precise as the NNs <-> circuits bridge we might hope to build, given that NNs are already circuity things and it's easy to have a small program for a function without having a small circuit for it (as the small program could run for a long time)), and (2) it's imo plausible that some variant of the circuit prior is "philosophically/physically more correct" than the program prior, though this is less clear than the first point. ↩︎
to be clear: I'm not claiming it doesn't have good generalization behavior — instead, I lack good evidence/reason to think it does or doesn't and feel like I don't know ↩︎
Background
Lucius: I recently held a small talk presenting an idea for how and why deep learning generalises. It tried to reduce concepts from Singular Learning theory back to basic algorithmic information theory to sketch a unified picture that starts with Solomonoff induction and, with a lot of hand waving, derives that under some assumptions, just fitting a big function to your data using a local optimisation method like stochastic gradient descent maybe, sorta, kind of, amounts to a cheap bargain bin approximation of running Solomonoff induction on that data.
Lucius: The parametrised function we fit to the data has to be the sort that can act like a vaguely passable approximation of a compute-limited Universal Turing machine, with parameter configurations playing the role of programs.
Lucius: In a Universal Turing machine, 'simpler' hypotheses with less ‘burdensome details’ will be implemented by exponentially more equivalent programs. So, simpler hypotheses take up exponentially more volume in program space. Thus, if we start with a uniform prior over all programs of some fixed length l, we’re effectively implementing an approximation of the Solomonoff prior, giving exponentially more weight to simpler hypotheses.[1]
Lucius: Similarly, in a good parametrised function, ‘simpler’ fits to the data with less ‘burdensome details’ will correspond to exponentially more equivalent parameter settings than complicated ones. So, simple functions will take up an exponentially larger volume in parameter space than complicated functions, and a uniform prior over parameter space will exponentially favour them before we even see any data.
Lucius: Later, I posted the slides for the talk in a group chat because someone asked for them. I did not include a GIANT DISCLAIMER that these slides are NOT RIGOROUS and present an OVERSIMPLIFIED STORY to help the audience follow the thread of the explanation better. They should NOT BE PARSED without ACCOMPANYING COMMENTARY.
Lucius: This was very stupid of me.
Lucius: Kaarel and Dmitry promptly jumped in to critique the slides, and I replied to their critiques. Afterwards, we all realised that the ensuring discussion basically acts as the commentary the slides were missing, and also touched on many other ideas regarding the nature of intelligence that may be of interest to some.
Lucius: So we figured we'd post the discussion here, somewhat edited, along with the slides.
Slides
DISCLAIMER: THESE OVERSIMPLIFIED SLIDES present an OVERSIMPLIFIED STORY because they were just PROPS FOR A TALK. Their contents should be taken IN THE CONTEXT of the CRITIQUE AND DISCUSSION BELOW and probably seasoned with LARGE QUANTITIES OF SALT even then.
Slides: Deep Learning is cheap Solomonoff Induction
Discussion
Lucius: Slides Jake asked for.
Kaarel: The idea of a wheel is simple and humans can think about stuff and invent wheels but local learning (evolution) struggles to invent wheels.
Kaarel: More generally, most mathematical concepts probably could not be invented by SGD (at any scale that fits in this universe) (and mathematical concepts are useful for prediction).
Jake:
Kaarel: Yes, yes
Jake: Yes yes to u
Jake: I’m being facetious.
Jake: No macro wheels, the point stands.
Jake: Indeed more the fool evolution for inventing wheels in some places but not using them everywhere where the logical inference is small. But evolution isn’t taking steps in logical inference space.
Kaarel: Humans invent mathematical concepts largely not by doing gradient descent in concept-space, but by thinking about what concept to make. it's maybe more like solving a constraint satisfaction problem where you can't find a solution by local search but can find a solution by thinking.
Kaarel: (i'm not saying that concepts cannot be invented by a thing arising from SGD doing some thinking about which concepts to make — that's plausible with the right setup, though it could require something beyond what's currently in common use/consideration. certainly evolution is a pretty dumb pretty local thing which sorta built a thing which can make concepts)
Kaarel: A neural net struggling to learn which subset of inputs was XOR-ed despite it being a simple linear algebra problem to determine which subset was XOR-ed (and probably not too hard for a human to think of trying that to solve the same prediction problem) is a good example to have in mind here.
Lucius: I think you may be parsing these slides differently from what I tried to say with them in the talk. I don't think humans or AGIs being able to find things evolution and neural network training can't is evidence against what I'm trying to say. Moving locally in some hypothesis space on a second-to-second basis doesn't mean you're well described as moving locally in a fitness landscape over long time horizons.
Lucius: If I want to build a car, but don't know how, this model does not say I have to think through random car designs and update until I find one that seems to work. That's sort of what this model suggests happens a lot on a second-by-second basis, but over longer time scales, your current position in search space influences what data you'll even try to fit next and with what sort of function. So you might move to making car designs in some platonic realm where wheels just turn without engines, and if some design works in that space, you might go back to fitting in the original space with a new starting point and new data to fit based on that platonic design.
Lucius: According to this idea, the rules for how to pick what to fit next might actually be really dumb. At least at first. You might memorise what rule worked well on a previous occasion that pattern matches this occasion, for example (where the pattern match is just a dumb function fit, like everything else).
Lucius: Eventually, your rules for what to think about next might get better. But that might be through normal updating that works just like everything else in this setup. Memorise, then try to find more general rules by fitting the memorisations with some functions.
Lucius: Through this process, you might come to acquire rules for what to think about and fit functions to that are quite advanced indeed. Rules like formal logical thinking, that have your search trajectory move in ways very unlike what we'd usually associate with local updating.
Lucius: I don't think these very fancy rules necessarily contain some sort of core insight required for generality though. I think your mind works ok and is reasonably general even before it gets the very fancy rules. I'm not sure there's any kind of core pattern that unites all the fancy rules either. I think they're maybe just good rules you can use.
Lucius: So I guess the difference in (lightly held) opinion here might be this part
The idea in the slides does say that local search in 'concept space' is probably pretty central for doing things. There's also rules for what kind of concept space to search in when to fit what data, but these are more a collection of heuristics learned over time, and the very basics of general intelligence maybe work ok even when these rules are still quite basic.
Kaarel: Some thoughts about intelligence in sorta-continuation of the above discussion. This isn't a direct response, but some considerations that i think are probably relevant + true/good:
Kaarel: Some sorta meta commentary: I think that "what's the science algorithm?"/"how does problem-solving/thinking work?" is sometimes asked like one is asking for some sort of final formula for thinking. This is imo like asking "what's the ultimate technology?" or "what's the formula for mathematics?". There's no such thing as 'a formula for mathematics' — there are infinitely many cool things to be figured out about mathematical objects. I think there are also infinitely many cool things to be figured out about how to think well, and infinitely many cool structures to incorporate into our thinking.
Kaarel: I would guess that in our circles, this imo mistake is downstream of something like this mistake by yudkowsky, but idk.
Kaarel: Of course, non-ultimate variants of these questions, as in "let's figure out more about intelligence", are imo centrally interesting. I just don't think these questions are the sorts of things to be 'solved' — they are the sorts of things to be indefinitely figured out better (it's imo probably even appropriate to think of oneself as always only understanding a finite fragment of an infinite thing here, like in math)
Dmitry: I finally read Lucius' slide show. I like it a lot.
Dmitry: A few comments however:
Dmitry: 1. I would say that SLT isn't about the Bayesian prior (that's just "information geometry" or whatever). It's rather an extra few epicycles: giving more details on what the prior looks like geometrically.
Dmitry: It's just not true that programs correctly sample the "Bayesian" prior. Findability/SGD is a massive part of what is actually learnable. This (not SLT/Bayesianism) is what explains modularity in particular.
Dmitry: I'm planning to make a post about this, but there are lots of ways to see this. The parity problem and noticing that harder problems show up in LLM programs (which doesn't mean they're not solvable, just that they're not solvable by the most compressed/ "Solomonoff-preferred" way), it's evident in the algorithm learned by associative groups composition, etc.
Dmitry: I think studying the Bayesian/informational prior is still very valuable and a great correction to what people talk about when they talk about priors (it's just that SGD and learnability are even more important).
Dmitry: I think Kaarel actually said more or less the same thing. Basically: the SGD prior says that NN's prefer the "most efficient, including degeneracy, way of solving a problem". This is not true. In fact, as you both agree, NN's *locally* prefer the "most efficient, including degeneracy, way of solving the problem in a way that's a perturbation of a previous solution. This means that (with high probability) there needs to be a "learning story" connecting an instance of the algorithm that's ultimately learned to a generic starting point. The solution predicted by the "true" Bayesian prior is not in fact learnable in any interesting real-world context.
Dmitry: One way to motivate this is, once again, that the most efficient way of learning parity isn't in fact learnable (and this is provably the case). It's still possible to train a NN with a more interesting/complex loss to solve the parity problem: e.g. if the NN has a concept of language and logic, in a suitable sense, it can solve parity in polynomial time. It's just that, provably (in a certain sense), it's not going to be finding the most optimal "Bayesian-preferred" solution, but rather routing around the un-findability of this solution and going though some more elaborate SGD path to find a less degenerate/natural solution that has the correct I/O behavior.
Dmitry: I think in practice, a lot of the time this means that the generalizing solution found will just be more modular than the optimal solution (this is the case for associative groups). But more generally, training stories can be more complicated, since you can GD to "edit" or post-process a highly modular but bad algorithm to a better but less parallel algorithm. Trying to model this is this discussion about priors that Kaarel described that we had when he was visiting.
Dmitry: Re: Kaarel's "structure of intelligence", I think I largely agree. But I think the difference between us is that I am somewhat strongly in the "interp is possible" camp. In physics, you get complex and interesting behavior from taking simple "interaction" behaviors at different characteristic scales and putting them together in a giant superposition. I think NN's follow the same general framework, and it's plausible that we'll be "done" with interp (at least well enough to do things like guarantee non-deception from low-level behaviors) once we have suitably good notions of "interaction", "characteristic scale", and "putting together".
Dmitry: I also think that there are ways in which we can get useful-for-safety interp while our understanding is still bad, so long as we get some partial completeness: e.g. (mostly) completely characterize what can happen at a particular characteristic scale.
Dmitry: This makes me particularly excited about this Sam Marks paper, since while I don't think that "find all mechanisms of this shape" is a viable path, it would nevertheless be quite exciting and plausible to say something like "a large component of NN behavior is described by a superposition of diagrams of roughly this complexity" (this could be shown by looking at a random phrase and producing a collection of diagrams that explain it). This would tell us something nontrivial about the complexity and branchiness of LLM intelligence, without actually explaining it in a way that is reverse-engineerable (same as people do for physics).
Lucius:
I agree with this. These slides are giving a simplified 0-th order toy story to aid understanding. They're really not complete without the accompanying talk.
Since this has now happened a second time, no seriously, these are just props from a small Apollo-internal talk that Jake asked for. They're not an essay.
Lucius:
I feel like there's some miscommunication here. Let me back up and try to put my words more in the context that produced them.
In the year 2020, if you had asked me to write down a practical, real general intelligence algorithm, I would have told you that I have no clue how to do that. Not the best possible algorithm. Just any algorithm at all. Part of why I was unable to do that might be lack of knowledge about what you call many interesting thinking-structures. But it felt to me like there was a much graver and deeper problem there than just sufficient lack of knowledge about a vast and rich vista of tools for thinking. I certainly did not understand a lot of the little tricks and ingredients that maybe help with generality to the point that I could formalise them as code. But I felt that there was something even more central there that I was far more fundamentally confused about: How do you make an algorithm that generalises at all, ever, even a little bit, from old data to new data?
Even before deep learning, this felt like maybe the most core question here to me. Though I wouldn’t have put it in those terms back then, I was a teenager without the math vocab to express this. It's kind of in the name: general intelligence. When observing my own thoughts while solving problems, or thinking about how the brain learned over the long term, I felt like this machinery had to somehow, constantly and at every step, navigate a vast number of potentially correct local fits to the data and think only about the ones that generalised to new data. Learning about Solomonoff induction at least clarified how picking a good hypothesis out of many that fit the data was possible at all even in mathematical principle. But my brain was clearly purposefully picking out simple hypotheses, rather than tracking every hypothesis possible. At every moment of conscious perception, the thoughts coming into my conscious awareness seemed to me to be completely undetermined by previous thoughts. There was some seeming ability there to always continue the line of fuzzy logic such that the overall chain was likely to generalise and reach the target, when exponentially many other continuations would have failed. It felt to me like however that trick worked, it might also be the trick through which I could learn to walk or hold a spoon, since this process seemed to have the same generalisation-shaped hole in the middle.
So it used to feel to me that there was an utterly crucial core insight missing here, about how to get any generality whatsoever. I think I was not alone in this. I very much had the impression that Yudkowsky and probably MIRI more generally thought about it in a similar way, for example. Among many others.
I was so confused the first time I saw an explainer video on deep learning. I had heard that they got computers to learn now, so I thought they must have figured out how to solve the generalisation problem. I kept rewatching the video looking for the part where the generalisation trick happens, and came up empty.
Maybe this doesn't feel like such a central question to slightly younger people who grew up more used to the observed fact that deep learning works. Or maybe you just never made the mistake I and perhaps many others made for so many years when thinking about function fits and the Solomonoff prior, so you never became deeply confused in the first place. But to me, this question of generalisation was the question. All the other stuff that might or might not go into general intelligence could be rich and vast and poorly understood, but it never felt completely mysterious and utterly indispensable to AGI to me the way the generalisation problem did.
If you never thought this was all that mysterious and that it seemed explicable with the math we already had, Bayes points to you. In that case, I could see how you'd be confused or annoyed by people like me walking around ranting about this basic fact like it's some grant revelation and we understand AGI now. But if you put yourself in my shoes, can you see how slowly resolving this very basic confusion over the last 2-3 years would feel like progress to me?
Kaarel: Ok i'm on board with there being some progress here :)
Bridging NN SGD and Solomonoff induction (from Oct 2024)
Kaarel: Dmitry and I had a few conversations about the stuff he and Louis Jaburi (et al.) are figuring out about how NNs learn group operations in which we discussed how it would be cool to try to spell out an analogy between the kind of structure-selection seen empirically for learning group operations and Solomonoff induction, and how it might be helpful to have some explicit story like that written up for people to have a crisp example to consider when thinking about how NN learning is or isn't like Solomonoff induction.
Lucius: I dunno about using that as the specific example, but I agree that spelling out the story as we currently see it more explicitly for everyone might be a good idea.
Lucius: I’d love to compare/contrast:
Lucius: We can’t actually run that empirically for most realistic problems for the first two options, of course. So either we keep those to abstract discussion, or we find a really neat example where you can do it on paper.
Kaarel: To say a bit more about the group operation Solomonoff thing: Dmitry was telling me that it's looking plausible that one could develop a good description whose broad strokes are: "There's a prior on circuits. These circuits are of some known kind: mapping both inputs into some 'small' algebraic object, multiplying there, and mapping back; each such circuit is only somewhat better than random at getting the right output group element but the signal beats the noise when sufficiently many circuits are added. the output of a trained NN on an input (g,h) is a linear combination of outputs of random circuits on (g,h)drawn from this prior. (these circuits get 'drawn' during NN learning.) And by this I of course don't just mean to state a fact about the input-output behavior, but to also claim that the NN is structured this way."
Kaarel: Given my surface understanding it seems plausible to me that with some significant work the prior could also be explicitly described here.
Kaarel: I guess the thought is that the trained NN looks a bit like a subroutine of a Solomonoff inductor here: "The NN has a bunch of small computations it knows how to execute. To predict an output, it executes all of them and aggregates their outputs. This is like the part of Solomonoff induction that happens after algorithms which have previously made a wrong prediction are eliminated: you run all remaining algorithms and aggregate their predictions for the next bit."
Kaarel: And NN training is when the small patterns that are to be used get selected. That's like the subroutine of a Solomonoff inductor that selects which algorithms get their opinions about the next bit taken into account. So we have some analogy between [NN training + inference] and solomonoff induction here.
Kaarel: Of course, the above story is naive in that really there are interaction terms between the circuits — a more sophisticated picture would talk about the ecology of circuits. concretely, e.g. if one looks at the fraction of circuits which make an error at input (g,h), calculating the variance of this quantity as one varies the input, then it's plausible this variance will be smaller than what one would predict if the circuits were drawn independently, because 'the circuits are selected to cover each other's weaknesses'. But the hope is that this kind of thing can be thought as a higher-order correction on top of the above first-order story.
Kaarel: Anyway, I like the themes of
Acknowledgements
Lucius: Thanks to Alexander Odenziel, David Quarel, Matthias Dellago and probably a bunch of people my brain failed to track for discussions and insight that helped put the ideas in these slides together. I also don't claim any originality here. A lot of this is arguably latent in Solomonoff's original work already, and it would not surprise me much if many people in algorithmic information theory or elsewhere put something like this story together at some point over the years. It's also not the first time Singular Learning Theory has been linked to algorithmic information theory in some way. As pointed out by Daniel Murfet here, Clift-Murfet-Wallbridge used a version of the padding argument to show that the learning coefficient is (in a sense) a lower bound on the Kolmogorov complexity in the setting of noisy Turing machines, and a follow-up result of that was treated in Thomas Waring's master thesis.
Kaarel: I should also thank Sam Eisenstat, Tsvi Benson-Tilsen, Jake Mendel, Simon Skade, Jessica Taylor, for discussions + many of the ideas in my messages.
If you're used to seeing the Solomonoff prior formulated as summing over all programs, with each program exponentially weighted by its length, the way it's written down here: That's for prefix Turing machines. There's an equivalent definition for plain Turing machines where you just take a uniform distribution over all programs of length l, then take the limit l→∞.