kaarelh AT gmail DOT com
¿ thoughts on the following:
I think that for each NN architecture+prior+task/loss, conditioning the initialization prior on train data (or doing some other bayesian thing) is typically basically a completely different learning algorithm than (S)GD-learning, because local learning is a very different thing, which is one reason I doubt the story in the slides as an explanation of generalization in deep learning[1].[2] But setting this aside (though I will touch on it again briefly in the last point I make below), I agree it would be cool to have a story connecting the parametrized bayesian thing to something like Solomonoff induction. Here's an outline of an attempt to give a more precise story extending the one in Lucius's slides, with a few afaik-open problems:
Separately from the above bridge attempt, it is not at all obvious to me that parametrized bayes in fact has such good generalization behavior at all (i.e., "at least as good as deep learning", whatever that means, let's say)[4]; here's some messages on this topic I sent to [the group chat in which the posted discussion happened] later:
"i'd be interested in hearing your reasons to think that NN-parametrized bayesian inference with a prior given by canonical initialization randomization (or some other reasonable prior) generalizes well (for eg canonical ML tasks or boolean functions), if you think it does — this isn't so clear to me at all
some practical SGD-NNs generalize decently, but that's imo a sufficiently different learning process to give little evidence about the bayesian case (but i'm open to further discussion of this). i have some vague sense that the bayesian thing should be better than SGD, but idk if i actually have good reason to believe this?
i assume that there are some other practical ML things inspired by bayes which generalize decently but it seems plausible that those are still pretty local so pretty far from actual bayes and maybe even closer to SGD than to bayes, tho idk what i should precisely mean by that. but eg it seems plausible from 3 min of thinking that some MCMC (eg SGLD) setup with a non-galactic amount of time on a NN of practical size would basically walk from init to a local likelihood max and not escape it in time, which sounds a lot more like SGD than like bayes (but idk maybe some step size scheduling makes the mixing time non-galactic in some interesting case somehow, or if it doesn't actually do that maybe it can give a fine approximation of the posterior in some other practical sense anyway? seems tough). i haven't thought about variational inference much tho — maybe there's something practical which is more like bayes here and we could get some relevant evidence from that
maybe there's some obvious answer and i'm being stupid here, idk :)
one could also directly appeal to the uniformly random program analogy but the current version of that imo doesn't remotely constitute sufficiently good reason to think that bayesian NNs generalize well on its own"
(edit: this comment suggests https://arxiv.org/pdf/2002.02405 as evidence that bayes-NNs generalize worse than SGD-NNs. but idk — I haven't looked at the paper yet — ie no endorsement of it one way or the other from me atm)
to the extent that deep learning in fact exhibits good generalization, which is probably a very small extent compared to sth like Solomonoff induction, and this has to do with some stuff I talked about in my messages in the post above; but I digress ↩︎
I also think that different architecture+prior+task/loss choices probably give many substantially-differently-behaved learning setups, deserving somewhat separate explanations of generalization, for both bayes and SGD. ↩︎
edit: Instead of doing this thing with circuits, you could get an alternative "principled generalization baseline/ceiling" from doing the same thing with programs instead (i.e., have a complexity prior on turing machines and condition it on seen input-output pairs), which I think ends up being equivalent (up to a probably-in-some-sense-small term) to using the kolmogorov complexities of these functions (thought of "extensionally" as strings, ie just listing outputs in some canonical order (different choices of canonical order should again give the same complexities (up to a probably-in-some-sense-small term))). While this is probably a more standard choice historically, it seems worse for our purposes given that (1) it would probably be strictly harder to build a bridge from NNs to it (and there probably just isn't any NNs <-> programs bridge which is as precise as the NNs <-> circuits bridge we might hope to build, given that NNs are already circuity things and it's easy to have a small program for a function without having a small circuit for it (as the small program could run for a long time)), and (2) it's imo plausible that some variant of the circuit prior is "philosophically/physically more correct" than the program prior, though this is less clear than the first point. ↩︎
to be clear: I'm not claiming it doesn't have good generalization behavior — instead, I lack good evidence/reason to think it does or doesn't and feel like I don't know ↩︎
you say "Human ingenuity is irrelevant. Lots of people believe they know the one last piece of the puzzle to get AGI, but I increasingly expect the missing pieces to be too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments." and you link https://tsvibt.blogspot.com/2024/04/koan-divining-alien-datastructures-from.html for "too alien for most researchers to stumble upon just by thinking about things without doing compute-intensive experiments"
i feel like that post and that statement are in contradiction/tension or at best orthogonal
there's imo probably not any (even-nearly-implementable) ceiling for basically any rich (thinking-)skill at all[1] — no cognitive system will ever be well-thought-of as getting close to a ceiling at such a skill — it's always possible to do any rich skill very much better (I mean these things for finite minds in general, but also when restricting the scope to current humans)
(that said, (1) of course, it is common for people to become better at particular skills up to some time and to become worse later, but i think this has nothing to do with having reached some principled ceiling; (2) also, we could perhaps eg try to talk about 'the artifact that takes at most bits to specify (in some specification-language) which figures out units of math the quickest (for some sufficiently large compared to )', but even if we could make sense of that, it wouldn't be right to think of it as being at some math skill ceiling to begin with, because it will probably very quickly change very much about its thinking (i.e. reprogram itself, imo plausibly indefinitely many times, including indefinitely many times in important ways, until the heat death of the universe or whatever); (3) i admit that there can be some purposes for which there is an appropriate way to measure goodness at some rich skill with a score in , and for such a purpose potential goodness at even a rich skill is of course appropriate to consider bounded and optimal performance might be rightly said to be approachable, but this somehow feels not-that-relevant in the present context)
i'll try to get away with not being very clear about what i mean by a 'rich (thinking-)skill' except that it has to do with having a rich domain (the domain either effectively presenting any sufficiently rich set of mathematical questions as problems or relating richly to humans, or in particular just to yourself, usually suffices) and i would include all the examples you give ↩︎
Having found myself repeating the same points/claims in various conversations about what NN learning is like (especially around singular learning theory), I figured it's worth writing some of them down. My typical confidence in a claim below is like 95%[1]. I'm not claiming anything here is significantly novel. The claims/points:
(* but it looks to me like learning theory is unfortunately hard to make relevant to ai alignment[9])
these thoughts are sorta joint with Jake Mendel and Dmitry Vaintrob (though i'm making no claim about whether they'd endorse the claims). also thank u for discussions: Sam Eisenstat, Clem von Stengel, Lucius Bushnaq, Zach Furman, Alexander Gietelink Oldenziel, Kirke Joamets
with the important caveat that, especially for claims involving 'circuits'/'structures', I think it's plausible they are made in a frame which will soon be superseded or at least significantly improved/clarified/better-articulated, so it's a 95% given a frame which is probably silly ↩︎
train loss in very overparametrized cases is an exception. in this case it might be interesting to note that optima will also be off at infinity if you're using cross-entropy loss, https://arxiv.org/pdf/2006.06657 ↩︎
also, gradient descent is very far from doing optimal learning in some solomonoff sense — though it can be fruitful to try to draw analogies between the two — and it is also very far from being the best possible practical learning algorithm ↩︎
by it being a law of large numbers phenomenon, i mean sth like: there are a bunch of structures/circuits/pattern-completers that could be learned, and each one gets learned with a certain probability (or maybe a roughly given total number of these structures gets learned), and loss is roughly some aggregation of indicators for whether each structure gets learned — an aggregation to which the law of large numbers applies ↩︎
to say more: any concept/thinking-structure in general has to be invented somehow — there in some sense has to be a 'sensible path' to that concept — but any local learning process is much more limited than that still — now we're forced to have a path in some (naively seen) space of possible concepts/thinking-structures, which is a major restriction. eg you might find the right definition in mathematics by looking for a thing satisfying certain constraints (eg you might want the definition to fit into theorems characterizing something you want to characterize), and many such definitions will not be findable by doing sth like gradient descent on definitions ↩︎
ok, (given an architecture and a loss,) technically each point in the loss landscape will in fact have a different local neighborhood, so in some sense we know that the probability of getting to a point is a function of its neighborhood alone, but what i'm claiming is that it is not nicely/usefully a function of its neighborhood alone. to the extent that stuff about this probability can be nicely deduced from some aspect of the neighborhood, that's probably 'logically downstream' of that aspect of the neighborhood implying something about nice paths to the point. ↩︎
also note that the points one ends up at in LLM training are not local minima — LLMs aren't trained to convergence ↩︎
i think identifying and very clearly understanding any toy example where this shows up would plausibly be better than anything else published in interp this year. the leap complexity paper does something a bit like this but doesn't really do this ↩︎
i feel like i should clarify here though that i think basically all existing alignment research fails to relate much to ai alignment. but then i feel like i should further clarify that i think each particular thing sucks at relating to alignment after having thought about how that particular thing could help, not (directly) from some general vague sense of pessimism. i should also say that if i didn't think interp sucked at relating to alignment, i'd think learning theory sucks less at relating to alignment (ie, not less than interp but less than i currently think it does). but then i feel like i should further say that fortunately you can just think about whether learning theory relates to alignment directly yourself :) ↩︎
a thing i think is probably happening and significant in such cases: developing good 'concepts/ideas' to handle a problem, 'getting a feel for what's going on in a (conceptual) situation'
a plausibly analogous thing in humanity(-seen-as-a-single-thinker): humanity states a conjecture in mathematics, spends centuries playing around with related things (tho paying some attention to that conjecture), building up mathematical machinery/understanding, until a proof of the conjecture almost just falls out of the machinery/understanding
I find it surprising/confusing/confused/jarring that you speak of models-in-the-sense-of-mathematical-logic=:L-models as the same thing as (or as a precise version of) models-as-conceptions-of-situations=:C-models. To explain why these look to me like two pretty much entirely distinct meanings of the word 'model', let me start by giving some first brushes of a picture of C-models. When one employs a C-model, one likens a situation/object/etc of interest to a situation/object/etc that is already understood (perhaps a mathematical/abstract one), that one expects to be better able to work/play with. For example, when one has data about sun angles at a location throughout the day and one is tasked with figuring out the distance from that location to the north pole, one translates the question to a question about 3d space with a stationary point sun and a rotating sphere and an unknown point on the sphere and so on. (I'm not claiming a thinker is aware of making such a translation when they make it.) Employing a C-model making an analogy. From inside a thinker, the objects/situations on each side of the analogy look like... well, things/situations; from outside a thinker, both sides are thinking-elements.[1] (I think there's a large GOFAI subliterature trying to make this kind of picture precise but I'm not that familiar with it; here are two papers that I've only skimmed: https://www.qrg.northwestern.edu/papers/Files/smeff2(searchable).pdf , https://api.lib.kyushu-u.ac.jp/opac_download_md/3070/76.ps.tar.pdf .)
I'm not that happy with the above picture of C-models, but I think that it seeming like an even sorta reasonable candidate picture might be sufficient to see how C-models and L-models are very different, so I'll continue in that hope. I'll assume we're already on the same page about what an L-model is ( https://en.wikipedia.org/wiki/Model_theory ). Here are some ways in which C-models and L-models differ that imo together make them very different things:
(But maybe I'm grouping like a thousand different things together unnaturally under C-models and you have some single thing or a subset in mind that is in fact closer to L-models?)
All this said, I don't want to claim that no helpful analogy could be made between C-models and L-models. Indeed, I think there is the following important analogy between C-models and L-models:
(Acknowledgments. I'd like to thank Dmitry Vaintrob and Sam Eisenstat for related conversations.)
This is complicated a bit by a thinker also commonly looking at the C-model partly as if from the outside — in particular, when a thinker critiques the C-model to come up with a better one. For example, you might notice that the situation of interest has some property that the toy situation you are analogizing it to lacks, and then try to fix that. For example, to guess the density of twin primes, you might start from a naive analogy to a probabilistic situation where each 'prime' p has probability (p-1)/p of not dividing each 'number' independently at random, but then realize that your analogy is lacking because really p not dividing n makes it a bit less likely that p doesn't divide n+2, and adjust your analogy. This involves a mental move that also looks at the analogy 'from the outside' a bit.
That said, the hypothetical you give is cool and I agree the two principles decouple there! (I intuitively want to save that case by saying the COM is only stationary in a covering space where the train has in fact moved a bunch by the time it stops, but idk how to make this make sense for a different arrangement of portals.) I guess another thing that seems a bit compelling for the two decoupling is that conservation of angular momentum is analogous to conservation of momentum but there's no angular analogue to the center of mass (that's rotating uniformly, anyway). I guess another thing that's a bit compelling is that there's no nice notion of a center of energy once we view spacetime as being curved ( https://physics.stackexchange.com/a/269273 ). I think I've become convinced that conservation of momentum is a significantly bigger principle :). But still, the two seem equivalent to me before one gets to general relativity. (I guess this actually depends a bit on what the proof of 12.72 is like — in particular, if that proof basically uses the conservation of momentum, then I'd be more happy to say that the two aren't equivalent already for relativity/fields.)
here's a picture from https://hansandcassady.org/David%20J.%20Griffiths-Introduction%20to%20Electrodynamics-Addison-Wesley%20(2012).pdf :
Given 12.72, uniform motion of the center of energy is equivalent to conservation of momentum, right? P is const <=> dR_e/dt is const.
(I'm guessing 12.72 is in fact correct here, but I guess we can doubt it — I haven't thought much about how to prove it when fields and relativistic and quantum things are involved. From a cursory look at his comment, Lubos Motl seems to consider it invalid lol ( in https://physics.stackexchange.com/a/3200 ).)
I didn't express this clearly, but yea I meant no pretraining on human text at all, and also nothing computer-generated which "uses human mathematical ideas" (beyond what is in base ZFC), but I'd probably allow something like the synthetic data generation used for AlphaGeometry (Fig. 3) except in base ZFC and giving away very little human math inside the deduction engine. I agree this would be very crazy to see. The version with pretraining on non-mathy text is also interesting and would still be totally crazy to see. I agree it would probably imply your "come up with interesting math concepts". But I wouldn't be surprised if like >20% of the people on LW who think A[G/S]I happens in like 2−3 years thought that my thing could totally happen in 2025 if the labs were aiming for it (though they might not expect the labs to aim for it), with your things plausibly happening later. E.g. maybe such a person would think "AlphaProof is already mostly RL/search and one could replicate its performance soon without human data, and anyway, AlphaGeometry already pretty much did this for geometry (and AlphaZero did it for chess)" and "some RL+search+self-play thing could get to solving major open problems in math in 2 years, and plausibly at that point human data isn't so critical, and IMO problems are easier than major open problems, so plausibly some such thing gets to IMO problems in 1 year". But also idk maybe this doesn't hang together enough for such people to exist. I wonder if one can use this kind of idea to get a different operationalization with parties interested in taking each side though. Like, maybe whether such a system would prove Cantor's theorem (stated in base ZFC) (imo this would still be pretty crazy to see)? Or whether such a system would get to IMO combos relying moderately less on human data?