All possible encoding schemes / universal priors differ from each other by at most a finite prefix. You might think this doesn't achieve much, since the length of the prefix can be in principle unbounded; but in practice, the length of the prefix (or rather, the prior itself) is constrained by a system's physical implementation. There are some encoding schemes which neither you nor any other physical entity will ever be able to implement, and so for the purposes of description length minimization these are off the table. And of the encoding schemes that remain on the table, virtually all of them will behave identically with respect to the description lengths they assign to "natural" versus "unnatural" optimization criteria.
It looks to me like the "updatelessness trick" you describe (essentially, behaving as though certain non-local branches of the decision tree are still counterfactually relevant even though they are not — although note that I currently don't see an obvious way to use that to avoid the usual money pump against intransitivity) recovers most of the behavior we'd see under VNM anyway; and so I don't think I understand your confusion re: VNM axioms.
E.g. can you give me a case in which (a) we have an agent that exhibits preferences against whose naive implementation there exists some kind of money pump (not necessarily a repeatable one), (b) the agent can implement the updatelessness trick in order to avoid the money pump without modifying their preferences, and yet (c) the agent is not then representable as having modified their preferences in the relevant way?
I think I might be missing something, because the argument you attribute to Dávid still looks wrong to me. You say:
The entropy of the simulators’ distribution need not be more than the entropy of the (square of the) wave function in any relevant sense. Despite the fact that subjective entropy may be huge, physical entropy is still low (because the simulations happen on a high-amplitude ridge of the wave function, after all).
Doesn't this argument imply that the supermajority of simulations within the simulators' subjective distribution over universe histories are not instantiated anywhere within the quantum multiverse?
I think it does. And, if you accept this, then (unless for some reason you think the simulators' choice of which histories to instantiate is biased towards histories that correspond to other "high-amplitude ridges" of the wave function, which makes no sense because any such bias should have already been encoded within the simulators' subjective distribution over universe histories) you should also expect, a priori, that the simulations instantiated by the simulators should not be indistinguishable from physical reality, because such simulations comprise a vanishingly small proportion of the simulators' subjective probability distribution over universe histories.
What this in turn means, however, is that prior to observation, a Solomonoff inductor (SI) must spread out much of its own subjective probability mass across hypotheses that predict finding itself within a noticeably simulated environment. Those are among the possibilities it must take into account—meaning, if you stipulate that it doesn't find itself in an environment corresponding to any of those hypotheses, you've ruled out all of the "high-amplitude ridges" corresponding to instantiated simulations in the crossent of the simulators' subjective distribution and reality's distribution.
We can make this very stark: suppose our SI finds itself in an environment which, according to its prior over the quantum multiverse, corresponds to one high-amplitude ridge of the physical wave function, and zero high-amplitude ridges containing simulators that happened to instantiate that exact environment (either because no branches of the quantum multiverse happened to give rise to simulators that would have instantiated that environment, or because the environment in question simply wasn't a member of any simulators' subjective distributions over reality to begin with). Then the SI would immediately (correctly) conclude that it cannot be in a simulation.
Now, of course, the argument as I've presented it here is heavily reliant on the idea of our SI being an SI, in such a way that it's not clear how exactly the argument carries over to the logically non-omniscient case. In particular, it relies on the SI being capable of discerning differences between very good simulations and perfect simulations, a feat which bounded reasoners cannot replicate; and it relies on the notion that our inability as bounded reasoners to distinguish between hypotheses at this level of granularity is best modeled in the SI case by stipulating that the SI's actual observations are in fact consistent with its being instantiated within a base-level, high-amplitude ridge of the physical wave function—i.e. that our subjective inability to tell whether we're in a simulation should be viewed as analogous to an SI being unable to tell whether it's in a simulation because its observations actually fail to distinguish. I think this is the relevant analogy, but I'm open to being told (by you or by Dávid) why I'm wrong.
The AI has a similarly hard time to the simulators figuring out what's a plausible configuration to arise from the big bang. Like the simulators have an entropy N distribution of possible AIs, the AI itself also has an entropy N distribution for that. So it's probability that it's in a real Everett branch is not p, but p times 2^-N, as it has only a 2^-N prior probability that the kind of word it observes is the kind of thing that can come up in a real Everett branch. So it's balanced out with the simulation hypothesis, and as long as the simulators are spending more planets, that hypothesis wins.
If I imagine the AI as a Solomonoff inductor, this argument looks straightforwardly wrong to me: of the programs that reproduce (or assign high probability to, in the setting where programs produce probabilistic predictions of observations) the AI's observations, some of these will do so by modeling a branching quantum multiverse and sampling appropriately from one of the branches, and some of them will do so by modeling a branching quantum multiverse, sampling from a branch that contains an intergalactic spacefaring civilization, locating a specific simulation within that branch, and sampling appropriately from within that simulation. Programs of the second kind will naturally have higher description complexity than programs of the first kind; both kinds feature a prefix that computes and samples from the quantum multiverse, but only the second kind carries out the additional step of locating and sampling from a nested simulation.
(You might object on the grounds that there are more programs of the second kind than of the first kind, and the probability that the AI is in a simulation at all requires summing over all such programs, but this has to be balanced against the fact most if not all of these programs will be sampling from branches much later in time than programs of the first type, and will hence be sampling from a quantum multiverse with exponentially more branches; and not all of these branches will contain spacefaring civilizations, or spacefaring civilizations interested in running ancestor simulations, or spacefaring civilizations interested in running ancestor simulations who happen to be running a simulation that exactly reproduces the AI's observations. So this counter-counterargument doesn't work, either.)
These two kinds of “learning” are not synonymous. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing.
I think I am confused both about whether I think this is true, and about how to interpret in such a way that it might be true. Could you go into more detail on what it means for a learner to learn something without there being some representational semantics that could be used to interpret what it's learned, even if the learner itself doesn't explicitly represent those semantics? Or is the lack of explicit representation actually the core substance of the claim here?
It seems the SOTA for training LLMs has (predictably) pivoted away from pure scaling of compute + data, and towards RL-style learning based on (synthetic?) reasoning traces (mainly CoT, in the case of o1). AFAICT, this basically obviates safety arguments that relied on "imitation" as a key source of good behavior, since now additional optimization pressure is being applied towards correct prediction rather than pure imitation.
Strictly speaking, this seems very unlikely, since we know that e.g. CoT increases the expressive power of Transformers.
Ah, yeah, I can see how I might've been unclear there. I was implicitly taking CoT into account when I talked about the "base distribution" of the model's outputs, as it's essentially ubiquitous across these kinds of scaffolding projects. I agree that if you take a non-recurrent model's O(1) output and equip it with a form of recurrent state that you permit to continue for O(n) iterations, that will produce a qualitatively different distribution of outputs than the O(1) distribution.
In that sense, I readily admit CoT into the class of improvements I earlier characterized as "shifted distribution". I just don't think this gets you very far in terms of the overarching problem, since the recurrent O(n) distribution is the one whose output I find unimpressive, and the method that was used to obtain it from the (even less impressive) O(1) distribution is a one-time trick.[1]
And also intuitively, I expect, for example, that Sakana's agent would be quite a bit worse without access to Semantic search for comparing idea novelty; and that it would probably be quite a bit better if it could e.g. retrieve embeddings of full paragraphs from papers, etc.
I also agree that another way to obtain a higher quality output distribution is to load relevant context from elsewhere. This once more seems to me like something of a red herring when it comes to the overarching question of how to get an LLM to produce human- or superhuman-level research; you can load its context with research humans have already done, but this is again a one-time trick, and not one that seems like it would enable novel research built atop the human-written research unless the base model possesses a baseline level of creativity and insight, etc.[2]
If you don't already share (or at least understand) a good chunk of my intuitions here, the above probably sounds at least a little like I'm carving out special exceptions: conceding each point individually, while maintaining that they bear little on my core thesis. To address that, let me attempt to put a finger on some of the core intuitions I'm bringing to the table:
On my model of (good) scientific research de novo, a lot of key cognitive work occurs during what you might call "generation" and "synthesis", where "generation" involves coming up with hypotheses that merit testing, picking the most promising of those, and designing a robust experiment that sheds insight; "synthesis" then consists of interpreting the experimental results so as to figure out the right takeaway (which very rarely ought to look like "we confirmed/disconfirmed the starting hypothesis").
Neither of these steps are easily transmissible, since they hinge very tightly on a given individual's research ability and intellectual "taste"; and neither of them tend to end up very well described in the writeups and papers that are released afterwards. This is hard stuff even for very bright humans, which implies to me that it requires a very high quality of thought to manage consistently. And it's these steps that I don't think scaffolding can help much with; I think the model has to be smart enough, at baseline, that its landscape of cognitive reachability contains these kinds of insights, before they can be elicited via an external method like scaffolding.[3]
I'm not sure whether you could theoretically obtain greater benefits from allowing more than O(n) iterations, but either way you'd start to bump up against context window limitations fairly quickly. ↩︎
Consider the extreme case where we prompt the model with (among other things) a fully fleshed out solution to the AI alignment problem, before asking it to propose a workable solution to the AI alignment problem; it seems clear enough that in this case, almost all of the relevant cognitive work happened before the model even received its prompt. ↩︎
I'm uncertain-leaning-yes on the question of whether you can get to a sufficiently "smart" base model via mere continued scaling of parameter count and data size; but that connects back to the original topic of whether said "smart" model would need to be capable of goal-directed thinking, on which I think I agree with Jeremy that it would; much of my model of good de novo research, described above, seems to me to draw on the same capabilities that characterize general-purpose goal-direction. ↩︎
And I suspect we probably can, given scaffolds like https://sakana.ai/ai-scientist/ and its likely improvements (especially if done carefully, e.g. integrating something like Redwood's control agenda, etc.). I'd be curious where you'd disagree (since I expect you probably would) - e.g. do you expect the AI scientists become x-risky before they're (roughly) human-level at safety research, or they never scale to human-level, etc.?
Jeremy's response looks to me like it mostly addresses the first branch of your disjunction (AI becomes x-risky before reaching human-level capabilities), so let me address the second:
I am unimpressed by the output of the AI scientist. (To be clear, this is not the same thing as being unimpressed by the work put into it by its developers; it looks to me like they did a great job.) Mostly, however, the output looks to me basically like what I would have predicted, on my prior model of how scaffolding interacts with base models, which goes something like this:
A given model has some base distribution on the cognitive quality of its outputs, which is why resampling can sometimes produce better or worse responses to inputs. What scaffolding does is to essentially act as a more sophisticated form of sampling based on redundancy: having the model check its own output, respond to that output, etc. This can be very crudely viewed as an error correction process that drives down the probability that a "mistake" at some early token ends up propagating throughout the entirety of the scaffolding process and unduly influencing the output, which biases the quality distribution of outputs away from the lower tail and towards the upper tail.
The key moving piece on my model, however, is that all of this is still a function of the base distribution—a rough analogy here would be to best-of-n sampling. And the problem with best-of-n sampling, which looks to me like it carries over to more complicated scaffolding, is that as n increases, the mean of the resulting distribution increases as a sublinear (actually, logarithmic) function of n, while the variance decreases at a similar rate (but even this is misleading, since the resulting distribution will have negative skew, meaning variance decreases more rapidly in the upper tail than in the lower tail).
Anyway, the upshot of all of this is that scaffolding cannot elicit capabilities that were not already present (in some strong sense) in the base model—meaning, if the base models in question are strongly subhuman at something like scientific research (which it presently looks to me like they still are), scaffolding will not bridge that gap for them. The only thing that can close that gap without unreasonably large amounts of scaffolding, where "unreasonable" here means something a complexity theorist would consider unreasonable, is a shifted base distribution. And that corresponds to the kind of "useful [superhuman] capabilities" Jeremy is worried about.
I'm interested! Also curious as to how this is implemented; are you using retrieval-augmented generation, and if so, with what embeddings?
Your phrasing here is vague and somewhat convoluted, so I have difficulty telling if what you say is simply misleading, or false. Regardless:
If you have UTM1 and UTM2, there is a constant-length prefix P such that UTM1 with P prepended to some further bitstring as input will compute whatever UTM2 computes with only that bitstring as input; we can say of P that it "encodes" UTM2 relative to UTM1. This being the case, each function indexed by UTM1 differs from its counterpart for UTM2 by a maximum of len(P), because whenever it's the case that a given function would otherwise be encoded in UTM1 by a bitstring longer than len(P + [the length of the shortest bitstring encoding the function in UTM2]), the prefixed version of that function simply is the shortest bitstring encoding it in UTM1.
One of the consequences of this, however, is that this prefix-based encoding method is only optimal for functions whose prefix-free encodings (i.e. encodings that cannot be partitioned into substrings such that one of the substrings encodes another UTM) in UTM1 and UTM2 differ in length by more than len(P). And, since len(P) is a measure of UTM2's complexity relative to UTM1, it follows directly that, for a UTM2 whose "coding scheme" is such that a function whose prefix-free encoding in UTM2 differs in length from its prefix-free encoding in UTM1 by some large constant (say, ~2^10^80), len(P) itself must be on the order of 2^10^80—in other words, UTM2 must have an astronomical complexity relative to UTM1.
For any physically realizable universal computational system, that system can be analogized to UTM1 in the above analysis. If you have some behavioral policy that is e.g. deontological in nature, that behavioral policy can in principle be recast as an optimization criterion over universe histories; however, this criterion will in all likelihood have a prefix-free description in UTM1 of length ~2^10^80. And, crucially, there will be no UTM2 in whose encoding scheme the criterion in question has a prefix-free description of much less than ~2^10^80, without that UTM2 itself having a description complexity of ~2^10^80 relative to UTM1—meaning, there is no physically realizable system that can implement UTM2.