Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Yudkowsky's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer's Intellectual Turing Test; I'll still be using my own usual frames.
This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in.
My AI Model Delta Compared To Yudkowsky
Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:
33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.
Here’s one oversimplified doom argument/story in a world where natural abstraction fails hard:
- Humanity is going to build superhuman goal-optimizing agents. (‘Cause, like, obviously somebody’s going to do that, there’s no shortage of capabilities researchers loudly advertising that they’re aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests.
- We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system.)
- Then:
- Obviously full value alignment is out.
- Robust and faithful instruction following or “do what I mean” is out; the meaning of human words/concepts can’t be robustly and faithfully represented in the system’s internal ontology at all.
- Corrigibility is out, unless (here lies one of Eliezer’s hopes) corrigibility turns out to be such a natural concept that it can faithfully and robustly translate even into the ontology of a very alien AI.
- Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.
- Interpretability is, at best, some weak correlative heuristics which won’t generalize well. The lack of 99% reliability in mechinterp is not just because our current methods are primitive.
- Etc, etc. All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.
- It’s not like this gets any better over time; if anything, AIs’ internal ontologies just keep getting more alien as their power level ramps up.
- … so we die as soon as one of these superhuman goal-optimizing agents applies enough optimization pressure to the world and the faithfulness/robustness of the translation fails. (Actually, Eliezer expects, we’re likely to die of easier problems before then, but even if our species’ competence is far higher than currently seems, the translation problem would kill us.)
- As an added bonus, the AIs will know all this (‘cause, y’know, they’re smart), will therefore know that divergence between their goals and humans’ goals is inevitable (because their goals are in fundamentally alien ontologies and therefore will diverge out-of-distribution), and will therefore be incentivized to strategically hide their long-term intentions until it’s time for the humans to go.
Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.
By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn't super-long). So, all those technical alignment possibilities are back on the table.
That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.
- ^
particular time = particular point in the unrolled execution of the program
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase "validation set" in my earlier reply to you, so it's not a matter of guessing someone's password—I'm quite familiar with these concepts, as someone who's implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they're literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of "IID" you seem to appeal to elsewhere in your comment—that it's possible to test the model's generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn't (and can't be) captured by "IID" validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model's generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call "omniscience", but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that "omniscience", you can, although note that strictly speaking the model is still just working from inferences from training data. It's just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical "head", pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don't, and which instead use some kind of lossy approximation of that state involving intermediary concepts like "intent", "belief", "agent", "subjective state", etc.
You're close; I'd say the concern is slightly worse than that. It's that the "future data" never actually comes into existence, at any point. So the source of distributional shift isn't just "the data is generated at the wrong time", it's "the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated".
(This would be the case e.g. with any concept of "human approval" that came from a literal physical human or group of humans during training, and not after the system was deployed "in the wild".)
The problem is that "vanilla" abstractions are not the most predictively useful possible abstractions, if you've got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a "better abstraction", but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.