Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Yudkowsky's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer's Intellectual Turing Test; I'll still be using my own usual frames.
This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in.
My AI Model Delta Compared To Yudkowsky
Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:
33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.
Here’s one oversimplified doom argument/story in a world where natural abstraction fails hard:
- Humanity is going to build superhuman goal-optimizing agents. (‘Cause, like, obviously somebody’s going to do that, there’s no shortage of capabilities researchers loudly advertising that they’re aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests.
- We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system.)
- Then:
- Obviously full value alignment is out.
- Robust and faithful instruction following or “do what I mean” is out; the meaning of human words/concepts can’t be robustly and faithfully represented in the system’s internal ontology at all.
- Corrigibility is out, unless (here lies one of Eliezer’s hopes) corrigibility turns out to be such a natural concept that it can faithfully and robustly translate even into the ontology of a very alien AI.
- Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.
- Interpretability is, at best, some weak correlative heuristics which won’t generalize well. The lack of 99% reliability in mechinterp is not just because our current methods are primitive.
- Etc, etc. All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.
- It’s not like this gets any better over time; if anything, AIs’ internal ontologies just keep getting more alien as their power level ramps up.
- … so we die as soon as one of these superhuman goal-optimizing agents applies enough optimization pressure to the world and the faithfulness/robustness of the translation fails. (Actually, Eliezer expects, we’re likely to die of easier problems before then, but even if our species’ competence is far higher than currently seems, the translation problem would kill us.)
- As an added bonus, the AIs will know all this (‘cause, y’know, they’re smart), will therefore know that divergence between their goals and humans’ goals is inevitable (because their goals are in fundamentally alien ontologies and therefore will diverge out-of-distribution), and will therefore be incentivized to strategically hide their long-term intentions until it’s time for the humans to go.
Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.
By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn't super-long). So, all those technical alignment possibilities are back on the table.
That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.
- ^
particular time = particular point in the unrolled execution of the program
If I've understood you correctly, you consider your only major delta with Elizer Yudkowsky to be whether or not natural abstractions basically always work or reliably exist harnessably, to put it in different terms. Is that a fair restatement?
If so, I'm (specifically) a little surprised that that's all. I would have expected whatever reasoning the two of you did differently or whatever evidence the two of you weighted differently (or whatever else) would have also given you some other (likely harder to pin down) generative-disagreements (else maybe it's just really narrow really strong evidence that one of you saw and the other didn't???).
Maybe that's just second-order though. But I would still like to hear what the delta between NADoom!John and EY still is, if there is one. If there isn't, that's surprising, too, and I'd be at least a little tempted to see what pairs of well-regarded alignment researchers still seem to agree on (and then if there are nonobvious commonalities there).
Also, to step back from the delta a bit here -
For all we know, all our existing training methods are really good at producing AIs with alien ontologies, and there's some really weird unexpected procedure you need to follow that does produce nice ontology-sharing aligned-by-default AIs. I wouldn't call it likely, but if we feel up to positing that possibility at all, we should also be willing to posit the reverse.