Preamble: Delta vs Crux
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Yudkowsky's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models basically look like Yudkowsky’s as far as I can tell. That said, note that this is not an attempt to pass Eliezer's Intellectual Turing Test; I'll still be using my own usual frames.
This post might turn into a sequence if there's interest; I already have another one written for Christiano, and people are welcome to suggest others they'd be interested in.
My AI Model Delta Compared To Yudkowsky
Best guess: Eliezer basically rejects the natural abstraction hypothesis. He mostly expects AI to use internal ontologies fundamentally alien to the ontologies of humans, at least in the places which matter. Lethality #33 lays it out succinctly:
33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
What do my models look like if I propagate that delta? In worlds where natural abstraction basically fails, we are thoroughly and utterly fucked, and a 99% probability of doom strikes me as entirely reasonable and justified.
Here’s one oversimplified doom argument/story in a world where natural abstraction fails hard:
- Humanity is going to build superhuman goal-optimizing agents. (‘Cause, like, obviously somebody’s going to do that, there’s no shortage of capabilities researchers loudly advertising that they’re aiming to do that exact thing.) These will be so vastly more powerful than humans that we have basically-zero bargaining power except insofar as AIs are aligned to our interests.
- We’re assuming natural abstraction basically fails, so those AI systems will have fundamentally alien internal ontologies. For purposes of this overcompressed version of the argument, we’ll assume a very extreme failure of natural abstraction, such that human concepts cannot be faithfully and robustly translated into the system’s internal ontology at all. (For instance, maybe a faithful and robust translation would be so long in the system’s “internal language” that the translation wouldn’t fit in the system.)
- Then:
- Obviously full value alignment is out.
- Robust and faithful instruction following or “do what I mean” is out; the meaning of human words/concepts can’t be robustly and faithfully represented in the system’s internal ontology at all.
- Corrigibility is out, unless (here lies one of Eliezer’s hopes) corrigibility turns out to be such a natural concept that it can faithfully and robustly translate even into the ontology of a very alien AI.
- Insofar as an AI cares-as-a-terminal-goal about keeping humans around, it will care about its own alien conception of “humans” which does not match ours, and will happily replace us with less resource-intensive (or otherwise preferable) things which we would not consider “human”.
- Interpretability is, at best, some weak correlative heuristics which won’t generalize well. The lack of 99% reliability in mechinterp is not just because our current methods are primitive.
- Etc, etc. All of the technical alignment hopes are out, unless we posit some objective natural enough that it can be faithfully and robustly translated into the AI’s internal ontology despite the alien-ness.
- It’s not like this gets any better over time; if anything, AIs’ internal ontologies just keep getting more alien as their power level ramps up.
- … so we die as soon as one of these superhuman goal-optimizing agents applies enough optimization pressure to the world and the faithfulness/robustness of the translation fails. (Actually, Eliezer expects, we’re likely to die of easier problems before then, but even if our species’ competence is far higher than currently seems, the translation problem would kill us.)
- As an added bonus, the AIs will know all this (‘cause, y’know, they’re smart), will therefore know that divergence between their goals and humans’ goals is inevitable (because their goals are in fundamentally alien ontologies and therefore will diverge out-of-distribution), and will therefore be incentivized to strategically hide their long-term intentions until it’s time for the humans to go.
Note that the “oversimplification” of the argument mostly happened at step 2; the actual expectation here would be that a faithful and robust translation of human concepts is long in the AI’s internal language, which means we would need very high precision in order to instill the translation. But that gets into a whole other long discussion.
By contrast, in a world where natural abstraction basically works, the bulk of human concepts can be faithfully and robustly translated into the internal ontology of a strong AI (and the translation isn't super-long). So, all those technical alignment possibilities are back on the table.
That hopefully gives a rough idea of how my models change when I flip the natural abstraction bit. It accounts for most of the currently-known-to-me places where my models diverge from Eliezer’s. I put nontrivial weight (maybe about 10-20%) on the hypothesis that Eliezer is basically correct on this delta, though it’s not my median expectation.
- ^
particular time = particular point in the unrolled execution of the program
Because the human isn't going to constantly be present for everything the system does after it's deployed (unless for some reason it's not deployed).
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system's time-evolution with alacrity.
(Yes, this is computationally intractable, but we're already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model's native ontology completely diverges from our own.)
So we need not assume that predicting "the genius philosopher" is a core task. It's enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don't have reliable access to its internal world-model, and even if we did we'd see blobs of amplitude and not much else. There's no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn't really matter for the purposes of our story) on the model's outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn't present to provide their actual feedback, as was never the case during training? What becomes the model's referent then? We'd like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model's behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model's native ontology is something in which "human philosophers" are not basic objects, is liable to look very weird indeed.
Sort of, yes—but I'd call it "malgeneralization" rather than "misgeneralization". It's not failing to generalize, it's just not generalizing the way you'd want it to.
Depends on what you mean by "progress", and "out-of-distribution". A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it's not like you'll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of "progress on out-of-distribution generalization", when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?