I like this post and agree that there are different threat models one might categorize broadly under "inner alignment". Before reading this I hadn't reflected on the relationship between them.
Some random thoughts (after an in-person discussion with Erik):
Thanks for the comments!
One can define deception as a type of distributional shift. [...]
I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.
Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)
I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation?
I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one's beliefs about what is easy or hard to do with deep learning.
Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.
One thing to flag is that even if for any given model, the probability of capabilities generalizing is very low, total doom can still be high, since there might be many tries at getting models that generalize well across distributional shifts, whereas the selection pressures to getting alignment robustness is comparably weaker. You can imagine a 2x2 quadrant of capabilities vs alignment generalizability across distributional shift:
Capabilities doesn't generalize, alignment doesn't: irrelevant
Capabilities doesn't generalize, alignment does: irrelevant
Capabilities generalizes, alignment doesn't: potentially very dangerous, especially if power-seeking. Agent (or agent and friends) acquires more power and may attempt a takeover.
Capabilities generalizes, alignment does: Good, but not clearly great. By default I won't expect it to be powerseeking (unless you're deliberately creating a sovereign), so it only has as much power as humans allow it to have. So the AI might risk being outcompeted by their more nefarious peers.
Fantastic post!
But for now, this doesn't lead to permanent improvements to the language model.
LMs like RETRO or WebGPT seem to be an answer to this. They have external database that essentially permanently improves the model's capabilities. If you have a sufficiently powerful LM that can surf the web and extract new relevant info, you can get continuous capability gains without needing to actually retrain the model with gradient descent. I would personally call this permanent.
Note that the database doesn't only need to be some curated or external source, it can be the model's actual high-quality prompts and interactions with users that end up being stored and then reused when it makes sense.
TL;DR: This is an attempt to disentangle some concepts that I used to conflate too much as just "inner alignment". This will be old news to some, but might be helpful for people who feel confused about how deception, distributional shift, and "sharp left turns" are related. I first discuss them as entirely separate threat models, and then talk about how they're all aspects of "capabilities are more robust than alignment".
Here are three different threat models for how an AI system could very suddenly do catastrophic things:
Note that these can be formulated as entirely distinct scenarios. For example, deception doesn't require a distributional shift[1] nor capability gains; instead, the sudden change in model behavior occurs because the AI was "let out of the box" during deployment. Conversely, in the distributional shift scenario, the model might not be deceptive during training, etc. (One way to think about this is that they rely on changes along different axes of the training/deployment dichotomy).
Examples
I don't think we have any empirical examples of deception in AI systems, though there are thought experiments. We do see kind of similar phenomena in interactions between humans, basically whenever someone pretends to have a different goal than they actually do in order to gain influence.
To be clear, here's one thing that is not an example of deception in the sense in which I'm using the word: an AI does things during training that only look good to humans even though they actually aren't, and then continues to do those things in deployment. To me, this seems like a totally different failure mode, but I've also seen this called "deception" (e.g. "Goodhart deception" in this post), thus the clarification.
We do have experimental evidence for goal misgeneralization under distributional shift (the second scenario above). A well-known one is the CoinRun agent from Goal misgeneralization in Deep RL, and more recently, DeepMind published many more examples.
A classic example for sudden capability gains is the history of human evolution. Relatively small changes in the human brain compared to other primates made cultural evolution feasible, which allowed humans to improve from a source other than biological evolutionary pressure. The consequence were extremely quick capability gains for humanity (compared to evolutionary time scales). This example contains both the "threshold mechanism", where a small change to cognitive architectures has big effects, and the "learning from another source mechanism", with the former enabling the latter.
In ML, grokking might be an example for the "threshold mechanism" for sudden capability gains: a comparatively small number of gradient steps can massively improve generalization beyond the training distribution. An example of learning from something other than gradients is in-context learning in language models (e.g. you can give an LM information in the prompt and it can use that information). But for now, this doesn't lead to permanent improvements to the language model.
Relations between these concepts
I used to conflate deception, distributional shift, and sharp left turns as "inner alignment" in a way that I now think wasn't helpful. But on the other hand, these do feel related, so what makes them similar?
One obvious aspect is that these could all lead to very sudden failures (as opposed to a pure "going out with a whimper" scenario). In each case, the AI might behave fine for a while—not just in terms of "looking fine" to human observers, but even under some ideal outer alignment solution. Then something changes, typically quite suddenly, and the AI behaves very differently (and likely badly in a way that would be obvious to us). The reason these scenarios are dangerous is thus that the AI could make high-stakes decisions, to use Paul's framing. I think this is the sense in which they all feel related to inner alignment.
A more interesting (but also more hand-wavy) point is that all three are in some sense about capabilities being more robust than alignment:
A note on distributional shift
I've specifically been talking about goal misgeneralization under distributional shift. Here, we mainly care about not generalizing catastrophically, i.e. competently pursuing a misaligned goal. There's a different type of distributional shift that's important in alignment, where we care about actually good generalization, rather than just non-catastrophic one. For example, for ELK we want to generalize from questions that humans can answer to questions we can't answer. But we don't just want to generalize in a way that doesn't kill us if we read the answer, we want to generalize in the way that gives correct answers. This seems distinct from what I've discussed in this post in some important ways, though there are definitely parallels (e.g. the generalization we want not being natural enough that we're guaranteed to get it automatically).
Thanks to Leon Lang for feedback on a draft of this post!
A tiny distributional shift would be enough to let the model determine whether it's in training or deployment, and in principle it could even be deceptive without any expected distributional shift, by defecting randomly.