ETA: This post can basically be read as arguing that imitating human decisions, or any other outputs from an (approximate) planning process, seems especially likely to produce mesa-optimization, since a competent imitator should recover an (approximate) planning (i.e. optimization) process.
This post states an observation which I think a number of people have had, but which hasn't been written up (AFAIK). I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.
1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.
2) We shouldn't expect to learn exactly the correct process, though.
3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.
Example: The human might be doing planning over a bounded horizon of time-steps, or with a bounded utility function, and the AI might infer a version of the planning process that doesn't bound horizon or utility.
Clarifying note: Imitating a human is just one example; the key feature of the human is that the process generating their decisions is (arguably) well-modeled as involving planning over a long horizon.
Counter-argument(s):
- The human may have privileged access to context informing their decision; without that context, the solution may look very different
- Mistakes in imitating the human may be relatively harmless; the approximation may be good enough
- We can restrict the model family with the specific intention of preventing planning-like solutions
Overall, I have a significant amount of uncertainty about the significance of this issue, and I would like to see more thought regarding it.
In trying to argue for the safety of imitation learning, the key property of the imitation might be something like "the imitation is indistinguishable from the actual human on this distribution, where the distinguishers may be from this model class and have this information". GANs get a property like this pretty directly, although there is an issue in that they don't have access to the imitator's internal states, so e.g. they can't tell whether the imitator is using pseudorandomness or the input-based randomness imitators have access to.
Suppose we have an imitation system with a property like this. Then the argument for safety is going to look something like: "The human being imitated is corrigible. If the imitator weren't corrigible, then the discriminator could tell, because this would result in distinguishable actions."
For example, let's say that the imitated human is able to violently take control of the AI lab. In fact the imitated human doesn't do this (in any of the training data), and, arguably, it doesn't take a smart discriminator to tell that the human probably wouldn't do this in a new situation. So, if the imitator did do this (because e.g. it incorrectly inferred the human's decision theory), the discriminator could tell it apart. Of course, arguing that the discriminator generalizes this well would require some robustness argument; this particular problem seems easy (if the method for taking control involves taking really obvious actions like using weapons) but there might be more subtle ways of taking control. In those cases we would want some argument that, if the imitator comes up with a malign/incorrigible plan, then a discriminator with access to the imitator's internal states can notice this and notice that the imitated human wouldn't do this, because this isn't harder than coming up with the plan in the first place, and the discriminator is at least as capable as the imitator.
In general, while there are potential problems, I expect them to be more subtle than "the imitator incorrectly infers the human's decision theory and pursues convergent instrumental goals".
(Worth noting other problems with imitation learning, discussed in this post and this post)
I think I disagree pretty broadly with the assumptions/framing of your comment, although not necessarily the specific claims.
1) I don't think it's realistic to imagine we have "indistinguishable imitation" with an idealized discriminator. It might be possible in the future, and it might be worth considering to make intellectual progress, but I'm not expecting it to happen on a deadline. So I'm talking about what I expect might be a practical problem if we actually try to build systems that imitate humans in the coming decade... (read more)