ETA: This post can basically be read as arguing that imitating human decisions, or any other outputs from an (approximate) planning process, seems especially likely to produce mesa-optimization, since a competent imitator should recover an (approximate) planning (i.e. optimization) process.
This post states an observation which I think a number of people have had, but which hasn't been written up (AFAIK). I find it one of the more troubling outstanding issues with a number of proposals for AI alignment.
1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.
2) We shouldn't expect to learn exactly the correct process, though.
3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.
Example: The human might be doing planning over a bounded horizon of time-steps, or with a bounded utility function, and the AI might infer a version of the planning process that doesn't bound horizon or utility.
Clarifying note: Imitating a human is just one example; the key feature of the human is that the process generating their decisions is (arguably) well-modeled as involving planning over a long horizon.
Counter-argument(s):
- The human may have privileged access to context informing their decision; without that context, the solution may look very different
- Mistakes in imitating the human may be relatively harmless; the approximation may be good enough
- We can restrict the model family with the specific intention of preventing planning-like solutions
Overall, I have a significant amount of uncertainty about the significance of this issue, and I would like to see more thought regarding it.
Suppose you take a terabyte of data on human decisions and actions. You search for the shortest program that outputs the data, then see what gets outputted afterwards. The shortest program that outputs the data might look like a simulation of the universe with an arrow pointing to a particular hard drive. The "imitator" will guess at what file is next on the disk.
One problem for imitation learning is the difficulty in pointing out the human and separating them from the environment. The details of the humans decision might depend on what they had for lunch. (Of course, multiple different decisions might be good enough. But this illustrates that "imitate a human" isn't a clear cut procedure. And you have to be sure that the virtual lunch doesn't contain virtual mind control nanobots. ;-)
You could put a load of data about humans into a search for short programs that produce the same data. Hopefully the model produced will be some approximation of the universe. Hopefully, you have some way of cutting a human out of the model and putting them into a virtual box.
Alternatively you could use nanotech for mind uploading, and get a virtual human in a box.
If we have lots of compute and not much time, then uploading a team of AI researchers to really solve friendly AI is a good idea.
If we have a good enough understanding of "imitation learning", and no nanotech, we might be able to get an AI to guess the researchers mental states given observational data.
An imitation of a human might be a super-fast intelligence, with a lot of compute, but it won't be qualitatively super-intelligent.