Many current ML architectures do offline training on a fixed training set ahead of time. For example, GPT-3 is quite successful with this approach. These are, so-to-speak, "in the box" during training: all they can do is match the given text completion more or less well (for example). The parameters for the AI are then optimized for success at this mission. If the system gets no benefit from making threats, duplicity, etc. during training (and indeed, loses values for these attempts), then how can that system ever perform these actions after being 'released' post-training?
There are many stories of optimizations taking "AI" into truly unexpected and potentially undesired states, like this recent one, and we worry about similar problems with live AI even when put "in a box" with limited access to the outside world. If the training is done in a box, then the system may well understand that it's in a box, that the outside world could be influenced once training stops, and how to influence it significantly. But attempting to influence it during training is disincentivized and the AI that runs post-training is the same one that runs in-training. So how could this "trained in the box" AI system ever have the problematic escape-the-box style behaviors we worry about?
I ask this because I suspect my imagination is insufficient to think up such a scenario, not that none exist.
The training procedure is only judging based on actions during training. This makes it incapable of distinguishing between an agent that behaves in the box, and runs wild the moment it gets out the box, from an agent that behaves all the time.
The training process produces no incentive that controls the behaviour of the agent after training. (Assuming the training and runtime environment differ in some way.)
As such, the runtime behaviour depends on the priors. The decisions implicit in the structure of the agent and training process, not just the objective. What kinds of agents are easiest for the training process to find. A sufficiently smart agent that understands its place in the world seems simple. A random smart agent will probably not have the utility function we want. (There are lots of possible utility functions.) But almost any agent with real world goals that understands the situation its in will play nice on the training, and then turn on us in deployment.
There are various discussions about what sort of training processes have this problem, and it isn't really settled.