Many current ML architectures do offline training on a fixed training set ahead of time. For example, GPT-3 is quite successful with this approach. These are, so-to-speak, "in the box" during training: all they can do is match the given text completion more or less well (for example). The parameters for the AI are then optimized for success at this mission. If the system gets no benefit from making threats, duplicity, etc. during training (and indeed, loses values for these attempts), then how can that system ever perform these actions after being 'released' post-training?
There are many stories of optimizations taking "AI" into truly unexpected and potentially undesired states, like this recent one, and we worry about similar problems with live AI even when put "in a box" with limited access to the outside world. If the training is done in a box, then the system may well understand that it's in a box, that the outside world could be influenced once training stops, and how to influence it significantly. But attempting to influence it during training is disincentivized and the AI that runs post-training is the same one that runs in-training. So how could this "trained in the box" AI system ever have the problematic escape-the-box style behaviors we worry about?
I ask this because I suspect my imagination is insufficient to think up such a scenario, not that none exist.
GPT-3 is not dangerous to train in a box (and in fact is normally trained without I/O faculties, in a way that makes boxing not particularly meaningful), because its training procedure doesn't involve any I/O complicated enough to introduce infosec concerns, and because it isn't smart enough to do anything particularly clever with its environment. It only predicts text, and there is no possible text that will do anything weird without someone reading it.
With things-much-more-powerful-than-GPT-3, infosec of the box becomes an issue. For example, you could be training on videogames, and some videogames have buffer overflows that are controllable via the gamepad. You'll also tend to wind up with log file data that's likely to get viewed by humans, who could be convinced to add I/O to let it access the internet.
At really high levels of intelligence and amounts computation - far above GPT-3 or anything likely to be feasible today - you can run into "thoughtcrime", where an AI's thoughts contain morally significant agents.
I wrote a paper on the infosec aspects of this, describing how you would set things up if you thought you had a genuine chance of producing superintelligence and needed to run some tests before you let it do things: https://arxiv.org/abs/1604.00545