Why does the AI even "want" failure mode 3? If it's a RL agent, it's not "motivated to maximize its reward", it's "motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward". Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less "provably") prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it's not provable that it can't) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be "infinite", but it can be successfully prevented from "feeling" the infinite-ness of that potential reward, because the RL procedure itself doesn't consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn't subverting those safeguards unless they've already failed.)
Consider the strategy "do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry". This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.
What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. None of your "the AI is an RL agent, so it shouldn't want to break out" works any differently whether the box is a fancy concrete vacuum faraday cage, or just a cardboard box. A fancy box is only useful if there is something the AI wants, but is unable to do.
To a RL agent, if it hasn't tried to break out of the box, then breaking out of the box is a case of generalization. For that matter, other forms of wireheading are also a case of generalization.
Is "do whatever action you predict to maximize the electricity in this particular piece of wire" really "general"? You're basically claiming that the more intelligent someone is, the more likely they are to wirehead. With humans, in my experience, and for a loose definition of "wirehead", the pattern seems to be the opposite; and that seems to me to be solid enough in terms of how RL works that I doubt it's worth the work to dig deep enough to resolve our disagreement here.
This post is me looking at 1905.12186.pdf and giving some ideas on how it might fail.
Firstly, I would like to point out that I think the design in this paper does buy you some amount of marginal safety when dealing with slightly superhuman AI's, and that it still gives opportunity to perhaps get something useful out.
Failure mode 1)
A lack of optimization doesn't mean a lack of danger. Even after the episode ends, the box may contain dangerous optimized items. Given that the box was specified to contain only a human, (And some cloths) this suggests the most dangerous thing that could leave is probably some form of meme. For example, suppose the AI finds a very convincing argument that humans should do whatever AI's want. This is a very good solution to the problem of hacking the human into giving the AI whatever answer it wants. (Which is pretty much what this AI was programmed to do, although it may sometimes give good answers on occasions where that is easier than hacking the operator) Once the human exits the box, the idea can go viral.
Failure mode 2)
What does the "Asymptotic optimality" mean in practice. Imagine a world containing BoMAI. The BoMAI comes up with a plan to take over the world after this episode has ended. It tries this plan. Lets imagine a magic angel interferes with this plan, stopping it the moment the episode ends. So the BoMAI is somewhat puzzled, and tries again. Eventually BoMAI will learn that plans involving taking over the world after the end of the episode never work, and so it will stop doing them. This means the magic angel only has a finite amount of work to do. Plans that involve taking over the world are assumed to receive a lower rating within the episode. The smaller the difference in reward between a plan that also takes over the world after the episode, and a plan that purely optimizes within an episode, the more slowly such taking over world plans disappear. Leading in the limit to failure mode 1.
Of course, in reality, the magic angel doesn't exist, and the AI succeeds in taking over the world.
Failure mode 3)
Criticizing the box.
The mentor has no vacuum between them and the font of all evil (the AI). There is also no Faraday cage, and a screen facing the mentor that is vulnerable to all sorts of hacking. Therefore the mentor can be hacked. There is no vacuum between the mentor and the AI, so if the mentor can claw through concrete, they gain hardware level access to the AI. They can then disable the pressure sensors and any other systems that end the episode. Now the airlocks can be opened by using whatever mechanism is used to open the airlocks at the end of an episode.
There must be some point where information from the pressure sensors feeds into the AI system to stop the episode. This point, and the pressure sensors themselves, are vulnerable to electromagnetic and sonic interference from the AI. Suppose the AI has a bunch of cooling fans it is in control of, and by vibrating them all in just the right way, it can cause all the sound waves to constructively interfere right on the solder joint between a pressure sensor and the AI's circuits. This allows it to work the solder loose, slowly breaking that connection. The AI can now encourage the human operator to smash the glass.
The AI bootstraps nanotech. It turns out there was a sophisticated and easily hackable piece of proto-nanotech just sitting around, in the form of a bacteria just sitting on one of the AI's chips. By carefully controlling the electromagnetic, thermal and sonic emissions of its hardware, the AI can hack that ribosome. The bootstrapped nanotech can tunnel through the walls of the box and take over the world.
There are other flaws in the "nothing can get through". For instance, solid objects could be poked through a small hole, so long as the hole was carefully sealed to stop gasses escaping. Or tiny amounts of gas could be leaked through a pinhole, not enough to trip the pressure sensors, with information in which gas. But I can't think of how to use these to take over the world.