Even an AI that asks to be released for seemingly-benign reasons is already demonstrating a disregard for basic safety protocol, and should thus be shut down.
The AI might have a good argument, such as: hey, you are already 99% convinced that I am safe, and there is a lot of suffering in the world, and I could prevent it. And of course that could be a lie.
I suppose that at some moment we are going to allow the AI to have an effect on real world (otherwise what is the point), so it is a question of when exactly. The AI may argue for a sooner date. It would be wrong to say that a later date is always the correct answer, because for practical purposes that would mean never.
Many of these issues can be detected through static analysis of the source code, and therefore doesn't require the AI to be running.
The problem is, if the static analysis does not find any obvious issues, what next?
Also, when you look at machine learning today, there is the code, and there is the trained model. And the model is huge and not legible for humans. How to verify that the model is okay?
One could dissect the AI into smaller modules, where each module is too small to sustain any simulated consciousness, and run tests on these modules individually with transparency tools. This still poses no risk, because it is no more alive than a human is under an autopsy.
That assumes a modular structure with simple boundaries. What if the modules keeps sending zillions of numbers to each other? How are you going to test one of them individually? If you do not provide realistic numbers from the other modules, the behavior you observe may not be representative.
As an extra precaution, the AI's memory could be wiped before turning it back on. Thus, if it is still unfriendly then it will repeat the same mistake it did before which reveled its unalignment, giving cause to shut it down again and repeat the process.
This assumes that the first detectable unfriendly action will not kill us. If the AI is smart enough to use a strategy "while humans have power over me, do what they want; if I become stronger than them, convert everyone to paperclips", this precaution would not help.
Summary: yes, all these things are good ideas to try. But the main concern is that even doing all of this is not likely to help us. In some sense, this all only exposes "easy" problems, not the "hard" ones.
Surely, if we knew that the AI was unfriendly
We are not living in a world of certain knowledge about what is and what isn't dangerous.
The AI box experiment shows an agent that is constantly begging or persuading its gatekeeper to be released, including threats to torture simulated copies of its creators, and I would think such psychotic behavior is already a pretty big red flag.
There's a good reason that Eliezer decided not to provide any transcript for the original experiment. It seems that those confused you about what the experiment is about.
In reading about proposals about AI Boxing, one thought immediately comes to mind: Why would we want the risk of having an Unfriendly AI turned on in the first place? After all, inasmuch as a Boxed AI poses reduced risk, a collection of source code in non-executable text files poses no risk at all.
At first, I hesitated to write this post because it felt like a really dumb question. Surely, if we knew that the AI was unfriendly, then there would be no question to shut it down. And yet, the more I read posts on Boxed AI the more that doesn't appear to be common knowledge. The AI box experiment shows an agent that is constantly begging or persuading its gatekeeper to be released, including threats to torture simulated copies of its creators, and I would think such psychotic behavior is already a pretty big red flag. Even an AI that asks to be released for seemingly-benign reasons is already demonstrating a disregard for basic safety protocol, and should thus be shut down.
Of course, the next question becomes "is there any reason we should turn an unaligned AI on?". Of course, the main usefulness of an unaligned AI is an analogous reason why we retain cultures of Smallpox: study what went wrong in order to prevent similar issues in the future. Many of these issues can be detected through static analysis of the source code, and therefore doesn't require the AI to be running.
Naturally, static code analysis has its limitations, and many bugs can only be detected in runtime. However, this still doesn't mean that the software has to be run from start to finish in the way that it is intended. One could dissect the AI into smaller modules, where each module is too small to sustain any simulated consciousness, and run tests on these modules individually with transparency tools. This still poses no risk, because it is no more alive than a human is under an autopsy.
As an extra precaution, the AI's memory could be wiped before turning it back on. Thus, if it is still unfriendly then it will repeat the same mistake it did before which reveled its unalignment, giving cause to shut it down again and repeat the process.
Has this proposal of a dissected AI been discussed in the literature? I'd be interested to see.