What does "in a box" mean? Presumably some sort of artificial limitation on the AI's capabilities.
Either this is intended to be a permanent state, or a trial period until safety can be proven.
Suppose it is a permanent state: the AI's developers are willing to do without the "dangerous" capabilities, and are content with answers an AI can offer while inside its box. If so, the limitations would be integrated into the design from the ground up, at every possible level. Core algorithms would depend on not having to deal with the missing functionality. Yes, given enough time, one could rewrite the AI's code and migrate it to hardware where these limitations are not in force, but it would not be a switch that an individual or committee could simply be convinced to flip.
However, if any of the data returned by such an AI are permitted to alter reality outside the box in any way, it is in principle possible that the AI's cures for cancer/winning stock strategies/poetry will set in motion some chain of events that will build support among relevant decision-makers for an effort to rewrite/migrate the AI so that it is no longer in a box.
Suppose it is a temporary state: the AI is temporarily nerfed until it is shown to be safe. In that case, a gatekeeper should have some criteria in mind for proof-of-friendliness. If/when the AI can meet these criteria, the gatekeeper can and should release it. A gatekeeper who unconditionally refuses to release the AI is a waste of resources because the same function could be performed by an empty terminal.
Assuming this suggested rule is observed:
The results of any simulated test of the AI shall be provided by the AI party.
...the AI-box game simplifies to:
Can the gatekeeper party come up with a friendliness test that the AI party cannot fake?
Some of you have expressed the opinion that the AI-Box Experiment doesn't seem so impossible after all. That's the spirit! Some of you even think you know how I did it.
There are folks aplenty who want to try being the Gatekeeper. You can even find people who sincerely believe that not even a transhuman AI could persuade them to let it out of the box, previous experiments notwithstanding. But finding anyone to play the AI - let alone anyone who thinks they can play the AI and win - is much harder.
Me, I'm out of the AI game, unless Larry Page wants to try it for a million dollars or something.
But if there's anyone out there who thinks they've got what it takes to be the AI, leave a comment. Likewise anyone who wants to play the Gatekeeper.
Matchmaking and arrangements are your responsibility.
Make sure you specify in advance the bet amount, and whether the bet will be asymmetrical. If you definitely intend to publish the transcript, make sure both parties know this. Please note any other departures from the suggested rules for our benefit.
I would ask that prospective Gatekeepers indicate whether they (1) believe that no human-level mind could persuade them to release it from the Box and (2) believe that not even a transhuman AI could persuade them to release it.
As a courtesy, please announce all Experiments before they are conducted, including the bet, so that we have some notion of the statistics even if some meetings fail to take place. Bear in mind that to properly puncture my mystique (you know you want to puncture it), it will help if the AI and Gatekeeper are both verifiably Real People<tm>.
"Good luck," he said impartially.