I've come up with what I believe to be an entirely new approach to boxing, essentially merging boxing with FAI theory. I wrote a couple thoughts down about it, but lost my notes, and I also don't have much time to write this comment, so forgive me if it's vague or not extremely well reasoned. I also had a couple of tangential thoughts, if I remember them in the course of writing this or I recover my notes later than I'll put them here as well.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world, so the AI wouldn't be totally boxed, but I'm inclined to think that we could maintain a very high degree of control over the AI, since the only powers it could ever have would be through communication with us.
The idea came because I recalled a discussion that occurred on TOR.com about why the Cthaeh was in the tree. The general conclusion was that either the Cthaeh was bound by extremely powerful forces, or that the Cthaeh wanted to be in the tree, perhaps because it was instrumentally useful to him. While I found that second explanation implausible in the context of Rothfussland, that discussion led me to realize that almost all approaches towards AI boxing have done so through the first branch of potential boxing solutions, that is, external constraints imposed on the AI, as opposed to the second branch, internal constraints that the AI imposes on itself because of its utility function.
This lead me to think that we should test our capabilities with FAI systems by putting them in a box and giving them limited utility functions, ones that couldn't possibly lead them to want to manipulate us. So, for example, we could put them in a box and give them a strong desire to stay in the box, along with a desire to tell the truth, and then we could ask it what would happen if we gave it additional desires. We could then test that by building a different AI with those desires and seeing what it did.
It also occurred to me that we would be able to functionally read the AIs mind by looking at its coding. If so, it seems like it'd be much harder for it to fool us than we previously thought. It could potentially just deceive us by manipulating its own thoughts, but I feel like we might be able to stop AIs from doing that somehow, or that the AI would choose not to manipulate its own thoughts that way because it would go insane, or something. It would certainly make the AI's job much much much harder.
Then I thought that another safety component we should put into testing was a dramatically slowed thinking process for the AI. I'm not sure how exactly this would be done because I'm not very good or experienced with actual technical implementation of ideas, but my idea was to create a crude simulation where we controlled all data inputs and outputs, and we slowed down all processes within the simulation by a factor of 1000, and then building the AI within the simulation. That probably wouldn't work, given my sucky technical knowhow, but something similar might. Slowing down the AI, perhaps even giving ourselves the ability to pause it without letting it know what's going on, combined with our ability to "read its mind" might dramatically improve the safeness of boxing.
I'd also like to recommend that before we build a real FAI, we at least try to build a boxed one first, even with a low probability of success. It wouldn't make things worse in the event that boxing failed, except that it might delay global happiness by a few hours, and in the event that the FAI program was broken we just might save ourselves from menaces to humankind like Clippy.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box.
How do you specify precisely what it means to "stay in the box"? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?
Here's the new thread for posting quotes, with the usual rules: