The entire point of my idea is that we can just build the AI such that it doesn't want to leave the box or increase its power base.
Let's return to my comment four comments up. How will you formalize "power base" in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I think, because your above objection makes no sense at all and is obviously wrong upon a moment's reflection.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
I think a key component of our disagreement here might be that I'm assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it's not going to make humans lose- but it's also not going to cure cancer for us.
If you have an AI that's useful, it must be because it's clever and it has data. If you type in "how do I cure cancer without reducing the longevity of the patient?" and expect to get a response like "1000 ccs of Vitamin C" instead of "what do you mean?", then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn't have this background knowledge- if it can't read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn't seem very useful as a security measure.
If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us.
It's already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can't tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don't end with a neat "humans_lose=1" that we can look at and say "hm, maybe we shouldn't implement this plan."
In practice, debugging is much more effective at finding the source of problems after they've manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we'll have access to all of the 0s and 1s.
And I don't think it's perfect or even good, not by a long shot, but I think it's better than building an unboxed FAI because it adds a few more layers of protection, and that's definitely worth pursuing because we're dealing with freaking existential risk here.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don't think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
Let's return to my comment four comments up. How will you formalize "power base" in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I won't. The AI can do whatever it wants to the gatekeepers through the text channel, and won't want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
...If you would like to point out a part that of the argument that does not follow, I would be happy to try
Here's the new thread for posting quotes, with the usual rules: