Let's return to my comment four comments up. How will you formalize "power base" in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
I won't. The AI can do whatever it wants to the gatekeepers through the text channel, and won't want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it's not going to make humans lose- but it's also not going to cure cancer for us. If you have an AI that's useful, it must be because it's clever and it has data. If you type in "how do I cure cancer without reducing the longevity of the patient?" and expect to get a response like "1000 ccs of Vitamin C" instead of "what do you mean?", then the AI should already know about cancer and humans and medicine and so on. If the AI doesn't have this background knowledge- if it can't read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn't seem very useful as a security measure.
I agree, the way that I'm proposing to do AI is very limited. I myself can't think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn't come up with answers. I'm sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don't really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
It's already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can't tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don't end with a neat "humans_lose=1" that we can look at and say "hm, maybe we shouldn't implement this plan."
In practice, debugging is much more effective at finding the source of problems after they've manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we'll have access to all of the 0s and 1s.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we'd built the AI then we'd have enough knowledge to read its mind. It wouldn't just magically appear, it would only do things in the way we'd told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I've got no coding experience, just ideas from what I've read on this site.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don't think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I've never seen it proposed elsewhere on this site.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
Well, he demonstrated that it can sometimes get out. But my claim was that "getting out" isn't the scary part- the scary part is "reshaping the world." My brain can reshape the world just fine while remaining in my skull and only communicating with my body through slow chemical wires, and so giving me the goal of "keep your brain in your skull" doesn't materially reduce my ability or desire to reshape the ...
Here's the new thread for posting quotes, with the usual rules: