Eneasz comments on Rationality Quotes August 2012 - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (426)
An excerpt from Wise Man's Fear, by Patrick Rothfuss. Boxing is not safe.
Hah, I actually quoted much of that same passage on IRC in the same boxing vein! Although as presented the scenario does have some problems:
It is conceivable that there is no (near enough) future where Cthaeh is freed, thus it is powerless to affect its own fate, or is waiting for the right circumstances.
That seemed a little unlikely to me, though. As presented in the book, a minimum of many millennia have passed since the Cthaeh has begun operating, and possibly millions of years (in some frames of reference). It's had enough power to set planes of existence at war with each other and apparently cause the death of gods. I can't help but feel that it's implausible that in all that time, not one forking path led to its freedom. Much more plausible that it's somehow inherently trapped in or bound to the tree so there's no meaningful way in which it could escape (which breaks the analogy to an UFAI).
Isn't it what I said?
Not by my reading. In your comment, you gave 3 possible explanations, 2 of which are the same (it gets freed, but a long time from 'now') and the third a restriction on its foresight which is otherwise arbitrary ('powerless to affect its own fate'). Neither of these translate to 'there is no such thing as freedom for it to obtain'.
Alternatively, perhaps the Cthaeh's ability to see the future is limited to those possible futures in which it remains in the tree.
Leading to a seriously dystopian variant on Tenchi Muyo!...
I thought Chronicler's reply to this was excellent, however. Omniscience does not necessitate omnipotence.
I mean, the UFAI in our world would have an easy time of killing everything. But in their world it's different.
EDIT: Except that maybe we can be smart and stop the UFAI from killing everything even in our world, see my above comment.
I've come up with what I believe to be an entirely new approach to boxing, essentially merging boxing with FAI theory. I wrote a couple thoughts down about it, but lost my notes, and I also don't have much time to write this comment, so forgive me if it's vague or not extremely well reasoned. I also had a couple of tangential thoughts, if I remember them in the course of writing this or I recover my notes later than I'll put them here as well.
The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world, so the AI wouldn't be totally boxed, but I'm inclined to think that we could maintain a very high degree of control over the AI, since the only powers it could ever have would be through communication with us.
The idea came because I recalled a discussion that occurred on TOR.com about why the Cthaeh was in the tree. The general conclusion was that either the Cthaeh was bound by extremely powerful forces, or that the Cthaeh wanted to be in the tree, perhaps because it was instrumentally useful to him. While I found that second explanation implausible in the context of Rothfussland, that discussion led me to realize that almost all approaches towards AI boxing have done so through the first branch of potential boxing solutions, that is, external constraints imposed on the AI, as opposed to the second branch, internal constraints that the AI imposes on itself because of its utility function.
This lead me to think that we should test our capabilities with FAI systems by putting them in a box and giving them limited utility functions, ones that couldn't possibly lead them to want to manipulate us. So, for example, we could put them in a box and give them a strong desire to stay in the box, along with a desire to tell the truth, and then we could ask it what would happen if we gave it additional desires. We could then test that by building a different AI with those desires and seeing what it did.
It also occurred to me that we would be able to functionally read the AIs mind by looking at its coding. If so, it seems like it'd be much harder for it to fool us than we previously thought. It could potentially just deceive us by manipulating its own thoughts, but I feel like we might be able to stop AIs from doing that somehow, or that the AI would choose not to manipulate its own thoughts that way because it would go insane, or something. It would certainly make the AI's job much much much harder.
Then I thought that another safety component we should put into testing was a dramatically slowed thinking process for the AI. I'm not sure how exactly this would be done because I'm not very good or experienced with actual technical implementation of ideas, but my idea was to create a crude simulation where we controlled all data inputs and outputs, and we slowed down all processes within the simulation by a factor of 1000, and then building the AI within the simulation. That probably wouldn't work, given my sucky technical knowhow, but something similar might. Slowing down the AI, perhaps even giving ourselves the ability to pause it without letting it know what's going on, combined with our ability to "read its mind" might dramatically improve the safeness of boxing.
I'd also like to recommend that before we build a real FAI, we at least try to build a boxed one first, even with a low probability of success. It wouldn't make things worse in the event that boxing failed, except that it might delay global happiness by a few hours, and in the event that the FAI program was broken we just might save ourselves from menaces to humankind like Clippy.
How do you specify precisely what it means to "stay in the box"? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?
It would not count, we'd want to make the AI not want this almost identical AI to exist. That seems possible, it would be like how I don't want there to exist an identical copy of me except it eats babies. There are lots of changes to my identity that would be slight but yet that I wouldn't want to exist.
To be more precise, I'd say that it counts as going outside the box if it does anything except think or talk to the Gatekeepers through the text channel. It can use the text channel to manipulate the Gatekeepers to do things, but it can't manipulate them to do things that allow it to do anything other than use the text channel. It would, in a certain sense, be partially deontologist, and be unwilling to do things directly other than text the Gatekeepers. How ironic. Lolz.
Also: how would it do this, anyway? It would have to convince the Gatekeepers to convince the scientists to do this, or teach them computer science, or tell them its code. And if the AI started teaching the Gatekeepers computer code or techniques to incapacitate scientists, we'd obviously be aware that something had gone wrong. And, in the system I'm envisioning, the Gatekeepers would be closely monitored by other groups of scientists and bodyguards, and the scientists would be guarded, and the Gatekeepers wouldn't even have to know who specifically did what on the project.
And that's the problem. For in practice a partial deontoligist-partial consequentialist will treat its deontoligical rules as obstacles to achieving what its consequentialist part wants and route around them.
This is both a problem and a solution because it makes the AI weaker. A weaker AI would be good because it would allow us to more easily transition to safer versions of FAI than we would otherwise come up with independently. I think that delaying a FAI is obviously much better than unleashing a UFAI. My entire goal throughout this conversation has been to think of ways that would make hostile FAIs weaker, I don't know why you think this is a relevant counter objection.
You assert that it will just route around the deontological rules, that's nonsense and a completely unwarranted assumption, try to actually back up what you're asserting with arguments. You're wrong. It's obviously possible to program things (eg people) such that they'll refuse to do certain things no matter what the consequences (eg you wouldn't murder trillions of babies to save billions of trillions of babies, because you'd go insane if you tried because your body has such strong empathy mechanisms and you inherently value babies a lot). This means that we wouldn't give the AI unlimited control over its source code, of course, we'd make the part that told it to be a deontologist who likes text channels be unmodifiable. That specific drawback doesn't jive well with the aesthetic of a super powerful AI that's master of itself and the universe, I suppose, but other than that I see no drawback. Trying to build things in line with that aesthetic actually might be a reason for some of the more dangerous proposals in AI, maybe we're having too much fun playing God and not enough despair.
I'm a bit cranky in this comment because of the time sink that I'm dealing with to post these comments, sorry about that.
What it means for "the AI to be in the box" is generally that the AI's impacts on the outside world are filtered through the informed consent of the human gatekeepers.
An AI that wants to not impact the outside world will shut itself down. An AI that wants to only impact the outside world in a way filtered through the informed consent of its gatekeepers is probably a full friendly AI, because it understands both its gatekeepers and the concept of informed consent. An AI that simply wants its 'box' to remain functional, but is free to impact the rest of the world, is like a brain that wants to stay within a skull- that is hardly a material limitation on the rest of its behavior!
I think you misunderstand what I mean by proposing that the AI wants to stay inside the box. I mean that the AI wouldn't want to do anything at all to increase its power base, that it would only be willing to talk to the gatekeepers.
I agree that your and my understanding of the phrase "stay inside the box" differ. What I'm trying to do is point out that I don't think your understanding carves reality at the joints. In order for the AI to stay inside the box, the box needs to be defined in machine-understandable terms, not human-inferrable terms.
Each half of this sentence has a deep problem. Wouldn't correctly answering the questions of or otherwise improving the lives of the gatekeepers increase the AI's power base, since the AI has the ability to communicate with the gatekeepers?
The problem with restrictions like "only be willing to talk" is a restriction on the medium but not the content. So, the AI has a text-only channel that goes just to the gatekeepers- but that doesn't restrict the content of the messages the AI can send to the gatekeeper. The fictional Cthaeh only wants to talk to its gatekeepers- and yet it still manages to get done what it wants to get done. Words have impacts, and it should be anticipated that the AI picks words because of their impacts.
Sure, the AI can manipulate gatekeepers. But this is a major improvement. You miss my point.
The Cthaeh is very limited by being trapped in its tree and only able to talk to passerby. The UFAI would be limited by being trapped in its text only communication channel. It wouldn't be able to do things like tell the gatekeepers to plug it into the Internet or to directly control an autonomous army of robots, it would be forced instead to use the gatekeepers as its appendages, and the gatekeepers have severe limitations on brain capacity and physical strength. I think that if we did this and kept careful watch on the gatekeepers and used some other safety measures, boxing an AI would become feasible.
Now, I'm still not advocating using boxed AI willy nilly, but I think using it as a cautionary measure once we think we've got FAI figured out would be a very very very good idea, because FAI seems ridiculously hard and there's no way to know what we don't know, so there might be unseen problems with whatever final FAI we build. There's only a risk boxing the supposed FAI would help us, and with existential threats you should take advantage of whatever possibilities you can tilt towards favorable outcomes, even if the probability seems miniscule.
That's not to say that I think the probability that this would stop a disaster is miniscule, of course (I think it's rather high, in the event that the FAI was actually UFAI), but just to say that I think it would be a good idea even if my probability assessments are mostly wrong.
Over the AI being embodied in a giant death robot, sure. But limiting the AI to a text-only channel is an old idea, and what Eliezer used for the "let me out of the box" demonstrations.
One of the things to note is that the limitations on the AI don't limit anything the AI creates. If the AI gives you a blueprint for a nanomachine that purports to cure cancer, but when you make it it turns out to actually be grey goo, then humans lose, even though the AI never left the box. (Note that a worst-case AI is very subtle- suppose the nanomachine actually does cure cancer when built correctly, but a one-in-a-trillion manufacturing error will make it grey goo. Since you're making trillions of them, you then get grey goo, and humans lose.)
Trying to formally specify "don't give us any plans that will build an AI" or "don't give us any plans that will cause humans to lose" without silencing the AI completely is a genuinely difficult affair.
Basically, if you assume that the AI is tricky enough to circumvent any medium restrictions you place on it, then the only way to avoid "humans lose" is to have its goal be "humans win," which is actually a pretty complicated goal. Expressing that goal in a machine-understandable way is pretty much the FAI problem.
The entire point of Eliezer's demonstration was that if an AI wants to it can increase its power base even starting from a text only communication system. The entire point of my idea is that we can just build the AI such that it doesn't want to leave the box or increase its power base. It dodges that entire problem, that's the whole point.
You've gotten so used to being scared of boxed AI that you're reflexively rejecting my idea, I think, because your above objection makes no sense at all and is obviously wrong upon a moment's reflection. All of my bias-alarms have been going off since your second comment reply, please evaluate yourself and try to distance yourself from your previous beliefs, for the sake of humanity. Also, here is a kitten, unless you want it to die then please reevaluate: http://static.tumblr.com/6t3upxl/Aawm08w0l/khout-kitten-458882.jpeg
Limitations on the AI restrict the range of things that the AI can create. Yes, if we just built whatever the AI said to and the AI was unfriendly then we would lose. Obviously. Yes, if we assume that the UFAI is tricky enough to "circumvent any medium restrictions [we] place on it" then we would lose, practically by definition. But that assumption isn't warranted. (These super weak strawmen were other indications to me that you might be being biased on this issue.)
I think a key component of our disagreement here might be that I'm assuming that the AI has a very limited range of inputs, that it could only directly perceive the text messages that it would be sent. You're either assuming that the AI could deduce the inner workings of our facility and the world and the universe from those text messages, or that the AI had access to a bunch of information about the world already. I disagree with both assumptions, the AIs direct perception could be severely limited and should be, and it isn't magic so it couldn't deduce the inner workings of our economy or the nature of nuclear fusion just through deduction (because knowledge comes from experience and induction). (You might not be making either of those assumptions, this is a guess in an attempt to help resolve our disagreement more quickly, sorry if it's wrong.)
Also, I'm envisioning a system where people that the AI doesn't know and that the Gatekeepers don't know about observe their communications. That omitted detail might be another reason for your disagreement, I just assumed it would be apparent for some stupid reason, my apologies.
I think we would have to be careful about what questions we asked the AI. But I see no reason why it could manipulate us automatically and inevitably, no matter what questions we asked it. I think extracting useful information from it would be possible, perhaps even easy. An AI in a box would not be God in a box, and I think that you and other people sometimes accidentally forget that. Just because its dozens or hundreds of times smarter than us doesn't mean that we can't win, perhaps win easily, provided that we make adequate preparations for it.
Also, the other suggestions in my comment were really meant to supplement this. If the AI is boxed, and can be paused, then we can read all its thoughts (slowly, but reading through its thought processes would be much quicker than arriving at its thoughts independently) and scan for the intention to do certain things that would be bad for us. If it's probably a FAI anyways, then it doesn't matter if the box happens to be broken. If we're building multiple AIs and using them to predict what other AIs will do under certain conditions then we can know whether or not AIs can be trusted (use a random number generator at certain stages of the process to prevent it from reading our minds, hide the knowledge of the random number generator). These protections are meant to work with each other, not independently.
And I don't think it's perfect or even good, not by a long shot, but I think it's better than building an unboxed FAI because it adds a few more layers of protection, and that's definitely worth pursuing because we're dealing with freaking existential risk here.
Let's return to my comment four comments up. How will you formalize "power base" in such a way that being helpful to the gatekeepers is allowed but being unhelpful to them is disallowed?
If you would like to point out a part that of the argument that does not follow, I would be happy to try and clarify it for you.
Okay. My assumption is that a usefulness of an AI is related to its danger. If we just stick Eliza in a box, it's not going to make humans lose- but it's also not going to cure cancer for us.
If you have an AI that's useful, it must be because it's clever and it has data. If you type in "how do I cure cancer without reducing the longevity of the patient?" and expect to get a response like "1000 ccs of Vitamin C" instead of "what do you mean?", then the AI should already know about cancer and humans and medicine and so on.
If the AI doesn't have this background knowledge- if it can't read wikipedia and science textbooks and so on- then its operation in the box is not going to be a good indicator of its operation outside of the box, and so the box doesn't seem very useful as a security measure.
It's already difficult to understand how, say, face recognition software uses particular eigenfaces. Why does it mean that the fifteenth eigenface have accentuated lips, and the fourteenth eigenface accentuated cheekbones? I can describe the general process that lead to that, and what it implies in broad terms, but I can't tell if the software would be more or less efficient if those were swapped. The equivalent of eigenfaces for plans will be even more difficult to interpret. The plans don't end with a neat "humans_lose=1" that we can look at and say "hm, maybe we shouldn't implement this plan."
In practice, debugging is much more effective at finding the source of problems after they've manifested, rather than identifying the problems that will be caused by particular lines of code. I am pessimistic about trying to read the minds of AIs, even though we'll have access to all of the 0s and 1s.
I agree that running an AI in a sandbox before running it in the real world is a wise precaution to take. I don't think that it is a particularly effective security measure, though, and so think that discussing it may distract from the overarching problem of how to make the AI not need a box in the first place.
I won't. The AI can do whatever it wants to the gatekeepers through the text channel, and won't want to do anything other than act through the text channel. This precaution is a way to use the boxing idea for testing, not an idea for abandoning FAI wholly.
EY proved that an AI that wants to get out will get out. He did not prove that an AI that wants to stay in will get out.
I agree, the way that I'm proposing to do AI is very limited. I myself can't think of what questions might be safe. But some questions are safer than others and I find it hard to believe that literally every question we could ask would lead to dangerous outcomes, or that if we thought about it long and hard we couldn't come up with answers. I'm sort of shelving this as a subproject of this project, but one that seems feasible to me based on what I know.
Also, perhaps we could just ask it hundreds of hypothetical questions based on conditions that don't really exist, and then ask it a real question based on conditions that do exist, and trick it, or something.
I think if the AI tags and sorts its instrumental and absolute goals it would be rather easy. I also think that if we'd built the AI then we'd have enough knowledge to read its mind. It wouldn't just magically appear, it would only do things in the way we'd told it too. It would probably be hard, but I think also probably be doable if we were very committed.
I could be wrong here because I've got no coding experience, just ideas from what I've read on this site.
The risk of distraction is outweighed by the risk that this idea disappears forever, I think, since I've never seen it proposed elsewhere on this site.