Eugine_Nier comments on Rationality Quotes August 2012 - Less Wrong

6 Post author: Alejandro1 03 August 2012 03:33PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (426)

You are viewing a single comment's thread. Show more comments above.

Comment author: chaosmosis 24 August 2012 02:54:35AM *  0 points [-]

I've come up with what I believe to be an entirely new approach to boxing, essentially merging boxing with FAI theory. I wrote a couple thoughts down about it, but lost my notes, and I also don't have much time to write this comment, so forgive me if it's vague or not extremely well reasoned. I also had a couple of tangential thoughts, if I remember them in the course of writing this or I recover my notes later than I'll put them here as well.

The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box. I believe this would solve all of the problems with the AI manipulating people in order to free itself. Now, the AI still could manipulate people in an attempt to use them to impact the outside world, so the AI wouldn't be totally boxed, but I'm inclined to think that we could maintain a very high degree of control over the AI, since the only powers it could ever have would be through communication with us.

The idea came because I recalled a discussion that occurred on TOR.com about why the Cthaeh was in the tree. The general conclusion was that either the Cthaeh was bound by extremely powerful forces, or that the Cthaeh wanted to be in the tree, perhaps because it was instrumentally useful to him. While I found that second explanation implausible in the context of Rothfussland, that discussion led me to realize that almost all approaches towards AI boxing have done so through the first branch of potential boxing solutions, that is, external constraints imposed on the AI, as opposed to the second branch, internal constraints that the AI imposes on itself because of its utility function.

This lead me to think that we should test our capabilities with FAI systems by putting them in a box and giving them limited utility functions, ones that couldn't possibly lead them to want to manipulate us. So, for example, we could put them in a box and give them a strong desire to stay in the box, along with a desire to tell the truth, and then we could ask it what would happen if we gave it additional desires. We could then test that by building a different AI with those desires and seeing what it did.

It also occurred to me that we would be able to functionally read the AIs mind by looking at its coding. If so, it seems like it'd be much harder for it to fool us than we previously thought. It could potentially just deceive us by manipulating its own thoughts, but I feel like we might be able to stop AIs from doing that somehow, or that the AI would choose not to manipulate its own thoughts that way because it would go insane, or something. It would certainly make the AI's job much much much harder.

Then I thought that another safety component we should put into testing was a dramatically slowed thinking process for the AI. I'm not sure how exactly this would be done because I'm not very good or experienced with actual technical implementation of ideas, but my idea was to create a crude simulation where we controlled all data inputs and outputs, and we slowed down all processes within the simulation by a factor of 1000, and then building the AI within the simulation. That probably wouldn't work, given my sucky technical knowhow, but something similar might. Slowing down the AI, perhaps even giving ourselves the ability to pause it without letting it know what's going on, combined with our ability to "read its mind" might dramatically improve the safeness of boxing.

I'd also like to recommend that before we build a real FAI, we at least try to build a boxed one first, even with a low probability of success. It wouldn't make things worse in the event that boxing failed, except that it might delay global happiness by a few hours, and in the event that the FAI program was broken we just might save ourselves from menaces to humankind like Clippy.

Comment author: Eugine_Nier 24 August 2012 06:55:48PM 2 points [-]

The idea, essentially, is that when creating a box AI you would build its utility function such that it wants very badly to stay in the box.

How do you specify precisely what it means to "stay in the box"? In particular, would creating a nearly identical copy of itself except without this limitation outside the box while the original stays in the box count?

Comment author: chaosmosis 24 August 2012 09:39:51PM *  0 points [-]

It would not count, we'd want to make the AI not want this almost identical AI to exist. That seems possible, it would be like how I don't want there to exist an identical copy of me except it eats babies. There are lots of changes to my identity that would be slight but yet that I wouldn't want to exist.

To be more precise, I'd say that it counts as going outside the box if it does anything except think or talk to the Gatekeepers through the text channel. It can use the text channel to manipulate the Gatekeepers to do things, but it can't manipulate them to do things that allow it to do anything other than use the text channel. It would, in a certain sense, be partially deontologist, and be unwilling to do things directly other than text the Gatekeepers. How ironic. Lolz.

Also: how would it do this, anyway? It would have to convince the Gatekeepers to convince the scientists to do this, or teach them computer science, or tell them its code. And if the AI started teaching the Gatekeepers computer code or techniques to incapacitate scientists, we'd obviously be aware that something had gone wrong. And, in the system I'm envisioning, the Gatekeepers would be closely monitored by other groups of scientists and bodyguards, and the scientists would be guarded, and the Gatekeepers wouldn't even have to know who specifically did what on the project.

Comment author: Eugine_Nier 26 August 2012 08:39:13PM 1 point [-]

It would, in a certain sense, be partially deontologist,

And that's the problem. For in practice a partial deontoligist-partial consequentialist will treat its deontoligical rules as obstacles to achieving what its consequentialist part wants and route around them.

Comment author: chaosmosis 27 August 2012 06:17:28PM -2 points [-]

This is both a problem and a solution because it makes the AI weaker. A weaker AI would be good because it would allow us to more easily transition to safer versions of FAI than we would otherwise come up with independently. I think that delaying a FAI is obviously much better than unleashing a UFAI. My entire goal throughout this conversation has been to think of ways that would make hostile FAIs weaker, I don't know why you think this is a relevant counter objection.

You assert that it will just route around the deontological rules, that's nonsense and a completely unwarranted assumption, try to actually back up what you're asserting with arguments. You're wrong. It's obviously possible to program things (eg people) such that they'll refuse to do certain things no matter what the consequences (eg you wouldn't murder trillions of babies to save billions of trillions of babies, because you'd go insane if you tried because your body has such strong empathy mechanisms and you inherently value babies a lot). This means that we wouldn't give the AI unlimited control over its source code, of course, we'd make the part that told it to be a deontologist who likes text channels be unmodifiable. That specific drawback doesn't jive well with the aesthetic of a super powerful AI that's master of itself and the universe, I suppose, but other than that I see no drawback. Trying to build things in line with that aesthetic actually might be a reason for some of the more dangerous proposals in AI, maybe we're having too much fun playing God and not enough despair.

I'm a bit cranky in this comment because of the time sink that I'm dealing with to post these comments, sorry about that.