Strange7 comments on Reply to Holden on 'Tool AI' - Less Wrong

94 Post author: Eliezer_Yudkowsky 12 June 2012 06:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (348)

You are viewing a single comment's thread. Show more comments above.

Comment author: Strange7 13 June 2012 08:04:24PM 3 points [-]

There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.

I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don't know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.

Comment author: Eliezer_Yudkowsky 13 June 2012 08:19:44PM 1 point [-]

There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.

There is? How?

Comment author: Strange7 13 June 2012 08:29:08PM *  4 points [-]
Comment author: Eliezer_Yudkowsky 13 June 2012 11:00:18PM 0 points [-]

And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone - what ring is that in?

Comment author: Strange7 13 June 2012 11:32:02PM 1 point [-]

Say, the part of the AI that outputs speech to a microphone - what ring is that in?

I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf 'text to speech' app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.

Comment author: Eliezer_Yudkowsky 14 June 2012 02:16:51AM 0 points [-]

Let me rephrase. The part of the agent that chooses what to say to the user - what ring is that in?

Comment author: Strange7 14 June 2012 03:31:20AM 1 point [-]

That's less of a rephrasing and more of a relocating the goalposts across state lines. "Choosing what to say," properly unpacked, is approximately every part of the AI that doesn't already exist.

Comment author: Eliezer_Yudkowsky 14 June 2012 03:34:30AM 3 points [-]

Yes. That's the problem with the ring architecture.

Comment author: Strange7 14 June 2012 03:59:42AM 3 points [-]

As opposed to a problem with having a massive black box labeled "decisionmaking" in your AI plans, and not knowing how to break it down into subgoals?

Comment author: Johnicholas 14 June 2012 02:56:48AM 2 points [-]

I don't think Strange7 is arguing Strange7's point strongly; let me attempt to strengthen it.

A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly 'are you sure' dialog boxes.

In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice - this explanation doesn't explain why the rings are ordered rather than just voting or something - this is just one possible explanation).

Comment author: Eliezer_Yudkowsky 14 June 2012 03:13:54AM 3 points [-]

This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case - this is wise and sane. Calling "adding impediments to some errors under some circumstances" hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system... will usually fail to cash out as an implementable algorithm, never mind it being wise.

Comment author: Strange7 14 June 2012 03:23:36AM 4 points [-]

The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.

Comment author: pnrjulius 19 June 2012 04:08:15AM 0 points [-]

Don't programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...

Isn't that basically saying "this line of code is harder to modify than that one"?

In fact, couldn't we use exactly this idea---user access protocols---to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes... but this isn't a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)

Comment author: RichardWein 14 July 2012 06:37:08PM 0 points [-]

[Responding to an old comment, I know, but I've only just found this discussion.]

Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there's a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won't achieve the desired effect if the AI bypasses it.

In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI's behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.

AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky's attempted solution seems naive to me. There's no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don't need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.