Eliezer_Yudkowsky comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (320)
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don't know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
There is? How?
I don't think Strange7 is arguing Strange7's point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly 'are you sure' dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice - this explanation doesn't explain why the rings are ordered rather than just voting or something - this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case - this is wise and sane. Calling "adding impediments to some errors under some circumstances" hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system... will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
http://en.wikipedia.org/wiki/Ring_0
So you're essentially saying put it in a box? Now where have I heard that beforeā¦
You are filling in a pattern rather than making a useful observation. E_Y expressed incredulity and ignorance on the subject of making some parts of the code running on a computer harder to modify than other parts of the code on that same computer; I cited a source demonstrating that it is, in fact, a well-established thing. Not impossible to modify, not infallibly isolated from the outside world. Just more of a challenge to alter.
I do not believe I am only filling in a pattern.
Putting the self-modifying parts of the AI (which we might as well call the actual AI) in the equivalent of a VM is effectively the same as forcing it to interact with the world through a limited interface which is an example of the AI box problem.
Right- I think the issue is more that I (at least) view the AI as operating entirely in ring 3. It might be possible to code one where the utility function is ring 0, I/O is ring 1, and action-plans are ring 3, but for those distinctions to be meaningful they need to resist bad self-modifying and allow good self-modification.
For example, we might say "don't make any changes to I/O drivers that have a massively positive effect on the utility function" to make it so that the AI can't hallucinate its reward button being pressed all the time. But how do we differentiate between that and it making a change in ring 3 from a bad plan to a great plan, that results in a massive increase in reward?
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a "prediction" to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for "predicting consequences" and no code in ring 0 for "assessing utilities".
One would also like to know what happens if the current AI, instead of "self"-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.
Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it's hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)
The only defense I've thought of against those sorts of hallucinations is a "is this real?" check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)
I don't think the utility function should be ring 0. Utility functions are hard, and ring zero is for stuff where any slip-up crashes the OS. Ring zero is where you put the small, stupid, reliable subroutine that stops the AI from self-modifying in ways that would make it unstable, or otherwise expanding it's access privileges in inappropriate ways.
I'd like to know what this small subroutine looks like. You know it's small, so surely you know what's in it, right?
Doesn't actually follow. ie. Strange7 is plainly wrong but this retort still fails.
It doesn't follow necessarily, but Eliezer has justified skepticism that someone who doesn't know what's in the subroutine would have good reason to say that it's small.
As I previously mentioned, the design of software is not my profession. I'm not a surgeon or an endocrinologist, either, even though I know that an adrenal gland is smaller, and in some ways simpler, than the kidney below it. If you had a failing kidney, would you ask me to perform a transplant on the basis of that qualification alone?
And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone - what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf 'text to speech' app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
Let me rephrase. The part of the agent that chooses what to say to the user - what ring is that in?
That's less of a rephrasing and more of a relocating the goalposts across state lines. "Choosing what to say," properly unpacked, is approximately every part of the AI that doesn't already exist.
Yes. That's the problem with the ring architecture.
As opposed to a problem with having a massive black box labeled "decisionmaking" in your AI plans, and not knowing how to break it down into subgoals?
Don't programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn't that basically saying "this line of code is harder to modify than that one"?
In fact, couldn't we use exactly this idea---user access protocols---to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes... but this isn't a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I've only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there's a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won't achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI's behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky's attempted solution seems naive to me. There's no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don't need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.