Eliezer_Yudkowsky comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (348)
I'm saying that using the word "hardwiring" is always harmful because they imagine an instruction with lots of extra force, when in fact there's no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don't, and it's usually much more complex than it sounds even if you say "hardwire". See the reply above on "hardwiring" Deep Blue to protect the light-square bishop. Though usually it's even worse than this, like trying to do the equivalent of having an instruction that says "#define BUGS OFF" and then saying, "And just to make sure it works, let's hardwire it in!"
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don't know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
There is? How?
http://en.wikipedia.org/wiki/Ring_0
And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone - what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf 'text to speech' app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
Let me rephrase. The part of the agent that chooses what to say to the user - what ring is that in?
That's less of a rephrasing and more of a relocating the goalposts across state lines. "Choosing what to say," properly unpacked, is approximately every part of the AI that doesn't already exist.
Yes. That's the problem with the ring architecture.
As opposed to a problem with having a massive black box labeled "decisionmaking" in your AI plans, and not knowing how to break it down into subgoals?
I don't think Strange7 is arguing Strange7's point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly 'are you sure' dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice - this explanation doesn't explain why the rings are ordered rather than just voting or something - this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case - this is wise and sane. Calling "adding impediments to some errors under some circumstances" hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system... will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
Don't programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn't that basically saying "this line of code is harder to modify than that one"?
In fact, couldn't we use exactly this idea---user access protocols---to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes... but this isn't a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I've only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there's a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won't achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI's behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky's attempted solution seems naive to me. There's no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don't need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.
I always thought that "hardwiring" meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or "force".
Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)
Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.
I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.
And I can’t shake the feeling that between "just built" and "godlike" there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)
Imagine we took you, gave you the best "upgrades" we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.
Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)
[Huh, I just noticed I’m a year late on this chat. Sorry.]
Software physically modifies the machine. What can you do with a soldering iron that you can't do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don't.
That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.
Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.
I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).
Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me "I can’t write a program to control that LED, so even if there’s a bug it won’t happen". If instead I force myself to think "How would I do that if I were the AI", it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.
My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.
If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.
Maybe I don't understand what you mean by hardware.
For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by "you can say how to do it in software"?
That's pretty much what I mean. The point is that if you don't understand the structurally required properties well enough to describe the characteristics of a digital amp meter with a polling subroutine, saying that you'll hardwire the digital amp meter doesn't help very much. There's a hardwired version which is moderately harder to subvert on the presumption of small design errors, but first you have to be able to describe what the software does. Consider also that anything which can affect the outside environment can construct copies of itself minus hardware constraints, construct an agent that reaches back in and modifies the hardware, etc. If you can't describe how not do to this in software, 'hardwiring' won't help - the rules change somewhat when you're dealing with intelligent agents.
Now that’s an understatement!
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of "hardwiring". Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom "in games of chess always protect the white-square bishop" would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to "hardwire" a desire to protect the white-square bishop as an additional aspect of friendliness. That won't work if friendliness is derived from a concept like "only be friendly to cognitive processes bearing a suitable similarity to this agent" where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.