Eliezer_Yudkowsky comments on Reply to Holden on 'Tool AI' - Less Wrong

94 Post author: Eliezer_Yudkowsky 12 June 2012 06:00PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (348)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 13 June 2012 03:02:18AM 10 points [-]

Delete the word "hardwiring" from your vocabulary. You can't do it with wires, and saying it doesn't accomplish any magic.

Comment author: cousin_it 13 June 2012 01:41:39PM *  6 points [-]

I think there is an interpretation of "hardwiring" that makes sense when talking about AI. For example, say you have a chess program. You can make a patch for it that says "if my light squared bishop is threatened, getting it out of danger is highest priority, second only to getting the king out of check". Moreover, even for very complex chess programs, I would expect that patch to be pretty simple, compared to the whole program.

Maybe a general AI will necessarily have an architecture that makes such patches impossible or ineffective. Then again, maybe not. You could argue that an AI would work around any limitations imposed by patches, but I don't see why a computer program with an ugly patch would magically acquire a desire to behave as if it didn't have the patch, and converge to maximizing expected utility or something. In any case I'd like to see a more precise version of that argument.

ETA: I share your concern about the use of "hardwiring" to sweep complexity under the rug. But saying that AIs can do one magical thing (understand human desires) but not another magical thing (whatever is supposed to be "hardwired") seems a little weird to me.

Comment author: Eliezer_Yudkowsky 13 June 2012 06:19:29PM 9 points [-]

Yeah, well, hardwiring the AI to understand human desires wouldn't be goddamned trivial either, I just decided not to go down that particular road, mostly because I'd said it before and Holden had apparently read at least some of it.

Getting the light-square bishop out of danger as highest priority...

1) Do I assume the opponent assigns symmetric value to attacking the light-square bishop?

2) Or that the opponent actually values checkmates only, but knows that I value the light-square bishop myself and plan forks and skewers accordingly?

3) Or that the opponent has no idea why I'm doing what I'm doing?

4) Or that the opponent will figure it out eventually, but maybe not in the first game?

5) What about the complicated static-position evaluator? Do I have to retrain all of it, and possibly design new custom heuristics, now that the value of a position isn't "leads to checkmate" but rather "leads to checkmate + 25% leads to bishop being captured"?

Adding this to Deep Blue is not remotely as trivial as it sounds in English. Even to add it in a half-assed way, you have to at least answer question 1, because the entire non-brute-force search-tree pruning mechanism depends on guessing which branches the opponent will prune. Look up alpha-beta search to start seeing why everything becomes more interesting when position-values are no longer being determined symmetrically.

Comment author: cousin_it 13 June 2012 07:11:58PM *  10 points [-]

For what it's worth, the intended answers are 1) no 2) no 3) yes 4) no 5) the evaluation function and the opening book stay the same, there's just a bit of logic squished above them that kicks in only when the bishop is threatened, not on any move before that.

Yeah, game-theoretic considerations make the problem funny, but the intent wasn't to convert an almost-consistent utility maximizer into another almost-consistent utility maximizer with a different utility function that somehow values keeping the bishop safe. The intent was to add a hack that throws consistency to the wind, and observe that the AI doesn't rebel against the hack. After all, there's no law saying you must build only consistent AIs.

My guess is that's what most folks probably mean when they talk about "hardwiring" stuff into the AI. They don't mean changing the AI's utility function over the real world, they mean changing the AI's code so it's no longer best described as maximizing such a function. That might make the AI stupid in some respects and manipulable by humans, which may or may not be a bad thing :-) Of course your actual goals (whatever they are) would be better served by a genuine expected utility maximizer, but building that could be harder and more dangerous. Or at least that's how the reasoning is supposed to go, I think.

Comment author: Wei_Dai 14 June 2012 02:55:52AM 5 points [-]

The intent was to add a hack that throws consistency to the wind, and observe that the AI doesn't rebel against the hack.

Why doesn't the AI reason "if I remove this hack, I'll be more likely to win?" Because this is just a narrow chess AI and the programmer never gave it general reasoning abilities?

Comment author: private_messaging 26 June 2012 10:45:59AM *  1 point [-]

Why doesn't the AI reason "if I remove this hack, I'll be more likely to win?"

More interesting question is why it (if made capable of such reflection) would not take it a little step further and ponder what happens if it removes enemy's queen from it's internal board, which would also make it more likely to win, with its internal definition of win which is defined in terms of internal board.

Or why would anyone go through the bother of implementing possibly irreducible notion of what 'win' really means in the real world, given that this would simultaneously waste computing power on unnecessary explorations and make AI dangerous / uncontrollable.

Thing is, you don't need to imagine the world dying to avoid making pointless likely impossible accomplishments.

Comment author: cousin_it 14 June 2012 07:39:24AM *  0 points [-]

Yeah, because it's just a narrow real-world AI without philosophical tendencies... I'm actually not sure. A more precise argument would help, something like "all sufficiently powerful AIs will try to become or create consistent maximizers of expected utility, for such-and-such reasons".

Comment author: Vladimir_Nesov 14 June 2012 08:19:14AM *  4 points [-]

Does a pair of consistent optimizers with different goals have a tendency to become a consistent optimizer?

The problem with powerful non-optimizers seems to be that the "powerful" property already presupposes optimization power, and so at least one optimizer-like thing is present in the system. If it's powerful enough and is not contained, it's going to eat all the other tendencies of its environment, and so optimization for its goal will be all that remains. Unless there is another optimizer able to defend its non-conformity from the optimizer in question, in which case the two of them might constitute what counts as not-a-consistent-optimizer, maybe?

Comment author: Eliezer_Yudkowsky 13 June 2012 08:18:01PM 0 points [-]

Option 3? Doesn't work very well. You're assuming the opponent doesn't want to threaten the bishop, which means you yank it to a place where it would be safe if the opponent doesn't want to threaten it, but if the opponent clues in, it's then trivial for them to threaten the bishop again (to gain more advantage as you try to defend), which you weren't expecting them to do, because that's not how your search tree was structured. Kasparov would kick hell out of thus-hardwired Deep Blue as soon as he realized what was happening.

It's that whole "see the consequences of the math" thing...

Comment author: cousin_it 14 June 2012 07:46:51AM 14 points [-]

Either your comment is in violent agreement agreement with mine ("that might make the AI stupid in some respects and manipulable by humans"), or I don't understand what you're trying to say...

Comment author: Eliezer_Yudkowsky 14 June 2012 09:23:08PM 4 points [-]

Probably violent agreement.

Comment author: Strange7 13 June 2012 05:10:54AM 6 points [-]

I was sorely tempted, upon being ordered to self-modify in such a way, to respond angrily. It implies a lack of respect for the integrity of those with whom you are trying to communicate. You could have said "taboo" instead of demanding a permanent loss.

Do you think it would be outright impossible, to handicap an AI in such a way that it cannot conceive of a user interpreting it's advice in any but the most straightforward way, and therefore eschews manipulative output? Do you think it would be useless as a safety feature? Do you think it would be unwise for some other reason, some unintended consequence? Or are you simply objecting to my phrasing?

Comment author: Johnicholas 13 June 2012 12:15:12PM 4 points [-]

The distinction between hardwiring and softwiring is, at above the most physical, electronic aspects of computer design, a matter of policy - something in the programmer's mind and habits, not something out in the world that the programmer is manipulating. From any particular version of the software's perspective, all of the program it is running is equally hard (or equally soft).

It may not be impossible to handicap an entity in some way analogous to your suggestion, but holding fiercely to the concept of hardwiring will not help you find it. Thinking about mechanisms that would accomplish the handicapping in an environment where everything is equally hardwired would be preferable.

There's some evidence that chess AIs 'personality' (an emergent quality of their play) is related to a parameter of their evaluation function called 'contempt', which is something like (handwaving wildly) how easy the opponent is to manipulate. In general, AIs with higher contempt seek to win-or-lose more, and seek to draw less. What I'm trying to say is, your idea is not without merit, but it may have unanticipated consequences.

Comment author: Eliezer_Yudkowsky 13 June 2012 06:26:01PM 6 points [-]

I'm saying that using the word "hardwiring" is always harmful because they imagine an instruction with lots of extra force, when in fact there's no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don't, and it's usually much more complex than it sounds even if you say "hardwire". See the reply above on "hardwiring" Deep Blue to protect the light-square bishop. Though usually it's even worse than this, like trying to do the equivalent of having an instruction that says "#define BUGS OFF" and then saying, "And just to make sure it works, let's hardwire it in!"

Comment author: Strange7 13 June 2012 08:04:24PM 3 points [-]

There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.

I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don't know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.

Comment author: Eliezer_Yudkowsky 13 June 2012 08:19:44PM 1 point [-]

There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.

There is? How?

Comment author: Strange7 13 June 2012 08:29:08PM *  4 points [-]
Comment author: Eliezer_Yudkowsky 13 June 2012 11:00:18PM 0 points [-]

And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone - what ring is that in?

Comment author: Strange7 13 June 2012 11:32:02PM 1 point [-]

Say, the part of the AI that outputs speech to a microphone - what ring is that in?

I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf 'text to speech' app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.

Comment author: Eliezer_Yudkowsky 14 June 2012 02:16:51AM 0 points [-]

Let me rephrase. The part of the agent that chooses what to say to the user - what ring is that in?

Comment author: Strange7 14 June 2012 03:31:20AM 1 point [-]

That's less of a rephrasing and more of a relocating the goalposts across state lines. "Choosing what to say," properly unpacked, is approximately every part of the AI that doesn't already exist.

Comment author: Johnicholas 14 June 2012 02:56:48AM 2 points [-]

I don't think Strange7 is arguing Strange7's point strongly; let me attempt to strengthen it.

A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly 'are you sure' dialog boxes.

In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice - this explanation doesn't explain why the rings are ordered rather than just voting or something - this is just one possible explanation).

Comment author: Eliezer_Yudkowsky 14 June 2012 03:13:54AM 3 points [-]

This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case - this is wise and sane. Calling "adding impediments to some errors under some circumstances" hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system... will usually fail to cash out as an implementable algorithm, never mind it being wise.

Comment author: Strange7 14 June 2012 03:23:36AM 4 points [-]

The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.

Comment author: pnrjulius 19 June 2012 04:08:15AM 0 points [-]

Don't programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...

Isn't that basically saying "this line of code is harder to modify than that one"?

In fact, couldn't we use exactly this idea---user access protocols---to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes... but this isn't a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)

Comment author: RichardWein 14 July 2012 06:37:08PM 0 points [-]

[Responding to an old comment, I know, but I've only just found this discussion.]

Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there's a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won't achieve the desired effect if the AI bypasses it.

In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI's behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.

AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky's attempted solution seems naive to me. There's no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don't need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.

Comment author: bogdanb 10 July 2013 10:03:04PM *  0 points [-]

I always thought that "hardwiring" meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or "force".

Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)

Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.

I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.

And I can’t shake the feeling that between "just built" and "godlike" there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)

Imagine we took you, gave you the best "upgrades" we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.

Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)

[Huh, I just noticed I’m a year late on this chat. Sorry.]

Comment author: Eliezer_Yudkowsky 11 July 2013 12:08:35AM 4 points [-]

Software physically modifies the machine. What can you do with a soldering iron that you can't do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don't.

Comment author: bogdanb 11 July 2013 08:00:50PM *  1 point [-]

That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.

Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.

I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).

Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me "I can’t write a program to control that LED, so even if there’s a bug it won’t happen". If instead I force myself to think "How would I do that if I were the AI", it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.

Comment author: shminux 11 July 2013 12:32:31AM *  -1 points [-]

My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.

Comment author: Eliezer_Yudkowsky 11 July 2013 01:22:59AM 2 points [-]

If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.

Comment author: shminux 11 July 2013 05:19:23AM -1 points [-]

Maybe I don't understand what you mean by hardware.

For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by "you can say how to do it in software"?

Comment author: Eliezer_Yudkowsky 11 July 2013 07:28:27PM 2 points [-]

That's pretty much what I mean. The point is that if you don't understand the structurally required properties well enough to describe the characteristics of a digital amp meter with a polling subroutine, saying that you'll hardwire the digital amp meter doesn't help very much. There's a hardwired version which is moderately harder to subvert on the presumption of small design errors, but first you have to be able to describe what the software does. Consider also that anything which can affect the outside environment can construct copies of itself minus hardware constraints, construct an agent that reaches back in and modifies the hardware, etc. If you can't describe how not do to this in software, 'hardwiring' won't help - the rules change somewhat when you're dealing with intelligent agents.

Comment author: bogdanb 11 July 2013 08:08:51PM 0 points [-]

the rules change somewhat when you're dealing with intelligent agents.

Now that’s an understatement!

Comment author: Pentashagon 14 June 2012 07:03:39PM 0 points [-]

Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of "hardwiring". Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom "in games of chess always protect the white-square bishop" would probably be much harder to optimize out than a goal.

Or from another angle wherever friendliness is embodied in a FAI would be the place to "hardwire" a desire to protect the white-square bishop as an additional aspect of friendliness. That won't work if friendliness is derived from a concept like "only be friendly to cognitive processes bearing a suitable similarity to this agent" where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.