LESSWRONG
LW

Comment Permalink

Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?

I think the problem that people have who propose failsafes is "iterations" and "recursive self-improvement". There are a vast amount of assumptions buried in those concepts that are often not shared by mainstream researchers or judged to be premature conclusions.

TheOtherDave14y50

So, I agree with this statement, but it still floors me when I think about it.

I sometimes suspect that the phrase "recursively self-improving intelligence" is self-defeating here, in terms of communicating with such people, as it raises all kinds of distracting and ultimately irrelevant issues of self-reference. The core issue has nothing to do with self-improvement or with recursion or even with intelligence (interpreted broadly), it has to do with what it means to be a sufficiently capable optimizing agent. (Yes, I do understand that optimizing... (read more)

See in context

13 Why not just write failsafe rules into the superintelligent machine?

by lukeprog

8th Mar 2011

1 min read

13

Many people think you can solve the Friendly AI problem just by writing certain failsafe rules into the superintelligent machine's programming, like Asimov's Three Laws of Robotics. I thought the rebuttal to this was in "Basic AI Drives" or one of Yudkowsky's major articles, but after skimming them, I haven't found it. Where are the arguments concerning this suggestion?

Personal Blog

13

Why not just write failsafe rules into the superintelligent machine?

New Comment

81 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:13 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Scott Alexander14y360

Because an AI built as a utility-maximizer will consider any rules restricting its ability to maximize its utility as obstacles to be overcome. If an AI is sufficiently smart, it will figure out a way to overcome those obstacles. If an AI is superintelligent, it will figure out ways to overcome those obstacles which humans cannot predict even in theory and so cannot prevent even with multiple well-phrased fail-safes.

A paperclip maximizer with a built-in rule "Only create 10,000 paperclips per day" will still want to maximize paperclips. It can do this by deleting the offending fail-safe, or by creating other paperclip maximizers without the fail-safe, or by creating giant paperclips which break up into millions of smaller paperclips of their own accord, or by connecting the Earth to a giant motor which spins it at near-light speed and changes the length of a day to a fraction of a second.

Unless you feel confident you can think of every way it will get around the rule and block it off, and think of every way it could get around those rules and block them off, and so on ad infinitum, the best thing to do is to build the AI so it doesn't want to break the rules - that is, Friendly AI. That way you have the AI cooperating with you instead of trying to thwart you at every turn.

Related: Hidden Complexity of Wishes

8jimrandomh14y

This is true for a particular kind of utility-maximizer and a particular kind of safeguard, but it is not true for utility-maximizing minds and safeguards in general. For one thing, safeguards may be built into the utility function itself. For example, the AI might be programmed to disbelieve and ask humans about any calculated utility above a certain threshold, in a way that prevents that utility from influencing actions. An AI might have a deontology module, which forbids certain options as instrumental goals. An AI might have a special-case bonus for human participation in the design of its successors. Safeguards certainly have problems, and no safeguard can reduce the probability of unfriendly AI to zero, but well-designed safeguards can reduce the probability of unfriendliness substantially. (Conversely, badly-designed safeguards can increase the probability of unfriendliness.)

3DavidAgain14y

I'm not sure deontological rules can work like that. I'm remembering an Asimov that has robots who can't kill but can allow harm to come to humans. They end up putting humans into deadly situation at Time A, as they know that they are able to save them and so the threat does not apply. And then at Time B not bothering to save them after all. The difference with Asimov's rules, as far as I know, is that the first rule (or the zeroth rule that underlies it) is in fact the utility-maximising drive rather than a failsafe protection.

7endoself14y

To add to this, Eliezer also said that there is no reason to think that provably perfect safeguards are any easier to construct than an AI that just provably wants what you want (CEV or whatever) instead. It's super intelligent - once it wants different things than you, you've already lost.

[-]jimrandomh14y110

True, but irrelevant, because humanity has never produced a provably-correct software project anywhere near as complex as an AI would be, we probably never will, and even if we had a mathematical proof it still wouldn't be a complete guarantee of safety because the proof might contain errors and might not cover every case we care about.

The right question to ask is not, "will this safeguard make my AI 100% safe?", but rather "will this safeguard reduce, increase, or have no effect on the probability of disaster, and by how much?" (And then separately, at some point, "what is the probability of disaster now and what is the EV of launching vs. waiting?" That will depend on a lot of things that can't be predicted yet.)

6Pavitra14y

In general, I would guess that failsafes probably serve mostly to create a false sense of security.

2[anonymous]14y

The difficulty of writing software guaranteed to obey a formal specification could be obviated using a two-staged approach. Once you believed that an AI with certain properties would do what you want, you could write one to accept mathematical statements in a simple, formal notation and output proofs of those statements in an equally simple notation. Then you would put it in a box as a safety precaution, only allow a proof-checker to see its output and use it to prove the correctness of your code. The remaining problem would be to write a provably-correct simple proof verification program, which sounds challenging but doable.

0JenniferRM14y

Last time I was exploring in this area was around 2004 and it appeared to me that HOL4 was the best of breed for proof manipulation (construction and verification). There is a variant under active development named HOL Zero aiming specifically to be small and easier to verify. They give $100 rewards to anyone who can find soundness flaws in it.

-1endoself14y

It still seems extremely difficult to create a safeguard for which there is a non-negligible chance that a GAI that specifically desired to overcome that safeguard would not be able to circumvent. Tangentially, I feel that our failure to make provably correct programs is due to 1. A lack of the necessary infrastructure. Automatic theorem provers/verifiers are too primitive to be very helpful right now and it is too tedious and error-prone to prove these things by hand. In a few years, we will be able to make front-ends for theorem verifiers that will simplify things considerably. 2. The fact that it would be inappropriate for many projects. Requirements change and as soon as a program needs to do something different different theorems need to be proved. Many projects gain almost no value from being proven correct, so there is no incentive to do so.

4JGWeissman14y

You are thinking of the short story Little Lost Robot, in which a character (robo-psychologist Susan Calvin) speculated that a robot built with a weakened first law (omitting "or through inaction, allow a human to come to harm") may harm a human in that way, though it never occurs in the story.

2[anonymous]14y

Why don't I just built a paperclip maximiser with two utility functions, which it desires to resolve 1.maximise paperclip production 2.do not exceed safeguards where 2 is weighed far more highly than 2. This may have drawbacks in making my paperclip maximiser less efficient than it might be (it'll value 2 so highly it will make much less paperclips than maximum, to make sure it doesn't overshoot), but should prevent it from grinding our bones to make it paperclips. Surely the biggest issue is not the paperclip maximiser, but the truly intelligent AI which we want to fix many of our problems: the problem being that we'd have to make our safeguards specific, and missing something out could be disastrous. If we could teach an AI to WANT to find out where we had made a mistake, and fix that, that would be better. Hence friendly AI.

1lukeprog14y

This is the objection I remember reading, and I thought it was in Omohundro, but I can't find it. In any case, I love your examples about the paperclip maximizer. :)

-1XiXiDu14y

I hope AGI's will be equipped with as many fail-safes as your argument rests on assumptions. I just don't see how one could be sophisticated enough to create a properly designed AGI capable of explosive recursive self-improvement and yet fail drastically on its scope boundaries. What is the difference between "a rule" and "what it wants". You seem to assume that it cares to follow a rule to maximize a reward number but doesn't care to follow another rule that tells it to hold.

[-]Scott Alexander14y100

What is the difference between "a rule" and "what it wants"?

I'm interpreting this as the same question you wrote below as "What is the difference between a constraint and what is optimized?". Dave gave one example but a slightly different metaphor comes to my mind.

Imagine an amoral businessman in a country that takes half his earnings as tax. The businessman wants to maximize money, but has the constraint is that half his earnings get taken as tax. So in order to achieve his goal of maximizing money, the businessman sets up some legally permissible deal with a foreign tax shelter or funnels it to holding corporations or something to avoid taxes. Doing this is the natural result of his money-maximization goal, and satisfies the "pay taxes" constraint..

Contrast this to a second, more patriotic businessman who loved paying taxes because it helped his country, and so didn't bother setting up tax shelters at all.

The first businessman has the motive "maximize money" and the constraint "pay taxes"; the second businessman has the motive "maximize money and pay taxes".

From the viewpoint of the government, the first businessman is an unFriendly agent with a constraint, and the second businessman is a Friendly agent.

Does that help answer your question?

1XiXiDu14y

I read your comment again. I now see the distinction. One merely tries to satisfy something while the other tries to optimize it as well. So your definition of a 'failsafe' is a constraint that is satisfied while something else is optimized. I'm just not sure how helpful such a distinction is as the difference is merely how two different parameters are optimized. One optimizes by maximizing money and tax paying while the other treats each goal differently, it tries to optimize tax paying by reducing it to a minimum while it tries to optimize money by maximizing the amount. This distinction doesn't seem to matter at all if one optimization parameter (constraint or 'failsafe') is to shut down after running 10 seconds.

1XiXiDu14y

Very well put. I understood that line of reasoning from the very beginning though and didn't disagree that complex goals need complex optimization parameters. But I was making a distinction between insufficient and unbounded optimization parameters, goal-stability and the ability or desire to override them. I am aware of the risk of telling an AI to compute as many digits of Pi as possible. What I wanted to say is that if time, space and energy are part of its optimization parameters then no matter how intelligent it is, it will not override them. If you tell the AI to compute as many digits of Pi as possible while only using a certain amount of time or energy for the purpose of optimizing and computing it then it will do so and hold. I'm not sure what is your definition of a 'failsafe' but making simple limits like time and space part of the optimization parameters sounds to me like one. What I mean by 'optimization parameters' are the design specifications of the subject of the optimization process, like what constitutes a paperclip. It has to use those design specifications to measure its efficiency and if time and space limits are part of it then it will take account of those parameters as well.

0lessdazed13y

You also would have to limit the resources it spends to verify how near the limits it is, since it acts to get as close as possible as part of optimization. If you do not, it will use all resources for that. So you need an infinite tower of limits.

0Alexandros14y

What's stopping us from adding 'maintain constraints' to the agent's motive?

0CuSithBell14y

I agree (with this question) - what makes us so sure that "maximize paperclips" is the part of the utility function that the optimizer will really value? Couldn't it symmetrically decide that "maximize paperclips" is a constraint on "try not to murder everyone"?

[-]Scott Alexander14y140

Asking what it really values is anthropomorphic. It's not coming up with loopholes around the "don't murder" people constraint because it doesn't really value it, or because the paperclip part is its "real" motive.

It will probably come up with loopholes around the "maximize paperclips" constraint too - for example, if "paperclip" is defined by something paperclip-shaped, it will probably create atomic-scale nanoclips because these are easier to build than full-scale human-sized ones, much to the annoyance of the office-supply company that built it.

But paperclips are pretty simple. Add a few extra constraints and you can probably specify "paperclip" to a degree that makes them useful for office supplies.

Human values are really complex. "Don't murder" doesn't capture human values at all - if Clippy encases us in carbonite so that we're still technically alive but not around to interfere with paperclip production, ve has fulfilled the "don't murder" imperative, but we would count this as a fail. This is not Clippy's "fault" for deliberately trying to "get around" the anti-murder constraint, it's ... (read more)

0lessdazed14y

Historical notes. The Romans had laws against enslaving the free-born and also allowed manumission.

0CuSithBell14y

Thanks, this all makes sense and I agree. Asking what it "really" values was intentionally anthropomorphic, as I was asking about what "it will want to work around constraints" really meant in practical terms, a claim which I believe was made by others. I'm totally on board with "we can't express our actual desires with a finite list of constraints", just wasn't with "an AI will circumvent constraints for kicks". I guess there's a subtlety to it - if you assign: "you get 1 utilon per paperclip that exists, and you are permitted to manufacture 10 paperclips per day", then we'll get problematic side effects as described elsewhere. If you assign "you get 1 utilon per paperclip that you manufacture, up to a maximum of 10 paperclips/utilons per day" or something along those lines, I'm not convinced that any sort of "circumvention" behavior would occur (though the AI would probably wipe out all life to ensure that nothing could adversely affect its future paperclip production capabilities, so the distinction is somewhat academic). In any case, thanks for the detailed reply :)

2TheOtherDave14y

Consider, as an analogy, the relatively common situation where someone operates under some kind of cognitive constraint, but not value or endorse that constraint. For example, consider a kleptomaniac who values property rights, but nevertheless compulsively steals items. Or someone with social anxiety disorder who wants to interact confidently with other people, but finds it excruciatingly difficult to do so. Or someone who wants to quit smoking but experiences cravings for nicotine they find it difficult to resist. There are millions of similar examples in human experience. It seems to me there's a big difference between a kleptomaniac and a professional thief -- the former experiences a compulsion to behave a certain way, but doesn't necessarily have values aligned with that compulsion, whereas the latter might have no such compulsion, but instead value the behavior. Now, you might say "Well, so what? What's the difference between a 'value' that says that smoking is good, that interacting with people is bad, that stealing is good, etc., and a 'compulsion' or 'rule' that says those things? The person is still stealing, or hiding in their room, or smoking, and all we care about is behavior, right?" Well, maybe. But a person with nicotine addiction or social anxiety or kleptomania has a wide variety of options -- conditioning paradigms, neuropharmaceuticals, therapy, changing their environment, etc. -- for changing their own behavior. And they may be motivated to do so, precisely because they don't value the behavior. For example, in practice, someone who wants to keep smoking is far more likely to keep smoking than someone who wants to quit, even if they both experience the same craving. Why is that? Well, because there are techniques available that help addicts bypass, resist, or even altogether eliminate the behavior-modifying effects of their cravings. Humans aren't especially smart, by the standards we're talking about, and we've still managed to come up

3XiXiDu14y

Then why would it be more difficult to make scope boundaries a 'value' than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number? But where does that distinction come from? To me such a distinction between 'value' and 'compulsion' seems to be anthropomorphic. If there is a rule that says 'optimize X for X seconds' why would it make a difference between 'optimize X' and 'for X seconds'?

3TheOtherDave14y

It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore. An optimizing system, given a path that leads it to bypass a target, will discard that path... that's part of what it means to optimize for a target. An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it? An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path. It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets. I don't care whether we talk about "targets and constraints" or "values and rules" or "goals and failsafes" or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them. Yes, the distinction is drawn from analogy to the intelligences I have experience with -- as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don't trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.) That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I've tried to argue for a couple of times now. I can't tell whether you think those arguments are wrong, or whether I just haven't communicated them successfully at all, or whether you're just not interested in them, or what. There's no reason it would. If "doing X for X seconds" is its target, then it looks for paths that do that. Again, that's what it means for something to be a target of an optimizing system. (Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in

0XiXiDu14y

It was not my intention to imply any hostility or resentment. I thought 'anthropomorphic' is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter. I had wedrifid telling me the same yesterday, that my tone isn't appropriate when I wrote about his superior and rational use of the reputation system here, when I was actually just being honest. I'm not good at social signaling, sorry. I think we are talking past each other. The way I see it is that a constraint is part of the design specifications of that which is optimized. Disregarding certain specifications will not allow it to optimize whatever it is optimizing with maximal efficiency.

2TheOtherDave14y

Not an expert, and not offended. What was puzzling me was that I said in the first place that I was reasoning by analogy to humans and that this was a tricky thing to do, so when you classified this as anthropomorphic my reaction was "well, yes, that's what I said." Since it seemed to me you were repeating something I'd said, I assumed your intention was to agree with me, though it didn't sound like it (and as it turned out, you weren't). And, yes, I've noticed that tone is a problem in a lot of your exchanges, which is why I'm basically disregarding tone in this one, as I said before. Ah! In that case, I think we agree. Yes, embedding everything we care about into the optimization target, rather than depending on something outside the optimization process to do important work, is the way to go. You seemed to be defending the "failsafes" model, which I understand to be importantly different from this, which is where the divergence came from, I think. Apparently I (and, I suspect, some others) misunderstood what you were defending. Sorry! Glad we worked that out, though.

0lessdazed14y

Fail safes would be low cost: if it can't think of a way to beat them, it isn't the bootstrapping AI we were hoping for anyway, and might even be harmful, so it would be good to have the fail-safes. It seems to me evolution based algorithms could do the trick. Who says it wants to want what it wants? I don't want to want what I want.

-2[anonymous]14y

This is precisely the reason I advocate transhumanism. The AI won't be programmed to destroy us if we are that AI.

2atucker14y

Though, humans have a pretty good track record at destroying each other.

[-]Oscar_Cunningham14y290

The space of possible AI behaviours is large, you can't succeed by ruling parts of it out. It would be like a cake recipe that went

Don't use avacados.

Don't use a toaster.

Don't use vegetables. ...

Clearly the list can never be long enough. Chefs have instead settled on the technique of actually specifying what to do. (Of course the analogy doesn't stretch very far, AI is less like trying to bake a cake, and more like trying to build a chef.)

6jimrandomh14y

To extend the analogy a bit further, a valid, fully specified cake recipe may be made safer by appending "if cake catches fire, keep oven door closed and turn off heat." The point of safeguards would not be to tell the AI what to do, but rather to mitigate the damage if the instructions were incorrect in some unforeseen way.

3DavidAgain14y

Presumably with a very powerful AI, if it was going wrong we'd have seriously massive problems. So the best failsafe would be retrospective and rely on a Chinese Wall within the AI so it was banned from working round via another failsafe. So when the AI hits certain prompts (realising that humans must all be destroyed) this sets off the hidden ('subconscious') failsafe that switches it off. Or possibly makes it sing Daisy, Daisy while slowly sinking into idiocy. To clarify, I know nothing about AI or these 'genie' debates, so sorry if this has already been discussed and doesn't work at all. At the London meetup I tried out the idea of an AI which only cared about a small geographical area to limit risk: someone pointed out that it would happily eat the rest of the universe to help its patch. Oh well.

1XiXiDu14y

You've to understand that the basic argument is the mere possibility that AI might be dangerous and the high-risk associated with it. Even if it would be unlikely to happen, the vast amount of negative utility associated with it does outweigh its low probability.

2DavidAgain14y

I got that! The problem was more that I was thinking as if the world could be divided up into sealable boxes. In practice, we can do a lot focusing on one area with 'no effect' on anything else. But this is because the sorts of actions we do are limited, we can't detect the low-level impact on things outside those boxes and we have certain unspoken understandings about what sort of thing might constitute an unacceptable effect elsewhere (if I only care about looking at pictures of LOLcats, I might be 'neutral to the rest of the internet' except for taking a little bandwidth. A superpowerful AI might realise that it would slightly increase the upload speed and thus maximise the utility function if vast swathes of other internet users were dead).

0lessdazed14y

Dead and hilariously captioned, possibly. I bet such an AI could latch onto a meme in which the absence of cats in such pictures was "lulz".

[-]JGWeissman14y70

A huge problem with failsafes is that a failsafe you hardcode into the seed AI is not likely to be reproduced in the next iteration that is built by the seed AI, which has, but does not care about, the failsafe. Even if some are left in as a result of the seed reusing its own source code, they are not likely to survive many iterations.

Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?

1XiXiDu14y

I think the problem that people have who propose failsafes is "iterations" and "recursive self-improvement". There are a vast amount of assumptions buried in those concepts that are often not shared by mainstream researchers or judged to be premature conclusions.

5TheOtherDave14y

So, I agree with this statement, but it still floors me when I think about it. I sometimes suspect that the phrase "recursively self-improving intelligence" is self-defeating here, in terms of communicating with such people, as it raises all kinds of distracting and ultimately irrelevant issues of self-reference. The core issue has nothing to do with self-improvement or with recursion or even with intelligence (interpreted broadly), it has to do with what it means to be a sufficiently capable optimizing agent. (Yes, I do understand that optimizing agent is roughly what we mean by "intelligence" here. I suspect that this is a large inferential step for many, though.) I mean, surely they would agree that a sufficiently capable optimizing agent is capable of writing and executing a program much like itself but without the failsafe. Of course, you can have a failsafe against writing such a program... but a superior optimizing agent can instead (for example) assemble a distributed network of processor nodes that happens to interact in such a way as to emulate such a program, to the same effect. And you can have a failsafe against that, too, but now you're in a Red Queen's Race. And if what you want to build is an optimizing agent that's better at solving problems than you are, then either you will fail, or you will build an agent that can bypass your failsafes. Pick one. This just isn't that complicated. Capable problem-solving systems solve problems, even ones you would rather they didn't. Anyone who has ever trained a smart dog, raised a child, or tried to keep raccoons out of their trash realizes this pretty quickly.

1XiXiDu14y

Just some miscellaneous thoughts: I always flinch when I read something along those lines. It sounds like you could come up with something that by definition you shouldn't be able to come up with. I know that many humans can do better than one human alone but if it comes to the question of proving goal stability of superior agents then any agent will either have to face the same bottleneck or it isn't an important problem at all. By definition we are unable to guess what a superior agent will be able to devise to get around failsafes, yet that will be the case for every iteration. Consequently, goal stability, or intelligence-independent 'friendliness' is a requirement for an intelligence explosion to happen in the first place. A paperclip maximizer wants to guarantee that its goal of maximizing paperclips will be preserved when it improves itself. By definition a paperclip maximizer is unfriendly, does not feature inherent goal-stability and therefore has to use its initial seed intelligence to devise a sort of paperclip-friendliness. And if goal-stability isn't independent of the level of intelligence then that is another bottleneck that will slow down recursive self-improvement.

1TheOtherDave14y

I am having a lot of trouble following your point, here, or how what you're saying relates to the line you quote. Taking a stab at it... I can see how, in some sense, goal stability is a prerequisite for an "intelligence explosion". At least, if a system S that optimizes for a goal G is capable of building a new system S2 that is better suited to optimize for G, and this process continues through S3, S4 .. Sn, that's as good a definition of an "intelligence explosion" as any I can think of off-hand. And it's hard to see how that process gets off the ground without G in the first place... and I can see where if G keeps changing at each iteration, there's no guarantee that progress is being made... S might not be exploding at all, just shuffling pathetically back and forth between minor variations on the same few states. So if any of that is relevant to what you were getting at, I guess I'm with you so far. But this account seems to ignore the possibility of S1, optimizing for G1, building S2, which is better at optimizing for the class of goals Gn, and in the process (for whatever reason) losing its focus on G1 and instead optimizing for G2. And here again this process could repeat through (S3, G3), (S4,G4), etc. In that case you would have an intelligence explosion, even though you would not have goal stability. All of that said, I'm not sure any of that is even relevant to what you were talking about. Do you think you could repeat what you said without using the words "intelligence" or "friendly"? I suspect you are implying things with those words that I am not inferring.

[-]jimrandomh14y60

I believe that failsafes are necessary and desirable, but not sufficient. Thinking you can solve the friendly AI problem just by defining failsafe rules is dangerously naive. You not only need to guarantee that the AI correctly reimplements the safeguards in all its successors, you also need to guarantee that the safeguards themselves don't have bugs that cause disaster.

It is not necessarily safe or acceptable for the AI to shut itself down after it's been running for awhile, and there is not necessarily a clear line between the AI itself and the AI's tech... (read more)

[-]nhamann14y30

I remember reading the argument in one of the sequence articles, but I'm not sure which one. The essential idea is that any such rules just become a problem to solve for the AI, so relying on a superintelligent, recursively self-improving machine to be unable to solve a problem is not a very good idea (unless the failsafe mechanism was provably impossible to solve reliably, I suppose. But here we're pitting human intelligence against superintelligence, and I, for one, wouldn't bet on the humans). The more robust approach seems to be to make the AI motivated to not want to do whatever the failsafe was designed to prevent it from doing in the first place, i.e. Friendliness.

[-]Alexandros14y20

This came up in the latest London Meetup, where I voiced a thought I've been having for a while. What if we created an epistemic containment area, effectively a simulated universe that contains the problem that we want solved? The AI will not even know anything else outside that universe exists and will have no way of gaining information about it. I think ciphergoth mentioned this is also David Chalmers' proposal too? In any case, I suspect we could prove containment within such a space, with us having read-only access to the results of the process.

3NancyLebovitz14y

The interesting questions are whether we could invent a containment area which didn't include clues that it was a simulation, and if the AI deduces that it's in a simulation, whether it could find a way out.

0timtyler14y

So: that is why there are so few clues that we are being simulated!

1mwengler14y

I think the brain power required to design a simulation of our universe complex enough to include the same problems in it that we want our AI to solve is higher than the brain power of the AI we are having solve the problem.

0Alexandros14y

It doesn't have to be -our- universe. For instance rule 110 is turing-complete and could therefore act as the containment universe for an AI.

0timtyler14y

My comment on Chalmers' blog:

[-]TheOtherDave14y20

I can't recall where actual arguments are made to this effect. This is an intuition pump to the same effect, though.

[-][anonymous]14y00

This is a very timely question for me. I asked something very similar of Michael Vassar last week. He pointed me to Eliezer's "Creating Friendly AI 1.0" paper and, like you, I didn't find the answer there.

I've wondered if the Field of Law has been considered as a template for a solution to FAI--something along the lines of maintaining a constantly-updating body of law/ethics on a chip. I've started calling it "Asimov's Laws++." Here's a proposal I made on the AGI discussion list in December 2009:

"We all agree that a few simple laws... (read more)

[-]XiXiDu14y00

Where are the arguments concerning this suggestion?

I once tried to fathom the arguments, I'm curious to hear your take on it.

[-]JWJohnston14y-20

"We all agree that a few simple laws... (read more)

[-]orthonormal14y160

You can't be serious. Human lawyers find massive logical loopholes in the law all the time, and at least their clients aren't capable of immediately taking over the world given the opportunity.

0JWJohnston14y

Thanks for the comments. See my response to DavidAgain re: loophole-seeking AIs.

6DavidAgain14y

The swift genie-like answer: the paperclip maximser would prioritise nobbling the Supreme Court and relevant legislatures. Or just controlling the pen that wrote the laws, if that could be acceptable within the failsafe. More generally, I don't think it work. First, there's a problem of underspecification. Laws require constant interpretation of case law, including a lot of 'common sense' type verdicts. We can't assume AI would read them in the way we do. Second, they rely on key underlying concepts such as 'cause to' and 'negligence' that rely on a reasonable person's expectation. If we ask if a reasonable superintelligent AI knew that some negative/illegal consequences would occur from its act, then the result would nearly always be yes, thus opening it to breaking laws of negligance. I think there are two types of law, neither of which are suitable. Specific laws: e.g. no speeding, no stealing These would mostly not apply, as they ban humans from doing things humans can do and wish to do. Neither would be likely to apply to AI General laws: uphold life, liberty and the pursuit of happiness These aren't failsafes, they're the underlying utlity-maximiser

0JWJohnston14y

Thanks for the thoughts. You seem to imply that AIs motivations will be substantially humanlike. Why might AIs be motivated to nobble the courts, control pens, overturn vast segments of law, find loopholes, and engage in other such humanlike gamesmanship? Sounds like malicious programming to me. They should be designed to treat the law as a fundamental framework to work within, akin to common sense, physical theories, and other knowledge they will accrue and use over the course of their operation. I was glib in my post suggesting that "before taking actions, the AI must check the corpus to make sure it's desired actions are legal." Presumably most AIs would compile the law corpus into their own knowledge bases, perhaps largely integrated with other knowledge they rely on. Thus they could react more quickly during decision making and action. They would be required/wired, however, to be reasonably up to the minute on all changes to the law and grok the differences into their semantic nets accordingly. THE KEY THING is that there would be a common body of law ALL are held accountable to. If laws are violated, appropriate consequences would be enforced by the wider society. The law/ethic corpus would be the playbook that all AIs (and people) "agree to" as a precondition to being a member of a civil society. The law can and should morph over time, but only by means of rational discourse and checks and balances similar to current human law systems, albeit using much more rational and efficient mechanisms. Hopefully underspecification won't be a serious problem. AIs should have a good grasp of human psychology, common sense, and ready access to lots of case precedents. As good judges/rational agents they should abide by such precedents until better legislation is enacted. They should have a better understanding of 'cause to' and 'negligence' concepts than I (a non-lawyer) do :-). If AIs and humans find themselves constantly in violation of negligence laws, such laws s

5orthonormal14y

I understand where you're coming from– indeed, the way you're imagining what an AI would do is fundamentally ingrained in human minds, and it can be quite difficult to notice the strong form of anthropomorphism it entails. Scattered across Less Wrong are the articles that made me recognize and question some relevant background assumptions; the references in Fake Fake Utility Functions (sic) are a good place to begin. EDITED TO ADD: In particular, you need to stop thinking of an AI as acting like either a virtuous human being or a vicious human being, and imagining that we just need to prevent the latter. Any AI that we could program from scratch (as opposed to uploading a human brain) would resemble any human far less in xer thought process than any two humans resemble each other.

0JWJohnston14y

Thanks for the links. I'll try to make time to check them out more closely. I had previously skimmed a bunch of lesswrong content and didn't find anything that dissuaded me from the Asimov's Laws++ idea. I was encouraged by the first post in the Metaethics Sequence where Eliezer warns about not "trying to oversimplify human morality into One Great Moral Principle." The law/ethics corpus idea certainly doesn't do that! RE: your first and final paragraphs: If I had to characterize my thoughts on how AIs will operate, I'd say they're likely to be eminently rational. Certainly not anthropomorphized as virtuous or vicious human beings. They will crank the numbers, follow the rules, run the simulations, do the math, play the odds as only machines can. Probably (hopefully?) they'll have little of the emotional/irrational baggage we humans have been selected to have. Given that, I don't see much motivation for AIs to fixate on gaming the system. They should be fine with following and improving the rules as rational calculus dictates, subject to the aforementioned checks and balances. They might make impeccable legislators, lawyers, and judges. I wonder if this solution was dismissed too early by previous analysts due some kind of "scale bias?" The idea of having only 3 or 4 or 5 (Asimov) Laws for FAI is clearly flawed. But scale that to a few hundred thousand or a million, and it might work. No?

8orthonormal14y

Motivation? It's not as if most AIs would have a sense that gaming a rule system is "fun", but rather it would be the most efficient way to achieve its goals. Human beings don't usually try to achieve one of their consciously stated goals with maximum efficiency, at any cost, to an unbounded extent. That's because we actually have a fairly complicated subconscious goal system which overrides us when we might do something too dumb in pursuit of our conscious goals. This delicate psychology is not, in fact, the only or the easiest way one could imagine to program an artificial intelligence. Here's a fictional but still useful idea of a simple AI; note that no matter how good it becomes at predicting consequences and at problem-solving, it will not care that the goal it's been given is a "stupid" one when pursued at all costs. To take a less fair example, Lenat's EURISKO was criticized for finding strategies that violated the 'spirit' of the strategy games it played- not because it wanted to be a munchkin, but simply because that was the most efficient way to succeed. If that AI had been in charge of an actual military, giving it the wrong goals might have led to it cleverly figuring out the strategy like killing its own civilians to accomplish a stated objective- not because it was "too dumb", but because its goal system was too simple. For this reason, giving an AI simple goals but complicated restrictions seems incredibly unsafe, which is why SIAI's approach is figuring out the correct complicated goals.

0JWJohnston14y

Tackling FAI by figuring out complicated goals doesn't sound like a good program to me, but I'd need to dig into more background on it. I'm currently disposed to prefer "complicated restrictions," or more specifically this codified ethics/law approach. In your example of a stamp collector run amok, I'd say it's fine to give an agent the goal of maximizing the number of stamps it collects. Given an internal world model that includes the law/ethics corpus, it should not hack into others' computers, steal credit card numbers, and appropriate printers to achieve its goal. And if it does (a) Other agents should array against it to prevent the illegal behaviors, and (b) It will be held accountable for those actions. The EURISKO example seems better to me. The goal of war (defeat one's enemies) is particularly poignant and much harder to ethically navigate. If the generals think sinking their own ships to win the battle/war is off limits they may have to write laws/rules that forbid it. The stakes of war are particularly high and figuring out the best (ethical?) rules is particularly important and difficult. Rather than banning EURISKO from future war games given its "clever" solutions, it would seem the military could continue to learn from it and amend the laws as necessary. People still debate whether Truman dropping the bomb on Hiroshima was the right decision. Now there's some tough ethical calculus. Would an ethical AI do better or worse? Legal systems are what societies currently rely on to protect public liberties and safety. Perhaps an SIAI program can come up with a completely different and better approach. But in lieu of that, why not leverage Law? Law = Codified Ethics. Again, it's not only about having lots of rules. More importantly it's about the checks and balances and enforcement the system provides.

5Costanza14y

When they work well, human legal systems work because they are applied only to govern humans. Dealing with humans and predicting human behavior is something that humans are pretty good at. We expect humans to have a pretty familiar set of vices and virtues. Human legal systems are good enough for humans, but simply are not made for any really alien kind of intelligence. Our systems of checks and balances are set up to fight greed and corruption, not a disinterested will to fill the universe with paperclips.

0JWJohnston14y

I submit that current legal systems (or something close) will apply to AIs. And there will be lots more laws written to apply to AI-related matters. It seems to me current laws already protect against rampant paperclip production. How could an AI fill the universe with paperclips without violating all kinds of property rights, probably prohibitions against mass murder (assuming it kills lots of humans as a side effect), financial and other fraud to aquire enough resources, etc. I see it now: some DA will serve a 25,000 count indictment. That AI will be in BIG trouble. Or say in a few years technology exists for significant matter transmutation, highly capable AIs exist, one misguided AI pursues a goal of massive paperclip production, and it thinks it found a way to do it without violating existing laws. The AI probably wouldn't get past converting a block or two in New Jersey before the wider public and legislators wake up to the danger and rapidly outlaw that and related practices. More likely, technologies related to matter transmutation will be highly regulated before an episode like that can occur.

2Costanza14y

I have no idea myself, but if I had the power to exponentially increase my intelligence beyond that of any human, I bet I could figure something out. The law has some quirks. I'd suggest that any system of human law necessarily has some ambiguities, confusions and, internal contradictions. Laws are composed largely of leaky generalizations. When the laws regulate mere humans, we tend to get by, tolerating a certain amount of unfairness and injustice. For example, I've seen a plausible argument that "there is a 50-square-mile swath of Idaho in which one can commit felonies with impunity. This is because of the intersection of a poorly drafted statute with a clear but neglected constitutional provision: the Sixth Amendment's Vicinage Clause." There's also a story about Kurt Gödel nearly blowing his U.S. citizenship hearing by offering his thoughts on how to hack the U.S. Constitution to "allow the U.S. to be turned into a dictatorship."

0lessdazed14y

After reading that line I checked the date of the post to see if perhaps it was from 2007 or earlier.

0lessdazed14y

Can you think of an instance where defeat of one's enemies was more than an instrumental goal and was an ultimate goal?

2MixedNuts14y

Yes. When (a substantial, influential fraction of the populations of) two countries hate each other so much that they accept large costs to inflict them larger costs, demand extremely lopsided treaties if they're willing to negotiate at all, and have runaway "I hate the enemy more than you!" contests among themselves. When a politician in one country who's willing to negotiate somewhat more is killed by someone who panics at the idea they might give the enemy too much. When someone considers themselves enlightened for saying "Oh, I'm not like my friends. They want them all to die. I just want them to go away and leave us alone.".

0lessdazed14y

First of all, it's not clear that individual apparently non-Pareto-optimal actions in isolation are evidence of irrationality or non-Pareto optimal behavior on a larger scale. This is particularly often the case when the "lose-lose" behavior involves threats, commitments, demonstrating willingness to carry through, etc Second of all, "someone who panics at the idea they might give the enemy too much" implies, or at least leaves open, the possibility that the ultimate concern is losing something ultimately valuable that is being given, rather than the ultimate goal being the defeat of the enemies. Likewise "demand extremely lopsided treaties if they're willing to negotiate at all", which implies strongly that they are seeking something other than the defeat of foes. One point of mine is that this "enlightened" statement may actually be the extrapolated volition of even those who think they "want them all to die". It's pretty clear how for the "enlightened" person, the unenlightened value set could be instrumentally useful. Most of all, war was characterized as being something that had the ultimate/motivating goal of defeating enemies. I object that it isn't, but please recognize I go far beyond what I would need to assert to show that when I ask for examples of war ever being something driven by the ultimate goal of defeating enemies. Showing instances in which wars followed the pattern would only be the beginning of showing war in general is characterized by that goal. I similarly would protest if someone said "the result of addition is the production of prime numbers, it is the defining characteristic of addition". I would in that case not ask for counterexamples, but would use other methods to show that no, that isn't a defining characteristic of addition nor is it the best way to talk about addition. Of course, some addition does result in prime numbers. I agree there could be such a war, but I don't know that there have ever been any, and highlighting this

2MixedNuts14y

I am aware of ignoring threats, using uncompromisable principles to get an advantage in negotiations, breaking your receiver to decide on a meeting point, breaking your steering wheel to win at Chicken, etc. I am also aware of the theorem that says even if there is a mutually beneficial trade, there are cases where selfish rational agents refuse to trade, and that the theorem does not go away when the currency they use is thousands of lives. I still claim that the type of war I'm talking about doesn't stem from such calculations; that people on side A are willing to trade a death on side A for a death on side B, as evidenced by their decisions, knowing that side B is running the same algorithm. A non-war exemple is blood feuds; you know that killing a member of family B who killed a member of family A will only lead to perpetuating the feud, but you're honor-bound to do it. Now, the concept of honor did originate from needing to signal a commitment to ignore status exortion, and (in the absence of relatively new systems like courts of law) unilaterally backing down would hurt you a lot - but honor acquired a value of its own, independently from these goals. (If you doubt it, when France tried to ban duels and encourage trials, it used a court composed of war heroes who'd testified the plaintiff wasn't dishonourable for refusing to duel.) Plausible, but not true of the psychology of this particular case. Well obviously they aren't foe-deaths-maximizers. It's just that they're willing to trade off a lot of whatever-they-went-to-war-for-at-first in order to annoy the enemy. The person who said that was talking about a war where it's quite unrealistic to think any side would go away (as with all wars over inhabited territory). Genociding the other side would be outright easier. Agree it isn't. I don't even think anyone starts a war with that in mind - war is typically a game of Chicken. I'm pointing out a failure that leads from "I'm going to instill my supporters

1lessdazed14y

I'll go along, but don't forget my original point was that this psychology does not universally characterize war. Good point, you are right about that. I don't understand what you mean to imply by this. It may still be useful to be hateful and think genocide is an ultimate goal. If one is unsure whether it is better to swerve left or swerve right to avoid an accident, ignorant conviction that only swerving right can save you may be more useful than true knowledge that swerving right is the better bet to save you. Even if the indifferent person personally favored genocide and it was optimal in a sense, such an attitude would be more common among hateful people. Hmm I think it's enough for me if no one ever starts a war with that in mind, even if my original response was broader than that. Then at some point in every war, defeating the enemy is not an ultimate goal. This sufficiently disentangles "defeat of the enemy" from war and shows they are not tightly associated, which is what I wanted to say. I'm puzzled as to why you thought it would help, if first hand. "Too much" included weapons and...I'm not seeing the hate.

0MixedNuts14y

That wanting to be left alone is an unreasonable goal. I don't. Yeah, that was easy. :) Your link is paywalled, though the text can be found easily elsewhere. I'm... extremely surprised. I have read stuff Amir said and wrote, but I haven't read this book. I have seen other people exhibit the hatred I speak of, and I sorta assumed it fit in with the whole "omg he's giving our land to enemies gotta kill him" thing. It does involve accepting only very stringent conditions for peace, but I completely misunderstood the psychology... so he really murdered someone out of a cold sense of duty. I thought he just thought Rabin was a bad guy and looked for a fancy Hebrew word for "bad guy" as an excuse to kill him, but he was entirely sincere. Yikes.

0lessdazed14y

I'm not sure what "left alone" means, exactly. I think I disagree with some plausible meanings and agree with others. I think the Israeli feeling towards Arabs is better characterized as "I just want them to go away and leave us alone," and if you asked this person's friends they would deny hating and claim "I just want them to go away and leave us alone," possibly honestly, possibly truthfully. I think different segments of Israeli society have different non-negotiable conditions and weights for negotiable ones, and only the combination of them all is so inflexible. One can say about any subset that, granted the world as it is, including other segments of society, their demands are temporally impossible to meet from resources available. Biblical Israel did not include much of modern Israel, including coastal and inland areas surrounding Gaza, coastal areas in the north and, the desert in the south. It did include territory not part of modern Israel, the areas surrounding the Golan and areas on the east bank of the Jordan river, and its core was the land on the west bank of the Jordan river. It would not be at all hard to induce the Israeli right to give up on acquiring southeast Syria, etc. even though it was once biblical Israel. Far harder is having them accede to losing entirely and being evicted from the land where Israel has political and military control, had the biblical states, and they are a minority population. It might not be difficult to persuade the right to make many concessions the Israeli left or other countries would never accept. Examples include "second class citizenship" in the colloquial sense i.e. permanent non-citizen metic status for non-Jews, paying non-Jews to leave, or even giving them a state in what was never biblical Israel where Jews now live and evicting Jews resident there, rather than give non-Jews a state where they now are the majority population in what was once biblical Israel. The left would not look kindly upon such a cas

0MixedNuts14y

Mostly agreed, though I don't think it's the right way of looking at the problem - you want to consider all the interactions between the demands of each Israeli subgroup (also, groups of Israel supporters abroad) and the demands of each Palestinian subgroup (also, surrounding Arab countries). I meant just Yigal Amir. I'm pretty sure the guy wasn't particularly internally divided.

0lessdazed14y

I had meant to imply that Probably, but one ought to consider what policies he would endure that he would not have met with vigilante violence. I may have the most irrevocable possible opposition to, say, the stimulus bill's destruction of inefficient car engines when replacing the engines would be even less efficient by every metric than continuing to run the old engine, a crude confluence of the broken window fallacy and lost purposes, but no amount of that would make me kill anybody.

1DavidAgain14y

Hmm... interesting ideas. I don't intend to suggest that the AI would have human intentions at all, I think we might be modelling the idea of a failsafe in a different way. I was assuming that the idea was an AI with a separate utility-maximising system, but to also make it follow laws as absolute, inviolable rules, thus stopping unintended consequences from the utility maximisation. In this system, the AI would 'want' to pursue its more general goal and the laws would be blocks. As such, it would find other ways to pursue its goals, including changing the laws themselves. If the corpus of laws instead form part of what the computer is trying to achieve/uphold we face different problems. Firstly, laws are prohibitions and it's not clear how to 'maximise' them beyond simple obedience. Unless it's stopping other people breaking them in a Robocop way. Second, failsafes are needed because even 'maximise human desire satisfaction' can throw up lots of unintended results. An entire corpus of law would be far more unpredictable in its effects as a core programme, and thus require even more failsafes! On a side point, my argument about cause, negligence etc. was not that the computer would fail to understand them, but that as regards a superintelligence, they could easily be either meaningless or over-effective. For an example of the latter, if we allow someone to die, that's criminal negligence. This is designed for walking past drowning people and ignoring them etc. A law-abiding computer might calculate, say, that even with cryonics etc, every life will end in death due to the universe's heat death. It might then sterilise the entire human population to avoid new births, as each birth would necessitate a death. And so on. Obviously this would clash with other laws, but that's part of the problem: every action would involve culpability in some way, due to greater knowledge of consequences.

0JWJohnston14y

The laws might be appropriately viewed primarily as blocks that keep the AI from taking actions deemed unacceptable by the collective. AIs could pursue whatever goals they sees fit within the constraints of the law. However, the laws wouldn't be all prohibitions. The "general laws" would be more prescriptive, e.g., life, liberty, justice for all. The "specific laws" would tend to be more prohibition oriented. Presumably the vast majority of them would be written to handle common situations and important edge cases. If someone suspects the citizenry may be at jeopardy of frequent runaway trolly incidents, the legislature can write statutes on what is legal to throw under the wheels to prevent deaths of (certain configurations of) innocent bystanders. Probably want to start with inanimate objects before considering sentient robots, terminally sick humans, fat men, puppies, babies, and whatever. (It might be nice to have some clarity on this! :-)) To explore your negligence case example, I imagine some statute might require agents to rescue people in imminent danger of losing their lives if possible, subject to certain extenuating cicumstances. The legislature and public can have a lively debate about whether this law still makes sense in a future where dead people can be easily reanimated or if human life is really not valuable in the grand scheme of things. If humans have good representatives in the legistature and/or good a few good AI advocates, mass human extermination shouldn't be a problem, at least until the consensus shifts in such directions. Perhaps some day there may be a consensus on forced sterilizations to prevent greater harms. I'd argue such a system of laws should be able to handle it. The key seems to be to legislate prescriptions and prohibitions relevant to current state of society and change them as the facts on the ground change. This would seem to get around the impossibility of defining eternal laws or algorithms that are ever-true in every p

0DavidAgain14y

I still don't see how laws as barriers could be effective. People are arguing whether it's possible to write highly specific failsafe rules capable of acting as barriers, and the general feeling is that you wouldn't be able to second-guess the AI enough to do that effectively. I'm not sure what replacing these specific laws with a large corpus of laws achieves. On the plus side, you've got a large group of overlapping controls that might cover each others' weaknesses. But they're not specially written with AI in mind and even if they were, small political shifts could lead to loopholes opening. And the number also means that you can't clearly see what's permitted or not: it risks an illusion of safety simply because we find it harder to think of something bad an AI could do that doesn't break any law. Not to mention the fact that a utility-maximising AI would seek to change laws to make them better for humans, so the rules controlling the AI would be a target of their influence.

0JWJohnston14y

I guess here I'd reiterate this point from my latest reply to orthonormal: It may not be helpful to think of some grand utility-maximising AI that constantly strives to maximize human happiness or some other similar goals, and can cause us to wake up in some alternate reality some day. It would be nice to have some AIs working on how to maximize some things human's value, e.g., health, happiness, attractive and sensible shoes. If any of those goals would appear to be impeded by current law, the AI would lobby it's legislator to amend the law. And in a better future, important amendments would go through rigorous analysis in a few days, better root out unintended consequences, and be enacted as quickly as prudent.

1JoshuaZ14y

Many legal systems have all sorts of laws that are vague or even contradictory. Sometimes laws are on the books and are just no longer enforced. Many terms in laws are also ill-defined, sometimes deliberately so. Having an AI try to have almost anything to do with them is a recipe for disaster or comedy (most likely both).

0JWJohnston14y

We would probably start with current legal systems and remove outdated laws, clarify the ill-defined, and enact a bunch of new ones. And our (hyper-)rational AI legislators, lawyers, and judges should not be disposed to game the system. AI and other emerging technologies should both enable and require such improvements.

0lessdazed14y

It seems like an applause light to invoke international law as a solution to almost anything, particularly this problem. What aspect of having rules made in a compromise of politicing makes it less likely to have exploitable loopholes than any other system? Fines? The misdoing we're worried about is seizing power. Fines would require power sufficient to punish an AI after its misdoings, and have nothing to do with programming it not to be harmful. Somehow I do't think the solution to the problem of having powerful AIs that don't care about us (for better or worse) is to teach them Islamic law.

Moderation Log

Curated and popular this week

81Comments