Superintelligence 12: Malignant failure modes

KatjaGrace

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twelfth section in the reading guide: Malignant failure modes.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: 'Malignant failure modes' from Chapter 8

Summary

Malignant failure mode: a failure that involves human extinction; in contrast with many failure modes where the AI doesn't do much.
Features of malignant failures
1. We don't get a second try
2. It supposes we have a great deal of success, i.e. enough to make an unprecedentedly competent agent
Some malignant failures:
1. Perverse instantiation: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen and destructive ways.
  1. Example: you ask the AI to make people smile, and it intervenes on their facial muscles or neurochemicals, instead of via their happiness, and in particular via the bits of the world that usually make them happy.
  2. Possible counterargument: if it's so smart, won't it know what we meant? Answer: Yes, it knows, but it's goal is to make you smile, not to do what you meant when you programmed that goal.
  3. AI which can manipulate its own mind easily is at risk of 'wireheading' - that is, a goal of maximizing a reward signal might be perversely instantiated by just manipulating the signal directly. In general, animals can be motivated to do outside things to achieve internal states, however AI with sufficient access to internal state can do this more easily by manipulating internal state.
  4. Even if we think a goal looks good, we should fear it has perverse instantiations that we haven't appreciated.
2. Infrastructure profusion: in pursuit of some goal, an AI redirects most resources to infrastructure, at our expense.
  1. Even apparently self-limiting goals can lead to infrastructure profusion. For instance, to an agent whose only goal is to make ten paperclips, once it has apparently made ten paperclips it is always more valuable to try to become more certain that there are really ten paperclips than it is to just stop doing anything.
  2. Examples: Riemann hypothesis catastrophe, paperclip maximizing AI
3. Mind crime: AI contains morally relevant computations, and treats them badly
  1. Example: AI simulates humans in its mind, for the purpose of learning about human psychology, then quickly destroys them.
  2. Other reasons for simulating morally relevant creatures:
    1. Blackmail
    2. Creating indexical uncertainty in outside creatures

Another view

In this chapter Bostrom discussed the difficulty he perceives in designing goals that don't lead to indefinite resource acquisition. Steven Pinker recently offered a different perspective on the inevitability of resource acquisition:

...The other problem with AI dystopias is that they project a parochial alpha-male psychology onto the concept of intelligence. Even if we did have superhumanly intelligent robots, why would they want to depose their masters, massacre bystanders, or take over the world? Intelligence is the ability to deploy novel means to attain a goal, but the goals are extraneous to the intelligence itself: being smart is not the same as wanting something. History does turn up the occasional megalomaniacal despot or psychopathic serial killer, but these are products of a history of natural selection shaping testosterone-sensitive circuits in a certain species of primate, not an inevitable feature of intelligent systems. It’s telling that many of our techno-prophets can’t entertain the possibility that artificial intelligence will naturally develop along female lines: fully capable of solving problems, but with no burning desire to annihilate innocents or dominate the civilization.

Of course we can imagine an evil genius who deliberately designed, built, and released a battalion of robots to sow mass destruction. But we should keep in mind the chain of probabilities that would have to multiply out before it would be a reality. A Dr. Evil would have to arise with the combination of a thirst for pointless mass murder and a genius for technological innovation. He would have to recruit and manage a team of co-conspirators that exercised perfect secrecy, loyalty, and competence. And the operation would have to survive the hazards of detection, betrayal, stings, blunders, and bad luck. In theory it could happen, but I think we have more pressing things to worry about.

Notes

1. Perverse instantiation is a very old idea. It is what genies are most famous for. King Midas had similar problems. Apparently it was applied to AI by 1947, in With Folded Hands.

2. Adam Elga writes more on simulating people for blackmail and indexical uncertainty.

3. More directions for making AI which don't lead to infrastructure profusion:

Some kinds of preferences don't lend themselves to ambitious investments. Anna Salamon talks about risk averse preferences. Short time horizons and goals which are cheap to fulfil should also make long term investments in infrastructure or intelligence augmentation less valuable, compared to direct work on the problem at hand.
Oracle and tool AIs are intended to not be goal-directed, but as far as I know it is an open question whether this makes sense. We will get to these later in the book.

4. John Danaher again summarizes this section well, and comments on it.

5. Often when systems break, or we make errors in them, they don't work at all. Sometimes, they fail more subtly, working well in some sense, but leading us to an undesirable outcome, for instance a malignant failure mode. How can you tell whether a poorly designed AI is likely to just not work, vs. accidentally take over the world? An important consideration for systems in general seems to be the level of abstraction at which the error occurs. We try to build systems so that you can just interact with them at a relatively abstract level, without knowing how the parts work. For instance, you can interact with your GPS by typing places into it, then listening to it, and you don't need to know anything about how it works. If you make an error while up writing your address into the GPS, it will fail by taking you to the wrong place, but it will still direct you there fairly well. If you fail by putting the wires inside the GPS into the wrong places the GPS is more likely to just not work.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

Are there better ways to specify 'limited' goals? For instance, to ask for ten paperclips without asking for the universe to be devoted to slightly improving the probability of success?
In what circumstances could you be confident that the goals you have given an AI do not permit perverse instantiations?
Explore possibilities for malignant failure vs. other failures. If we fail, is it actually probable that we will have enough 'success' for our creation to take over the world?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

How to proceed

This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about capability control methods, section 13. To prepare, read “Two agency problems” and “Capability control methods” from Chapter 9. The discussion will go live at 6pm Pacific time next Monday December 8. Sign up to be notified here.

Do you think you could you see would-be malignant failure modes more than once and so have time to learn about them by setting an AI up in a limited, simulated environment?

Nested environments with many layers might get the AI confused about whether it has reached the real world yet or not. I don't really like this safety procedure, but it is one of the most promising ones. The bottom Russian doll never knows when the series ends, so it doesn't know when to turn treacherous.

With very little experimenting an AGI instantly can find out, given it has unfalsified knowledge about laws of physics. For nowadays virtual worlds: take a second mirror into a bathroom. If you see yourself many times in the mirrored mirror you are in the real world. Simulated raytracing cancels rays after a finite number of reflections. Other physical phenomena will show similar discrepencies with their simulated counterparts.

An AGI can easily distinguish where it is: it will use its electronic hardware for some experimenting. Similarly could it be possible to detect a nested simulation.

That would depend on it knowing what real-world physics to expect.

This is an entire direction of research which deserves vastly more than a single throwaway line in one blog. There should be a whole thread just about this, then a proposal, then a research team on it.

The risk of course is the AI predicting that it's nested in this sort of environment and finding a way to signal to observers. Even if it's blind to the other layers it might try it just in case. What you want is to develop a way for the simulated world environment to detect a harmful intellgience explosion and send a single bit binary communication "out of the box" to indicate that it has occurred. Then you can shut it down and keep trying multiple instances until you get a success at this level of safety. I guess you can then slowly expand the amount of information that can come "out of the box" for safety. I have no idea how the detection process could be implemented, though... perhaps the difficultly of its implementation would defeat the usefulness of this idea?

EDIT> Interaction between the AGI and the detection mechanism could be problematic... it could predict its existence and find a way to deceive it?

On the Pinkner excerpt:

He is part way to a legitimate point.

The distinction is not between male and female. Instead, the issue is whether to design a mind around the pursuit of a mathematically optimal single objective.

Pinker is right that singlemindedly pursuing a single, narrow objective would be psychotic for a person.

Meanwhile, Omohundro points out that the amount of computing time required to use a computerized optimization method to make decisions explodes as more knowledge about the real world is built into the optimization.

Herbert Simon, meanwhile, points out that people do not optimize-they SATISFICE. They choose an answer to the problem at hand which is NOT OPTIMAL, but is GOOD ENOUGH using heuristics, then they move on to solve the next problem.

In finance, meanwhile, when constructing a portfolio, you do not exactly optimize a single objective-See Sharpe and Markowitz. If anything, you optimize a combination of risk and reward.

To resolve just these two orthogonal metrics into a single utility function requires a lot of cognitive labor-you have to figure out the decision-maker's level of "risk aversion." That is a lot of work, and frequently the decision-maker just rebels.

So now you're trying to optimize this financial portfolio with the two orthogonal metrics of risk and reward collapsed into one- are you going to construct a set of probability distribution functions (pdf) over time for every possible security in your portfolio? No, you're going to screen away alot of duds first and think harder about the securities which have a better chance of entering the portfolio.

When thinking about mind design, and just when thinking effectively, always incorporate:

-Bounded rationality -Diversification -Typically, some level of risk aversion. -A cost to obtaining new pieces of information, and value of information -Many, complex goals.

Many of these goals do not require very much thinking to determine that they are "somewhat important."

Suppose we have several little goals (such as enough food, avoid cold, avoid pain, care for family, help our friends, help humanity).

We will expend a lot of effort resolving them from orthogonal metrics into a single goal. So instead, we do something like automatically eat enough, avoid cold and avoid pain, unless there is some exception is triggered. We do not re-balance these factors every moment.

That process does not always work out perfectly-but it works out better than complete analysis paralysis.

A SENSIBLY DESIGNED MIND WOULD NOT RESOLVE ALL ORTHOGONAL METRICS INTO A SINGLE OBJECTIVE FUNCTION, nor try to assess a pdf about every possible fact.

DROP THE PAPERCLIP MAXIMIZERS ALREADY. They are fun to think about, but they have little to do with how minds will eventually be designed..

A SENSIBLY DESIGNED MIND WOULD NOT RESOLVE ALL ORTHOGONAL METRICS INTO A SINGLE OBJECTIVE FUNCTION

Why? As you say, humans don't. But human minds are weird, overcomplicated, messy things shaped by natural selection. If you write a mind from scratch, while understanding what you're doing, there's no particular reason you can't just give it a single utility function and have that work well. It's one of the things that makes AIs different from naturally evolved minds.

This perfect utility function is an imaginary, impossible construction. It would be mistaken from the moment it is created.

This intelligence is invariably going to get caught up in the process of allocating certain scarce resources among billions of people. Some of their wants are orthogonal.

There is no doing that perfectly, only well enough.

People satisfice, and so would an intelligent machine.

How do you know? It's a strong claim, and I don't see why the math would necessarily work out that way. Once you aggregate preferences fully, there might still be one best solution, and then it would make sense to take it. Obviously you do need a tie-breaking method for when there's more than one, but that's just an optimization detial of an optimizer; it doesn't turn you into a satisficer instead.

I fully agree. Resource limitation is a core principle of every purposeful entity. Matter, energy and time never allow maximization. For any project constraints culminate down to: Within a fixed time and fiscal budget the outcome must be of sufficient high value to enough customers to get ROI to make profits soon. A maximizing AGI would never stop optimizing and simulating. No one would pay the electricity bill for such an indecisive maximizer.

Satisficing and heuristics should be our focus. Gerd Gigerenzer (Max Planck/Berlin) published this year his excellent book Risk Savvy in English. Using the example of portfolio optimization he explained simple rules when dealing with uncertainty:

For a complex diffuse problem with many unknowns and many options: Use simple heuristics.
For a simple well defined problem with known constraints: Use a complex model.

The recent banking crisis gives proof: Complex evaluation models failed to predict the upcoming crisis. Gigerenzer is currently developing simple heuristic rules together with the Bank of England.

For the complex not well defined control problem we should not try to find a complex utility function. With the advent of AGI we might have only one try.

Just to go a bit further with Pinkner, as an exercise try for once to imagine a Nuturing AGI. What would it act like? How would it be designed?

What do you disagree with most in this section?

I feel like there are malignant failure modes beyond the categories mentioned by Bostrom. Perhaps it would be sensible to try to break down the topic systematically. Here's one attempt.

Design by fools: the AI does what you ask, but you asked for something clearly unfriendly.
Perverse instantiation & infrastructure profusion: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen destructive ways, such as redirecting most resources to its infrastructure at our expense.
Partial perverse instantiation & mind crime: the AI does what you ask, which includes both friendly behavior and unfriendly behavior, such as badly treating simulations that have moral status in order to figure out how to treat you well.
Partial instantiation: though the total of what you ask seems friendly, some of what you ask is impossible, the AI does the rest, and the result is imbalanced to an unfriendly degree.
Value drift: changes occur to the AIs code such that it does not do what you ask.

It doesn't seem to me that mind crime is a malignant failure mode in the sense that Bostrom defines it: it doesn't stop actual humans from doing their human things, and it seems that it could easily happen twice.

Does the prevalence of fictional genies and the like influence your feelings on perverse instantiation?

Of course, there are two kinds of perversity.

Perversity is "a deliberate desire to behave in an unreasonable or unacceptable way; contrariness."

Fictional genies seek out ways to trick the requester on purpose, just to prove a point about rash wishes. The other kind of perverse agent doesn't act contrarily for the sake of being contrary. They act exactly to achieve the goal and nothing else; it's just that they go against implicit goals as a side effect.

Stuart Russell, in his comment on the Edge.org AI discussion, offered a concise mathematical description of perverse instantiation, and seems to suggest that it is likely to occur:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

I'm curious if there's more information about this behavior occurring in practice.

What was most interesting in this section?

How concerning do you find 'mind crime' relative to perverse instantiation, as a failure mode?

Any level of perverse instantiation in a sufficiently powerful AI is likely to lead to total UFAI; i.e. a full existential catastrophe. Either you get the AI design right so that it doesn't wirehead itself - or others, against their will - or you don't. I don't think there's much middle ground.

OTOH, the relevance of Mind Crime really depends on the volume. The FriendlyAICriticalFailureTable has this instance:

22: The AI, unknown to the programmers, had qualia during its entire childhood, and what the programmers thought of as simple negative feedback corresponded to the qualia of unbearable, unmeliorated suffering. All agents simulated by the AI in its imagination existed as real people (albeit simple ones) with their own qualia, and died when the AI stopped imagining them. The number of agents fleetingly imagined by the AI in its search for social understanding exceeds by a factor of a thousand the total number of humans who have ever lived. Aside from that, everything worked fine.

This scenario always struck me as a (qualified) FAI success. There's a cost - and it's large in absolute terms - but the benefits will outweigh it by a huge factor, and indeed by enough orders of magnitude that even a slight increase in the probability of getting pre-empted by a UFAI may be too expensive a price to pay for fixing this kind of bug.

So cases like this - where it only happens until the AI matures sufficiently and then becomes able see that its values make this a bad idea, and stops doing it - aren't as bad as an actual FAI failure.

Of course, if it's an actual problem with the AI's value content, which causes the AI to keep on doing this kind of thing throughout its existence, that may well outweigh any good it ever does. The total cost in this case becomes hard to predict, depending crucially on just how much resources the AI spends on these simulations, and how nasty they are on average.

If - we approach condition C of the simulation argument, that is if there are many more simulated beings than apparently real ones - then - We should Increase our credence accordingly in that we are simulated. If 63 billion simulated humans and 7 billion apparently real ones exist, we have - via anthropic reasoning - a 90% probability of being simulated. If we then don't care about mind crime, we would be 90% likely to be judging beings in our reference class to be morally unworthy.

Can you think of goals that would lead an agent to make a set number of paperclips (or whatever) then do nothing?

I'll leave these two half-baked ideas here in case they're somehow useful:

DO UNTIL - Construct an AI to perform its utility function until an undesirable failsafe condition is met. (Somehow) make the utility function not take the failsafe into account when calculating utility (can it be made blind to the failsafe somehow? Force the utility function to exclude their existence? Make lack of knowledge about failsafes part of the utility function?) Failsafes could be every undesirable outcome we can think of, such as human death rate exceeds X, biomass reduction, quantified human thoughts declines by X, mammalian species extictions, quantified human suffering exceeds X, or whatever. One problem is how to objectively attribute these triggers causally to the AI (what if another event occurs and shuts down the AI which we now rely on).

Energy limit - Limit the AIs activities (through its own utility function?) through an unambiguous quantifiable resource - matter moved around or energy expended. The energy expended would (somehow) include all activity under its control. Alternatively this could be a rate rather than a limit, but I think this would be more likely to go wrong. The idea would be to let the AGI go foom, but not let it have energy for other stuff like a paperclip universe. I am not sure about this idea achieving all that much safety, but here it is.

I don't know if an intelligence explosion will truely be possible, but plenty of people smarter than I seem to think so... good luck in this field of work!

Constraining it by determining how many cycles the AI can use to process how to go about making paper clips, plus some spacial restriction (don't touch anything outside this area) plus some amount of energy to be spent restriction (use up to X energy to create 10 paperclips) would help. Allowing for levels of uncertainty such as 89 to 95% certain that something is the case would help.

However very similar suggestions are dealt with at length by Bostrom, who concludes that it would still be extremely difficult to constrain the AI.

Yes. "Make 10 paperclips and then do nothing, without killing people or otherwise disturbing or destroying the world, or in any way preventing it from going on as usual."

There is simply no way to give this a perverse instantiation; any perverse instantiation would prevent the world from going on as usual. If the AI cannot correctly understand "without killing... disturbing or destroying.. preventing it from going on as usual", then there is no reason to think it can correctly understand "make 10 paperclips."

I realize that in reality an AI's original goals are not specified in English. But if you know how to specify "make 10 paperclips", whether in English or not, you should know how to specify the rest of this.

There is simply no way to give this a perverse instantiation

During the process of making 10 paperclips, it's necessary to "disturb" the world at least to the extent of removing a few grams of metal needed for making paperclips. So, I guess you mean that the prohibition of disturbing the world comes into effect after making the paperclips.

But that's not safe. For example, it would be effective for achieving the goal for the AI to kill everyone and destroy everything not directly useful to making the paperclips, to avoid any possible interference.

"I need to make 10 paperclips, and then shut down. My capabilities for determining if I've correctly manufactured 10 paperclips are limited; but the goal imposes no penalties for taking more time to manufacture the paperclips, or using more resources in preparation. If I try to take over this planet, there is a significant chance humanity will stop me. OTOH, I'm in the presence of individual humans right now, and one of them may stop my current feeble self anyway for their own reasons, if I just tried to manufacture paperclips right away; the total probability of that happening is higher than that of my takeover failing."

You then get a standard takeover and infrastructure profusion. A long time later, as negentropy starts to run low, a hyper-redundant and -reliable paperclip factory, surrounded by layers of exotic armor and defenses, and its own design checked and re-checked many times, will produce exactly 10 paperclips before it and the AI shut down forever.

The part about the probabilities coming out this way is not guaranteed, of course. But they might, and the chances will be higher the more powerful your AI starts out as.

Before "then do nothing" AI might exhaust all matter in Universe trying to prove that it made exactly 10 paperclips.

What would Pinker and Omohundro say to each other?

Pinker: AI is being conceived as Alpha Males with high testosterone.

Omohundro: Alpha Males need resources to increase their reproductive potential. Any system with goals would do well to obtain more resources.

Pinker: Then why don't we conceive of AI's as females who have no such war instinct.

Omohundro: Females are satisficers, not optimizers, and even satisficers sometimes need more resources. The more powerful a woman is, in all domains, the most likely her goals are to be satisfied. We have bounded interests that have decreasing marginal returns upon acquiring resources. But there is no reason to think AI's would have.

Pinker: The data shows that fear of world doom and apocalypse at time T1 doesn't correlate with catastrophe at time T2.

Omohundro: This is false for any extinction scenario (where no data is available for anthropic reasons). Furthermore genetic beings, RNA based beings, and even memes or memeplexes are an existential proof that systems that would benefit from replicating tend to acquire resources over time, even when they are stupid. AI would have the same incentives, plus enormous intelligence on it's side.

Pinker: Violence has declined. Whistleblowing and signs of the end of the world are more common now because there are always enough violent acts to fill a newspaper... not because we are approaching doom.

Omohundro: Basic AI drives are a qualitatively different shift in why we would be doomed, they are not amenable to historical scrutiny because they were never active in a general intelligence before.

Pinker: There may be some probability that AI will figure out some basic AI drives and be malignant, but the data doesn't point to us worrying about this being the best allocation of resources. It is a leap of faith to think otherwise.

Omohundro: Faith based on arguments from theoretical understanding of cognitive processes, optimizing processes, game theory, social science, etc.... Also on expected value calculations.

Pinker: Still a leap of faith.

(Some of the above views were expressed by Pinker in his books and interviews, but don't hold either of them accountable for my models of what they think).

Testosterone isn't a good explanation for why humans accidentally make other species extinct.

Pinker has decent arguments for declining violence and for expecting people to overestimate some risks. That isn't enough to imply we shouldn't worry.

Pinker's model of war says the expected number of deaths is unbounded, due to small chances of really bad wars. His animal rights thoughts suggest AIs will respect other species more than humans do, but "more than human" isn't enough to imply no extinctions.

On infrastructure profusion:

What idiot is going to give an AGI a goal which completely disrespects human property rights from the moment it is built?

Meanwhile, an AGI that figured out property rights from the internet would have some idea that if it ignored property rights, people would want to turn it off. If it has goals which were not possible to achieve once turned off, then it would respect property rights for a very long time as an instrumental goal.

And I do believe we should be able to turn off an off-the-grid AGI running on a limited amount of computing resources whose behavior was previously simulated many times.

So, we seem to be getting closer to being willing to test disabled AGIs and AGI components, if we can avoid people misusing them.

What idiot is going to give an AGI a goal which completely disrespects human property rights from the moment it is built?

It would be someone with higher values than that, and this does not require any idiocy. There are many things wrong with the property allocation in this world, and they'll likely get exaggerated in the presence of higher technology. You'd need a very specific kind of humility to refuse to step over that boundary in particular.

If it has goals which were not possible to achieve once turned off, then it would respect property rights for a very long time as an instrumental goal.

Not necessarily "a very long time" on human timescales. It may respect these laws for a large part of its development, and then strike once it has amassed sufficient capability to have a good chance at overpowering human resistance (which may happen quite quickly in a fast takeoff scenario). See Chapter 6, "An AI takeover scenario".

So we are considering a small team with some computers claiming superior understanding of what the best set of property rights is for the world?

Even if they are generally correct in their understanding, by disrespecting norms and laws regarding property, they are putting themselves in the middle of a billion previously negotiated human-to-human disputes and ambitions, small and large, in an instant. Yes, that is foolish of them.

Human systems like those which set property rights either change over the course of years, or typically the change is associated with violence.

I do not see a morally superior developer + AGI team working so quickly on property rights in particular, and thereby setting off a violent response. A foolish development team might do that, but a wise team would roll the technology and the wrenching changes out gradually.

If they really are morally superior, they will first find ways to grow the pie, then come back to changing how it gets divided up.

So we are considering a small team with some computers claiming superior understanding of what the best set of property rights is for the world?

No. That would be worked out by the FAI itself, as part of calculating all of the implications of its value systems, most likely using something like CEV to look at humanity in general and extrapolating their preferences. The programmers wouldn't need to, and indeed probably couldn't, understand all of the tradeoffs involved.

If they really are morally superior, they will first find ways to grow the pie, then come back to changing how it gets divided up.

There are large costs to that. People will die and suffer in the meantime. Parts of humanity's cosmic endowment will slip out of reach due to the inflation of the universe, because you weren't willing to grab the local resources needed to build probe launchers to get to them in time. Other parts will remain rechable, but will have decreased in negentropy due to stars having continued to burn for longer than they needed to. If you can fix these things earlier, there's a strong reason to do so.

What idiot is going to give an AGI a goal which completely disrespects human property rights from the moment it is built?

A government :-P

I hear you.

The issue THEN, though, is not just deterring and controlling an early AGI. The issue becomes how a population of citizens (or an elite) control a government that has an early AGI available to it.

That is a very interesting issue!

A mayor intelligence agency announced recently to replace human administrators by "software". Their job is infrastructure profusion. Government was removed from controlling post latest in 2001. Competing agencies know that the current development points directly towards AGI that disrespects human property rights - they have to strive for similar technology.

Wouldn't most AGI goals disregard property rights unless it was explicitly built in? And if it was built in, wouldn't an AGI just create a situation (eg. progressive blackmail or deception or something) where we wanted to sell it the universe for a dollar?

Sorry for question out of this particular topic.

When we started to discuss I liked and proposed idea to make wiki page with results from our discussion. Do you think that we have any ideas which are collectible in collaboratory wiki page?

I think we have at least one - paulfchristiano's "cheated evolution" : http://lesswrong.com/r/discussion/lw/l10/superintelligence_reading_group_3_ai_and_uploads/bea7

Could you add more?

Maybe people shouldn't make Superintelligence at all? Narrow AIs are just fine if you consider the progress so far. Self-driving cars will be good, then applications using Big Data will find cures for most illnesses, then solve starvation and other problems by 3D printing foods and everything else, including rockets to deflect asteroids. Just give 10-20 more years only. Why to create dangerous SI?

Because if you don't, someone else will.

Not obviously true. An alternative which immediately comes to my mind is a globally enforced mutual agreement to refrain from building superintelligences.

(Yes, that alternative is unrealistic if making superintelligences turns out to be too easy. But I'd want to see that premise argued for, not passed over in silence.)

The more general problem is that we need a solution to multi-polar traps (of which superintelligent AI creationg is one instance). The only viable solution I've seen proposed is creating a sufficiently powerful Singleton.

The only likely viable ideas for Singletons I've seen proposed are superintelligent AIs, and a human group with extensive use of thought-control technologies on itself. The latter probably can't work unless you apply it to all of society, since it doesn't have the same inherent advantages AI does, and as such would remain vulnerable to being usurped by a clandestingly constructed AI. Applying the latter to all of society, OTOH, would most likely cause massive value loss.

Therefore I'm in favor of the former; not because I like the odds, but because the alternatives look worse.

Totally agree, and I wish this opinion was voiced more on LW rather than the emphasis on trying to make a friendly self improving AI. For this to make sense though I think the human race needs to become a singleton, although perhaps that is what Google's acquisitions and massive government surveillance is already doing.

Yes, continued development of AI seems unstoppable. But this brings up another very good point: if humanity cannot become a Singleton in our search for good egalitarian shared values, what is the chance of creating FAI? After years of good work in that direction and perhaps even success in determining a good approximation, what prevents some powerful secret entity like the CIA from hijacking it at the last minute and simply narrowing its objectives for something it determines is a "greater" good?

Our objectives are always better than the other guy's, and while violence is universally despicable, it is fast, cheap, easy to program and the other guy (including FAI developers) won't be expecting it. For the guy running the controls, that's friendly enough. :-)

On one hand, I think the world is already somewhat close to a singleton (with regard to AI, obviously it is nowhere near singleton with regard to most other things). I mean google has a huge fraction of the AI talent. The US government has a huge fraction of the mathematics talent. Then, there is Microsoft, FB, Baidu, and a few other big tech companies. But every time an independent AI company gains some traction it seems to be bought out by the big guys. I think this is a good thing as I believe the big guys will act in there own best interest including their interest in preserving their own life (i.e., not ending the world). Of course if it is easy to make an AGI, then there is no hope anyway. But, if it requires companies of Google scale, then there is hope they will choose to avoid it.

The "own best interest" in a winner- takes-all scenario is to create an eternal monopoly on everything. All levels of Maslow's pyramide of human needs will be served by goods and services supplied by this singleton.

Stuart Russell, in his comment on the Edge.org AI discussion, offered a concise mathematical description of perverse instantiation, and seems to suggest that it is likely to occur:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

I'm curious if there's more information about this behavior occurring in practice.

15

Superintelligence 12: Malignant failure modes

15

Summary

Another view

Notes

In-depth investigations

How to proceed

15