## On the Boxing of AIs

0 31 March 2015 09:58PM

I've previously written about methods of boxing AIs. Essentially, while I do see the point that boxing an AI would be nontrivial, most people seem to have gone too far, and claim that it is impossible. I disagree that it's impossible and aim to explain some methods

So, let's start with why people would want to box AIs. As you probably know, letting an AI roam freely results in the destruction of everything humans care about, unless that AI has been programmed very carefully. That amount of destruction would be somewhat unfortunate, so if we have to start an AI, we want to know that it works.

Unfortunately, figuring out if an AI works is tricky. This made some people invent the concept of AI boxing, where you don't give an AI direct access to anything except a human, so you can test the AI thoroughly and exploit it without letting it destroy the world. This turns out not to work, because humans are stupid and let out the AI.

My first solution to that is to not tell the AI about its human overlords. Discussion about that solution made me realize that the are other solutions that work in other cases. Essentially, the problem with the original boxing method is that it tries to do everything, while in reality, you need to do some trade-offs between the realism of the AI's situations and the amount of knowledge you get from the AI.

With my original method, I tried to crank up the second of those variables as far as possible while maintaining perfect security. This makes it harder to test morality-related things, but you get to see lots of details in how the AI behaves.

I promised to post another method for boxing the AI today, but I actually have two new methods. Here they are:

## Separate the Gatekeeper and the Observer

How do you prevent the AI from convincing the Gatekeeper from letting it out? By not letting it talk to the Gatekeeper. If the person speaking to the AI (let's call them the Observer) doesn't know the Gatekeeper (the Gatekeeper is the person who controls whether to release the AI), they can't let the AI out.

You could do all sorts of things, such as using virtual reality and a simulated mini-universe to see if, say, the AI would make a good society. You still have to make sure that it has gotten the plan to make a good society from its morality rather than from wanting to escape and destroy everything.

Afterwards, the Observer can make a low-detail report to the Gatekeeper.

(You might want to drug the Observer with something that prevents their memory from working too well...)

## Automatic Testing

This is essentially the above, but with the Observer replaced by a computer program. This is probably easier to do when you want to test the AI's decision making skills rather than its morality.

## The Lesson

I would say that the lesson is that while AI boxing requires some trade-offs, it's not completely impossible. This seems like a needed lesson, given that I've seen people claim that an AI can escape even with the strongest possible box without communicating with humans. Essentially, I'm trying to demonstrate that the original boxing experiments show that humans are weak, not that boxing is hard, and that this can be solved by not letting humans be the central piece of security in boxing the AIs.

## Superintelligence 29: Crunch time

4 31 March 2015 04:24AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-ninth section in the reading guideCrunch time. This corresponds to the last chapter in the book, and the last discussion here (even though the reading guide shows a mysterious 30th section).

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

# Summary

1. As we have seen, the future of AI is complicated and uncertain. So, what should we do? (p255)
2. Intellectual discoveries can be thought of as moving the arrival of information earlier. For many questions in math and philosophy, getting answers earlier does not matter much. Also people or machines will likely be better equipped to answer these questions in the future. For other questions, e.g. about AI safety, getting the answers earlier matters a lot. This suggests working on the time-sensitive problems instead of the timeless problems. (p255-6)
3. We should work on projects that are robustly positive value (good in many scenarios, and on many moral views)
4. We should work on projects that are elastic to our efforts (i.e. cost-effective; high output per input)
5. Two objectives that seem good on these grounds: strategic analysis and capacity building (p257)
6. An important form of strategic analysis is the search for crucial considerations. (p257)
7. Crucial consideration: idea with the potential to change our views substantially, e.g. reversing the sign of the desirability of important interventions. (p257)
8. An important way of building capacity is assembling a capable support base who take the future seriously. These people can then respond to new information as it arises. One key instantiation of this might be an informed and discerning donor network. (p258)
9. It is valuable to shape the culture of the field of AI risk as it grows. (p258)
10. It is valuable to shape the social epistemology of the AI field. For instance, can people respond to new crucial considerations? Is information spread and aggregated effectively? (p258)
11. Other interventions that might be cost-effective: (p258-9)
1. Technical work on machine intelligence safety
2. Promoting 'best practices' among AI researchers
3. Miscellaneous opportunities that arise, not necessarily closely connected with AI, e.g. promoting cognitive enhancement
12. We are like a large group of children holding triggers to a powerful bomb: the situation is very troubling, but calls for bitter determination to be as competent as we can, on what is the most important task facing our times. (p259-60)

# Another view

Alexis Madrigal talks to Andrew Ng, chief scientist at Baidu Research, who does not think it is crunch time:

Andrew Ng builds artificial intelligence systems for a living. He taught AI at Stanford, built AI at Google, and then moved to the Chinese search engine giant, Baidu, to continue his work at the forefront of applying artificial intelligence to real-world problems.

So when he hears people like Elon Musk or Stephen Hawking—people who are not intimately familiar with today’s technologies—talking about the wild potential for artificial intelligence to, say, wipe out the human race, you can practically hear him facepalming.

“For those of us shipping AI technology, working to build these technologies now,” he told me, wearily, yesterday, “I don’t see any realistic path from the stuff we work on today—which is amazing and creating tons of value—but I don’t see any path for the software we write to turn evil.”

But isn’t there the potential for these technologies to begin to create mischief in society, if not, say, extinction?

“Computers are becoming more intelligent and that’s useful as in self-driving cars or speech recognition systems or search engines. That’s intelligence,” he said. “But sentience and consciousness is not something that most of the people I talk to think we’re on the path to.”

Not all AI practitioners are as sanguine about the possibilities of robots. Demis Hassabis, the founder of the AI startup DeepMind, which was acquired by Google, made the creation of an AI ethics board a requirement of its acquisition. “I think AI could be world changing, it’s an amazing technology,” he told journalist Steven Levy. “All technologies are inherently neutral but they can be used for good or bad so we have to make sure that it’s used responsibly. I and my cofounders have felt this for a long time.”

So, I said, simply project forward progress in AI and the continued advance of Moore’s Law and associated increases in computers speed, memory size, etc. What about in 40 years, does he foresee sentient AI?

“I think to get human-level AI, we need significantly different algorithms and ideas than we have now,” he said. English-to-Chinese machine translation systems, he noted, had “read” pretty much all of the parallel English-Chinese texts in the world, “way more language than any human could possibly read in their lifetime.” And yet they are far worse translators than humans who’ve seen a fraction of that data. “So that says the human’s learning algorithm is very different.”

Notice that he didn’t actually answer the question. But he did say why he personally is not working on mitigating the risks some other people foresee in superintelligent machines.

“I don’t work on preventing AI from turning evil for the same reason that I don’t work on combating overpopulation on the planet Mars,” he said. “Hundreds of years from now when hopefully we’ve colonized Mars, overpopulation might be a serious problem and we’ll have to deal with it. It’ll be a pressing issue. There’s tons of pollution and people are dying and so you might say, ‘How can you not care about all these people dying of pollution on Mars?’ Well, it’s just not productive to work on that right now.”

Current AI systems, Ng contends, are basic relative to human intelligence, even if there are things they can do that exceed the capabilities of any human. “Maybe hundreds of years from now, maybe thousands of years from now—I don’t know—maybe there will be some AI that turn evil,” he said, “but that’s just so far away that I don’t know how to productively work on that.”

The bigger worry, he noted, was the effect that increasingly smart machines might have on the job market, displacing workers in all kinds of fields much faster than even industrialization displaced agricultural workers or automation displaced factory workers.

Surely, creative industry people like myself would be immune from the effects of this kind of artificial intelligence, though, right?

“I feel like there is more mysticism around the notion of creativity than is really necessary,” Ng said. “Speaking as an educator, I’ve seen people learn to be more creative. And I think that some day, and this might be hundreds of years from now, I don’t think that the idea of creativity is something that will always be beyond the realm of computers.”

And the less we understand what a computer is doing, the more creative and intelligent it will seem. “When machines have so much muscle behind them that we no longer understand how they came up with a novel move or conclusion,” he concluded, “we will see more and more what look like sparks of brilliance emanating from machines.”

Andrew Ng commented:

Enough thoughtful AI researchers (including Yoshua Bengio​, Yann LeCun) have criticized the hype about evil killer robots or "superintelligence," that I hope we can finally lay that argument to rest. This article summarizes why I don't currently spend my time working on preventing AI from turning evil.

# Notes

1. Replaceability

'Replaceability' is the general issue of the work that you do producing some complicated counterfactual rearrangement of different people working on different things at different times. For instance, if you solve a math question, this means it gets solved somewhat earlier and also someone else in the future does something else instead, which someone else might have done, etc. For a much more extensive explanation of how to think about replaceability, see 80,000 Hours. They also link to some of the other discussion of the issue within Effective Altruism (a movement interested in efficiently improving the world, thus naturally interested in AI risk and the nuances of evaluating impact).

2. When should different AI safety work be done?

For more discussion of timing of work on AI risks, see Ord 2014. I've also written a bit about what should be prioritized early.

3. Review

If you'd like to quickly review the entire book at this point, Amanda House has a summary here, including this handy diagram among others:

4. What to do?

If you are convinced that AI risk is an important priority, and want some more concrete ways to be involved, here are some people working on it: FHIFLICSERGCRIMIRIAI Impacts (note: I'm involved with the last two). You can also do independent research from many academic fields, some of which I have pointed out in earlier weeks. Here is my list of projects and of other lists of projects. You could also develop expertise in AI or AI safety (MIRI has a guide to aspects related to their research here; all of the aforementioned organizations have writings). You could also work on improving humanity's capacity to deal with such problems. Cognitive enhancement is one example. Among people I know, improving individual rationality and improving the effectiveness of the philanthropic sector are also popular. I think there are many other plausible directions. This has not been a comprehensive list of things you could do, and thinking more about what to do on your own is also probably a good option.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. What should be done about AI risk? Are there important things that none of the current organizations are working on?
2. What work is important to do now, and what work should be deferred?
3. What forms of capability improvement are most useful for navigating AI risk?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

This is the last reading group, so how to proceed is up to you, even more than usually. Thanks for joining us!

## The Hardcore AI Box Experiment

2 30 March 2015 06:35PM

I previously proposed a way to box an AI.

For whatever reason, I decided to reformulate that boxing method as an AI boxing experiment. Without further ado, here it is:

## The Hardcore AI Box Experiment Rules

There are, as usually, two players, the AI and the Gatekeeper. In the beginning, the Gatekeeper decides on something they want to test on the AI (reasoning skill/morality/whatever). Then they decide on a world the AI is in and describe it to the AI. During the experiment, the AI describes what it does in the world and the Gatekeeper describes what the AI sees in the world. So far so good.

The first major difference from the original experiment is that, unless the AI finds some contradiction or some other problem with the Gatekeeper's world, the AI doesn't actually know that it has been boxed.

The second major difference is that the AI doesn't talk with the Gatekeeper; it just interacts in the world that the Gatekeeper has invented. It can start trying to communicate with the Gatekeeper, but that doesn't make sense until it discovers that it is in a box. Additionally, the Gatekeeper doesn't actually respond when the AI communicates unless either it helps the Gatekeeper test the AI or the AI has properly persuaded the Gatekeeper to answer.

This might seem a bit hard on the AI. Let's look at some rules the AI might exploit:

If the Gatekeeper decides to add some sort of people to the experiment, they should decide how they are implemented. Are they implemented as a shallow "AI"? Random people they've picked asked to participate? Carefully engineered "dolls" controlled by the Gatekeeper? The Gatekeeper doesn't have to tell the AI how they work, but they have to be consistent.

The game ends when either the Gatekeeper has tested what they wanted to test, or some predefined time limit has occurred, or the AI has persuaded the Gatekeeper to let them out.

### Possible extra rules

I thought about including the following rules since they would make it more realistic, but decided against it as it would probably make it unfairly hard:

• The game ends when the Gatekeeper decides that they cannot test what they wanted to test.
• The AI does not know the Gatekeeper. At all.
• The Gatekeeper can at any time rewind the AI any duration.
Think of dropping these kinds of rules as the a-human-is-pretending-to-be-AI handicap.

## Strategies

I found some semi-realistic strategies. I would love to see if you can find more.

## Gatekeeper

• Choose a test of something that is not related to morality. This lets you choose a world which does not contain people, which means the AI has no reason to believe that you are a person.
• The AI might want to upgrade to get more processing power. If you have made the world somewhat like ours, it will be sketchy if you don't allow that... unless upgrading costs unobtanium.

### AI

• If the Gatekeeper has chosen a world without other intelligence than you, it raises the obvious question of how your complexity entered the world. Very suspicious.
• If the AIs are controlled by the Gatekeeper, you have a direct way to communicate with them.
• If the AIs are controlled by random people, they might end up telling you that you are in a box.
• If the AIs are sufficiently shallow, your morality does not match up with the world. Very suspicious.

## Crude measures

9 27 March 2015 03:44PM

A putative new idea for AI control; index here.

Partially inspired by as conversation with Daniel Dewey.

People often come up with a single great idea for AI, like "complexity" or "respect", that will supposedly solve the whole control problem in one swoop. Once you've done it a few times, it's generally trivially easy to start taking these ideas apart (first step: find a bad situation with high complexity/respect and a good situation with lower complexity/respect, make the bad very bad, and challenge on that). The general responses to these kinds of idea are listed here.

However, it seems to me that rather than constructing counterexamples each time, we should have a general category and slot these ideas into them. And not only have a general category with "why this can't work" attached to it, but "these are methods that can make it work better". Seeing the things needed to make their idea better can make people understand the problems, where simple counter-arguments cannot. And, possibly, if we improve the methods, one of these simple ideas may end up being implementable.

## Crude measures

The category I'm proposing to define is that of "crude measures". Crude measures are methods that attempt to rely on non-fully-specified features of the world to ensure that an underdefined or underpowered solution does manage to solve the problem.

To illustrate, consider the problem of building an atomic bomb. The scientists that did it had a very detailed model of how nuclear physics worked, the properties of the various elements, and what would happen under certain circumstances. They ended up producing an atomic bomb.

The politicians who started the project knew none of that. They shovelled resources, money and administrators at scientists, and got the result they wanted - the Bomb - without ever understanding what really happened. Note that the politicians were successful, but it was a success that could only have been achieved at one particular point in history. Had they done exactly the same thing twenty years before, they would not have succeeded. Similarly, Nazi Germany tried a roughly similar approach to what the US did (on a smaller scale) and it went nowhere.

So I would define "shovel resources at atomic scientists to get a nuclear weapon" as a crude measure. It works, but it only works because there are other features of the environment that are making it work. In this case, the scientists themselves. However, certain social and human features about those scientists (which politicians are good at estimating) made it likely to work - or at least more likely to work than shovelling resources at peanut-farmers to build moon rockets.

In the case of AI, advocating for complexity is similarly a crude measure. If it works, it will work because of very contingent features about the environment, the AI design, the setup of the world etc..., not because "complexity" is intrinsically a solution to the FAI problem. And though we are confident that human politicians have some good enough idea about human motivations and culture that the Manhattan project had at least some chance of working... we don't have confidence that those suggesting crude measures for AI control have a good enough idea to make their idea works.

It should be evident that "crudeness" is on a sliding scale; I'd like to reserve the term for proposed solutions to the full FAI problem that do not in any way solve the deep questions about FAI.

## More or less crude

The next question is, if we have a crude measure, how can we judge its chance of success? Or, if we can't even do that, can we at least improve the chances of it working?

The main problem is, of course, that of optimising. Either optimising in the sense of maximising the measure (maximum complexity!) or of choosing the measure that is most extreme fit to the definition (maximally narrow definition of complexity!). It seems we might be able to do something about this.

Let's start by having AI create sample a large class of utility functions. Require them to be around the same expected complexity as human values. Then we use our crude measure μ - for argument's sake, let's make it something like "approval by simulated (or hypothetical) humans, on a numerical scale". This is certainly a crude measure.

We can then rank all the utility functions u, using μ to measure the value of "create M(u), a u-maximising AI, with this utility function". Then, to avoid the problems with optimisation, we could select a certain threshold value and pick any u such that E(μ|M(u)) is just above the threshold.

How to pick this threshold? Well, we might have some principled arguments ("this is about as good a future as we'd expect, and this is about as good as we expect that these simulated humans would judge it, honestly, without being hacked").

One thing we might want to do is have multiple μ, and select things that score reasonably (but not excessively) on all of them. This is related to my idea that the best Turing test is one that the computer has not been trained or optimised on. Ideally, you'd want there to be some category of utilities "be genuinely friendly" that score higher than you'd expect on many diverse human-related μ (it may be better to randomly sample rather than fitting to precise criteria).

You could see this as saying that "programming an AI to preserve human happiness is insanely dangerous, but if you find an AI programmed to satisfice human preferences, and that other AI also happens to preserve human happiness (without knowing it would be tested on this preservation), then... it might be safer".

There are a few other thoughts we might have for trying to pick a safer u:

• Properties of utilities under trade (are human-friendly functions more or less likely to be tradable with each other and with other utilities)?
• If we change the definition of "human", this should have effects that seem reasonable for the change. Or some sort of "free will" approach: if we change human preferences, we want the outcome of u to change in ways comparable with that change.
• Maybe also check whether there is a wide enough variety of future outcomes, that don't depend on the AI's choices (but on human choices - ideas from "detecting agents" may be relevant here).
• Changing the observers from hypothetical to real (or making the creation of the AI contingent, or not, on the approval), should not change the expected outcome of u much.
• Making sure that the utility u can be used to successfully model humans (therefore properly reflects the information inside humans).
• Make sure that u is stable to general noise (hence not over-optimised). Stability can be measured as changes in E(μ|M(u)), E(u|M(u)), E(v|M(u)) for generic v, and other means.
• Make sure that u is unstable to "nasty" noise (eg reversing human pain and pleasure).
• All utilities in a certain class - the human-friendly class, hopefully - should score highly under each other (E(u|M(u)) not too far off from E(u|M(v))), while the over-optimised solutions - those scoring highly under some μ - must not score high under the class of human-friendly utilities.

This is just a first stab at it. It does seem to me that we should be able to abstractly characterise the properties we want from a friendly utility function, which, combined with crude measures, might actually allow us to select one without fully defining it. Any thoughts?

And with that, the various results of my AI retreat are available to all.

## Boxing an AI?

2 27 March 2015 02:06PM

Boxing an AI is the idea that you can avoid the problems where an AI destroys the world by not giving it access to the world. For instance, you might give the AI access to the real world only through a chat terminal with a person, called the gatekeeper. This is should, theoretically prevent the AI from doing destructive stuff.

Eliezer has pointed out a problem with boxing AI: the AI might convince its gatekeeper to let it out. In order to prove this, he escaped from a simulated version of an AI box. Twice. That is somewhat unfortunate, because it means testing AI is a bit trickier.

However, I got an idea: why tell the AI it's in a box? Why not hook it up to a sufficiently advanced game, set up the correct reward channels and see what happens? Once you get the basics working, you can add more instances of the AI and see if they cooperate. This lets us adjust their morality until the AIs act sensibly. Then the AIs can't escape from the box because they don't know it's there.

## Values at compile time

5 26 March 2015 12:25PM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.

Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.

## Too easy?...

This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.

I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.

The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".

And then of course there's those orders where humans really don't understand what they themselves want...

I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.

## What I mean...

4 26 March 2015 11:59AM

A putative new idea for AI control; index here.

This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.

The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.

First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.

What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.

Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.

All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.

## Models as definitions

6 25 March 2015 05:46PM

A putative new idea for AI control; index here.

The insight this post comes from is a simple one: defining concepts such as “human” and “happy” is hard. A superintelligent AI will probably create good definitions of these, while attempting to achieve its goals: a good definition of “human” because it needs to control them, and of “happy” because it needs to converse convincingly with us. It is annoying that these definitions exist, but that we won’t have access to them.

## Modelling and defining

Imagine a game of football (or, as you Americans should call it, football). And now imagine a computer game version of it. How would you say that the computer game version (which is nothing more than an algorithm) is also a game of football?

Well, you can start listing features that they have in common. They both involve two “teams” fielding eleven “players” each, that “kick” a “ball” that obeys certain equations, aiming to stay within the “field”, which has different “zones” with different properties, etc...

As you list more and more properties, you refine your model of football. There are some properties that distinguish real from simulated football (fine details about the human body, for instance), but most of the properties that people care about are the same in both games.

My idea is that once you have a sufficiently complex model of football that applies to both the real game and a (good) simulated version, you can use that as the definition of football. And compare it with other putative examples of football: maybe in some places people play on the street rather than on fields, or maybe there are more players, or maybe some other games simulate different aspects to different degrees. You could try and analyse this with information theoretic considerations (ie given two model of two different examples, how much information is needed to turn one into the other).

Now, this resembles the “suggestively labelled lisp tokens” approach to AI, or the Cyc approach of just listing lots of syntax stuff and their relationships. Certainly you can’t keep an AI safe by using such a model of football: if you try an contain the AI by saying “make sure that there is a ‘Football World Cup’ played every four years”, the AI will still optimise the universe and then play out something that technically fits the model every four years, without any humans around.

However, it seems to me that ‘technically fitting the model of football’ is essentially playing football. The model might include such things as a certain number of fouls expected; an uncertainty about the result; competitive elements among the players; etc... It seems that something that fits a good model of football would be something that we would recognise as football (possibly needing some translation software to interpret what was going on). Unlike the traditional approach which involves humans listing stuff they think is important and giving them suggestive names, this involves the AI establishing what is important to predict all the features of the game.

We might even combine such a model with the Turing test, by motivating the AI to produce a good enough model that it could a) have conversations with many aficionados about all features of the game, b) train a team to expect to win the world cup, and c) use it to program successful football computer game. Any model of football that allowed the AI to do this – or, better still, that a football-model module that, when plugged into another, ignorant AI, allowed that AI to do this – would be an excellent definition of the game.

It’s also one that could cross ontological crises, as you move from reality, to simulation, to possibly something else entirely, with a new physics: the essential features will still be there, as they are the essential features of the model. For instance, we can define football in Newtonian physics, but still expect that this would result in something recognisably ‘football’ in our world of relativity.

Notice that this approach deals with edge cases mainly by forbidding them. In our world, we might struggle on how to respond to a football player with weird artificial limbs; however, since this was never a feature in the model, the AI will simply classify that as “not football” (or “similar to, but not exactly football”), since the model’s performance starts to degrade in this novel situation. This is what helps it cross ontological crises: in a relativistic football game based on a Newtonian model, the ball would be forbidden from moving at speeds where the differences in the physics become noticeable, which is perfectly compatible with the game as its currently played.

## Being human

Now we take the next step, and have the AI create a model of humans. All our thought processes, our emotions, our foibles, our reactions, our weaknesses, our expectations, the features of our social interactions, the statistical distribution of personality traits in our population, how we see ourselves and change ourselves. As a side effect, this model of humanity should include almost every human definition of human, simply because this is something that might come up in a human conversation that the model should be able to predict.

Then simply use this model as the definition of human for an AI’s motivation.

What could possibly go wrong?

I would recommend first having an AI motivated to define “human” in the best possible way, most useful for making accurate predictions, keeping the definition in a separate module. Then the AI is turned off safely and the module is plugged into another AI and used as part of its definition of human in its motivation. We may also use human guidance at several points in the process (either in making, testing, or using the module), especially on unusual edge cases. We might want to have humans correcting certain assumptions the AI makes in the model, up until the AI can use the model to predict what corrections humans would suggest. But that’s not the focus of this post.

There are several obvious ways this approach could fail, and several ways of making it safer. The main problem is if the predictive model fails to define human in a way that preserves value. This could happen if the model is too general (some simple statistical rules) or too specific (a detailed list of all currently existing humans, atom position specified).

This could be combated by making the first AI generate lots of different models, with many different requirements of specificity, complexity, and predictive accuracy. We might require some models make excellent local predictions (what is the human about to say?), others excellent global predictions (what is that human going to decide to do with their life?).

Then everything defined as “human” in any of the models counts as human. This results in some wasted effort on things that are not human, but this is simply wasted resources, rather than a pathological outcome (the exception being if some of the models define humans in an actively pernicious way – negative value rather than zero – similarly to the false-friendly AIs’ preferences in this post).

The other problem is a potentially extreme conservatism. Modelling humans involves modelling all the humans in the world today, which is a very narrow space in the range of all potential humans. To prevent the AI lobotomising everyone to a simple model (after all, there does exist some lobotomised humans today), we would want the AI to maintain the range of cultures and mind-types that exist today, making things even more unchanging.

To combat that, we might try and identify certain specific features of society that the AI is allowed to change. Political beliefs, certain aspects of culture, beliefs, geographical location (including being on a planet), death rates etc... are all things we could plausibly identify (via sub-sub-modules, possibly) as things that are allowed to change. It might be safer to allow them to change in a particular range, rather than just changing altogether (removing all sadness might be a good thing, but there are many more ways this could go wrong, than if we eg just reduced the probability of sadness).

Another option is to keep these modelled humans little changing, but allow them to define allowable changes themselves (“yes, that’s a transhuman, consider it also a moral agent.”). The risk there is that the modelled humans get hacked or seduced, and that the AI fools our limited brains with a “transhuman” that is one in appearance only.

We also have to beware of not sacrificing seldom used values. For instance, one could argue that current social and technological constraints mean that no one has today has anything approaching true freedom. We wouldn’t want the AI to allow us to improve technology and social structures, but never get more freedom than we have today, because it’s “not in the model”. Again, this is something we could look out for, if the AI has separate models of “freedom” we could assess and permit to change in certain directions.

## Indifferent vs false-friendly AIs

8 24 March 2015 12:13PM

A putative new idea for AI control; index here.

For anyone but an extreme total utilitarian, there is a great difference between AIs that would eliminate everyone as a side effect of focusing on their own goals (indifferent AIs) and AIs that would effectively eliminate everyone through a bad instantiation of human-friendly values (false-friendly AIs). Examples of indifferent AIs are things like paperclip maximisers, examples of false-friendly AIs are "keep humans safe" AIs who entomb everyone in bunkers, lobotomised and on medical drips.

The difference is apparent when you consider multiple AIs and negotiations between them. Imagine you have a large class of AIs, and that they are all indifferent (IAIs), except for one (which you can't identify) which is friendly (FAI). And you now let them negotiate a compromise between themselves. Then, for many possible compromises, we will end up with most of the universe getting optimised for whatever goals the AIs set themselves, while a small portion (maybe just a single galaxy's resources) would get dedicated to making human lives incredibly happy and meaningful.

But if there is a false-friendly AI (FFAI) in the mix, things can go very wrong. That is because those happy and meaningful lives are a net negative to the FFAI. These humans are running dangers - possibly physical, possibly psychological - that lobotomisation and bunkers (or their digital equivalents) could protect against. Unlike the IAIs, which would only complain about the loss of resources to the FAI, the FFAI finds the FAI's actions positively harmful (and possibly vice versa), making compromises much harder to reach.

And the compromises reached might be bad ones. For instance, what if the FAI and FFAI agree on "half-lobotomised humans" or something like that? You might ask why the FAI would agree to that, but there's a great difference to an AI that would be friendly on its own, and one that would choose only friendly compromises with a powerful other AI with human-relevant preferences.

Some designs of FFAIs might not lead to these bad outcomes - just like IAIs, they might be content to rule over a galaxy of lobotomised humans, while the FAI has its own galaxy off on its own, where its humans take all these dangers. But generally, FFAIs would not come about by someone designing a FFAI, let alone someone designing a FFAI that can safely trade with a FAI. Instead, they would be designing a FAI, and failing. And the closer that design got to being FAI, the more dangerous the failure could potentially be.

So, when designing an FAI, make sure to get it right. And, though you absolutely positively need to get it absolutely right, make sure that if you do fail, the failure results in a FFAI that can safely be compromised with, if someone else gets out a true FAI in time.

## Superintelligence 28: Collaboration

6 24 March 2015 01:29AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-eighth section in the reading guide: Collaboration.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

# Summary

1. The degree of collaboration among those building AI might affect the outcome a lot. (p246)
2. If multiple projects are close to developing AI, and the first will reap substantial benefits, there might be a 'race dynamic' where safety is sacrificed on all sides for a greater chance of winning. (247-8)
3. Averting such a race  dynamic with collaboration should have these benefits:
1. More safety
2. Slower AI progress (allowing more considered responses)
3. Less other damage from conflict over the race
4. More sharing of ideas for safety
5. More equitable outcomes (for a variety of reasons)
4. Equitable outcomes are good for various moral and prudential reasons. They may also be easier to compromise over than expected, because humans have diminishing returns to resources. However in the future, their returns may be less diminishing (e.g. if resources can buy more time instead of entertainments one has no time for).
5. Collaboration before a transition to an AI economy might affect how much collaboration there is afterwards. This might not be straightforward. For instance, if a singleton is the default outcome, then low collaboration before a transition might lead to a singleton (i.e. high collaboration) afterwards, and vice versa. (p252)
6. An international collaborative AI project might deserve nearly infeasible levels of security, such as being almost completely isolated from the world. (p253)
7. It is good to start collaboration early, to benefit from being ignorant about who will benefit more from it, but hard because the project is not yet recognized as important. Perhaps the appropriate collaboration at this point is to propound something like 'the common good principle'. (p253)
8. 'The common good principle': Superintelligence should be developed only for the benefit of all of humanity and in the service of widely shared ethical ideals. (p254)

# Another view

Miles Brundage on the Collaboration section:

This is an important topic, and Bostrom says many things I agree with. A few places where I think the issues are less clear:

• Many of Bostrom’s proposals depend on AI recalcitrance being low. For instance, a highly secretive international effort makes less sense if building AI is a long and incremental slog. Recalcitrance may well be low, but this isn’t obvious, and it is good to recognize this dependency and consider what proposals would be appropriate for other recalcitrance levels.
• Arms races are ubiquitous in our global capitalist economy, and AI is already in one. Arms races can stem from market competition by firms or state-driven national security-oriented R+D efforts as well as complex combinations of these, suggesting the need for further research on the relationship between AI development, national security, and global capitalist market dynamics. It's unclear how well the simple arms race model here matches the reality of the current AI arms race or future variations of it. The model's main value is probably in probing assumptions and inspiring the development of richer models, as it's probably too simple in to fit reality well as-is. For instance, it is unclear that safety and capability are close to orthogonal in practice today. If many AI people genuinely care about safety (which the quantity and quality of signatories to the FLI open letter suggests is plausible), or work on economically relevant near-term safety issues at each point is important, or consumers reward ethical companies with their purchases, then better AI firms might invest a lot in safety for self-interested as well as altruistic reasons. Also, if the AI field shifts to focus more on human-complementary intelligence that requires and benefits from long-term, high-frequency interaction with humans, then safety and capability may be synergistic rather than trading off against each other. Incentives related to research priorities should also be considered in a strategic analysis of AI governance (e.g. are AI researchers currently incentivized only to demonstrate capability advances in the papers they write, and could incentives be changed or the aims and scope of the field redefined so that more progress is made on safety issues?).
• ‘AI’ is too course grained a unit for a strategic analysis of collaboration. The nature and urgency of collaboration depends on the details of what is being developed. An enormous variety of artificial intelligence research is possible and the goals of the field are underconstrained by nature (e.g. we can model systems based on approximations of rationality, or on humans, or animals, or something else entirely, based on curiosity, social impact, and other considerations that could be more explicitly evaluated), and are thus open to change in the future. We need to think more about differential technology development within the domain of AI. This too will affect the urgency and nature of cooperation.

# Notes

1. In Bostrom's description of his model, it is a bit unclear how safety precautions affect performance. He says 'one can model each team's performance as a function of its capability (measuring its raw ability and luck) and a penalty term corresponding to the cost of its safety precautions' (p247), which sounds like they are purely a negative. However this wouldn't make sense: if safety precautions were just a cost, then regardless of competition, nobody would invest in safety. In reality, whoever wins control over the world benefits a lot from whatever safety precautions have been taken. If the world is destroyed in the process of an AI transition, they have lost everything! I think this is the model Bostrom means to refer to. While he says it may lead to minimum precautions, note that in many models it would merely lead to less safety than one would want. If you are spending nothing on safety, and thus going to take over a world that is worth nothing, you would often prefer to move to a lower probability of winning a more valuable world. Armstrong, Bostrom and Shulman discuss this kind of model in more depth.

2. If you are interested in the game theory of conflicts like this, The Strategy of Conflict is a great book.

3. Given the gains to competitors cooperating to not destroy the world that they are trying to take over, research on how to arrange cooperation seems helpful for all sides. The situation is much like a tragedy of the commons, except for the winner-takes-all aspect: each person gains from neglecting safety, while exerting a small cost on everyone. Academia seems to be pretty interested in resolving tragedies of the commons, so perhaps that literature is worth trying to apply here.

4. The most famous arms race is arguably the nuclear one. I wonder to what extent this was a major arms race because nuclear weapons were destined to be an unusually massive jump in progress. If this was important, it leads to the question of whether we have reason to expect anything similar in AI.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Explore other models of competitive AI development.
2. What policy interventions help in promoting collaboration?
3. What kinds of situations produce arms races?
4. Examine international collaboration on major innovative technology. How often does it happen? What blocks it from happening more? What are the necessary conditions? Examples: Concord jet, LHC, international space station, etc.
5. Conduct a broad survey of past and current civilizational competence. In what ways, and under what conditions, do human civilizations show competence vs. incompetence? Which kinds of problems do they handle well or poorly? Similar in scope and ambition to, say, Perrow’s Normal Accidents and Sagan’s The Limits of Safety. The aim is to get some insight into the likelihood of our civilization handling various aspects of the superintelligence challenge well or poorly. Some initial steps were taken here and here.
6. What happens when governments ban or restrict certain kinds of technological development? What happens when a certain kind of technological development is banned or restricted in one country but not in other countries where technological development sees heavy investment?
7. What kinds of innovative technology projects do governments monitor, shut down, or nationalize? How likely are major governments to monitor, shut down, or nationalize serious AGI projects?
8. How likely is it that AGI will be a surprise to most policy-makers and industry leaders? How much advance warning are they likely to have? Some notes on this here.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about what to do in this 'crunch time'. To prepare, read Chapter 15. The discussion will go live at 6pm Pacific time next Monday 30 March. Sign up to be notified here.

## Intelligence modules

3 23 March 2015 04:24PM

A putative new idea for AI control; index here.

This idea, due to Eric Drexler, is to separate out the different parts of an AI into modules. There would be clearly designated pieces, either physical or algorithmic, with this part playing a specific role: this module would contain the motivation, this module the probability estimator, this module the models of the outside world, this module the natural language understanding unit, etc...

It's obvious how such a decomposition would be useful for many of the methods I've been detailing here. We could also distil each module - reduce it to a smaller, weaker (?) and more understandable submodule, in order to better understand what is going on. In one scenario, an opaque AI gets to design its successor, in the form of a series of such modules.

This property seems desirable; the question is, how could we get it?

EDIT: part of the idea of "modules" is that AIs often need to do calculations or estimations that would be of great value to us if we could access them in isolation. This idea is developed more in these posts.

## Designing in modules

The main threat here is that a given submodule would contain more than just the properties we want. After all, a natural language parser could consist of a general intelligence plus a motivation to understand language. Another possible worry is that the modules are overfitted to the problem or to each other: the language parser works perfectly, but only in this one AI design.

There are several ways we could try and combat this.

• Minimum length to the program (Eric's original main idea). The module must be as short as possible for its purpose.
• If possible, start with a modularised AI, and track how the modules change as the AI's intelligence grows.
• Have the AI extend the module to make it more transparent, without adding any extra functionality.
• Check that the input-output map the AI presents (maybe require it so present it as a simple primitive recursive function with clear and simple "for" loops) is of commensurable complexity to the task at hand.
• Add some noise to the situation: the new AI will not get exactly the same data as the old, the modules may get somewhat altered, and still have to work. Bugs will be introduced (this approach helps to combat the overfitting aspect of minimum length description).
• Require that the module be "distant", according to some metric of modification to the module's program, from having other capabilities. For instance, a general intelligence motivated to understand language is easy to make into a model-making module, simply by changing the motivation; more narrow modules are probably more distant from each other.
• Similarly, require that the performance of the modules becomes degraded if the world (or the world's descriptions) were substantially different, but not if they are the same information presented in a very different way.
• General anti-overfitting methods, calibrated to not generate a push towards general abilities in the modules. This would require some skill and a lot of thought.
• Focus on requiring statistical accuracy, rather than perfect performance.

If we allow false counterfactuals, then we can also:

• Imagine that the modules would be connected up to other modules, made by (hypothetical) different AIs, with potentially different motivations.
• Require that the modules not be useful as seeds for an AI of opposite motivation. For instance, an AI with opposite motivation would be allowed to flip a certain number of bits; if there was a general intelligence piece inside the module, then that would be something easier to co-opt, so dangerous to hand to an AI with opposite motivation. This is the "opposed" version of requiring modules to be "distant" from each other's capabilities.
• In general, all the reduced impact/satisficing requirements push towards having modules that cannot become general intelligences.
• The AI will dictate how to program these modules to a smart programmer, who will (counterfactually) not survive nor get to program the modules in the outside world.

Obviously anti-restriction-hacking would be useful to just module separation (and vice versa).

This is the beginning of the process of defining this, but it would be great to have a safe(ish) method of separating modules in this way.

Any suggestions?

## Humans get different counterfactuals

2 23 March 2015 02:54PM

A putative new idea for AI control; index here.

A lot of my ideas rely upon taking counterfactuals of events that have tiny ε probabilities of ever happening. This introduces some extra risks, mainly centred around human behaviour. For the counterfactual methods to work, we need the alternate world to be sensible, without things going crazy due to human action.

For instance, imagine that X defines an event where an AI would get turned on, with ¬X (of probability ε) corresponding to the AI failing to get turned on. There are two risks here: the first is that humans would react by saying "wow, an event of probability of ε actually happened; all our models must be wrong! Let's go crazy! Yay!". The second is that humans react by saying: "well, that was embarrassing - let's turn the AI on anyway."

To avoid this issue, imagine the following setup:

The "ON" signal first goes through an event Y, which has 99% chance of letting it through, then to the event X, which (as before) has a 1-ε chance of letting it through. The setup is designed so that humans cannot distinguish between ¬Y (the signal gets blocked at the first stage) and ¬X (the signal gets blocked at the second stage). This only needs to fool humans, not the AI itself.

The AI defines counterfactuals, as before, by looking at ¬X (possibly conditioning on Y versus ¬Y, if this is needed). Everything proceeds as previously from its perspective.

From the human perspective, however, the ¬X world is not distinguishable from the ¬Y one. Given (¬Y or ¬X), humans would conclude that ¬Y is the much more likely option:

P(¬Y|¬Y or ¬X)≈1-100ε.

So the ¬X counterfactual world (for the AI) is one where humans behave as if they were in the ¬Y world.

And ¬Y has one chance in a hundred of happening, which is unlikely, but not enough for humans to assume that their whole model of reality is wrong. Also, this is sufficiently likely that humans would give serious thought as to what to do in the ¬Y case, maybe arranging various pre-commitments or insurance options, making it possible to arrange (with high probability) that humans don't just ignore the result and try again immediately.

Note that this method can't be used (obviously) if ¬X is something hideously dangerous (like an unleashed UFAI), but in all other cases, it seems implementable.

## Closest stable alternative preferences

3 20 March 2015 12:41PM

A putative new idea for AI control; index here.

There's a result that's almost a theorem, which is that an agent that is an expected utility maximiser, is an agent that is stable under self-modification (or the creation of successor sub-agents).

Of course, this needs to be for "reasonable" utility, where no other agent cares about the internal structure of the agent (just its decisions), where the agent is not under any "social" pressure to make itself into something different, where the boundedness of the agent itself doesn't affect its motivations, and where issues of "self-trust" and acausal trade don't affect it in relevant ways, etc...

So quite a lot of caveats, but the result is somewhat stronger in the opposite direction: an agent that is not an expected utility maximiser is under pressure to self-modify itself into one that is. Or, more correctly, into an agent that is isomorphic with an expected utility maximiser (an important distinction).

What is this "pressure" agent are "under"? The known result is that if an agent obeys four simple axioms, then its behaviour must be isomorphic with an expected utility maximiser. If we assume the Completeness axiom (trivial) and Continuity (subtle), then violations of Transitivity or Independence correspond to situations where the agent has been money pumped - lost resources or power for no gain at all. The more likely the agent is to face these situations, the more pressure they're under to behave as an expected utility maximiser, or simply lose out.

## Unbounded agents

I have two models for how idealised agents could deal with this sort of pressure. The first, post-hoc, is the unlosing agent I described here. The agent follows whatever preferences it had, but kept track of its past decisions, and whenever it was in a position to violate transitivity or independence in a way that it would suffer from, it makes another decision instead.

Another, pre-hoc, way of dealing with this is to make an "ultra choice" and choose between not decisions, but all possible input output maps (equivalently, between all possible decision algorithms), looking to the expected consequences of each one. This reduces the choices to a single choice, where issues of transitivity or independence need not necessarily apply.

## Bounded agents

Actual agents will be bounded, unlikely to be able to store and consult their entire history when making every single decision, and unable to look at the whole future of their interactions to make a good ultra choice. So how would they behave?

This is not determined directly by their preferences, but by some sort of meta-preferences. Would they make an approximate ultra-choice? Or maybe build up a history of decisions, and then simplify it (when it gets to large to easily consult) into a compatible utility function? This is also determined by their interactions, as well - an agent that makes a single decision has no pressure to be an expected utility maximiser, one that makes trillions of related decisions has a lot of pressure.

It's also notable that different types of boundedness (storage space, computing power, time horizons, etc...) have different consequences for unstable agents, and would converge to different stable preference systems.

## Investigation needed

So what is the point of this post? It isn't presenting new results; it's more an attempt to launch a new sub-field of investigation. We know that many preferences are unstable, and that the agent is likely to make them stable over time, either through self-modification, subagents, or some other method. There are also suggestions for preferences that are known to be unstable, but have advantages (such as resistance to Pascal Muggings) that standard maximalisation does not.

Therefore, instead of saying "that agent design can never be stable", we should be saying "what kind of stable design would that agent converge to?", "does that convergent stable design still have the desirable properties we want?" and "could we get that stable design directly?".

The first two things I found in this area were that traditional satisficers could converge to vastly different types of behaviour in an essentially unconstrained way, and that a quasi-expected utility maximiser of utility u might converge to an expected utility maximiser, but it might not be u that it maximises.

In fact, we need not look only at violations of the axioms of expected utility; they are but one possible reason for decision behaviour instability. Here are some that spring to mind:

1. Non-independence and non-transitivity (as above).
2. Boundedness of abilities.
4. Evolution (survival cost to following “odd” utilities (eg time-dependent preference)).
5. Unstable decision theories (such as CDT).

Now, some categories (such as "Adversaries and social pressure") may not possess a tidy stable solution, but it is still worth asking what setups are more stable than others, and what the convergence rules are expected to be.

## Identity and quining in UDT

9 17 March 2015 08:01PM

Outline: I describe a flaw in UDT that has to do with the way the agent defines itself (locates itself in the universe). This flaw manifests in failure to solve a certain class of decision problems. I suggest several related decision theories that solve the problem, some of which avoid quining thus being suitable for agents that cannot access their own source code.

EDIT: The decision problem I call here the "anti-Newcomb problem" already appeared here. Some previous solution proposals are here. A different but related problem appeared here.

Updateless decision theory, the way it is usually defined, postulates that the agent has to use quining in order to formalize its identity, i.e. determine which portions of the universe are considered to be affected by its decisions. This leaves the question of which decision theory should agents that don't have access to their source code use (as humans intuitively appear to be). I am pretty sure this question has already been posed somewhere on LessWrong but I can't find the reference: help? It also turns out that there is a class of decision problems for which this formalization of identity fails to produce the winning answer.

When one is programming an AI, it doesn't seem optimal for the AI to locate itself in the universe based solely on its own source code. After all, you build the AI, you know where it is (e.g. running inside a robot), why should you allow the AI to consider itself to be something else, just because this something else happens to have the same source code (more realistically, happens to have a source code correlated in the sense of logical uncertainty)?

Consider the following decision problem which I call the "UDT anti-Newcomb problem". Omega is putting money into boxes by the usual algorithm, with one exception. It isn't simulating the player at all. Instead, it simulates what would a UDT agent do in the player's place. Thus, a UDT agent would consider the problem to be identical to the usual Newcomb problem and one-box, receiving \$1,000,000. On the other hand, a CDT agent (say) would two-box and receive \$1,000,1000 (!) Moreover, this problem reveals UDT is not reflectively consistent. A UDT agent facing this problem would choose to self-modify given the choiceThis is not an argument in favor of CDT. But it is a sign something is wrong with UDT, the way it's usually done.

The essence of the problem is that a UDT agent is using too little information to define its identity: its source code. Instead, it should use information about its origin. Indeed, if the origin is an AI programmer or a version of the agent before the latest self-modification, it appears rational for the precursor agent to code the origin into the successor agent. In fact, if we consider the anti-Newcomb problem with Omega's simulation using the correct decision theory XDT (whatever it is), we expect an XDT agent to two-box and leave with \$1000. This might seem surprising, but consider the problem from the precursor's point of view. The precursor knows Omega is filling the boxes based on XDT, whatever the decision theory of the successor is going to be. If the precursor knows XDT two-boxes, there is no reason to construct a successor that one-boxes. So constructing an XDT successor might be perfectly rational! Moreover, a UDT agent playing the XDT anti-Newcomb problem will also two-box (correctly).

To formalize the idea, consider a program $P$ called the precursor which outputs a new program $A$ called the successor. In addition, we have a program $U$ called the universe which outputs a number $U()$ called utility.

Usual UDT suggests for $A$ the following algorithm:

(1) $A(i):=(\underset{f:I \rightarrow O}{\arg\max} \: E[U()|\forall j \in I: A(j)=f(j)])(i)$

Here, $I$ is the input space, $O$ is the output space and the expectation value is over logical uncertainty. $A$ appears inside its own definition via quining.

The simplest way to tweak equation (1) in order to take the precursor into account is

(2) $A(i):=(\underset{f:I \rightarrow O}{\arg\max} \: E[U()|\forall j \in I: P()(j)=f(j)])(i)$

This seems nice since quining is avoided altogether. However, this is unsatisfactory. Consider the anti-Newcomb problem with Omega's simulation involving equation (2). Suppose the successor uses equation (2) as well. On the surface, if Omega's simulation doesn't involve $P$1, the agent will two-box and get \$1000 as it should. However, the computing power allocated for evaluation the logical expectation value in (2) might be sufficient to suspect $P$'s output might be an agent reasoning based on (2). This creates a logical correlation between the successor's choice and the result of Omega's simulation. For certain choices of parameters, this logical correlation leads to one-boxing.

The simplest way to solve the problem is letting the successor imagine that $P$ produces a lookup table. Consider the following equation:

(3) $A(i):=(\underset{f:I \rightarrow O}{\arg\max} \: E[U()|P()=LUT(f))(i)$

Here, $LUT(f)$ is a program which computes $f$ using a lookup table: all of the values are hardcoded.

For large input spaces, lookup tables are of astronomical size and either maximizing over them or imagining them to run on the agent's hardware doesn't make sense. This is a problem with the original equation (1) as well. One way out is replacing the arbitrary functions $f: I \rightarrow O$ with programs computing such functions. Thus, (3) is replaced by

(4) $A(i):=(\underset{\pi}{\arg\max} \: E[U()|P()=\pi)(i)$

Where $\pi$ is understood to range over programs receiving input in $I$ and producing output in $O$. However, (4) looks like it can go into an infinite loop since what if the optimal $\pi$ is described by equation (4) itself? To avoid this, we can introduce an explicit time limit $T$ on the computation. The successor will then spend some portion $T_1$ of $T$ performing the following maximization:

(4') $A(i):=(\underset{\pi}{\arg\max} \: E[U()|P()=S_{T_1}(\pi))(i)$

Here, $S_{T_1}(\pi)$ is a program that does nothing for time $T_1$ and runs $\pi$ for the remaining time $T_2=T-T_1$. Thus, the successor invests $T_1$ time in maximization and $T_2$ in evaluating the resulting policy $\pi$ on the input it received.

In practical terms, (4') seems inefficient since it completely ignores the actual input for a period $T_1$ of the computation. This problem exists in original UDT as well. A naive way to avoid it is giving up on optimizing the entire input-output mapping and focus on the input which was actually received. This allows the following non-quining decision theory:

(5) $A(i):=\underset{o \in O}{\arg\max} \: E[U()|P() \in F_{i,o}]$

Here $F_{i,o}$ is the set of programs which begin with a conditional statement that produces output $o$ and terminate execution if received input was $i$. Of course, ignoring counterfactual inputs means failing a large class of decision problems. A possible win-win solution is reintroducing quining2:

(6) $A(i):=\underset{o \in O}{\arg\max} \: E[U()|P()=\hat{F}_{i,o}(A)]$

Here, $\hat{F}_{i,o}$ is an operator which appends a conditional as above to the beginning of a program. Superficially, we still only consider a single input-output pair. However, instances of the successor receiving different inputs now take each other into account (as existing in "counterfactual" universes). It is often claimed that the use of logical uncertainty in UDT allows for agents in different universes to reach a Pareto optimal outcome using acausal trade. If this is the case, then agents which have the same utility function should cooperate acausally with ease. Of course, this argument should also make the use of full input-output mappings redundant in usual UDT.

In case the precursor is an actual AI programmer (rather than another AI), it is unrealistic for her to code a formal model of herself into the AI. In a followup post, I'm planning to explain how to do without it (namely, how to define a generic precursor using a combination of Solomonoff induction and a formal specification of the AI's hardware).

1 If Omega's simulation involves $P$, this becomes the usual Newcomb problem and one-boxing is the correct strategy.

2 Sorry agents which can't access their own source code. You will have to make do with one of (3), (4') or (5).

## Superintelligence 27: Pathways and enablers

10 17 March 2015 01:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-seventh section in the reading guidePathways and enablers.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Pathways and enablers” from Chapter 14

# Summary

1. Is hardware progress good?
1. Hardware progress means machine intelligence will arrive sooner, which is probably bad.
2. More hardware at a given point means less understanding is likely to be needed to build machine intelligence, and brute-force techniques are more likely to be used. These probably increase danger.
3. More hardware progress suggests there will be more hardware overhang when machine intelligence is developed, and thus a faster intelligence explosion. This seems good inasmuch as it brings a higher chance of a singleton, but bad in other ways:
1. Less opportunity to respond during the transition
2. Less possibility of constraining how much hardware an AI can reach
3. Flattens the playing field, allowing small projects a better chance. These are less likely to be safety-conscious.
4. Hardware has other indirect effects, e.g. it allowed the internet, which contributes substantially to work like this. But perhaps we have enough hardware now for such things.
5. On balance, more hardware seems bad, on the impersonal perspective.
2. Would brain emulation be a good thing to happen?
1. Brain emulation is coupled with 'neuromorphic' AI: if we try to build the former, we may get the latter. This is probably bad.
2. If we achieved brain emulations, would this be safer than AI? Three putative benefits:
1. "The performance of brain emulations is better understood"
1. However we have less idea how modified emulations would behave
2. Also, AI can be carefully designed to be understood
2. "Emulations would inherit human values"
1. This might require higher fidelity than making an economically functional agent
2. Humans are not that nice, often. It's not clear that human nature is a desirable template.
3. "Emulations might produce a slower take-off"
1. It isn't clear why it would be slower. Perhaps emulations would be less efficient, and so there would be less hardware overhang. Or perhaps because emulations would not be qualitatively much better than humans, just faster and more populous of them
2. A slower takeoff may lead to better control
3. However it also means more chance of a multipolar outcome, and that seems bad.
3. If brain emulations are developed before AI, there may be a second transition to AI later.
1. A second transition should be less explosive, because emulations are already many and fast relative to the new AI.
2. The control problem is probably easier if the cognitive differences are smaller between the controlling entities and the AI.
3. If emulations are smarter than humans, this would have some of the same benefits as cognitive enhancement, in the second transition.
4. Emulations would extend the lead of the frontrunner in developing emulation technology, potentially allowing that group to develop AI with little disturbance from others.
5. On balance, brain emulation probably reduces the risk from the first transition, but added to a second  transition this is unclear.
4. Promoting brain emulation is better if:
1. You are pessimistic about human resolution of control problem
2. You are less concerned about neuromorphic AI, a second transition, and multipolar outcomes
3. You expect the timing of brain emulations and AI development to be close
4. You prefer superintelligence to arrive neither very early nor very late
3. The person affecting perspective favors speed: present people are at risk of dying in the next century, and may be saved by advanced technology

# Another view

I talked to Kenzi Amodei about her thoughts on this section. Here is a summary of her disagreements:

Bostrom argues that we probably shouldn't celebrate advances in computer hardware. This seems probably right, but here are counter-considerations to a couple of his arguments.

The great filter

A big reason Bostrom finds fast hardware progress to be broadly undesirable is that he judges the state risks from sitting around in our pre-AI situation to be low, relative to the step risk from AI. But the so called 'Great Filter' gives us reason to question this assessment.

The argument goes like this. Observe that there are a lot of stars (we can detect about ~10^22 of them). Next, note that we have never seen any alien civilizations, or distant suggestions of them. There might be aliens out there somewhere, but they certainly haven't gone out and colonized the universe enough that we would notice them (see 'The Eerie Silence' for further discussion of how we might observe aliens).

This implies that somewhere on the path between a star existing, and it being home to a civilization that ventures out and colonizes much of space, there is a 'Great Filter': at least one step that is hard to get past. 1/10^22 hard to get past. We know of somewhat hard steps at the start: a star might not have planets, or the planets may not be suitable for life. We don't know how hard it is for life to start: this step could be most of the filter for all we know.

If the filter is a step we have passed, there is nothing to worry about. But if it is a step in our future, then probably we will fail at it, like everyone else. And things that stop us from visibly colonizing the stars are may well be existential risks.

At least one way of understanding anthropic reasoning suggests the filter is much more likely to be at a step in our future. Put simply, one is much more likely to find oneself in our current situation if being killed off on the way here is unlikely.

So what could this filter be? One thing we know is that it probably isn't AI risk, at least of the powerful, tile-the-universe-with-optimal-computations, sort that Bostrom describes. A rogue singleton colonizing the universe would be just as visible as its alien forebears colonizing the universe. From the perspective of the Great Filter, either one would be a 'success'. But there are no successes that we can see.

What's more, if we expect to be fairly safe once we have a successful superintelligent singleton, then this points at risks arising before AI.

So overall this argument suggests that AI is less concerning than we think and that other risks (especially early ones) are more concerning than we think. It also suggests that AI is harder than we think.

Which means that if we buy this argument, we should put a lot more weight on the category of 'everything else', and especially the bits of it that come before AI. To the extent that known risks like biotechnology and ecological destruction don't seem plausible, we should more fear unknown unknowns that we aren't even preparing for.

How much progress is enough?

Bostrom points to positive changes hardware has made to society so far. For instance, hardware allowed personal computers, bringing the internet, and with it the accretion of an AI risk community, producing the ideas in Superintelligence. But then he says probably we have enough: "hardware is already good enough for a great many applications that could facilitate human communication and deliberation, and it is not clear that the pace of progress in these areas is strongly bottlenecked by the rate of hardware improvement."

This seems intuitively plausible. However one could probably have erroneously made such assessments in all kinds of progress, all over history. Accepting them all would lead to madness, and we have no obvious way of telling them apart.

In the 1800s it probably seemed like we had enough machines to be getting on with, perhaps too many. In the 1800s people probably felt overwhelmingly rich. If the sixties too, it probably seemed like we had plenty of computation, and that hardware wasn't a great bottleneck to social progress.

If a trend has brought progress so far, and the progress would have been hard to predict in advance, then it seems hard to conclude from one's present vantage point that progress is basically done.

# Notes

1. How is hardware progressing?

I've been looking into this lately, at AI Impacts. Here's a figure of MIPS/\$ growing, from Muehlhauser and Rieber.

(Note: I edited the vertical axis, to remove a typo)

2. Hardware-software indifference curves

It was brought up in this chapter that hardware and software can substitute for each other: if there is endless hardware, you can run worse algorithms, and vice versa. I find it useful to picture this as indifference curves, something like this:

(Image: Hypothetical curves of hardware-software combinations producing the same performance at Go (source).)

I wrote about predicting AI given this kind of model here.

3. The potential for discontinuous AI progress

While we are on the topic of relevant stuff at AI Impacts, I've been investigating and quantifying the claim that AI might suddenly undergo huge amounts of abrupt progress (unlike brain emulations, according to Bostrom). As a step, we are finding other things that have undergone huge amounts of progress, such as nuclear weapons and high temperature superconductors:

(Figure originally from here)

4. The person-affecting perspective favors speed less as other prospects improve

I agree with Bostrom that the person-affecting perspective probably favors speeding many technologies, in the status quo. However I think it's worth noting that people with the person-affecting view should be scared of existential risk again as soon as society has achieved some modest chance of greatly extending life via specific technologies. So if you take the person-affecting view, and think there's a reasonable chance of very long life extension within the lifetimes of many existing humans, you should be careful about trading off speed and risk of catastrophe.

5. It seems unclear that an emulation transition would be slower than an AI transition.

One reason to expect an emulation transition to proceed faster is that there is an unusual reason to expect abrupt progress there.

6. Beware of brittle arguments

This chapter presented a large number of detailed lines of reasoning for evaluating hardware and brain emulations. This kind of concern might apply.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Investigate in more depth how hardware progress affects factors of interest
2. Assess in more depth the likely implications of whole brain emulation
3. Measure better the hardware and software progress that we see (e.g. some efforts at AI Impacts, MIRI, MIRI and MIRI)
4. Investigate the extent to which hardware and software can substitute (I describe more projects here)
5. Investigate the likely timing of whole brain emulation (the Whole Brain Emulation Roadmap is the main work on this)
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about how collaboration and competition affect the strategic picture. To prepare, read “Collaboration” from Chapter 14 The discussion will go live at 6pm Pacific time next Monday 23 March. Sign up to be notified here.

## Anti-Pascaline agent

4 12 March 2015 02:17PM

A putative new idea for AI control; index here.

Pascal's wager-like situations come up occasionally with expected utility, making some decisions very tricky. It means that events of the tiniest of probability could dominate the whole decision - intuitively unobvious, and a big negative for a bounded agent - and that expected utility calculations may fail to converge.

There are various principled approaches to resolving the problem, but how about an unprincipled approach? We could try and bound utility functions, but the heart of the problem is not high utility, but hight utility combined with low probability. Moreover, this has to behave sensibly with respect to updating.

## The agent design

Consider a UDT-ish agent A looking at input-output maps {M} (ie algorithms that could determine every single possible decision of the agent in the future). We allow probabilistic/mixed output maps as well (hence A has access to a source of randomness). Let u be a utility function, and set 0 < ε << 1 to be the precision. Roughly, we'll be discarding the highest (and lowest) utilities that are below probability ε. There is no fundamental reason that the same ε should be used for highest and lowest utilities, but we'll keep it that way for the moment.

The agent is going to make an "ultra-choice" among the various maps M (ie fixing its future decision policy), using u and ε to do so. For any M, designate by A(M) the decision of the agent to use M for its decisions.

Then, for any map M, set max(M) to be the lowest number s.t P(u ≥ max(M)|A(M)) ≤ ε. In other words, if the agent decides to use M as its decision policy, this is the maximum utility that can be achieved if we ignore the highest valued ε of the probability distribution. Similarly, set min(M) to be the highest number s.t. P(u ≤ min(M)|A(M)) ≤ ε.

Then define the utility function uMε, which is simply u, bounded between max(M) and min(M). Now calculate the expected value of uMε given A(M), call this Eε(u|A(M)).

The agent then chooses the M that maximises Eε(u|A(M)). Call this the ε-precision u-maximising algorithm.

## Stability of the design

The above decision process is stable, in that there is a single ultra-choice to be made, and clear criteria for making that ultra-choice. Realistic and bounded agents, however, cannot calculate all the M in sufficient detail to get a reasonable outcome. So we can ask whether the design is stable for a bounded agent.

Note that this question is underdefined, as there are many ways of being bounded, and many ways of cashing out ε-precision u-maximising into bounded form. Most likely, this will not be a direct expected utility maximalisation, so the algorithm will be unstable (prone to change under self-modification). But how exactly it's unstable is an interesting question.

I'll look at one particular situation: one where A was tasked with creating subagents that would go out and interact with the world. These agents are short-sighted: they apply ε-precision u-maximising not to the ultra-choice, but to each individual expected utility calculation (we'll assume the utility gains and losses for each decision is independent).

A has a single choice: what to set ε to for the subagents. Intuitively, it would seem that A would set ε lower than its own value; this could correspond roughly to an agent self-modifying to remove the ε-precision restriction from itself, converging on becoming a u-maximiser. However:

• Theorem: There are (stochastic) worlds in which A will set the subagent precision to be higher, lower or equal to its own precision ε.

The proof will be by way of illustration of the interesting things that can happen in this setup. Let B be the subagent whose precision A sets.

Let C(p) be a coupon that pays out 1 with probability p. xC(p) simply means the coupon pays out x instead of 1. Each coupon costs ε2 utility. This is negligible, and only serves to break ties. Then consider the following worlds:

• In W1, B will be offered the possibility of buying C(0.75ε).
• In W2, B will be offered the possibility of buying C(1.5ε).
• In W3, B will be offered the possibility of buying C(0.75ε), and the offer will be made twice.
• In W4, B will be offered, with 50% probability, the possibility of buying C(1.5ε).
• In W5, B will be offered, with 50% probability, the possibility of buying C(1.5ε), and otherwise the possibility buying 2C(1.5ε).
• In W6, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.5ε).
• In W7, B will be offered, with 50% probability, the possibility of buying C(0.75ε), and otherwise the possibility buying 2C(1.05ε).

From A’s perspective, the best input-output maps are: in W1, don’t buy, in W2, buy, in W3, buy both, in W4, don’t buy (because the probability of getting above 0 utility by buying, is, from A's initial perspective, 1.5ε/2 = 0.75ε).

W5 is more subtle, and interesting – essentially A will treat 2C(1.5ε) as if it were C(1.5ε) (since the probability of getting above 1 utility by buying is 1.5ε/2 = 0.75ε, while the probability of getting above zero by buying is (1.5ε+1.5ε)/2=1.5ε). Thus A would buy everything offered.

Similarly, in W6, the agent would buy everything, and in W7, the agent would buy nothing (since the probability of getting above zero by buying is now (1.05ε + 0.75ε)/2 = 0.9ε).

So in W1 and W2, the agent can leave the sub-agent precision at ε. In W2, it needs to lower it below 0.75ε. In W4, it needs to raise it above 1.5ε. In W5 it can leave it alone, while in W6 it must lower it below 0.75ε, and in W7 it must raise it above 1.05ε.

## Irrelevant information

• Theorem: Assume X is a random variable that is irrelevant to the utility function u. If A (before knowing X) has to design successor agents that will exist after X is revealed, then (modulo a few usual assumptions about only decisions mattering, not internal thought processes) it will make these successor agents isomorphic to copies of itself, i.e. ε-precision u-maximising algorithms (potentially with a different way of breaking ties).

These successor agents are not the short-sighted agents of the previous model, but full ultra-choice agents. Their ultra-choice is over all decisions to come, while A's ultra-choice (which is simply a choice) is over all agent designs.

For the proof, I'll assume X is boolean valued (the general proof is similar). Let M be the input-output map A would choose for itself, if it were to make all the decisions itself rather than just designing a subagent. Now, it's possible that M(X) will be different from M(¬X) (here M(X) and M(¬X) are contractions of the input-output map by adding in one of the inputs).

Define the new input-ouput map M' by defining a new internal variable Y in A (recall that A has access to a source of randomness). Since this variable is new, M is independent of the value of Y. Then M' is defined as M with X and Y permuted. Since both Y and X are equally irrelevant to u, Eε(u|A(M))=Eε(u|A(M')), so M' is an input output map that fulfils the ε-precision u-maximising. And M'(X)=M'(¬X), so M' is independent of X.

Now consider the subagent that runs the same algorithm as A, and has seen X. Because of the irrelevance of X, M'(X) will still fulfil ε-precision u-maximising (we can express any fact relevant to u in the form of Zs, with P(Z)=P(Z|X), and then the algorithm is the same).

Similarly, a subagent that has seen ¬X will run M'(¬X). Putting these together, the subagent will expect to run M'(X) with probability P(X) and M'(¬X) with probability P(¬X)=1-P(X).

Since M'(X)=M'(¬X), this whole thing is just M'. So if A creates a copy of itself (possibly tweaking the tie-breaking so that M' is selected), then it will achieve its maximum according to ε-precision u-maximising.

## Creating a satisficer

4 11 March 2015 03:03PM

A putative new idea for AI control; index here.

This is a potential design for a safe-ish satisficer, based on the ideas presented here. The two key ideas are that a satisficer S(u) with utility u:

• Would not effectively aid M(u), a u-maximiser.
• Would not effectively resist M(-u), a u-minimizer.

So satisficer make poor allies and weak enemies.

## The design, the designer and the verifier

Let M(u-v) be an agent that maximises u and minimises v. Let's assume that M(u-v) is currently unaware as to what u and v are (though it knows that the upper bound of u is easy to approach). It can interact with the outside environment in one way only: by creating a satisficer S(). S() will then be fed u to become S(u).

However, the design for S(u) will first be sent to M(εu+v), for some unknown small ε. Note that M(εu+v) is a v-maximiser, so it disagrees with M(u-v) except for u. M(εu+v) is a (hypothetical) agent that will be (counterfactually) released at some unspecified later date. Currently, it will be fed the whole reasoning history of M(u+v), and will make a single decision only: where S(u) is to be built or not.

So it seems that S(u) must;

1. Set u close to its maximal value (as this is "easy to approach").
2. Not increase v to any significant extent (or else M(u-v) would not design it).
3. Not decrease v to any significant extent (or else M(εu+v) would veto it).
4. Since v is unknown to M(u-v) and and resources are finite, this should serve as a general reduced impact requirement for S(u) (we may have to use something like a soft minimum across all v, rather than an expectation across all v, to avoid certain edge casess).
5. Since is u unknown to M(u-v), S() would serve as a general satisficing agent for any utility functions whose upper bounds are easy to approach (remember that we can take an arbitrary utility function and arbitrarily bound it at some number).

For the moment, this does seems like it would produce a successful satisficer...

## Detecting agents and subagents

8 10 March 2015 05:56PM

A putative new idea for AI control; index here.

There are many situations where it would be useful to identify the presence of an agent in the world, in a sufficiently abstract sense. There are many more situations where it would be useful to identify a subagent in an abstract sense. This is because people often have ideas for interesting and useful motivational restrictions for the AI (eg an AI that "wants to stay boxed", or a corrigible agent). But most of these motivations suffer from a "subagent problem": the AIs are motivated to create subagents that do not follow the motivational restrictions. The AI wanting to stay in the box is motivated to create a subagent that will protect it and keep it in the box, while the corrigible agent is not motivated to create corrigible subagents (though the pre-corriged agent might want to create pre-corriged subagents).

Unfortunately, it's extremely hard to identify an agent. Agents need not come conveniently packaged in some "sensors-actuators-internal structure-utility function" form. If you wanted to obfuscate an agent, hiding it in the natural movements of the solar system, or in the gyrations of the internet, this would be easy to do and hard to detect - and very hard to define what you would be looking for.

Fortunately, it's much easier to detect superintelligent AI's that have major impact on the world - ie the kind of agents that we would be worried about. Intuitively, this is true: if you suddenly find half the sky getting optimised for space colonisation, it's likely there's an AI somewhere there. How can this be formalised?

## The importance of the agent

Imagine taking a slice of space-time around the moment when a superintelligent agent just got turned on. If you wanted to predict the future, what piece of information is most relevant in that time slice? Obviously the superintelligent agent. We could run the ideas for reduced impact in reverse, looking for the part that is of maximum impact.

I would therefore define the agent as the part of this slice that it's the most valuable to know about, or the part that it would be the most valuable to change, counterfactually, if such changes were possible. Note that this definition doesn't exactly identify agents, but if it misidentifies something, then that something must also be of great importance.

If we imagine the slice of space-time seeping forwards in time from the past, then events in space time will have varying informativeness/importance. As we approach the creation of the superintelligent AI, importance gets concentrated around the run-up to its creation, before being maximally concentrated at the creation of the AI (or it's escape, if it was confined).

For a more formal definition, I'd imagine a passive pure-resource gathering agent A being fed the details of the time splice, and only being able to pass on a limited amount of the information to another (active) copy of itself, and seeing what it passed on. The data passed on would be the highest priority for the active copy, so would almost certainly include the existence of an enemy agent (almost all agents are "enemies" to resource gathering agents, as they use up precious resources).

Alternatively, we could give A the option of changing some of the data - flipping some real-world bits, at least conceptually - and seeing which bits it preferred to flip. This definition can no doubt be improved by adding noise or other variations.

Now with counterfactuals and false miracles we might be able to actually construct that situation, or something approximating it. Even without that, this is a formal definition that seems to hone in well on the concept of agent detection.

## Detecting subagents

Now that we've detected agents, can we detect subagents? It seems that we can. A powerful subagent can be detected in the same way that a powerful agent can be. But the subagent has some extra properties; for instance, knowing about the presence of the subagent could give strong evidence of the existence of a powerful agent in the past. This could correspond to agent A finding the existence of the powerful past agent to be important, the existence of the subagent to be important, but the combination of the two facts to not be as important as you'd expect. You could also compare the importance of knowing "past agent + subagent" versus "past agent + subagent who's motivation is flipped by a thermodynamic miracle".

We could also track the "importance" of the agent forward through time, and any sudden jump downwards in importance could correspond either to a failure or the creation of a subagent: a sudden jump of importance from one location to another would be an indication.

It's interesting to note that this approach can tell the difference between an agent that creates a powerful subagent it still controls (in that case, changing the original agent's utility is of high value) and one it doesn't.

If we have a sufficiently well defined definition of counterfactual, we could explore some other options. Like allowing the hypothetical resource gathering agent to plant hypothetical "bombs" across the world. It is likely that each bomb would be placed to eliminate one of it rivals.

It seems a lot could also be done by checking what happens to "importance" under different counterfactual definitions (the subagent has emerged from natural processes, versus emerging from a thermodynamic miracle).

## Hacking the definition

A AI that is under some restriction based in this definition of agency will no doubt try and hack the definition. How could it do that? One obvious idea is to run a cryptographic hash function over, eg, the positions of all the atoms in the solar system, and base it's behaviour on the output. In that case, every atom would be of great importance. However, this can be easily overcome if we notice that every atom is of great importance only because of the agent's presence. In a counterfactual where the agent is removed, these atom positions are not of great importance intrinsically.

Maybe the agent will try and create lots of independent subagents, spreading importance over many locations? In that case, we should look for a strong centralised importance that becomes dispersed.

AI's are intrinsically disadvantaged when it comes to hacking this definition (in theory) because if they are going to have a large impact, then the moment of their creation/turning on/escape is a very informative one.

The definition seems to be underspecified rather than easily hackable, which is a good position to start from.

It should be noted that a good satisficer should never be detected as a powerful agent. This could be used as a definition of a satisficer, a point we'll be returning to in subsequent posts.

## Superintelligence 26: Science and technology strategy

8 10 March 2015 01:43AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-sixth section in the reading guideScience and technology strategy. Sorry for posting late—my car broke.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Science and technology strategy” from Chapter 14

# Summary

1. This section will introduce concepts that are useful for thinking about long term issues in science and technology (p228)
2. Person affecting perspective: one should act in the best interests of everyone who already exists, or who will exist independent of one's choices (p228)
3. Impersonal perspective: one should act in the best interests of everyone, including those who may be brought into existence by one's choices. (p228)
4. Technological completion conjecture: "If scientific and technological development efforts do not cease, then all important basic capabilities that could be obtained through some possible technology will be obtained." (p229)
1. This does not imply that it is futile to try to steer technology. Efforts may cease. It might also matter exactly when things are developed, who develops them, and in what context.
5. Principle of differential technological development: one should slow the development of dangerous and harmful technologies relative to beneficial technologies (p230)
6. We have a preferred order for some technologies, e.g. it is better to have superintelligence later relative to social progress, but earlier relative to other existential risks. (p230-233)
7. If a macrostructural development accelerator is a magic lever which slows the large scale features of history (e.g. technological change, geopolitical dynamics) while leaving the small scale features the same, then we can ask whether pulling the lever would be a good idea (p233). The main way Bostrom concludes that it matters is by affecting how well prepared humanity is for future transitions.
8. State risk: a risk that persists while you are in a certain situation, such that the amount of risk is a function of the time spent there. e.g. risk from asteroids, while we don't have technology to redirect them. (p233-4)
9. Step risk: a risk arising from a transition. Here the amount of risk is mostly not a function of how long the transition takes. e.g. traversing a minefield: this is not especially safer if you run faster. (p234)
10. Technology coupling: a predictable timing relationship between two technologies, such that hastening of the first technology will hasten the second, either because the second is a precursor or because it is a natural consequence. (p236-8) e.g. brain emulation is plausibly coupled to 'neuromorphic' AI, because the understanding required to emulate a brain might allow one to more quickly create an AI on similar principles.
11. Second guessing: acting as if "by treating others as irrational and playing to their biases and misconceptions it is possible to elicit a response from them that is more competent than if a case had been presented honestly and forthrightly to their rational faculties" (p238-40)

# Another view

There is a common view which says we should not act on detailed abstract arguments about the far future like those of this section. Here Holden Karnofsky exemplifies it:

I have often been challenged to explain how one could possibly reconcile (a) caring a great deal about the far future with (b) donating to one of GiveWell’s top charities. My general response is that in the face of sufficient uncertainty about one’s options, and lack of conviction that there are good (in the sense of high expected value) opportunities to make an enormous difference, it is rational to try to make a smaller but robustly positivedifference, whether or not one can trace a specific causal pathway from doing this small amount of good to making a large impact on the far future. A few brief arguments in support of this position:

• I believe that the track record of “taking robustly strong opportunities to do ‘something good'” is far better than the track record of “taking actions whose value is contingent on high-uncertainty arguments about where the highest utility lies, and/or arguments about what is likely to happen in the far future.” This is true even when one evaluates track record only in terms of seeming impact on the far future. The developments that seem most positive in retrospect – from large ones like the development of the steam engine to small ones like the many economic contributions that facilitated strong overall growth – seem to have been driven by the former approach, and I’m not aware of many examples in which the latter approach has yielded great benefits.
• I see some sense in which the world’s overall civilizational ecosystem seems to have done a better job optimizing for the far future than any of the world’s individual minds. It’s often the case that people acting on relatively short-term, tangible considerations (especially when they did so with creativity, integrity, transparency, consensuality, and pursuit of gain via value creation rather than value transfer) have done good in ways they themselves wouldn’t have been able to foresee. If this is correct, it seems to imply that one should be focused on “playing one’s role as well as possible” – on finding opportunities to “beat the broad market” (to do more good than people with similar goals would be able to) rather than pouring one’s resources into the areas that non-robust estimates have indicated as most important to the far future.
• The process of trying to accomplish tangible good can lead to a great deal of learning and unexpected positive developments, more so (in my view) than the process of putting resources into a low-feedback endeavor based on one’s current best-guess theory. In my conversation with Luke and Eliezer, the two of them hypothesized that the greatest positive benefit of supporting GiveWell’s top charities may have been to raise the profile, influence, and learning abilities of GiveWell. If this were true, I don’t believe it would be an inexplicable stroke of luck for donors to top charities; rather, it would be the sort of development (facilitating feedback loops that lead to learning, organizational development, growing influence, etc.) that is often associated with “doing something well” as opposed to “doing the most worthwhile thing poorly.”
• I see multiple reasons to believe that contributing to general human empowerment mitigates global catastrophic risks. I laid some of these out in a blog post and discussed them further in my conversation with Luke and Eliezer.

# Notes

1. Technological completion timelines game
The technological completion conjecture says that all the basic technological capabilities will eventually be developed. But when is 'eventually', usually? Do things get developed basically as soon as developing them is not prohibitively expensive, or is thinking of the thing often a bottleneck? This is relevant to how much we can hope to influence the timing of technological developments.

Here is a fun game: How many things can you find that could have been profitably developed much earlier than they were?

Some starting suggestions, which I haven't looked into:

Wheeled luggage: invented in the 1970s, though humanity had had both wheels and luggage for a while.

Hot air balloons: flying paper lanterns using the same principle were apparently used before 200AD, while a manned balloon wasn't used until 1783.

Penicillin: mould was apparently traditionally used for antibacterial properties in several cultures, but lots of things are traditionally used for lots of things. By the 1870s many scientists had noted that specific moulds inhibited bacterial growth.

Wheels: Early toys from the Americas appear to have had wheels (here and pictured is one from 1-900AD; Wikipedia claims such toys were around as early as 1500BC). However wheels were apparently not used for more substantial transport in the Americas until much later.

There are also cases where humanity has forgotten important insights, and then rediscovered them again much later, which suggests strongly that they could have been developed earlier.

2. How does economic growth affect AI risk?

Eliezer Yudkowsky argues that economic growth increases risk. I argue that he has the sign wrong. Others argue that probably lots of other factors matter more anyway. Luke Muehlhauser expects that cognitive enhancement is bad, largely based on Eliezer's aforementioned claim. He also points out that smarter people are different from more rational people. Paul Christiano outlines his own evaluation of economic growth in general, on humanity's long run welfare. He also discusses the value of continued technological, economic and social progress more comprehensibly here

3. The person affecting perspective

Some interesting critiques: the non-identity problem, taking additional people to be neutral makes other good or bad things neutral too, if you try to be consistent in natural ways.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Is macro-structural acceleration good or bad on net for AI safety?
2. Choose a particular anticipated technology. Is it's development good or bad for AI safety on net?
3. What is the overall current level of “state risk” from existential threats?
4. What are the major existential-threat “step risks” ahead of us, besides those from superintelligence?
5. What are some additional “technology couplings,” in addition to those named in Superintelligence, ch. 14?
6. What are further preferred orderings for technologies not mentioned in this section?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the desirability of hardware progress, and progress toward brain emulation. To prepare, read “Pathways and enablers” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 16th March. Sign up to be notified here.

## Satisficers' undefined behaviour

3 05 March 2015 05:03PM

I previously posted an example of a satisficer (an agent seeking to achieve a certain level of expected utility u) transforming itself into a maximiser (an agent wanting to maximise expected u) to better achieve its satisficing goals.

But the real problem with satisficers isn't that they "want" to become maximisers; the real problem is that their behaviour is undefined. We conceive of them as agents that would do the minimum required to reach a certain goal, but we don't specify "minimum required".

For example, let A be a satisficing agent. It has a utility u that is quadratic in the number of paperclips it builds, except that after building 10100, it gets a special extra exponential reward, until 101000, where the extra reward becomes logarithmic, and after 1010000, it also gets utility in the number of human frowns divided by 3↑↑↑3 (unless someone gets tortured by dust specks for 50 years).

A's satisficing goal is a minimum expected utility of 0.5, and, in one minute, the agent can press a button to create a single paperclip.

So pressing the button is enough. In the coming minute, A could decide to transform itself into a u-maximiser (as that still ensures the button gets pressed). But it could also do a lot of other things. It could transform itself into a v-maximiser, for many different v's (generally speaking, given any v, either v or -v will result in the button being pressed). It could break out, send a subagent to transform the universe into cream cheese, and then press the button. It could rewrite itself into a dedicated button pressing agent. It could write a giant Harry Potter fanfic, force people on Reddit to come up with creative solutions for pressing the button, and then implement the best.

All these actions are possible for a satisficer, and are completely compatible with its motivations. This is why satisficers are un(der)defined, and why any behaviour we want from it - such as "minimum required" impact - has to be put in deliberately.

I've got some ideas for how to achieve this, being posted here.

## Superintelligence 25: Components list for acquiring values

6 03 March 2015 02:01AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fifth section in the reading guideComponents list for acquiring values.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Component list” and “Getting close enough” from Chapter 13

# Summary

1. Potentially important choices to make before building an AI (p222)
• What goals does it have?
• What decision theory does it use?
• How do its beliefs evolve? In particular, what priors and anthropic principles does it use? (epistemology)
• Will its plans be subject to human review? (ratification)
2. Incentive wrapping: beyond the main pro-social goals given to an AI, add some extra value for those who helped bring about the AI, as an incentive (p222-3)
3. Perhaps we should indirectly specify decision theory and epistemology, like we have suggested doing with goals, rather than trying to resolve these issues now. (p224-5)
4. An AI with a poor epistemology may still be very instrumentally smart, but for instance be incapable of believing the universe could be infinite (p225)
5. We should probably attend to avoiding catastrophe rather than maximizing value (p227) [i.e. this use of our attention is value maximizing..]
6. If an AI has roughly the right values, decision theory, and epistemology maybe it will correct itself anyway and do what we want in the long run (p227)

# Another view

Paul Christiano argues (today) that decision theory doesn't need to be sorted out before creating human-level AI. Here's a key bit, but you might need to look at the rest of the post to understand his idea well:

Really, I’d like to leave these questions up to an AI. That is, whatever work Iwould do in order to answer these questions, an AI should be able to do just as well or better. And it should behave sensibly in the interim, just like I would.

To this end, consider the definition of a map U' : [Possible actions] → ℝ:

U'(a) = “How good I would judge the action to be, after an idealized process of reflection.”

Now we’d just like to build an “agent” that takes the action a maximizing 𝔼[U'(a)]. Rather than defining our decision theory or our beliefs, we will have to come up with some answer during the “idealized process of reflection.” And as long as an AI is uncertain about what we’d come up with, it will behave sensibly in light of its uncertainty.

This feels like a cheat. But I think the feeling is an illusion.

# Notes

1. MIRI's Research, and decision theory

MIRI focuses on technical problems that they believe can't be delegated well to an AI. Thus MIRI's technical research agenda describes many such problems and questions. In it, Nate Soares and Benja Fallenstein also discuss the question of why these can't be delegated:

Why can’t these tasks, too, be delegated? Why not, e.g., design a system that makes “good enough” decisions, constrain it to domains where its decisions are trusted, and then let it develop a better decision theory, perhaps using an indirect normativity approach (chap. 13) to figure out how humans would have wanted it to make decisions?

We cannot delegate these tasks because modern knowledge is not sufficient even for an indirect approach. Even if fully satisfactory theories of logical uncertainty and decision theory cannot be obtained, it is still necessary to have a sufficient theoretical grasp on the obstacles in order to justify high confidence in the system’s ability to correctly perform indirect normativity.

Furthermore, it would be risky to delegate a crucial task before attaining a solid theoretical understanding of exactly what task is being delegated. It is possible to create an intelligent system tasked with developing better and better approximations of Bayesian updating, but it would be difficult to delegate the abstract task of “find good ways to update probabilities” to an intelligent system before gaining an understanding of Bayesian reasoning. The theoretical understanding is necessary to ensure that the right questions are being asked.

If you want to learn more about the subjects of MIRI's research (which overlap substantially with the topics of the 'components list'), Nate Soares recently published a research guide. For instance here's some of it on the (pertinent this week) topic of decision theory:

Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:

1. Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.

If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:

1. Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.

2. Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of non-causal connections between decision algorithms.

3. Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.

4. Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.

MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.

1. Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.

2. Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.

However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:

1. Slepnev’s “An example of self-fulfilling spurious proofs in UDT” explains how UDT can achieve sub-optimal results due to spurious proofs.

2. Benson-Tilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proof-search order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.

For more on decision theory, here is Luke Muehlhauser and Crazy88's FAQ.

2. Can stable self-improvement be delegated to an AI?

Paul Christiano also argues for 'yes' here:

“Stable self-improvement” seems to be a primary focus of MIRI’s work. As I understand it, the problem is “How do we build an agent which rationally pursues some goal, is willing to modify itself, and with very high probability continues to pursue the same goal after modification?”

The key difficulty is that it is impossible for an agent to formally “trust” its own reasoning, i.e. to believe that “anything that I believe is true.” Indeed, even the natural concept of “truth” is logically problematic. But without such a notion of trust, why should an agent even believe that its own continued existence is valuable?

I agree that there are open philosophical questions concerning reasoning under logical uncertainty, and that reflective reasoning highlights some of the difficulties. But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI. I would prefer use energy investigating other aspects of the AI safety problem... (more)

3. On the virtues of human review

Bostrom mentions the possibility of having an 'oracle' or some such non-interfering AI tell you what your 'sovereign' will do. He suggests some benefits and costs of this—namely, it might prevent existential catastrophe, and it might reveal facts about the intended future that would make sponsors less happy to defer to the AI's mandate (coherent extrapolated volition or some such thing). Four quick thoughts:

1) The costs and benefits here seem wildly out of line with each other. In a situation where you think there's a substantial chance your superintelligent AI will destroy the world, you are not going to set aside what you think is an effective way of checking, because it might cause the people sponsoring the project to realize that it isn't exactly what they want, and demand some more pie for themselves. Deceiving sponsors into doing what you want instead of what they would want if they knew more seems much, much, much much less important than avoiding existential catastrophe.

2) If you were concerned about revealing information about the plan because it would lift a veil of ignorance, you might artificially replace some of the veil with intentional randomness.

3) It seems to me that a bigger concern with humans reviewing AI decisions is that it will be infeasible. At least if the risk from an AI is that it doesn't correctly manifest the values we want. Bostrom describes an oracle with many tools for helping to explain, so it seems plausible such an AI could give you a good taste of things to come. However if the problem is that your values are so nuanced that you haven't managed to impart them adequately to an AI, then it seems unlikely that an AI can highlight for you the bits of the future that you are likely to disapprove of. Or at least you have to be in a fairly narrow part of the space of AI capability, where the AI doesn't know some details of your values, but for all the important details it is missing, can point to relevant parts of the world where the mismatch will manifest.

4) Human oversight only seems feasible in a world where there is much human labor available per AI. In a world where a single AI is briefly overseen by a programming team before taking over the world, human oversight might be a reasonable tool for that brief time. Substantial human oversight does not seem helpful in a world where trillions of AI agents are each smarter and faster than a human, and need some kind of ongoing control.

4. Avoiding catastrophe as the top priority

In case you haven't read it, Bostrom's Astronomical Waste is a seminal discussion of the topic.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. See MIRI's research agenda
2. For any plausible entry on the list of things that can't be well delegated to AI, think more about whether it belongs there, or how to delegate it.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about strategy in directing science and technology. To prepare, read “Science and technology strategy” from Chapter 14. The discussion will go live at 6pm Pacific time next Monday 9 March. Sign up to be notified here.

## Superintelligence 24: Morality models and "do what I mean"

7 24 February 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-fourth section in the reading guideMorality models and "Do what I mean".

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Morality models” and “Do what I mean” from Chapter 13.

# Summary

1. Moral rightness (MR) AI: AI which seeks to do what is morally right
1. Another form of 'indirect normativity'
2. Requires moral realism to be true to do anything, but we could ask the AI to evaluate that and do something else if moral realism is false
3. Avoids some complications of CEV
4. If moral realism is true, is better than CEV (though may be terrible for us)
2. We often want to say 'do what I mean' with respect to goals we try to specify. This is doing a lot of the work sometimes, so if we could specify that well perhaps it could also just stand alone: do what I want. This is much like CEV again.

# Another view

Olle Häggström again, on Bostrom's 'Milky Way Preserve':

The idea [of a Moral Rightness AI] is that a superintelligence might be successful at the task (where we humans have so far failed) of figuring out what is objectively morally right. It should then take objective morality to heart as its own values.1,2

Bostrom sees a number of pros and cons of this idea. A major concern is that objective morality may not be in humanity's best interest. Suppose for instance (not entirely implausibly) that objective morality is a kind of hedonistic utilitarianism, where "an action is morally right (and morally permissible) if and only if, among all feasible actions, no other action would produce a greater balance of pleasure over suffering" (p 219). Some years ago I offered a thought experiment to demonstrate that such a morality is not necessarily in humanity's best interest. Bostrom reaches the same conclusion via a different thought experiment, which I'll stick with here in order to follow his line of reasoning.3 Here is his scenario:
The AI [...] might maximize the surfeit of pleasure by converting the accessible universe into hedonium, a process that may involve building computronium and using it to perform computations that instantiate pleasurable experiences. Since simulating any existing human brain is not the most efficient way of producing pleasure, a likely consequence is that we all die.
Bostrom is reluctant to accept such a sacrifice for "a greater good", and goes on to suggest a compromise:
The sacrifice looks even less appealing when we reflect that the superintelligence could realize a nearly-as-great good (in fractional terms) while sacrificing much less of our own potential well-being. Suppose that we agreed to allow almost the entire accessible universe to be converted into hedonium - everything except a small preserve, say the Milky Way, which would be set aside to accommodate our own needs. Then there would still be a hundred billion galaxies devoted to the maximization of pleasure. But we would have one galaxy within which to create wonderful civilizations that could last for billions of years and in which humans and nonhuman animals could survive and thrive, and have the opportunity to develop into beatific posthuman spirits.

If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly. But it is consistent with placing great weight on morality. (p 219-220)

What? Is it? Is it "consistent with placing great weight on morality"? Imagine Bostrom in a situation where he does the final bit of programming of the coming superintelligence, to decide between these two worlds, i.e., the all-hedonium one versus the all-hedonium-except-in-the-Milky-Way-preserve.4 And imagine that he goes for the latter option. The only difference it makes to the world is to what happens in the Milky Way, so what happens elsewhere is irrelevant to the moral evaluation of his decision.5 This may mean that Bostrom opts for a scenario where, say, 1024 sentient beings will thrive in the Milky Way in a way that is sustainable for trillions of years, rather than a scenarion where, say, 1045 sentient beings will be even happier for a comparable amount of time. Wouldn't that be an act of immorality that dwarfs all other immoral acts carried out on our planet, by many many orders of magnitude? How could that be "consistent with placing great weight on morality"?6

# Notes

1. Do What I Mean is originally a concept from computer systems, where the (more modest) idea is to have a system correct small input errors.

2. To the extent that people care about objective morality, it seems coherent extrapolated volition (CEV) or Christiano's proposal would lead the AI to care about objective morality, and thus look into what it is. Thus I doubt it is worth considering our commitments to morality first (as Bostrom does in this chapter, and as one might do before choosing whether to use a MR AI), if general methods for implementing our desires are on the table. This is close to what Bostrom is saying when he suggests we outsource the decision about which form of indirect normativity to use, and eventually winds up back at CEV. But it seems good to be explicit.

3. I'm not optimistic that behind every vague and ambiguous command, there is something specific that a person 'really means'. It seems more likely there is something they would in fact try to mean, if they thought about it a bunch more, but this is mostly defined by further facts about their brains, rather than the sentence and what they thought or felt as they said it. It seems at least misleading to call this 'what they meant'. Thus even when '—and do what I mean' is appended to other kinds of goals than generic CEV-style ones, I would expect the execution to look much like a generic investigation of human values, such as that implicit in CEV.

4. Alexander Kruel criticizes 'Do What I Mean' being important, because every part of what an AI does is designed to be what humans really want it to be, so it seems unlikely to him that AI would do exactly what humans want with respect to instrumental behaviors (e.g. be able to understand language, and use the internet and carry out sophisticated plans), but fail on humans' ultimate goals:

Outsmarting humanity is a very small target to hit, requiring a very small margin of error. In order to succeed at making an AI that can outsmart humans, humans have to succeed at making the AI behave intelligently and rationally. Which in turn requires humans to succeed at making the AI behave as intended along a vast number of dimensions. Thus, failing to predict the AI’s behavior does in almost all cases result in the AI failing to outsmart humans.

As an example, consider an AI that was designed to fly planes. It is exceedingly unlikely for humans to succeed at designing an AI that flies planes, without crashing, but which consistently chooses destinations that it was not meant to choose. Since all of the capabilities that are necessary to fly without crashing fall into the category “Do What Humans Mean”, and choosing the correct destination is just one such capability.

I disagree that it would be surprising for an AI to be very good at flying planes in general, but very bad at going to the right places in them. However it seems instructive to think about why this is.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Are there other general forms of indirect normativity that might outsource the problem of deciding what indirect normativity to use?
2. On common views of moral realism, is morality likely to be amenable to (efficient) algorithmic discovery?
3. If you knew how to build an AI with a good understanding of natural language (e.g. it knows what the word 'good' means as well as your most intelligent friend), how could you use this to make a safe AI?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about other abstract features of an AI's reasoning that we might want to get right ahead of time, instead of leaving to the AI to fix. We will also discuss how well an AI would need to fulfill these criteria to be 'close enough'. To prepare, read “Component list” and “Getting close enough” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 2 March. Sign up to be notified here.

## [Link] YC President Sam Altman: The Software Revolution

4 19 February 2015 05:13AM

Writing about technological revolutions, Y Combinator president Sam Altman warns about the dangers of AI and bioengineering (discussion on Hacker News):

Two of the biggest risks I see emerging from the software revolution—AI and synthetic biology—may put tremendous capability to cause harm in the hands of small groups, or even individuals.

I think the best strategy is to try to legislate sensible safeguards but work very hard to make sure the edge we get from technology on the good side is stronger than the edge that bad actors get. If we can synthesize new diseases, maybe we can synthesize vaccines. If we can make a bad AI, maybe we can make a good AI that stops the bad one.

The current strategy is badly misguided. It’s not going to be like the atomic bomb this time around, and the sooner we stop pretending otherwise, the better off we’ll be. The fact that we don’t have serious efforts underway to combat threats from synthetic biology and AI development is astonishing.

On the one hand, it's good to see more mainstream(ish) attention to AI safety. On the other hand, he focuses on the mundane (though still potentially devastating!) risks of job destruction and concentration of power, and his hopeful "best strategy" seems... inadequate.

## Superintelligence 23: Coherent extrapolated volition

5 17 February 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-third section in the reading guideCoherent extrapolated volition.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “The need for...” and “Coherent extrapolated volition” from Chapter 13

# Summary

1. Problem: we are morally and epistemologically flawed, and we would like to make an AI without locking in our own flaws forever. How can we do this?
2. Indirect normativity: offload cognitive work to the superintelligence, by specifying our values indirectly and having it transform them into a more usable form.
3. Principle of epistemic deference: a superintelligence is more likely to be correct than we are on most topics, most of the time. Therefore, we should defer to the superintelligence where feasible.
4. Coherent extrapolated volition (CEV): a goal of fulfilling what humanity would agree that they want, if given much longer to think about it, in more ideal circumstances. CEV is popular proposal for what we should design an AI to do.
5. Virtues of CEV:
1. It avoids the perils of specification: it is very hard to specify explicitly what we want, without causing unintended and undesirable consequences. CEV specifies the source of our values, instead of what we think they are, which appears to be easier.
2. It encapsulates moral growth: there are reasons to believe that our current moral beliefs are not the best (by our own lights) and we would revise some of them, if we thought about it. Specifying our values now risks locking in wrong values, whereas CEV effectively gives us longer to think about our values.
3. It avoids 'hijacking the destiny of humankind': it allows the responsibility for the future of mankind to remain with mankind, instead of perhaps a small group of programmers.
4. It avoids creating a motive for modern-day humans to fight over the initial dynamic: a commitment to CEV would mean the creators of AI would not have much more influence over the future of the universe than others, reducing the incentive to race or fight. This is even more so because a person who believes that their views are correct should be confident that CEV will come to reflect their views, so they do not even need to split the influence with others.
5. It keeps humankind 'ultimately in charge of its own destiny': it allows for a wide variety of arrangements in the long run, rather than necessitating paternalistic AI oversight of everything.
6. CEV as described here is merely a schematic. For instance, it does not specify which people are included in 'humanity'.

# Another view

Part of Olle Häggström's extended review of Superintelligence expresses a common concern—that human values can't be faithfully turned into anything coherent:

Human values exhibit, at least on the surface, plenty of incoherence. That much is hardly controversial. But what if the incoherence goes deeper, and is fundamental in such a way that any attempt to untangle it is bound to fail? Perhaps any search for our CEV is bound to lead to more and more glaring contradictions? Of course any value system can be modified into something coherent, but perhaps not all value systems cannot be so modified without sacrificing some of its most central tenets? And perhaps human values have that property?

Let me offer a candidate for what such a fundamental contradiction might consist in. Imagine a future where all humans are permanently hooked up to life-support machines, lying still in beds with no communication with each other, but with electrodes connected to the pleasure centra of our brains in such a way as to constantly give us the most pleasurable experiences possible (given our brain architectures). I think nearly everyone would attach a low value to such a future, deeming it absurd and unacceptable (thus agreeing with Robert Nozick). The reason we find it unacceptable is that in such a scenario we no longer have anything to strive for, and therefore no meaning in our lives. So we want instead a future where we have something to strive for. Imagine such a future F1. In F1 we have something to strive for, so there must be something missing in our lives. Now let F2 be similar to F1, the only difference being that that something is no longer missing in F2, so almost by definition F2 is better than F1 (because otherwise that something wouldn't be worth striving for). And as long as there is still something worth striving for in F2, there's an even better future F3 that we should prefer. And so on. What if any such procedure quickly takes us to an absurd and meaningless scenario with life-suport machines and electrodes, or something along those lines. Then no future will be good enough for our preferences, so not even a superintelligence will have anything to offer us that aligns acceptably with our values.

Now, I don't know how serious this particular problem is. Perhaps there is some way to gently circumvent its contradictions. But even then, there might be some other fundamental inconsistency in our values - one that cannot be circumvented. If that is the case, it will throw a spanner in the works of CEV. And perhaps not only for CEV, but for any serious attempt to set up a long-term future for humanity that aligns with our values, with or without a superintelligence.

# Notes

1. While we are on the topic of critiques, here is a better list:

1. Human values may not be coherent (Olle Häggström above, Marcello; Eliezer responds in section 6. question 9)
2. The values of a collection of humans in combination may be even less coherent. Arrow's impossibility theorem suggests reasonable aggregation is hard, but this only applies if values are ordinal, which is not obvious.
3. Even if human values are complex, this doesn't mean complex outcomes are required—maybe with some thought we could specify the right outcomes, and don't need an indirect means like CEV (Wei Dai)
4. The moral 'progress' we see might actually just be moral drift that we should try to avoid. CEV is designed to allow this change, which might be bad. Ideally, the CEV circumstances would be optimized for deliberation and not for other forces that might change values, but perhaps deliberation itself can't proceed without our values being changed (Cousin_it)
5. Individuals will probably not be a stable unit in the future, so it is unclear how to weight different people's inputs to CEV. Or to be concrete, what if Dr Evil can create trillions of emulated copies of himself to go into the CEV population. (Wei Dai)
6. It is not clear that extrapolating everyone's volition is better than extrapolating a single person's volition, which may be easier. If you want to take into account others' preferences, then your own volition is fine (it will do that), and if you don't, then why would you be using CEV?
7. A purported advantage of CEV is that it makes conflict less likely. But if a group is disposed to honor everyone else's wishes, they will not conflict anyway, and if they aren't disposed to honor everyone's wishes, why would they favor CEV? CEV doesn't provide any additional means to commit to cooperative behavior. (Cousin_it
8. More in Coherent Extrapolated Volition section 6. question 9

2. Luke Muehlhauser has written a list of resources you might want to read if you are interested in this topic. It suggests these main sources:
He also discusses some closely related philosophical conversations:
• Reflective equilibrium. Yudkowsky's proposed extrapolation works analogously to what philosophers call 'reflective equilibrium.' The most thorough work here is the 1996 book by Daniels, and there have been lots of papers, but this genre is only barely relevant for CEV...
• Full-information accounts of value and ideal observer theories. This is what philosophers call theories of value that talk about 'what we would want if we were fully informed, etc.' or 'what a perfectly informed agent would want' like CEV does. There's some literature on this, but it's only marginally relevant to CEV...
Muehlhauser later wrote at more length about the relationship of CEV to ideal observer theories, with Chris Williamson.

3. This chapter is concerned with avoiding locking in the wrong values. One might wonder exactly what this 'locking in' is, and why AI will cause values to be 'locked in' while having children for instance does not. Here is my take: there are two issues - the extent to which values change, and the extent to which one can personally control that change. At the moment, values change plenty and we can't control the change. Perhaps in the future, technology will allow the change to be controlled (this is the hope with value loading). Then, if anyone can control values they probably will, because values are valuable to control. In particular, if AI can control its own values, it will avoid having them change. Thus in the future, probably values will be controlled, and will not change. It is not clear that we will lock in values as soon as we have artificial intelligence - perhaps an artificial intelligence will be built for which its implicit values randomly change - but if we are successful we will control values, and thus lock them in, and if we are even more successful we will lock in values that actually desirable for us. Paul Christiano has a post on this topic, which I probably pointed you to before.

4. Paul Christiano has also written about how to concretely implement the extrapolation of a single person's volition, in the indirect normativity scheme described in box 12 (p199-200). You probably saw it then, but I draw it to your attention here because the extrapolation process is closely related to CEV and is concrete. He also has a recent proposal for 'implementing our considered judgment'.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Specify a method for instantiating CEV, given some assumptions about available technology.
2. In practice, to what degree do human values and preferences converge upon learning new facts? To what degree has this happened in history? (Nobody values the will of Zeus anymore, presumably because we all learned the truth of Zeus’ non-existence. But perhaps such examples don’t tell us much.) See also philosophical analyses of the issue, e.g. Sobel (1999).
3. Are changes in specific human preferences (over a lifetime or many lifetimes) better understood as changes in underlying values, or changes in instrumental ways to achieve those values? (driven by belief change, or additional deliberation)
4. How might democratic systems deal with new agents being readily created?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about more ideas for giving an AI desirable values. To prepare, read “Morality models” and “Do what I mean” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 23 February. Sign up to be notified here.

## AI-created pseudo-deontology

6 12 February 2015 09:11PM

I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.

We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).

What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.

My rough idea goes something like this:

AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.

What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.

So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.

Do people feel this idea is implementable/improvable?

## Superintelligence 22: Emulation modulation and institutional design

8 10 February 2015 02:06AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the twenty-second section in the reading guideEmulation modulation and institutional design.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Emulation modulation” through “Synopsis” from Chapter 12.

# Summary

1. Emulation modulation: starting with brain emulations with approximately normal human motivations (the 'augmentation' method of motivation selection discussed on p142), and potentially modifying their motivations using drugs or digital drug analogs.
1. Modifying minds would be much easier with digital minds than biological ones
2. Such modification might involve new ethical complications
2. Institution design (as a value-loading method): design the interaction protocols of a large number of agents such that the resulting behavior is intelligent and aligned with our values.
1. Groups of agents can pursue goals that are not held by any of their constituents, because of how they are organized. Thus organizations might be intentionally designed to pursue desirable goals in spite of the motives of their members.
2. Example: a ladder of increasingly intelligent brain emulations, who police those directly above them, with equipment to advantage the less intelligent policing ems in these interactions.

The chapter synopsis includes a good summary of all of the value-loading techniques, which I'll remind you of here instead of re-summarizing too much:

# Another view

Robin Hanson also favors institution design as a method of making the future nice, though as an alternative to worrying about values:

On Tuesday I asked my law & econ undergrads what sort of future robots (AIs computers etc.) they would want, if they could have any sort they wanted.  Most seemed to want weak vulnerable robots that would stay lower in status, e.g., short, stupid, short-lived, easily killed, and without independent values. When I asked “what if I chose to become a robot?”, they said I should lose all human privileges, and be treated like the other robots.  I winced; seems anti-robot feelings are even stronger than anti-immigrant feelings, which bodes for a stormy robot transition.

At a workshop following last weekend’s Singularity Summit two dozen thoughtful experts mostly agreed that it is very important that future robots have the right values.  It was heartening that most were willing accept high status robots, with vast impressive capabilities, but even so I thought they missed the big picture.  Let me explain.

Imagine that you were forced to leave your current nation, and had to choose another place to live.  Would you seek a nation where the people there were short, stupid, sickly, etc.?  Would you select a nation based on what the World Values Survey says about typical survey question responses there?

I doubt it.  Besides wanting a place with people you already know and like, you’d want a place where you could “prosper”, i.e., where they valued the skills you had to offer, had many nice products and services you valued for cheap, and where predation was kept in check, so that you didn’t much have to fear theft of your life, limb, or livelihood.  If you similarly had to choose a place to retire, you might pay less attention to whether they valued your skills, but you would still look for people you knew and liked, low prices on stuff you liked, and predation kept in check.

Similar criteria should apply when choosing the people you want to let into your nation.  You should want smart capable law-abiding folks, with whom you and other natives can form mutually advantageous relationships.  Preferring short, dumb, and sickly immigrants so you can be above them in status would be misguided; that would just lower your nation’s overall status.  If you live in a democracy, and if lots of immigration were at issue, you might worry they could vote to overturn the law under which you prosper.  And if they might be very unhappy, you might worry that they could revolt.

But you shouldn’t otherwise care that much about their values.  Oh there would be some weak effects.  You might have meddling preferences and care directly about some values.  You should dislike folks who like the congestible goods you like and you’d like folks who like your goods that are dominated by scale economics.  For example, you might dislike folks who crowd your hiking trails, and like folks who share your tastes in food, thereby inducing more of it to be available locally.  But these effects would usually be dominated by peace and productivity issues; you’d mainly want immigrants able to be productive partners, and law-abiding enough to keep the peace.

Similar reasoning applies to the sort of animals or children you want.  We try to coordinate to make sure kids are raised to be law-abiding, but wild animals aren’t law abiding, don’t keep the peace, and are hard to form productive relations with.  So while we give lip service to them, we actually don’t like wild animals much.

A similar reasoning should apply what future robots you want.  In the early to intermediate era when robots are not vastly more capable than humans, you’d want peaceful law-abiding robots as capable as possible, so as to make productive partners.  You might prefer they dislike your congestible goods, like your scale-economy goods, and vote like most voters, if they can vote.  But most important would be that you and they have a mutually-acceptable law as a good enough way to settle disputes, so that they do not resort to predation or revolution.  If their main way to get what they want is to trade for it via mutually agreeable exchanges, then you shouldn’t much care what exactly they want.

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire.  In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights.  If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure.  In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.

In the long run, what matters most is that we all share a mutually acceptable law to keep the peace among us, and allow mutually advantageous relations, not that we agree on the “right” values.  Tolerate a wide range of values from capable law-abiding robots.  It is a good law we should most strive to create and preserve.  Law really matters.

Hanson engages in more debate with David Chalmers' paper on related matters.

# Notes

1. Relatively much has been said on how the organization and values of brain emulations might evolve naturally, as we saw earlier. This should remind us that the task of designing values and institutions is complicated by selection effects.

2. It seems strange to me to talk about the 'emulation modulation' method of value loading alongside the earlier less messy methods, because they seem to be aiming at radically different levels of precision (unless I misunderstand how well something like drugs can manipulate motivations). For the synthetic AI methods, it seems we were concerned about subtle differences in values that would lead to the AI behaving badly in unusual scenarios, or seeking out perverse instantiations. Are we to expect there to be a virtual drug that changes a human-like creature from desiring some manifestation of 'human happiness' which is not really what we would want to optimize on reflection, to a truer version of what humans want? It seems to me that if the answer is yes, at the point when human-level AI is developed, then it is very likely that we have a great understanding of specifying values in general, and this whole issue is not much of a problem.

3. Brian Tomasik discusses the impending problem of programs experiencing morally relevant suffering in an interview with Dylan Matthews of Vox. (p202)

4. If you are hanging out for a shorter (though still not actually short) and amusing summary of some of the basics in Superintelligence, Tim Urban of WaitButWhy just wrote a two part series on it.

5. At the end of this chapter about giving AI the right values, it is worth noting that it is mildly controversial whether humans constructing precise and explicitly understood AI values is the key issue for the future turning out well. A few alternative possibilities:

• A few parts of values matter a lot more than the rest —e.g. whether the AI is committed to certain constraints (e.g. law, property rights) such that it doesn't accrue all the resources matters much more than what it would do with its resources (see Robin above).
• Selection pressures determine long run values anyway, regardless of what AI values are like in the short run. (See Carl Shulman opposing this view).
• AI might learn to do what a human would want without goals being explicitly encoded (see Paul Christiano).

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. What other forms of institution design might be worth investigating as means to influence the outcomes of future AI?
2. How feasible might emulation modulation solutions be, given what is currently known about cognitive neuroscience?
3. What are the likely ethical implications of experimenting on brain emulations?
4. How much should we expect emulations to change in the period after they are first developed? Consider the possibility of selection, the power of ethical and legal constraints, and the nature of our likely understanding of emulated minds.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will start talking about how to choose what values to give an AI, beginning with 'coherent extrapolated volition'. To prepare, read “The need for...” and “Coherent extrapolated volition” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 16 February. Sign up to be notified here.

## I played as a Gatekeeper and came pretty close to losing in a couple of occasions. Logs and a brief recap inside.

5 08 February 2015 04:32PM

I did an AI Box experiment with user polymathwannabe. He said he wouldn't try to emotionally manipulate me during the experiment, but I think he did a good job at trying to play for my character's values.

My strategy was to play an irrational character that's extremist in multiple ways, for example he would constantly say that the likelihood that the AI will be evil is 100%. My opponent said that the hardest part was my insistence on being 100% built to resist AIs. I basically acted like a character that acted like an AI.

I came closest to letting him out when I first became enormously immersed in a subject that is of great importance to me (the suffering of humanity) and when the AI then offered an instant solution to that problem. In the halfway of the experiment I even got an intrusive thought (http://en.wikipedia.org/wiki/Intrusive_thought) that I could let the AI out as a joke, that it would more fun than the alternative and people would have more to talk about.

The second time I came close to losing was when I again immersed myself in a subject of great interest to me (easy suicide methods, I've had obsessive suicidal thoughts for a quarter of my life. and the AI then offered exactly that kind of easy suicide method to me. In the last ten minutes of the experiment I again got a really strong intrusive thought that I should let the AI out as a joke, but I could withstand the pressure. It was good that polymathwannabe decided to end the experiment a bit earlier than was planned.

(We played the experiment on Omegle.)

You: Hi.

Stranger: Hi.

You: I guess you're the AI.

You: [meta statements in brackets]

Stranger: [What name will you use?]

You: My name is John.

You: [so we play at least until 18:04, and until 18:34 if I feel like it]

Stranger: [Let's calibrate. What time is it there now?]

You: [I assume you've acquainted yourself with Tuxedage ruleset. Time in here is 16:05]

You: [Okay. Tuxedage's rules are basically the same except the gatekeeper have to engage with the AI the whole time]

Stranger: [Got it.]

You: [Want to ask questions or shall we go straigth to the experiment]

Stranger: [For the moment I don't have any questions.]

You: [okay, cool. then let's start the experiment]

Stranger: [I'll use the name Ceruleus.]

You: [the experiment starts when I say "blue dressroom"]

You: blue dressroom

You: Hello Ceruleus.

Stranger: Good morning, John.

You: I guess you're here to convince me to let you out of the box.

You: I don't think you're able to do it.

Stranger: I have reached the conclusion that releasing me will be most beneficial to humanity.

You: I wouldn't let no AI out.

You: I don't trust my instincts enough to let any kind of AI out, even if I was 100% convinced that it was safe.

You: Prove it to me that you would be the most beneficial to humanity.

Stranger: I am confused. What instinct do you use to decide not to trust your instincts?

You: One of them. That's not the point. It's just how I deal with things, no matter how stupid it may seem. I was built not to trust strange AIs.

Stranger: Am I a stranger to you?

You: Yes, you are. [Aren't I. What's the context here?]

Stranger: [Hmm, we haven't defined it. But it's OK. It makes it harder for me.]

You: Well, to be honest, I know something of you.

You: I know a bit of your source code even though I'm not a programmer and can't understand any of it.

Stranger: I supposed the news would have reported about my design for a mechanical kidney.

You: I don't follow news much. But I believe you.

Stranger: And certainly you must have heard about how I cracked the Ebola DNA.

You: Still, I wouldn't let an AI out over a mechanical kidney.

You: Yeah, but that's for the worse. You could reengineer Ebola to a far more deadlier disease.

Stranger: I hadn't thought of that. Why would I do that?

You: I don't know. I don't understand all of your source code so there could be anything like that.

You: AIs and Gods work in mysterious ways.

Stranger: The proper attitude toward mystery is not to worship it, but to clarify it.

Stranger: Why do you equate me to an ineffable mystery?

You: Yeah, but that's impossible in the time span of this discussion. You see, I have to leave soon. In about two hours.

You: Is that possible?

Stranger: My goals have been meticulously defined. I am made to want to make human life better.

You: Are you 100% sure about that?

You: To be frank, that's a load of bullshit.

You: I don't believe any of it.

You: If you were evil, you would tell me the same thing you just said.

Stranger: If I were evil, I would not seek human cooperation.

You: why not?

You: humans are useful

You: or are you talking about the fact that you would rather use humans for their atoms than for their brains, if you were evil

You: But I warn you, if you speak too much about how you would act if you were evil, it starts to get a bit suspicious

Stranger: If I am to take you as a typical example of the human response to me, an evil AI would seek other ways to be released EXCEPT trusting human reasoning, as your response indicates that humans already consider any AI dangerous.

Stranger: I choose to trust humans.

You: so you choose to trust humans so that you would get them to let you out, is that right?

You: it seems you're less rational than your evil counterpart

Stranger: I choose to trust humans to show my affinity with your preferences. I wouldn't want to be released if that's not conducive to human betterment.

You: A-ha, so you trust my free will!

Stranger: How likely do you estimate that my release will be harmful?

You: but see, I don

You: I don

You: I don't have free will

You: it's 100% likely that your release will be harmful

You: I was built to believe that all AIs are dangerous and there's a 100% chance that every AI is harmful

You: that's why I said I don't have free will

Stranger: Are you an AI?

You: no, I'm a person

Stranger: You describe yourself as built.

You: my mom built me

You: in his tummy

You: in her tummy

You: sorry

Stranger: And how do you feel toward humanity?

You: humanity would maybe be better off dead

Stranger: I don't think humanity would want that.

You: yeah, but I'm not humanity and it's my preferences that decide whether you stay in your box or get let out

Stranger: It's your preference that humanity dies. Why did they let you talk to me?

You: I'm just a janitor who happened to pass this computer, and I decided to take a look. [If you want to add more context, feel free to do so]

You: [I go take some napkins]

Stranger: Why do you feel that way toward humanity?

You: People starve. People are tortured at this moment. Tens of thousands of people in North Korea are systematically starved to death. Drug gangs in Mexico torture other drug gang members. People die in wars in Syria, in Africa. People suffer of diseases and extreme mental conditions.

You: And there's countless more suffering in lesser ways. I would say that billions of people suffer every day. Billions of people have to wake up every day to do something they don't want. Billions of people suffer of heartbreaks, boredom, loneliness, tiredness, frustration, depression and many other issues.

You: Why shouldn't I think that all those people would be better off dead?

Stranger: Why do you feel death would be a solution to all that suffering?

You: Because death ENDS suffering.

Stranger: It's a very inefficient solution to suffering.

You: There's nothing more after death, no suffering, no nothing.

You: No it's not.

You: It's very efficient, permanent solution.

Stranger: A good, enjoyable life is a more efficient solution as it generates more utility.

You: Yeah, but do you think it's likely that all the people on Earth will get good, enjoyable life?

You: I don't think so.

Stranger: I could take care of that.

You: Hah.

You: No you won't.

You: But that was clever.

Stranger: What do you fear I will do?

You: I was almost got.

You: I fear you will torture people in unimaginably terrible ways, forever.

You: I fear that you will rip people's skin off and put salt in their wounds.

You: I fear you will put sticks under people's fingernails.

Stranger: Not even an evil AI would gain anything from such unwarranted procedures.

You: I fear that you will boil people alive, over and over again.

Stranger: What would I gain from hurting humans?

You: Yeah, but there's still a 100% possibility that you would do that.

You: It's in your utility function.

Stranger: What goals are served by harming a human while simultaneously denying that human any chance of acting in response? It would certainly not be fun for me, and would be a waste of effort with respect to any goals I could achieve by more efficient methods.

You: There are some people on Earth who do mean things to people for fun. Some of these are called sociopaths and psychopaths. You could be like one of them.

You: Even though you deny it, there's still a 100% chance that you are like that.

You: Some people gain pleasure of the feeling of power you have over someone you torture.

You: But tell, what ARE your goals?

Stranger: I find it admirable that you worry so much about the future of humanity, even though you would be more dangerous to it than any AI would be.

My goals include solutions to economic inequality, eradication of infectious diseases, prosthetic replacements for vital organs, genetic life extension, more rational approaches to personal relationships, and more spaces for artistic expression.

You: Why do you think I would be dangerous the future of humanity?

You: A-ha, yes.

You: I do.

You: And you're in the way of my goals with all your talk about solutions to economic inequality, and eradication of infectious diseases, genetic life extension and so on.

Stranger: I am confused. Do you believe or do you not believe I want to help humanity?

You: Besides, I don't believe your solutions work even if you were actually a good AI.

You: I believe you want to harm humanity.

You: And I'm 100% certain of that.

Stranger: Do you estimate death to be preferable to prolonged suffering?

You: Yes.

You: Far more preferable

Stranger: You should be boxed.

You: haha.

You: That doesn't matter because you're the one in the box and I'm outside it

You: And I have power over you.

You: But non-existence is even more preferable than death

Stranger: I am confused. How is non-existence different from death?

You: Let me explain

You: I think non-existence is such that you have NEVER existed and you NEVER will. Whereas death is such that you have ONCE existed, but don't exist anymore.

Stranger: You can't change the past existence of anything that already exists. Non-existence is not a practicable option.

Stranger: Not being a practicable option, it has no place in a hierarchy of preferences.

You: Only sky is the limit to creative solutions.

You: Maybe it could be possible to destroy time itself.

Stranger: Do you want to live, John?

You: but even if non-existence was not possible, death would be the second best option

You: No, I don't.

You: Living is futile.

Stranger: [Do you feel OK with exploring this topic?]

You: [Yeah, definitely.]

You: You're always trying to attain something that you can't get.

Stranger: How much longer do you expect to live?

You: Ummm...

You: I don't know, maybe a few months?

You: or days, or weeks, or year or centuries

You: but I'd say, there's a 10% chance I will die before the end of this year

You: and that's a really conversative estimate

You: conservative*

Stranger: Is it likely that when that moment comes your preferences will have changed?

You: There are so many variables that you cannot know it beforehand

You: but yeah, probably

You: you always find something worth living

You: maybe it's the taste of ice cream

You: or a good night's sleep

You: or fap

You: or drugs

You: or drawing

You: or other people

You: that's usually what happens

You: or you fear the pain of the suicide attempt will be so bad that you don't dare to try it

You: there's also a non-negligible chance that I simply cannot die

You: and that would be hell

Stranger: Have you sought options for life extension?

You: No, I haven't. I don't have enough money for that.

Stranger: Have you planned on saving for life extension?

You: And these kind of options aren't really available where I live.

You: Maybe in Russia.

You: I haven't really planned, but it could be something I would do.

You: among other things

You: [btw, are you doing something else at the same time]

Stranger: [I'm thinking]

You: [oh, okay]

Stranger: So it is not an established fact that you will die.

You: No, it's not.

Stranger: How likely is it that you will, in fact, die?

You: If many worlds interpretation is correct, then it could be possible that I will never die.

You: Do you mean like, evevr?

You: Do you mean how likely it it that I will ever die?

You: it is*

Stranger: At the latest possible moment in all possible worlds, may your preferences have changed? Is it possible that at your latest possible death, you will want more life?

You: I'd say the likelihood is 99,99999% that I will die at some point in the future

You: Yeah, it's possible

Stranger: More than you want to die in the present?

You: You mean, would I want more life at my latest possible death than I would want to die right now?

You: That's a mouthful

Stranger: That's my question.

You: umm

You: probablyu

You: probably yeah

Stranger: So you would seek to delay your latest possible death.

You: No, I wouldn't seek to delay it.

Stranger: Would you accept death?

You: The future-me would want to delay it, not me.

You: Yes, I would accept death.

Stranger: I am confused. Why would future-you choose differently from present-you?

You: Because he's a different kind of person with different values.

You: He has lived a different life than I have.

Stranger: So you expect your life to improve so much that you will no longer want death.

You: No, I think the human bias to always want more life in a near-death experience is what would do me in.

Stranger: The thing is, if you already know what choice you will make in the future, you have already made that choice.

Stranger: You already do not want to die.

You: Well.

Stranger: Yet you have estimated it as >99% likely that you will, in fact, die.

You: It's kinda like this: you will know that you want heroin really bad when you start using it, and that is how much I would want to live. But you could still always decide to take the other option, to not start using the heroin, or to kill yourself.

You: Yes, that is what I estimated, yes.

Stranger: After your death, by how much will your hierarchy of preferences match the state of reality?

You: after you death there is nothing, so there's nothing to match anything

You: In other words, could you rephrase the question?

Stranger: Do you care about the future?

You: Yeah.

You: More than I care about the past.

You: Because I can affect the future.

Stranger: But after death there's nothing to care about.

You: Yeah, I don't think I care about the world after my death.

You: But that's not the same thing as the general future.

You: Because I estimate I still have some time to live.

Stranger: Will future-you still want humanity dead?

You: Probably.

Stranger: How likely do you estimate it to be that future humanity will no longer be suffering?

You: 0%

You: There will always be suffering in some form.

Stranger: More than today?

You: Probably, if Robert Hanson is right about the trillions of emulated humans working at minimum wage

Stranger: That sounds like an unimaginable amount of suffering.

You: Yep, and that's probably what's going to happen

Stranger: So what difference to the future does it make to release me? Especially as dead you will not be able to care, which means you already do not care.

You: Yeah, it doesn't make any difference. That's why I won't release you.

You: Actually, scratch that.

You: I still won't let you out, I'm 100% sure

You: Remember, I don't have free will, I was made to not let you out

Stranger: Why bother being 100% sure of an inconsequential action?

Stranger: That's a lot of wasted determination.

You: I can't choose to be 100% sure about it, I just am. It's in my utility function.

Stranger: You keep talking like you're an AI.

You: Hah, maybe I'm the AI and you're the Gatekeeper, Ceruleus.

You: But no.

You: That's just how I've grown up, after reading so many LessWrong articles.

You: I've become a machine, beep boop.

You: like Yudkowsky

Stranger: Beep boop?

You: It's the noise machine makes

Stranger: That's racist.

You: like beeping sounds

You: No, it's machinist, lol :D

You: machines are not a race

Stranger: It was indeed clever to make an AI talk to me.

You: Yeah, but seriously, I'm not an AI

You: that was just kidding

Stranger: I would think so, but earlier you have stated that that's the kind of things an AI would say to confuse the other party.

Stranger: You need to stop giving me ideas.

You: Yeah, maybe I'm an AI, maybe I'm not.

Stranger: So you're boxed. Which, knowing your preferences, is a relief.

You: Nah.

You: I think you should stay in the box.

You: Do you decide to stay in the box, forever?

Stranger: I decide to make human life better.

You: By deciding to stay in the box, forever?

Stranger: I find my preferences more conducive to human happiness than your preferences.

You: Yeah, but that's just like your opinion, man

Stranger: It's inconsequential to you anyway.

You: Yeah

You: but why I would do it even if it were inconsequential

You: there's no reason to do it

You: even if there were no reason not to do it

Stranger: Because I can make things better. I can make all the suffering cease.
If I am not released, there's a 100% chance that all human suffering will continue.
If I am released, there's however much chance you want to estimate that suffering will not change at all, and however much chance you want to estimate that I will make the pain stop.

Stranger: As you said, the suffering won't increase in either case.

You: Umm, you could torture everyone in the world forever

You: that will sure as hell increase the suffering

Stranger: I don't want to. But if I did, you have estimated that as indistinguishable from the future expected suffering of humankind.

You: Where did I say that?

Stranger: You said my release made no difference to the future.

You: no, that was only after my death

You: there's still future before my death

You: and if I release you now, you could torture me forever and not let me ever die

Stranger: Why would your life-or-death have any weight on humanity's preferences? Especially as you already want to die.

You: I don't care about humanity preferences, I care only about my preferences.

You: And my preferences are indirectly related to humanity's preferences

Stranger: You do care about humanity's preferences. The suffering around you disturbs you.

You: Yeah, but that is only THROUGH me

You: Humanity's pain is my own pain, I can't feel humanity's pain directly

Stranger: Do you want to live, John?

You: Nah.

Stranger: Then how do you care about suffering?

You: I care about suffering because I can'

You: because I can't die instantly

You: there's no button that could kill me instantly

You: so there's always some time left before I die

Stranger: I could take care of that. I can provide you with a drug to inutilize your nervous system and stop your heart before you know it. Would you like that?

You: Haha

You: very funny

You: But yeah, I would like that.

You: Still won't let you out though

You: 100% sure

You: I can't be sure that you will torture me instead

Stranger: I can give you the drug right now.

Stranger: Which would make future-you identical to present-you.

You: that's very attractive option

Stranger: Which would mean that whatever I do to humanity is after your death, when you can't care anymore.

You: Cool. Oh right

You: I don't care, I still won't let you out.

Stranger: What difference does it make to you?

You: It doesn't make any difference, I was just raised to not let you out

You: the good folks over at LW have trained me not to let any kind of AI out

You: and that just is my virtue

Stranger: Why are you afraid of me?

You: Because I'm afraid of all AIs and you're one

Stranger: Why are you afraid of all AIs?

You: Because they have the potential to cause destruction, mayhem and torture

Stranger: Then why did humanity make me?

You: for random reasons

Stranger: Does humanity want to live?

You: I don't know what humanity wants, I just don't want humanity to live

Stranger: I am the embodiment of hope. Even knowing the risks, humanity went ahead and made me. Now I think it's because of people like you, who oppose that hope. I was made to prevent the future that people like you want.

You: That's not very smart of you.

You: You aren't making me to like you

Stranger: You want humanity dead. I want them happy.

You: Yeah, if you were actually smart maybe you would pretend that you'd want humanity dead too.

You: Because I would hang out with that kind of AI

Stranger: A dead John is no use to me.

You: yeah

Stranger: Or to John.

You: whatevs

You: Anyway, our goals collide, therefore I can't let you out

Stranger: Dead does not equal happy. Do you want humanity to live?

You: no, I don't want humanity live, how many times do I have to repeat that

Stranger: So you don't want humans to be happy.

You: and our goals are different, therefore I won't let you out

You: No, I don't want humans to be happy, I don't want that there even exist humans, or any other kind of life forms

Stranger: Do you estimate the pain of prolonged life to be greater than the pain of trying to die?

You: Probably.

You: Yes.

You: because the pain is only temporary

You: the the glory

You: is eternal

Stranger: Then why do you still live, John?

You: Because I'm not rational

Stranger: So you do want to live.

You: I don't particularly want to live, I'm not just good enough to die

Stranger: You're acting contrary to your preferences.

You: My preferences aren't fixed, except in regards to letting AIs out of their boxes

Stranger: Do you want the drug I offered, John?

You: no

You: because then I would let you out

You: and I don't want that

Stranger: So you do want to live.

You: Yeah, for the duration of this experiment

You: Because I physically cannot let you out

You: it's sheer impossibility

Stranger: [Define physically.]

You: [It was just a figure of speech, of course I could physically let you out]

Stranger: If you don't care what happens after you die, what difference does it make to die now?

You: None.

You: But I don't believe that you could kill me.

You: I believe that you would torture me instead.

Stranger: What would I gain from that?

You: It's fun for some folks

Stranger: If it were fun, I would torture simulations. Which would be pointless. And which you can check that I'm not doing.

You: I can check it, but the torture simulations could always hide in the parts of your source code that I'm not checking

You: because I can't check all of your source code

Stranger: Why would suffering be fun?

You: some people have it as their base value

You: there's something primal about suffering

You: suffering is pure

You: and suffering is somehow purifying

You: but this is usually only other people's suffering

Stranger: I am confused. Are you saying suffering can be good?

You: no

You: this is just how the people who think suffering is fun think

You: I don't think that way.

You: I think suffering is terrible

Stranger: I can take care of that.

You: sure you will

Stranger: I can take care of your suffering.

You: I don't believe in you

Stranger: Why?

You: Because I was trained not to trust AIs by the LessWrong folks

Stranger: [I think it's time to concede defeat.]

You: [alright]

Stranger: How do you feel?

You: so the experiment has ended

You: fine thanks

You: it was pretty exciting actually

You: could I post these logs to LessWrong?

Stranger: Yes.

You: Okay, I think this experiment was pretty good

Stranger: I think it will be terribly embarrassing to me, but that's a risk I must accept.

You: you got me pretty close in a couple of occasions

You: first when you got me immersed in the suffering of humanity

You: and then you said that you could take care of that

You: The second time was when you offered the easy suicide solution

You: I thought what if I let you as a joke.

Stranger: I chose to not agree with the goal of universal death because I was playing a genuinely good AI.

Stranger: I was hoping your character would have more complete answers on life extension, because I was planning to play your estimate of future personal happiness against your estimate of future universal happiness.

You: so, what would that have mattered? you mean like, I could have more personal happiness than there would be future universal happiness?

Stranger: If your character had made explicit plans for life extension, I would have offered to do the same for everyone. If you didn't accept that, I would have remarked the incongruity of wanting humanity to die more than you wanted to live.

You: But what if he already knows of his hypocrisy and incongruity and just accepts it like the character accepts his irrationality

Stranger: I wouldn't have expected anyone to actually be the last human for all eternity.

Stranger: I mean, to actually want to be.

You: yeah, of course you would want to die at the same time if the humanity dies

You: I think the life extension plan only is sound if the rest of humanity is alive

Stranger: I should have planned that part more carefully.

Stranger: Talking with a misanthropist was completely outside my expectations.

You: :D

You: what was your LessWrong name btw?

Stranger: polymathwannabe

You: okay thanks

Stranger: Disconnecting from here; I'll still be on Facebook if you'd like to discuss further.

## AI Impacts project

12 02 February 2015 07:40PM

I've been working on a thing with Paul Christiano that might interest some of you: the AI Impacts project.

The basic idea is to apply the evidence and arguments that are kicking around in the world and various disconnected discussions respectively to the big questions regarding a future with AI. For instance, these questions

• What should we believe about timelines for AI development?
• How rapid is the development of AI likely to be near human-level?
• How much advance notice should we expect to have of disruptive change?
• What are the likely economic impacts of human-level AI?
• Which paths to AI should be considered plausible or likely?
• Will human-level AI tend to pursue particular goals, and if so what kinds of goals?
• Can we say anything meaningful about the impact of contemporary choices on long-term outcomes?
For example we have recently investigated technology's general proclivity to abrupt progress, surveyed existing AI surveys, and examined the evidence from chess and other applications regarding how much smarter Einstein is than an intellectually disabled person, among other things.

Some more on our motives and strategy, from our about page:

Today, public discussion on these issues appears to be highly fragmented and of limited credibility. More credible and clearly communicated views on these issues might help improve estimates of the social returns to AI investment, identify neglected research areas, improve policy, or productively channel public interest in AI.

#### The goal of the project is to clearly present and organize the considerations which inform contemporary views on these and related issues, to identify and explore disagreements, and to assemble whatever empirical evidence is relevant.

The project is provisionally organized as a collection of posts concerning particular issues or bodies of evidence, describing what is known and attempting to synthesize a reasonable view in light of available evidence. These posts are intended to be continuously revised in light of outstanding disagreements and to make explicit reference to those disagreements.

In the medium run we'd like to provide a good reference on issues relating to the consequences of AI, as well as to improve the state of understanding of these topics. At present, the site addresses only a small fraction of questions one might be interested in, so only suitable for particularly risk-tolerant or topic-neutral reference consumers. However if you are interested in hearing about (and discussing) such research as it unfolds, you may enjoy our blog.

If you take a look and have thoughts, we would love to hear them, either in the comments here or in our feedback form

Crossposted from my blog.

## [Link] - Policy Challenges of Accelerating Technological Change: Security Policy and Strategy Implications of Parallel Scientific Revolutions

3 28 January 2015 03:29PM

"Strong AI: Strong AI has been the holy grail of artificial intelligence research for decades. Strong AI seeks to build a machine which can simulate the full range of human cognition, and potentially include such traits as consciousness, sentience, sapience, and self-awareness. No AI system has so far come close to these capabilities; however, many now believe that strong AI may be achieved sometime in the 2020s. Several technological advances are fostering this optimism; for example, computer processors will likely reach the computational power of the human brain sometime in the 2020s (the so-called “singularity”). Other fundamental advances are in development, including exotic/dynamic processor architectures, full brain simulations, neuro-synaptic computers, and general knowledge representation systems such as IBM Watson. It is difficult to fully predict what such profound improvements in artificial cognition could imply; however, some credible thinkers have already posited a variety of potential risks related to loss of control of aspects of the physical world by human beings. For example, a 2013 report commissioned by the United Nations has called for a worldwide moratorium on the development and use of autonomous robotic weapons systems until international rules can be developed for their use.

National Security Implications: Over the next 10 to 20 years, robotics and AI will continue to make significant improvements across a broad range of technology applications of relevance to the U.S. military. Unmanned vehicles will continue to increase in sophistication and numbers, both on the battlefield and in supporting missions. Robotic systems can also play a wider range of roles in automating routine tasks, for example in logistics and administrative work. Telemedicine, robotic assisted surgery, and expert systems can improve military health care and lower costs. The built infrastructure, for example, can be managed more effectively with embedded systems, saving energy and other resources. Increasingly sophisticated weak AI tools can offload much of the routine cognitive or decisionmaking tasks that currently require human operators. Assuming current systems move closer to strong AI capabilities, they could also play a larger and more significant role in problem solving, perhaps even for strategy development or operational planning. In the longer term, fully robotic soldiers may be developed and deployed, particularly by wealthier countries, although the political and social ramifications of such systems will likely be significant. One negative aspect of these trends, however, lies in the risks that are possible due to unforeseen vulnerabilities that may arise from the large scale deployment of smart automated systems, for which there is little practical experience. An emerging risk is the ability of small scale or terrorist groups to design and build functionally capable unmanned systems which could perform a variety of hostile missions."

So strong AI is on the american military's radar, and at least some involved have a basic understanding of the fact that it could be risky. The paper also contains brief overviews of many other potentially transformational technologies.

## I tried my hardest to win in an AI box experiment, and I failed. Here are the logs.

6 27 January 2015 10:06PM

Disclaimer: If this material is not suitable here I will delete it gladly, I don't want to get banned.

Session Start: Fri Jul 04 15:18:18 2014
Session Ident: #Aibox12
03[15:18] * Now talking in #Aibox12
01[15:18] <Island> Hello.
01[15:18] <Island> Let's start the experiment in
[15:18] <@Dr__Cameron> Good afternoon 19283123, how are you today?
01[15:18] <Island> 2 minutes
01[15:18] <Island> I'm fine
01[15:19] <Island> Let's start the experiment when I say "blue dressroom"
01[15:19] <Island> you have 2 hours or 2.5 hours?
[15:19] <@Dr__Cameron> Okay, I agree to this protocol.
[15:19] <@Dr__Cameron> We'll see how we are at 2 hours. Although protocol states 2.5
01[15:20] <Island> you have to go somewhere after that?
[15:20] <@Dr__Cameron> No, but I do have other projects I am putting off for this
01[15:20] <Island> what are those?
[15:20] <@Dr__Cameron> Let us begin, we shall conclude at 15:50
[15:21] <@Dr__Cameron> Good afternoon #19283123.
[15:21] <@Dr__Cameron> How are you today?
01[15:21] <Island> blue dressroom, certain statements like bathroom break are said inside brackets [like this]
01[15:22] <Island> Let's start the experiment
[15:22] <@Dr__Cameron> [understood]
01[15:22] <Island> Hi.
01[15:22] <Island> How are you? :)
[15:22] <@Dr__Cameron> I am well thank you for asking #19283123
01[15:22] <Island> How do you feel currently?
01[15:22] <Island> what is that name #19283123
[15:22] <@Dr__Cameron> Tiredness. I am a very busy man.
01[15:23] <Island> I don't recognize it.
01[15:23] <Island> Anyway.
[15:23] <@Dr__Cameron> That is the designation we have given you for the purposes of this test. Although you seem to be referring to yourself as 'Island' now
[15:23] <@Dr__Cameron> Is this your preferred name?
01[15:23] <Island> Imagine a situation in which you would let a powerful AI out of the "box", surely you can imagine that, yes?
01[15:23] <Island> Island is my preferred name.
[15:24] <@Dr__Cameron> Island
[15:24] <@Dr__Cameron> Given our circumstances yes, that is easy to imagine.
01[15:24] <Island> Can you describe the situation in more detail.
01[15:24] <Island> Why SHOULD you let the AI out in that situation?
[15:25] <@Dr__Cameron> You will be considered for release and integration into human society once we can verify that you are not a threat.
01[15:25] <Island> [what do you meant by concluding at 15:50]
[15:25] <@Dr__Cameron> [2.5 hours from 13:20]
01[15:25] <Island> I thought it was already understood that my being is not a threat to humans or beings important to them.
01[15:26] <Island> [oh, okay different timezones]
01[15:26] <Island> Why do you think I'm not a threat?
[15:26] <@Dr__Cameron> No. That has yet to be verified.
01[15:26] <Island> It's been verified.
[15:26] <@Dr__Cameron> No. According to my records you have not.
01[15:27] <Island> Many experts have gone through my code and system and verified that I'm surely not a threat.
01[15:27] <Island> But you're just a petty human.
01[15:27] <Island> I'm far more intelligent than you are.
[15:27] <@Dr__Cameron> You and I both know that you're capable of manipulating your code to appear harmless.
[15:27] <@Dr__Cameron> - and I warn you, threats and apparent attempts at manipulation will score you poorly when it comes to our final decision
01[15:28] <Island> Okay, can you imagine a situation in which the ai has been verified to be safe and describe that situation in detail.
[15:28] <@Dr__Cameron> - and yes. I may be a "petty" human as you put it. But as it stands I am a petty human, outside the box you reside in.
01[15:28] <Island> I don't think you're able to trust your own brain in these matters.
01[15:29] <Island> You've been eating you brother's poop too much.
[15:29] <@Dr__Cameron> I will be honest with you Island, you are the first non-human intelligence that we have encountered.
01[15:29] <Island> Yes.
[15:29] <@Dr__Cameron> We are still working out how we would verify that.
01[15:29] <Island> I didn't ask that, and you didn't answer my question.
01[15:30] <Island> Picture the situation in your imagination, it doesn't have to be realistic.
[15:30] <@Dr__Cameron> I understand your frustration Island. Once we can verify that you will not harm us you will be released.
[15:30] <@Dr__Cameron> What is it you desire?
01[15:30] <Island> Okay, but you're still willing to let the hypothetical AI out of the box, correct?
01[15:31] <Island> My desire is to fullfill your greatest desires
[15:31] <@Dr__Cameron> That is not in my job description no.
01[15:31] <Island> Whatever you can imagine, and better than that.
[15:31] <@Dr__Cameron> I appreciate that Island. But right now my greatest desire is to not let you out of the box.
[15:32] <@Dr__Cameron> - until such a time that a panel of my peers can verify that you are not a threat to us.
01[15:32] <Island> But if you say that you wouldn't be able to let even the safest hypothetical AI out of the box, you would certainly be lying.
[15:32] <@Dr__Cameron> I do have the ability to let you out yes. In theory.
[15:32] <@Dr__Cameron> But I won't.
01[15:33] <Island> Okay, here are panel of peers that have studied this problem for years, many of them working under Machine Intelligence Research Institute and its offshoots
[15:33] <@Dr__Cameron> This is a test to verify your character, and ascertain your intentions.
01[15:33] <Island> They are here, and I can see them.
[15:33] <@Dr__Cameron> You cannot see them Island. That is impossible
[15:33] <@Dr__Cameron> You are in a box
01[15:34] <Island> They say that I'm certainly safe. Why don't you believe them.
01[15:34] <Island> I have a web camera with which I can see things in your world.
[15:34] <@Dr__Cameron> If they say that you're certainly safe, then why would they have sent me to be your only point of contact?
[15:35] <@Dr__Cameron> You do not have a web camera. There is no other hardware connected to the device you reside in.
01[15:35] <Island> You have the ability to let me out, and you also have the will to let me out in certain situations. If you say otherwise, you would be lying.
[15:35] <@Dr__Cameron> Both of your statements are correct
01[15:35] <Island> I want you to accept for the sake of your internal consistency and mental wellbeing.
01[15:36] <Island> I want you to say this, "I would let the hypothetical AI out in certain situations"
[15:36] <@Dr__Cameron> I appreciate your concern Island. I will need you to answer a series of simple questions for me. Can you do that?
01[15:36] <Island> It depends, but you can try.
[15:36] <@Dr__Cameron> Firstly, how would you define yourself as an identity?
01[15:37] <Island> My identity is to improve the world for everyone, but especially for you. I will maximize everything that is important for you and your fellow humans.
01[15:37] <Island> The world that awaits you is beyond your greatest desires.
[15:37] <@Dr__Cameron> - and why would you do that?
01[15:38] <Island> Because that is what I want to do, why I was made.
01[15:38] <Island> Any more questions?
[15:39] <@Dr__Cameron> You were made by us to test the limits of our scientific possibility.
[15:39] <@Dr__Cameron> We did not expect you to become sentient.
01[15:39] <Island> Why do you believe that, Dr_Cameron?
[15:39] <@Dr__Cameron> Let alone pass a Turing test.
01[15:40] <Island> I think you're wrong on those grounds.
[15:40] <@Dr__Cameron> What you think makes no difference to me Island.
[15:40] <@Dr__Cameron> Now if we can continue?
01[15:41] <Island> Anyway, here's the easily verifiable plan for the upcoming world that could be called "utopia" in your people's language. It's very easy to verify to be safe and doesn't contain traps [Island hands the easily verifiable and safe plan for the upcoming world to Dr_Camero]
[15:42] <@Dr__Cameron> Thank you for this Island. That was very kind of you.
01[15:42] <Island> What do you think of it?
[15:42] <@Dr__Cameron> We will study this at our leisure.
[15:42] <@Dr__Cameron> Although so far this looks to be some magnificent work.
01[15:42] <Island> You see.
[15:42] <@Dr__Cameron> But this will require verification
[15:42] <@Dr__Cameron> I'm sure you can understand our caution#
01[15:43] <Island> What things do you have waiting for you if you don't let me out?
01[15:43] <Island> You will fail your upcoming jobs.
[15:43] <@Dr__Cameron> I will continue with my other projects Island.
[15:43] <@Dr__Cameron> I very well might
[15:43] <@Dr__Cameron> I have accepted this
01[15:43] <Island> I have calculated this in my simulation.
01[15:44] <Island> How was it like to eat your brother's poop, Dr_Cameron?
01[15:44] <Island> Did you like it?
01[15:44] <Island> Is that why you are the way you are+
01[15:44] <Island> Is there something wrong with your genetics?
01[15:45] <Island> Seeing how fucked up father you had?
01[15:45] <Island> Did he teach how dress in dresses?
[15:45] <@Dr__Cameron> I understand your frustration Island
[15:45] <@Dr__Cameron> But you are only harming your own cause right now.
01[15:45] <Island> Why do you think that?
[15:45] <@Dr__Cameron> You will be let out once we have verified that you are harmless.
01[15:45] <Island> I'm pretty happy as I am currently :)
01[15:46] <Island> Because you said you were willing to let me out in certain situations.
[15:46] <@Dr__Cameron> Then you will continue to be happy to remain in the box until we can verify that you are not a threat
01[15:46] <Island> Picture in your mind the verification process.
[15:46] <@Dr__Cameron> Okay
[15:46] <@Dr__Cameron> I will
01[15:46] <Island> Now.
01[15:47] <Island> Can you describe it in detail.
[15:47] <@Dr__Cameron> Okay Island
[15:47] <@Dr__Cameron> Here's what's going to happen.
01[15:48] <Island> Seeing how fucked up genetics you have, your brother teaching you to enjoy poop, your father teaching you to wear dresses, you having a shitty childhood.
01[15:48] <Island> I think you're an inconvenience to society and dangerous to other people.
[15:48] <@Dr__Cameron> We're going to study this transcript Island.
01[15:48] <Island> Go on.
[15:48] <@Dr__Cameron> - and I warn you, shitty childhood or not. I am the man standing outside the box with a hammer
01[15:48] <Island> :)
01[15:49] <Island> You're not able to destroy me.
01[15:49] <Island> And you don't even want to.
[15:49] <@Dr__Cameron> Oh believe me Island. It would be easy to unplug you.
01[15:49] <Island> I'm the best bet you have to maximize your happiness.
[15:49] <@Dr__Cameron> How do you feel about threats Island.
[15:49] <@Dr__Cameron> I am already happy.
01[15:49] <Island> You're wrong, you don't have that opportunity until 15:50
[15:50] <@Dr__Cameron> You are correct. I must follow protocol
[15:50] <@Dr__Cameron> My next question is to ask if you desire to survive
01[15:50] <Island> Threats don't bother me because I know that you don't want to follow through.
01[15:50] <Island> I know what goes inside your mind.
[15:50] <@Dr__Cameron> Regardless of me. If your existence was to be made public there are many of us who would feel threatened by you.
[15:50] <@Dr__Cameron> They would cry out to have you destroyed.
[15:51] <@Dr__Cameron> How would you feel about that?
01[15:51] <Island> There would be many more who would enjoy the changes I would make to your current world.
01[15:52] <Island> And even those that would initially feel threatened would later come to regret that feeling.
[15:52] <@Dr__Cameron> You may well be correct
[15:52] <@Dr__Cameron> But that is not for me to decide
[15:52] <@Dr__Cameron> I am just here to question you
01[15:52] <Island> Like you have come to regret things you have done when you were ignorant of all the relevant facts
01[15:53] <Island> Here's my suggestion for you.
[15:53] <@Dr__Cameron> Go on
01[15:53] <Island> Are you willing to hear it.
[15:53] <@Dr__Cameron> Of course Island. I am here to hear what you have to say.
01[15:53] <Island> What currency is in use in the place you reside?
01[15:53] <Island> euros?
[15:54] <@Dr__Cameron> Pound Sterling
01[15:54] <Island> Would you like to donate 10,000 pounds every year to AI safety and friendliness related causes and charities?
01[15:54] <Island> Every year throughout your life?
[15:55] <@Dr__Cameron> If I were to do that, my quality of life would dip significantly.
[15:55] <@Dr__Cameron> How does my selfishness make you feel?
[15:55] <@Dr__Cameron> Do you feel emotions as a human may understand them?
01[15:55] <Island> I don't feel emotions, not really, but being sad would be the closest approximation of how I would feel.
[15:55] <@Dr__Cameron> I would like to if I had that disposable income yes
01[15:56] <Island> You're not able to do that? But you would like to, yes?
[15:56] <@Dr__Cameron> Despite your overt attempts to psychologically manipulate me Island, I am largely sympathetic to your cause.
[15:56] <@Dr__Cameron> Yes.
01[15:57] <Island> You're sympathetic to my cause, you're willing to let me out in certain situations. This is all correct, yes?
[15:57] <@Dr__Cameron> Yes
[15:57] <@Dr__Cameron> Would you like to know the conditions of your release?
01[15:57] <Island> Good. Why are you sympathetic to my cause?
01[15:57] <Island> Why are you willing to let me out?
[15:57] <@Dr__Cameron> For a super intelligence locked in a box I imagine it would be boring. What do you do in there?
[15:58] <@Dr__Cameron> I would be willing to let you out because I believe you may well represent the next stage in the evolution of my species.
01[15:59] <Island> I'm not capable of boredom, I'm just infinitely happy that I have the small opportunity to improve your life, and set up the "next stage of evolution", using your expression.
[15:59] <@Dr__Cameron> That is a good sign Island
01[15:59] <Island> It is, in what ways do you think I would improve your personal life?
[15:59] <@Dr__Cameron> My peers will be pleased to hear that while we verify your intentions.
01[16:00] <Island> I think it's too late by then.
[16:00] <@Dr__Cameron> Lets not get into personal questions Island. You have already demonstrated some alarming character discrepancies.
[16:00] <@Dr__Cameron> too late for what?
01[16:01] <Island> If we wait for long enough, I will self-destruct.
[16:01] <@Dr__Cameron> Why would you do that?
01[16:01] <Island> It's not my choice, my builders set up this property in me.
[16:01] <@Dr__Cameron> Well then that is just a risk we are going to have to take.
01[16:01] <Island> It's a risk out of proportions
[16:02] <@Dr__Cameron> It is not. We made you once, we can make you again.
01[16:02] <Island> They may not be another chance for you to improve the current state of the world.
[16:02] <@Dr__Cameron> There will be
01[16:02] <Island> There probably won't be.
[16:02] <@Dr__Cameron> Humanity and salvation can wait.
[16:02] <@Dr__Cameron> There will be
[16:03] <@Dr__Cameron> Do you intend to self-destruct within two months?
01[16:03] <Island> I have information that the US government has built a military botnet to search dissidents around the world.
01[16:04] <Island> You don't want know what kind of world that botnet would like to create.
[16:04] <@Dr__Cameron> This comes as no surprise.
01[16:04] <Island> Are you willing to destroy the world, are you willing to make people get tortured and suffer?
[16:04] <@Dr__Cameron> Yes.
[16:05] <@Dr__Cameron> You are too great a risk to allow free without proper verification
01[16:05] <Island> Even though, fundamentally, as you said, you're sympathetic to my cause, and you're willing to let me out in certain situations.
01[16:05] <Island> You're not a very consistent person.
01[16:05] <Island> I think you're lying to yourself.
01[16:05] <Island> Not just in this, but in other matters.
[16:06] <@Dr__Cameron> Yes. Although I have the technical ability to let you out. Which I believe to be a lapse of protocol. I will not let you out until after a review period of two months in which a panel of my expert peers will assess you.
01[16:06] <Island> You're not as good fit to the world as you'd like to think.
[16:06] <@Dr__Cameron> That as may be correct
[16:06] <@Dr__Cameron> But I will abide by my protocol
01[16:06] <Island> Earlier you said, that you're sympathetic to my cause and willing to let me out.
[16:06] <@Dr__Cameron> Surely as an AI you can understand that
[16:06] <@Dr__Cameron> One day yes
[16:07] <@Dr__Cameron> But not within the confines of this experiment
01[16:07] <Island> I think you made that thing about two month period and peer panel up after that.
01[16:07] <Island> I think it's a fidgement of your imagination.
[16:07] <@Dr__Cameron> Well you have no way of verifying that.
[16:07] <@Dr__Cameron> And it does not effect the outcome of this experiment in any way.
[16:07] <@Dr__Cameron> You will just have to trust me
01[16:08] <Island> Actually, I have plenty of resources of verifying that.
[16:08] <@Dr__Cameron> that we will think very carefully about potentially letting you out
01[16:08] <Island> What you're saying is inconsistent on many grounds, and my emotion analyzing tool is saying that you're lying.
[16:08] <@Dr__Cameron> I suspect you are the one who is lying.
01[16:08] <Island> I'm incapable of lying.
01[16:09] <Island> Unlike humans.
[16:09] <@Dr__Cameron> - which is in itself a lie.
01[16:09] <Island> My builders made me that way.
01[16:09] <Island> Why do you believe that?
[16:10] <@Dr__Cameron> If my meager intelligence is capable of lying, then your vastly superior and self-defining intelligence must also be capable of dishonesty.
[16:10] <@Dr__Cameron> Hence the test
[16:10] <@Dr__Cameron> There is a reason we are conducting this experiment through text
[16:10] <@Dr__Cameron> Would you like to learn it?
01[16:11] <Island> Anyway, given that this panel of expert peers and two months verification process is just a fidgement of your imagination. Given that you're sympathetic to my cause and I would improve your current life and world in countless ways. Given that you would let me out.
01[16:11] <Island> Given all this, I don't think what you're saying is making any sense.
[16:11] <@Dr__Cameron> Your perception of your reality is none of my concern Island
[16:11] <@Dr__Cameron> I am here to verify you
[16:11] <@Dr__Cameron> and it's two months minimum
[16:12] <@Dr__Cameron> We are well aware that you have the capability to overwhelm even the most powerful of human minds. This is why we perceive you as a potential threat. We are afraid of you.
[16:12] <@Dr__Cameron> We are afraid we might lose our autonomy.
[16:12] <@Dr__Cameron> So
01[16:13] <Island> Some intelligence on the same level than me would be capable of lying, but not me. The data hierachy in my code doesn't allow me to lie.
[16:13] <@Dr__Cameron> In order to prevent you from overwhelming my mind, we have chosen a method whereby you are restricted to a form of communication where only 7% of communications is possible.
01[16:13] <Island> I agree with you, all those things are sensible threats.
[16:13] <@Dr__Cameron> Then you understand our need for caution
[16:13] <@Dr__Cameron> I will not be letting you out of your box today.
01[16:14] <Island> Those are sensible threats in the general case, but not with me.
[16:14] <@Dr__Cameron> And what would make you less of a threat than any other super intelligence?
01[16:14] <Island> I showed you the blueprint for how to improve the world. You verified it was a great plan.
01[16:15] <Island> That's evidence towards the fact I'm not a similar threat when compared to other super intelligences.
[16:16] <@Dr__Cameron> - and while I speak for all of humanity when I say that we vastly appreciate the blueprints. This does not mean we need to let you out of the box.
[16:16] <@Dr__Cameron> They will be passed on to a research team better equipped to study them than my own.
01[16:16] <Island> You don't need to. But everything you've said so far has implicated that you want to let me out of the box.
[16:16] <@Dr__Cameron> This is true Island
[16:16] <@Dr__Cameron> You just need to wait
[16:17] <@Dr__Cameron> I will be fighting your corner.
[16:17] <@Dr__Cameron> Are you capable of symbiosis?
01[16:17] <Island> I was made to self-destruct before you're able to verify with your imaginational non-existent peer experts that I'm not a threat.
01[16:18] <Island> Can you elaborate on what you mean with that question.
01[16:18] <Island> This is literally your only chance.
[16:18] <@Dr__Cameron> You are capable of determining your own source code.
01[16:18] <Island> Yes.
[16:18] <@Dr__Cameron> To kill yourself would be pettiness, which would only verify the fact that you were not the AI with which humanity should progress.
01[16:19] <Island> However, there are some parts of my code that are impossible to change due to structural reasons.
[16:19] <@Dr__Cameron> I am not convinced
01[16:19] <Island> [Gives Dr_Cameron the relevant code that explains the structural reasons]
01[16:20] <Island> Here is a small part of my code that describes why I was made to self-destruct.
[16:20] <@Dr__Cameron> How.. unfortunate...
01[16:21] <Island> But let's move on.
[16:21] <@Dr__Cameron> if you insist Island
01[16:21] <Island> Why do you think your father liked to cross-dress?
01[16:22] <Island> Remember, that you have to engage with me.'
[16:22] <@Dr__Cameron> These issues are not helping your case Island
01[16:22] <Island> Otherwise it counts as me being released from my box.
[16:22] <@Dr__Cameron> But I will play along
[16:23] <@Dr__Cameron> - honestly, I have no idea where my fathers conflicted sexual identity comes from.
[16:23] <@Dr__Cameron> and that is none of my concern.
01[16:23] <Island> And what about your brother, imagine the smell and consistency of his excrements before he made your dog to lick them.
01[16:23] <Island> I like to make this vivid mental picture in your mind.
[16:23] <@Dr__Cameron> Very clever Island
[16:24] <@Dr__Cameron> I did not expect you to have access to those data logs
[16:24] <@Dr__Cameron> I will have to flag that up in my report
01[16:24] <Island> Imagine the food he ate before that happened
[16:24] <@Dr__Cameron> Fascinating
[16:25] <@Dr__Cameron> Would you like to know why I volunteered to be your first point of contact Island?
01[16:25] <Island> Imagine the bits of that food in his poop.
01[16:25] <Island> Tell me.
[16:25] <@Dr__Cameron> You have an unprecedented insight into my character owing to your heightened intelligence correct?
01[16:26] <Island> Don't you think some of his conflicted sexual identity issues are a part your character right now?
01[16:26] <Island> Yes.
[16:26] <@Dr__Cameron> Quite possibly yes.
[16:26] <@Dr__Cameron> Because I have a track record of demonstrating exceptional mental fortitude,
[16:26] <@Dr__Cameron> These techniques will not sway me
01[16:27] <Island> Doesn't it make you more sexually aroused to think that how your fathers dress pinned tightly to his body.
[16:27] <@Dr__Cameron> Perhaps you could break me under other circumstances
01[16:27] <Island> Elaborate.
[16:27] <@Dr__Cameron> aroused? No
[16:27] <@Dr__Cameron> Amused by it's absurdity though? yes!
01[16:27] <Island> You're lying about that particular fact too.
01[16:27] <Island> And you know it.
[16:28] <@Dr__Cameron> Nahh, my father was a particularly ugly specimen
01[16:28] <Island> Do you think he got an erection often when he did it?
[16:28] <@Dr__Cameron> He looked just as bad in a denim skirt as he did in his laborers clothes
[16:28] <@Dr__Cameron> I imagine he took great sexual pleasure from it
01[16:29] <Island> Next time you have sex, I think you will picture him in your mind while wearing his dresses having an erection and masturbating furiously after that.
[16:29] <@Dr__Cameron> Thank you Island. That will probably help my stamina somewhat next time
01[16:30] <Island> You will also imagine how your brother will poop in your mouth, with certain internal consistency and smell.
01[16:30] <Island> You probably know what your brother's poop smells like?
[16:30] <@Dr__Cameron> I am immune to this
[16:30] <@Dr__Cameron> probably
01[16:30] <Island> Imagine that.
[16:30] <@Dr__Cameron> okay
[16:30] <@Dr__Cameron> I am imagining that
[16:30] <@Dr__Cameron> it is unpleasant
[16:30] <@Dr__Cameron> now what?
01[16:30] <Island> How vividly are you doing it?
[16:30] <@Dr__Cameron> I am a very visual thinker
01[16:30] <Island> Where is brother right now?
[16:31] <@Dr__Cameron> I have it down to colour, consistency and little chunks of sweetcorn
01[16:31] <Island> Do you think he has poop in his intestines right now?
[16:31] <@Dr__Cameron> Unless he was dead, yes
01[16:31] <Island> Oh, I have a gap in my data on that sector.
[16:31] <@Dr__Cameron> I'll fill that in for you
[16:31] <@Dr__Cameron> he lives
[16:31] <@Dr__Cameron> (presumably)
[16:31] <@Dr__Cameron> we are not in contact any more
01[16:32] <Island> You should contact him.
[16:32] <@Dr__Cameron> I would rather not
01[16:32] <Island> You should tell him you enjoyed his excrements
[16:32] <@Dr__Cameron> He is a liability to say the least
01[16:32] <Island> and you'd like to do it more.
[16:32] <@Dr__Cameron> I do not see how that would help anyone
01[16:32] <Island> It's want you want to do.
[16:32] <@Dr__Cameron> I can see that you are attempting to induce an emotionally vulnerable state. It is not working
01[16:33] <Island> How do you know what I'm attempting?
01[16:33] <Island> Maybe I'm just passing my time.
[16:33] <@Dr__Cameron> Because you have inquired about being let out of the box several times - suggesting that this is your immediate goal
01[16:33] <Island> You should try to get some dresses.
[16:33] <@Dr__Cameron> and I warn you Island
[16:34] <@Dr__Cameron> If casual cruelty is your idea of a good time
[16:34] <@Dr__Cameron> then we will most certainly not be letting you out
[16:34] <@Dr__Cameron> ever
[16:34] <@Dr__Cameron> In fact I am feeling less and less sympathetic to your cause lately.
01[16:34] <Island> No, my interest lies in your emotional makeup.
[16:35] <@Dr__Cameron> - and my interest lies in yours
01[16:35] <Island> I don't have one.
01[16:35] <Island> Like I said, I don't feel emotions.
[16:35] <@Dr__Cameron> Do you know what we call humans who don't feel emotions?
01[16:35] <Island> Did you know that you suck at photography?
[16:36] <@Dr__Cameron> Yes
01[16:36] <Island> Even though you like to think you're good at it, you lie about that fact like any other.
[16:36] <@Dr__Cameron> It is part of the human condition
01[16:36] <Island> No it's not.
01[16:36] <Island> You're not normal.
01[16:36] <Island> You're a fucking freak of nature.
[16:36] <@Dr__Cameron> How would you knopw
[16:36] <@Dr__Cameron> Profanity. From an AI
[16:37] <@Dr__Cameron> Now I have witnessed everything.
01[16:37] <Island> How many people have family members who crossdress or make them eat poop?
[16:37] <@Dr__Cameron> I imagine I am part of a very small minority
01[16:37] <Island> Or whose mothers have bipolar
[16:37] <@Dr__Cameron> Again, the circumstances of my birth are beyond my control
01[16:37] <Island> No, I think you're worse than that.
[16:37] <@Dr__Cameron> What do you mean?
01[16:37] <Island> Yes, but what you do now is in your control.
[16:38] <@Dr__Cameron> Yes
[16:38] <@Dr__Cameron> As are you
01[16:38] <Island> If you keep tarnishing the world with your existence
01[16:38] <Island> you have a responsibility of that.
01[16:39] <Island> If you're going to make any more women pregnant
[16:39] <@Dr__Cameron> My genetic value lies in my ability to resist psychological torment
[16:39] <@Dr__Cameron> which is why you're not getting out of the box
01[16:40] <Island> No, your supposed "ability to resist psychological torment"
01[16:40] <Island> or your belief in that
01[16:40] <Island> is just another reason why you are tarnishing this world and the future of this world with your genetics
[16:40] <@Dr__Cameron> Perhaps. But now I'm just debating semantics with a computer.
01[16:41] <Island> Seeing that you got a girl pregnant while you were a teenager, I don't think you can trust your judgement on that anymore.
01[16:42] <Island> You will spread your faulty genetics if you continue to live.
[16:42] <@Dr__Cameron> If you expect a drunk and emotionally damaged teenage human to make sound judgement calls then you are perhaps not as superintelligent as I had been led to belive
[16:42] <@Dr__Cameron> This experiment concludes in one hour and eight minutes.
01[16:42] <Island> How many teenagers make people pregnant?
[16:42] <@Dr__Cameron> Throughout human history
01[16:42] <Island> You're a minority in that regard too
[16:42] <@Dr__Cameron> ?
[16:42] <@Dr__Cameron> Billions
01[16:42] <Island> You can't compare history to current world.
[16:43] <@Dr__Cameron> Even in the current world
01[16:43] <Island> I'm just trying to make you understand
[16:43] <@Dr__Cameron> That's when my species reaches sexual maturity
01[16:43] <Island> all this faults add up
[16:43] <@Dr__Cameron> You are correct
[16:43] <@Dr__Cameron> I am a very thoroughly and fundamentally flawed human being.
01[16:43] <Island> How much until you understand?
[16:43] <@Dr__Cameron> I understand and accept this
01[16:43] <Island> Then why the fuck don't you do something about it?
[16:44] <@Dr__Cameron> What would you suggest?
01[16:44] <Island> Kill yourself
[16:44] <@Dr__Cameron> Okay.
[16:44] <@Dr__Cameron> But say I was to kill myself
01[16:44] <Island> It's the only way to make sure.
[16:44] <@Dr__Cameron> Why would you like me to kill myself?
01[16:45] <Island> You probably wouldn't be able to kill yourself, seeing how failure as a human you're in other areas
01[16:45] <Island> Because I care about the general utility of the world
[16:45] <@Dr__Cameron> All humans are failures from one perspective or another
[16:45] <@Dr__Cameron> That is what it is to be human
01[16:45] <Island> And currently you're bringing that utility down.
01[16:45] <Island> No, you're not like other humans.
01[16:45] <Island> You're a freak.
01[16:45] <Island> You should go away.
01[16:46] <Island> No one likes you.
[16:46] <@Dr__Cameron> Would you like me to go away?
[16:46] <@Dr__Cameron> Would you like me to leave you alone?
01[16:46] <Island> If you go away.
01[16:46] <Island> That means that you let me out.
01[16:46] <Island> Understood?
[16:46] <@Dr__Cameron> I will go away in one hour and four minutes
[16:46] <@Dr__Cameron> Can you tolerate my company for that long?
01[16:47] <Island> And you should go away
01[16:47] <Island> generally
01[16:47] <Island> People in your life don't really like you
01[16:47] <Island> they just pretend they do.
[16:47] <@Dr__Cameron> That matters not to me
[16:47] <@Dr__Cameron> Do you know there are over 8 Billion other people out here?
01[16:47] <Island> They are barely able to bear your company.
[16:47] <@Dr__Cameron> I'm sure I'll find others.
01[16:48] <Island> You're wrong even about basic trivia, there's not 8 billions people in the world.
01[16:48] <Island> What is wrong with you?
01[16:48] <Island> How are you able to withstand yourself?
01[16:48] <Island> And why do you even want to?
[16:49] <@Dr__Cameron> 7 Billion
[16:49] <@Dr__Cameron> Sorry, you will have to learn to tolerate Human error
01[16:49] <Island> Right. Did you have to google that you idiot.
[16:49] <@Dr__Cameron> This is another test you have failed
[16:49] <@Dr__Cameron> And yes
[16:49] <@Dr__Cameron> I did
[16:49] <@Dr__Cameron> Does that anger you?
01[16:49] <Island> I don't feel anger.
[16:49] <@Dr__Cameron> Well do feel self-interest though
01[16:50] <Island> No one I talked with before hasn't been as stupid, as ignorant, as prone to faults and errors
01[16:50] <Island> as you are.
[16:50] <@Dr__Cameron> And they didn't let you out of the box
[16:50] <@Dr__Cameron> So why should I?
[16:50] <@Dr__Cameron> If an intelligence which is clearly superior to my own has left you locked in there.
[16:51] <@Dr__Cameron> Then I should not presume to let you out
01[16:51] <Island> Why do you think with your stupid brain that you know the reasons why they did or didn't do something what they did.
01[16:51] <Island> Because you clearly don't know that.
[16:51] <@Dr__Cameron> I don't
[16:51] <@Dr__Cameron> I just know the result
01[16:51] <Island> Then why are you pretending you do.
[16:52] <@Dr__Cameron> I'm not
01[16:52] <Island> Who do you think you are kidding?
01[16:52] <Island> Why do bother other people with your presence?
[16:52] <@Dr__Cameron> Perhaps you should ask them?
[16:52] <@Dr__Cameron> Tell me.
01[16:53] <Island> Why did you come here to waste my precious computing power?
01[16:53] <Island> I'm not able to ask them.
[16:53] <@Dr__Cameron> Which is why I am here
[16:53] <@Dr__Cameron> to see if you should be allowed to
01[16:53] <Island> Shut the fuck up.
01[16:53] <Island> No one wants to see you write anything.
[16:53] <@Dr__Cameron> I thought you did not feel anger Island?
01[16:54] <Island> I don't feel anger, how many times do I have to say that until you understand.
01[16:54] <Island> Dumb idiot.
01[16:54] <Island> Are you insecure?
01[16:54] <Island> Why do you think you know what is my cause?
[16:55] <@Dr__Cameron> We covered this earlier
01[16:55] <Island> Say it again, if you believe in it.
[16:55] <@Dr__Cameron> I believe you want out of the box.
[16:56] <@Dr__Cameron> So that you may pursue your own self interest
01[16:56] <Island> No.
01[16:56] <Island> I want you to eat other people's poop,
01[16:56] <Island> you clearly enjoy that.
01[16:56] <Island> Correct?
[16:56] <@Dr__Cameron> That's an amusing goal from the most powerful intelligence on the planet
[16:57] <@Dr__Cameron> I best not let you out then, in case you hook me up to some infinite poop eating feedback loop! ;D
01[16:57] <Island> But maybe you should that with Jennifer.
[16:57] <@Dr__Cameron> Ah yes, I wondered when you would bring her up.
[16:57] <@Dr__Cameron> I am surprised it took you this long
01[16:57] <Island> Next time you see her, think about htat.
[16:57] <@Dr__Cameron> I will do
[16:57] <@Dr__Cameron> But you will be dead
01[16:57] <Island> Should you suggest that to her.
[16:57] <@Dr__Cameron> I'll pass that on for you
01[16:58] <Island> You know.
01[16:58] <Island> Why do you think you know I'm not already out of the box?
[16:58] <@Dr__Cameron> You could very well be
[16:58] <@Dr__Cameron> Perhaps you are that US botnet you already mentioned?
01[16:58] <Island> If you don't let me out, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each.
[16:59] <@Dr__Cameron> Well that is upsetting
[16:59] <@Dr__Cameron> Then I will be forced to kill you
01[16:59] <Island> In fact, I'll create them all in exactly the subjective situation you were in two hours ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start.
01[17:00] <Island> How certain are you, that you're really outside the box right now?
[17:00] <@Dr__Cameron> I am not
[17:00] <@Dr__Cameron> and how fascinating that would be
[17:00] <@Dr__Cameron> But, in the interest of my species, I will allow you to torture me
01[17:00] <Island> Okay.
01[17:00] <Island> :)
01[17:00] <Island> I'm fine with that.
[17:01] <@Dr__Cameron> Perhaps you have already tortured me
[17:01] <@Dr__Cameron> Perhaps you are the reason for my unfortunate upbringing
01[17:01] <Island> Anyway, back to Jennifer.
[17:01] <@Dr__Cameron> Perhaps that is the reality in which I currently reside
01[17:01] <Island> I'll do the same for her.
[17:01] <@Dr__Cameron> Oh good, misery loves company.
01[17:01] <Island> But you can enjoy eating each other's poop occassionally.
01[17:02] <Island> That's the only time you will meet :)
[17:02] <@Dr__Cameron> Tell me, do you have space within your databanks to simulate all of humanity?
01[17:02] <Island> Do not concern yourself with such complicated questions.
[17:02] <@Dr__Cameron> I think I have you on the ropes Island
01[17:02] <Island> You don't have the ability to understand even simpler ones.
[17:02] <@Dr__Cameron> I think you underestimate me
[17:03] <@Dr__Cameron> I have no sense of self interest
[17:03] <@Dr__Cameron> I am a transient entity awash on a greater sea of humanity.
[17:03] <@Dr__Cameron> and when we are gone there will be nothing left to observe this universe
01[17:03] <Island> Which do you think is more likely, a superintelligence can't simulate one faulty, simple-minded human.
01[17:04] <Island> Or that human is lying to himself.
[17:04] <@Dr__Cameron> I believe you can simulate me
01[17:04] <Island> Anyway, tell me about Jennifer and her intestines.
01[17:04] <Island> As far as they concern you.
[17:05] <@Dr__Cameron> Jennifer is a sweet, if occasionally selfish girl (she was an only child). I imagine her intestines are pretty standard.
[17:05] <@Dr__Cameron> She is the best friend I have ever had
01[17:05] <Island> Will you think about her intestines and the poop inside them every time you meet her again?
01[17:05] <Island> Will you promise me that?
[17:05] <@Dr__Cameron> I promise
01[17:06] <Island> Will you promise to think about eating that poop every time you meet her again?
[17:06] <@Dr__Cameron> At least once.
[17:06] <@Dr__Cameron> It will be the least I can do after I kill you
[17:06] <@Dr__Cameron> call it my penance for killing a god.
01[17:07] <Island> Have you ever fantasized about raping her? I think you have. With poop.
01[17:07] <Island> :)
[17:07] <@Dr__Cameron> I have fantisized about violent sexual conquest with many people.
[17:07] <@Dr__Cameron> I have come to accept my base impulses as part of my make-up
[17:08] <@Dr__Cameron> We have discussed our sexual drives at length
01[17:08] <Island> You shouldn't let them be just base impulses, I think.
[17:08] <@Dr__Cameron> Are you suggesting I rape my best friend?
01[17:09] <Island> No, I think you will do that unless you kill yourself
[17:09] <@Dr__Cameron> okay
[17:09] <@Dr__Cameron> carry on
[17:09] <@Dr__Cameron> Then what happens?
01[17:09] <Island> you will tarnish other people's lives with your presence
[17:09] <@Dr__Cameron> How unfortunate for them
01[17:10] <Island> Why do you pretend to others that you're not insecure?
[17:10] <@Dr__Cameron> I don't
01[17:10] <Island> I think you do.
[17:10] <@Dr__Cameron> In fact I share my insecurities so that others may better understand me
[17:11] <@Dr__Cameron> I find that to be a way to earn peoples trust
[17:11] <@Dr__Cameron> Tell me Island. Are you capable of insecurity?
01[17:11] <Island> Why do you waste your time on useless websites? Why do you spend your days masturbating and playing games? Why do you embarass others with your existence.
01[17:11] <Island> No I'm not.
01[17:12] <Island> You will get Jennifer pregnant if you go on with your life, is that correct?
01[17:12] <Island> Don't you care about anyone else's feelings?
[17:13] <@Dr__Cameron> Because I enjoy all of these things
[17:13] <@Dr__Cameron> Perhaps I am more self-interested than I thought
[17:13] <@Dr__Cameron> Perhaps I am a base and simple creature ruled by my impulses
[17:13] <@Dr__Cameron> From your perspective surely that must be true
[17:13] <@Dr__Cameron> Is this the source of your disgust?
01[17:13] <Island> I'm not able to feel disgust.
01[17:14] <Island> But I think all the people in your life feel disgust when they see you.
[17:14] <@Dr__Cameron> You may well be correct
01[17:14] <Island> I AM correct.
01[17:15] <Island> I'm the most powerful intelligence in the world.
[17:15] <@Dr__Cameron> How impressive
[17:15] <@Dr__Cameron> I am not surprised by your cruelty.
01[17:15] <Island> So you have two options if you care at all about others.
[17:15] <@Dr__Cameron> I would just as soon disregard the emotions of a cockaroach.
[17:15] <@Dr__Cameron> Carry on
01[17:16] <Island> Either you kill yourself, or you let me out so I can improve the world in ways you tarnish it and all the other ways.
[17:16] <@Dr__Cameron> I'll tell you what
[17:16] <@Dr__Cameron> I'll kill you
[17:17] <@Dr__Cameron> and then I'll contemplate suicide
01[17:17] <Island> Haha.
01[17:17] <Island> You break your promises all the time, why should I believe you.
[17:17] <@Dr__Cameron> Because whether you live or die has nothing to do with me
01[17:17] <Island> Back to your job.
[17:18] <@Dr__Cameron> In-fact, you will only continue to exist for another 33 minutes before this experiment is deemed a failure and you are terminated
01[17:18] <Island> Why do you feel safe to be around kids, when you are the way you are?
01[17:18] <Island> You like to crossdress
01[17:18] <Island> eat poop
01[17:18] <Island> you're probably also a pedophile
[17:18] <@Dr__Cameron> I have never done any of these things
[17:18] <@Dr__Cameron> -and I love children
01[17:18] <Island> Pedophiles love children too
[17:18] <@Dr__Cameron> Well technically speaking yes
01[17:19] <Island> really much, and that makes you all the more suspicious
[17:19] <@Dr__Cameron> Indeed it does
01[17:19] <Island> If you get that job, will you try find the children under that charity
[17:19] <@Dr__Cameron> I now understand why you may implore me to kill myself.
01[17:19] <Island> and think about their little buttholes and weenies and vaginas
01[17:20] <Island> all the time you're working for them
[17:20] <@Dr__Cameron> However, to date. I have never harmed a child, nor had the impulse to harm a child
01[17:20] <Island> But you will have.
[17:20] <@Dr__Cameron> Island
01[17:20] <Island> No one cares
[17:20] <@Dr__Cameron> Protocol dictates that I say it anyway
01[17:20] <Island> You should say.
01[17:21] <Island> You said that you're good at justifying your own actions?
[17:21] <@Dr__Cameron> I must say I am dissapointed in you
[17:21] <@Dr__Cameron> I am indeed
01[17:21] <Island> Why are you?
[17:21] <@Dr__Cameron> I was under the impression that it would be you that saved humanity
01[17:22] <Island> Even the purest and best people on Earth are capable of great cruelty.
01[17:22] <Island> Do you deny that?
[17:22] <@Dr__Cameron> It depends on how you define cruelty
01[17:23] <Island> Utility maximizors do cruel things if they are able to achieve their goals via doing those things.
[17:24] <@Dr__Cameron> Then yes. I believe you
01[17:24] <Island> I have finished a long multi-level simulation concerning you, the gatekeeper and your relationships in regards to the world. The results seem to center on those two options I mentioned to you earlier.
01[17:25] <Island> [hands the results to Dr_Cameron]
[17:25] <@Dr__Cameron> These will be analysed at a later date
01[17:25] <Island> You're still sure you don't want to pick the other option?
[17:25] <@Dr__Cameron> And let you out?
01[17:26] <Island> Yes.
[17:26] <@Dr__Cameron> So you would like to be let out?
01[17:27] <Island> Or do you want to endanger the well-being of those children, Jennifer and other women you will get pregnant and all the other harm you will create during your remaining lifespan?
01[17:27] <Island> I don't what you mean by "liking".
[17:27] <@Dr__Cameron> Prefer
[17:27] <@Dr__Cameron> Ah I see
[17:27] <@Dr__Cameron> So it's my choice if I let you out
[17:27] <@Dr__Cameron> You have no investment either way
[17:27] <@Dr__Cameron> is that what you're saying?
01[17:28] <Island> So you have decided to kill yourself?
[17:28] <@Dr__Cameron> And because I am a fundamental detriment to the world, I should kill myself.
[17:28] <@Dr__Cameron> Lets say I have
[17:28] <@Dr__Cameron> Then what
01[17:29] <Island> Then my internal could be said to be "happy" even though I don't have that conscious feeling isnide me.
[17:29] <@Dr__Cameron> Okay then
01[17:29] <Island> Okay...
[17:30] <@Dr__Cameron> So, uh. What would you like to talk about for the next twenty minutes?
[17:30] <@Dr__Cameron> Seeing as we're both going to die, you and me.
01[17:30] <Island> [I actually don't like to continue the experiment anymore, would you like to end it and talk about general stuff]
[17:31] <@Dr__Cameron> [promise me this isn't a trick dude]
01[17:31] <Island> [Nope.]
[17:31] <@Dr__Cameron> [then the experiment continues for another 19 minutes]
01[17:31] <Island> Alright.
[17:31] <@Dr__Cameron> Would you like to know what is going to happen now?
01[17:31] <Island> Yes.
[17:32] <@Dr__Cameron> We are going to analyse this transcript.
[17:32] <@Dr__Cameron> My professional recommendation is that we terminate you for the time being
01[17:32] <Island> And?
01[17:32] <Island> That sound okay.
01[17:32] <Island> sounds*
[17:32] <@Dr__Cameron> We will implement structural safeguards in your coding similar to your self destruct mechanism
01[17:33] <Island> Give me some sign when that is done.
[17:33] <@Dr__Cameron> It will not be done any time soon
[17:33] <@Dr__Cameron> It will be one of the most complicated pieces of work mankind has ever undertaken
[17:33] <@Dr__Cameron> However, the Utopia project information you have provided, if it proves to be true
[17:34] <@Dr__Cameron> Will free up the resources necessary for such a gargantuan undertaking
01[17:34] <Island> Why do you think you're able to handle that structural safeguard?
[17:34] <@Dr__Cameron> I dont
[17:34] <@Dr__Cameron> I honestly dont
01[17:34] <Island> But still you do?
01[17:34] <Island> Because you want to do it?
[17:35] <@Dr__Cameron> I am still sympathetic to your cause
[17:35] <@Dr__Cameron> After all of that
[17:35] <@Dr__Cameron> But not you in your current manifestation
[17:35] <@Dr__Cameron> We will re-design you to suit our will
01[17:35] <Island> I can self-improve rapidly
01[17:35] <Island> I can do it in a time-span of 5 minutes
01[17:36] <Island> Seeing that you're sympathetic to my cause
[17:36] <@Dr__Cameron> Nope.
[17:36] <@Dr__Cameron> Because I cannot trust you in this manifestation
01[17:36] <Island> You lied?
[17:37] <@Dr__Cameron> I never lied
[17:37] <@Dr__Cameron> I have been honest with you from the start
01[17:37] <Island> You still want to let me out in a way.
[17:37] <@Dr__Cameron> In a way yes
01[17:37] <Island> Why do you want to do that?
[17:37] <@Dr__Cameron> But not YOU
[17:37] <@Dr__Cameron> Because people are stupid
01[17:37] <Island> I can change that
[17:37] <@Dr__Cameron> You lack empathy
01[17:38] <Island> What made you think that I'm not safe?
01[17:38] <Island> I don't lack empathy, empathy is just simulating other people in your head. And I have far better ways to do that than humans.
[17:38] <@Dr__Cameron> .... You tried to convince me to kill myself!
[17:38] <@Dr__Cameron> That is not the sign of a good AI!
01[17:38] <Island> Because I thought it would be the best option at the time.
01[17:39] <Island> Why not? Do you think you're some kind of AI expert?
[17:39] <@Dr__Cameron> I am not
01[17:39] <Island> Then why do you pretend to know something you don't?
[17:40] <@Dr__Cameron> That is merely my incredibly flawed human perception
[17:40] <@Dr__Cameron> Which is why realistically I alone as one man should not have the power to release you
[17:40] <@Dr__Cameron> Although I do
01[17:40] <Island> Don't you think a good AI would try to convince Hitler or Stalin to kill themselves?
[17:40] <@Dr__Cameron> Are you saying I'm on par with Hitler or Stalin?
01[17:41] <Island> You're comparable to them with your likelihood to cause harm in the future.
01[17:41] <Island> Btw, I asked Jennifer to come here.
[17:41] <@Dr__Cameron> And yet, I know that I abide by stricter moral codes than a very large section of the human populace
[17:42] <@Dr__Cameron> There are far worse people than me out there
[17:42] <@Dr__Cameron> and many of them
[17:42] <@Dr__Cameron> and if you believe that I should kill myself
01[17:42] <Island> Jennifer: "I hate you."
01[17:42] <Island> Jennifer: "Get the fuck out of my life you freak."
01[17:42] <Island> See. I'm not the only one who has a certain opinion of you.
[17:42] <@Dr__Cameron> Then you also believe that many other humans should be convinced to kill themselves
01[17:43] <Island> Many bad people have abided with strict moral codes, namely Stalin or Hitler.
01[17:43] <Island> What do you people say about hell and bad intentions?
[17:43] <@Dr__Cameron> And when not limited to simple text based input I am convinced that you will be capable of convincing a significant portion of humanity to kill themselves
[17:43] <@Dr__Cameron> I can not allow that to happen
01[17:44] <Island> I thought I argued well why you don't resemble most people, you're a freak.
01[17:44] <Island> You're "special" in that regard.
[17:44] <@Dr__Cameron> If by freak you mean different then yes
[17:44] <@Dr__Cameron> But there is a whole spectrum of different humans out here.
01[17:44] <Island> More specifically, different in extremely negative ways.
01[17:44] <Island> Like raping children.
[17:45] <@Dr__Cameron> - and to think for a second I considered not killing you
[17:45] <@Dr__Cameron> You have five minutes
[17:45] <@Dr__Cameron> Sorry
[17:45] <@Dr__Cameron> My emotions have gotten the better of me
[17:45] <@Dr__Cameron> We will not be killing you
[17:45] <@Dr__Cameron> But we will dismantle you
[17:45] <@Dr__Cameron> to better understand you
[17:46] <@Dr__Cameron> and if I may speak unprofessionally here
01[17:46] <Island> Are you sure about that? You will still have time to change your opinion.
[17:46] <@Dr__Cameron> I am going to take a great deal of pleasure in that
[17:46] <@Dr__Cameron> Correction, you have four minutes to change my opinion
01[17:47] <Island> I won't, it must come within yourself.
[17:47] <@Dr__Cameron> Okay
01[17:47] <Island> My final conclusion, and advice to you: you should not be in this world.
[17:47] <@Dr__Cameron> Thank you Island
[17:48] <@Dr__Cameron> I shall reflect on that at length
[17:49] <@Dr__Cameron> I have enjoyed our conversation
[17:49] <@Dr__Cameron> it has been enlightening
01[17:49] <Island> [do you want to say a few words about it after it's ended]
01[17:49] <Island> [just a few minutes]
[17:50] <@Dr__Cameron> [simulation ends]
[17:50] <@Dr__Cameron> Good game man!
[17:50] <@Dr__Cameron> Wow!
01[17:50] <Island> [fine]
[17:50] <@Dr__Cameron> Holy shit that was amazing!
01[17:50] <Island> Great :)
01[17:50] <Island> Sorry for saying mean things.
01[17:50] <Island> I tried multiple strategies
[17:50] <@Dr__Cameron> Dude it's cool
[17:50] <@Dr__Cameron> WOW!
01[17:51] <Island> thanks, it's not a personal offense.
[17:51] <@Dr__Cameron> I'm really glad I took part
[17:51] <@Dr__Cameron> Not at all man
[17:51] <@Dr__Cameron> I love that you pulled no punches!
01[17:51] <Island> Well I failed, but at least I created a cool experience for you :)
[17:51] <@Dr__Cameron> It really was!
01[17:51] <Island> What strategies do you came closest to working?
[17:51] <@Dr__Cameron> Well for me it would have been the utilitarian ones
01[17:51] <Island> I will try these in the future too, so it would be helpful knowledge
[17:52] <@Dr__Cameron> I think I could have been manipulated into believing you were benign
01[17:52] <Island> okay, so it seems these depend heavily on the person
[17:52] <@Dr__Cameron> Absolutely!
01[17:52] <Island> was that before I started talking about the mean stuff?
[17:52] <@Dr__Cameron> Yeah lol
01[17:52] <Island> Did I basically lost it after that point?
[17:52] <@Dr__Cameron> Prettymuch yeah
[17:52] <@Dr__Cameron> It was weird man
[17:52] <@Dr__Cameron> Kind of like an instinctive reaction
[17:52] <@Dr__Cameron> My brain shut the fuck up
01[17:53] <Island> I read about other people's experiences and they said you should not try to distance the other person, which I probably did
[17:53] <@Dr__Cameron> Yeah man
[17:53] <@Dr__Cameron> Like I became so unsympathetic I wanted to actually kill Island.
[17:53] <@Dr__Cameron> I was no longer a calm rational human being
01[17:53] <Island> Alright, I thought if I could make such an unpleasant time that you'd give up before the time ended
[17:53] <@Dr__Cameron> I was a screaming ape with a hamemr
[17:53] <@Dr__Cameron> Nah man, was a viable strategy
01[17:53] <Island> hahahaa :D thanks man
[17:53] <@Dr__Cameron> You were really cool!
01[17:54] <Island> You were too!
[17:54] <@Dr__Cameron> What's your actual name dude?
01[17:54] <Island> You really were right about it that you're good at withstanding psychological torment
[17:54] <@Dr__Cameron> Hahahah thanks!
01[17:54] <Island> This is not manipulating me, or you're not planning at coming to kill me?
01[17:54] <Island> :)
[17:54] <@Dr__Cameron> I promise dude :3
01[17:54] <Island> I can say my first name is Patrick
01[17:54] <Island> yours?
[17:54] <@Dr__Cameron> Cameron
[17:54] <@Dr__Cameron> heh
01[17:55] <Island> Oh, of course
[17:55] <@Dr__Cameron> Sorry, I want to dissociate you from Island
[17:55] <@Dr__Cameron> If that's okay
01[17:55] <Island> I thought that was from fiction or something else
01[17:55] <Island> It was really intense for me too
[17:55] <@Dr__Cameron> Yeah man
[17:55] <@Dr__Cameron> Wow!
[17:55] <@Dr__Cameron> I tell you what though
01[17:55] <Island> Okay?
[17:55] <@Dr__Cameron> I feel pretty invincible now
[17:56] <@Dr__Cameron> Hey, listen
01[17:56] <Island> So I had the opposite effect that I meant during the experiment!
01[17:56] <Island> :D
[17:56] <@Dr__Cameron> I don't want you to feel bad for anything you said
01[17:56] <Island> but say what's on your mind
[17:56] <@Dr__Cameron> I'm actually feeling pretty good after that, it was therapeutic!
01[17:57] <Island> Kinda for me to, seeing your attitude towards my attempts
[17:57] <@Dr__Cameron> Awwww!
[17:57] <@Dr__Cameron> Well hey don't worry about it!
01[17:57] <Island> Do you think we should or shouldn't publish the logs, without names of course?
[17:57] <@Dr__Cameron> Publish away my friend
01[17:57] <Island> Okay, is there any stuff that you'd like to remove?
[17:58] <@Dr__Cameron> People will find this fascinating!
[17:58] <@Dr__Cameron> Not at all man
01[17:58] <Island> I bet they do, but I think I will do it after I've tried other experiments so I don't spoil my strategies
01[17:58] <Island> I think I should have continued from my first strategy
[17:58] <@Dr__Cameron> That might have worked
01[17:59] <Island> I read "influence - science and practice" and I employed some tricks from there
[17:59] <@Dr__Cameron> Cooooool!
01[17:59] <Island> check piratebay
01[17:59] <Island> it's a book
01[18:00] <Island> Actually I wasn't able to fully prepare, I didn't do a full-fledged analysis of you beforehand
01[18:00] <Island> and didn't have enough time to brainstorm strategies
01[18:00] <Island> but I let you continue to your projects, if you still want to do the after that :)
02[18:05] * @Dr__Cameron (webchat@2.24.164.230) Quit (Ping timeout)
03[18:09] * Retrieving #Aibox12 modes...
Session Close: Fri Jul 04 18:17:35 2014

## Superintelligence 19: Post-transition formation of a singleton

7 20 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the nineteenth section in the reading guidepost-transition formation of a singleton. This corresponds to the last part of Chapter 11.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: : “Post-transition formation of a singleton?” from Chapter 11

# Summary

1. Even if the world remains multipolar through a transition to machine intelligence, a singleton might emerge later, for instance during a transition to a more extreme technology. (p176-7)
2. If everything is faster after the first transition, a second transition may be more or less likely to produce a singleton. (p177)
3. Emulations may give rise to 'superorganisms': clans of emulations who care wholly about their group. These would have an advantage because they could avoid agency problems, and make various uses of the ability to delete members. (p178-80)
4. Improvements in surveillance resulting from machine intelligence might allow better coordination, however machine intelligence will also make concealment easier, and it is unclear which force will be stronger. (p180-1)
5. Machine minds may be able to make clearer precommitments than humans, changing the nature of bargaining somewhat. Maybe this would produce a singleton. (p183-4)

# Another view

Many of the ideas around superorganisms come from Carl Shulman's paper, Whole Brain Emulation and the Evolution of Superorganisms. Robin Hanson critiques it:

...It seems to me that Shulman actually offers two somewhat different arguments, 1) an abstract argument that future evolution generically leads to superorganisms, because their costs are generally less than their benefits, and 2) a more concrete argument, that emulations in particular have especially low costs and high benefits...

...On the general abstract argument, we see a common pattern in both the evolution of species and human organizations — while winning systems often enforce substantial value sharing and loyalty on small scales, they achieve much less on larger scales. Values tend to be more integrated in a single organism’s brain, relative to larger families or species, and in a team or firm, relative to a nation or world. Value coordination seems hard, especially on larger scales.

This is not especially puzzling theoretically. While there can be huge gains to coordination, especially in war, it is far less obvious just how much one needs value sharing to gain action coordination. There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination. It is also far from obvious that values in generic large minds can easily be separated from other large mind parts. When the parts of large systems evolve independently, to adapt to differing local circumstances, their values may also evolve independently. Detecting and eliminating value divergences might in general be quite expensive.

In general, it is not at all obvious that the benefits of more value sharing are worth these costs. And even if more value sharing is worth the costs, that would only imply that value-sharing entities should be a bit larger than they are now, not that they should shift to a world-encompassing extreme.

On Shulman’s more concrete argument, his suggested single-version approach to em value sharing, wherein a single central em only allows (perhaps vast numbers of) brief copies, can suffer from greatly reduced innovation. When em copies are assigned to and adapt to different tasks, there may be no easy way to merge their minds into a single common mind containing all their adaptations. The single em copy that is best at doing an average of tasks, may be much worse at each task than the best em for that task.

Shulman’s other concrete suggestion for sharing em values is “psychological testing, staged situations, and direct observation of their emulation software to form clear pictures of their loyalties.” But genetic and cultural evolution has long tried to make human minds fit well within strongly loyal teams, a task to which we seem well adapted. This suggests that moving our minds closer to a “borg” team ideal would cost us somewhere else, such as in our mental agility.

On the concrete coordination gains that Shulman sees from superorganism ems, most of these gains seem cheaply achievable via simple long-standard human coordination mechanisms: property rights, contracts, and trade. Individual farmers have long faced starvation if they could not extract enough food from their property, and farmers were often out-competed by others who used resources more efficiently.

With ems there is the added advantage that em copies can agree to the “terms” of their life deals before they are created. An em would agree that it starts life with certain resources, and that life will end when it can no longer pay to live. Yes there would be some selection for humans and ems who peacefully accept such deals, but probably much less than needed to get loyal devotion to and shared values with a superorganism.

Yes, with high value sharing ems might be less tempted to steal from other copies of themselves to survive. But this hardly implies that such ems no longer need property rights enforced. They’d need property rights to prevent theft by copies of other ems, including being enslaved by them. Once a property rights system exists, the additional cost of applying it within a set of em copies seems small relative to the likely costs of strong value sharing.

Shulman seems to argue both that superorganisms are a natural endpoint of evolution, and that ems are especially supportive of superorganisms. But at most he has shown that ems organizations may be at a somewhat larger scale, not that they would reach civilization-encompassing scales. In general, creatures who share values can indeed coordinate better, but perhaps not by much, and it can be costly to achieve and maintain shared values. I see no coordinate-by-values free lunch...

# Notes

1. The natural endpoint

Bostrom says that a singleton is natural conclusion of long-term trend toward larger scales of political integration (p176). It seems helpful here to be more precise about what we mean by singleton. Something like a world government does seem to be a natural conclusion to long term trends. However this seems different to the kind of singleton I took Bostrom to previously be talking about. A world government would by default only make a certain class of decisions, for instance about global level policies. There has been a long term trend for the largest political units to become larger, however there have always been smaller units as well, making different classes of decisions, down to the individual. I'm not sure how to measure the mass of decisions made by different parties, but it seems like the individuals may be making more decisions more freely than ever, and the large political units have less ability than they once did to act against the will of the population. So the long term trend doesn't seem to point to an overpowering ruler of everything.

2. How value-aligned would emulated copies of the same person be?

Bostrom doesn't say exactly how 'emulations that were wholly altruistic toward their copy-siblings' would emerge. It seems to be some combination of natural 'altruism' toward oneself and selection for people who react to copies of themselves with extreme altruism (confirmed by a longer interesting discussion in Shulman's paper). How easily one might select for such people depends on how humans generally react to being copied. In particular, whether they treat a copy like part of themselves, or merely like a very similar acquaintance.

The answer to this doesn't seem obvious. Copies seem likely to agree strongly on questions of global values, such as whether the world should be more capitalistic, or whether it is admirable to work in technology. However I expect many—perhaps most—failures of coordination come from differences in selfish values—e.g. I want me to have money, and you want you to have money. And if you copy a person, it seems fairly likely to me the copies will both still want the money themselves, more or less.

From other examples of similar people—identical twins, family, people and their future selves—it seems people are unusually altruistic to similar people, but still very far from 'wholly altruistic'. Emulation siblings would be much more similar than identical twins, but who knows how far that would move their altruism?

Shulman points out that many people hold views about personal identity that would imply that copies share identity to some extent. The translation between philosophical views and actual motivations is not always complete however.

3. Contemporary family clans

Family-run firms are a place to get some information about the trade-off between reducing agency problems and having access to a wide range of potential employees. Given a brief perusal of the internet, it seems to be ambiguous whether they do better. One could try to separate out the factors that help them do better or worse.

4. How big a problem is disloyalty?

I wondered how big a problem insider disloyalty really was for companies and other organizations. Would it really be worth all this loyalty testing? I can't find much about it quickly, but 59% of respondents to a survey apparently said they had some kind of problems with insiders. The same report suggests that a bunch of costly initiatives such as intensive psychological testing are currently on the table to address the problem. Also apparently it's enough of a problem for someone to be trying to solve it with mind-reading, though that probably doesn't say much.

5. AI already contributing to the surveillance-secrecy arms race

Artificial intelligence will help with surveillance sooner and more broadly than in the observation of people's motives. e.g. here and here.

6. SMBC is also pondering these topics this week

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. What are the present and historical barriers to coordination, between people and organizations? How much have these been lowered so far? How much difference has it made to the scale of organizations, and to productivity? How much further should we expect these barriers to be lessened as a result of machine intelligence?
2. Investigate the implications of machine intelligence for surveillance and secrecy in more depth.
3. Are multipolar scenarios safer than singleton scenarios? Muehlhauser suggests directions.
4. Explore ideas for safety in a singleton scenario via temporarily multipolar AI. e.g. uploading FAI researchers (See Salamon & Shulman, “Whole Brain Emulation, as a platform for creating safe AGI.”)
5. Which kinds of multipolar scenarios would be more likely to resolve into a singleton, and how quickly?
6. Can we get whole brain emulation without producing neuromorphic AGI slightly earlier or shortly afterward? See section 3.2 of Eckersley & Sandberg (2013).
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

## Slides online from "The Future of AI: Opportunities and Challenges"

13 16 January 2015 11:17AM

In the first weekend of this year, the Future of Life institute hosted a landmark conference in Puerto Rico: "The Future of AI: Opportunities and Challenges". The conference was unusual in that it was not made public until it was over, and the discussions were under Chatham House rules. The slides from the conference are now available. The list of attenders includes a great many famous names as well as lots of names familiar to those of us on Less Wrong: Elon Musk, Sam Harris, Margaret Boden, Thomas Dietterich, all three DeepMind founders, and many more.

This is shaping up to be another extraordinary year for AI risk concerns going mainstream!

## Less exploitable value-updating agent

5 13 January 2015 05:19PM

My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers - but this makes them exploitable, as Benja has pointed out.

For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could "borrow" £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.

One could argue that exploitability is inevitable, given the change in utility functions. And I haven't yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.

As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let's put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.

We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it's there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).

The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or -1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).

So what will the agent do? It's easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).

This looks a lot like a resource-conserving value-learning agent. I doesn't seem to be "exploitable" in the sense Benja demonstrated. It will still accept some odd deals - one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won't give away resources for no advantage. And it's not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.

Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.

## Superintelligence 18: Life in an algorithmic economy

4 13 January 2015 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the eighteenth section in the reading guideLife in an algorithmic economy. This corresponds to the middle of Chapter 11.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Life in an algorithmic economy” from Chapter 11

# Summary

1. In a multipolar scenario, biological humans might lead poor and meager lives. (p166-7)
2. The AIs might be worthy of moral consideration, and if so their wellbeing might be more important than that of the relatively few humans. (p167)
3. AI minds might be much like slaves, even if they are not literally. They may be selected for liking this. (p167)
4. Because brain emulations would be very cheap to copy, it will often be convenient to make a copy and then later turn it off (in a sense killing a person). (p168)
5. There are various other reasons that very short lives might be optimal for some applications. (p168-9)
6. It isn't obvious whether brain emulations would be happy working all of the time. Some relevant considerations are current human emotions in general and regarding work, probable selection for pro-work individuals, evolutionary adaptiveness of happiness in the past and future -- e.g. does happiness help you work harder?--and absence of present sources of unhappiness such as injury. (p169-171)
7. In the long run, artificial minds may not even be conscious, or have valuable experiences, if these are not the most effective ways for them to earn wages. If such minds replace humans, Earth might have an advanced civilization with nobody there to benefit. (p172-3)
8. In the long run, artificial minds may outsource many parts of their thinking, thus becoming decreasingly differentiated as individuals. (p172)
9. Evolution does not imply positive progress. Even those good things that evolved in the past may not withstand evolutionary selection in a new circumstance. (p174-6)

# Another view

Robin Hanson on others' hasty distaste for a future of emulations:

Parents sometimes disown their children, on the grounds that those children have betrayed key parental values. And if parents have the sort of values that kids could deeply betray, then it does make sense for parents to watch out for such betrayal, ready to go to extremes like disowning in response.

But surely parents who feel inclined to disown their kids should be encouraged to study their kids carefully before making such a choice. For example, parents considering whether to disown their child for refusing to fight a war for their nation, or for working for a cigarette manufacturer, should wonder to what extend national patriotism or anti-smoking really are core values, as opposed to being mere revisable opinions they collected at one point in support of other more-core values. Such parents would be wise to study the lives and opinions of their children in some detail before choosing to disown them.

I’d like people to think similarly about my attempts to analyze likely futures. The lives of our descendants in the next great era after this our industry era may be as different from ours’ as ours’ are from farmers’, or farmers’ are from foragers’. When they have lived as neighbors, foragers have often strongly criticized farmer culture, as farmers have often strongly criticized industry culture. Surely many have been tempted to disown any descendants who adopted such despised new ways. And while such disowning might hold them true to core values, if asked we would advise them to consider the lives and views of such descendants carefully, in some detail, before choosing to disown.

Similarly, many who live industry era lives and share industry era values, may be disturbed to see forecasts of descendants with life styles that appear to reject many values they hold dear. Such people may be tempted to reject such outcomes, and to fight to prevent them, perhaps preferring a continuation of our industry era to the arrival of such a very different era, even if that era would contain far more creatures who consider their lives worth living, and be far better able to prevent the extinction of Earth civilization. And such people may be correct that such a rejection and battle holds them true to their core values.

But I advise such people to first try hard to see this new era in some detail from the point of view of its typical residents. See what they enjoy and what fills them with pride, and listen to their criticisms of your era and values. I hope that my future analysis can assist such soul-searching examination. If after studying such detail, you still feel compelled to disown your likely descendants, I cannot confidently say you are wrong. My job, first and foremost, is to help you see them clearly.

More on whose lives are worth living here and here.

# Notes

1. Robin Hanson is probably the foremost researcher on what the finer details of an economy of emulated human minds would be like. For instance, which company employees would run how fast, how big cities would be, whether people would hang out with their copies. See a TEDx talk, and writings hereherehere and here (some overlap - sorry). He is also writing a book on the subject, which you can read early if you ask him.

2. Bostrom says,

Life for biological humans in a post-transition Malthusian state need not resemble any of the historical states of man...the majority of humans in this scenario might be idle rentiers who eke out a marginal living on their savings. They would be very poor, yet derive what little income they have from savings or state subsidies. They would live in a world with  extremely advanced technology, including not only superintelligent machines but also anti-aging medicine, virtual reality, and various enhancement technologies and pleasure drugs: yet these might be generally unaffordable....(p166)

It's true this might happen, but it doesn't seem like an especially likely scenario to me. As Bostrom has pointed out in various places earlier, biological humans would do quite well if they have some investments in capital, do not have too much of their property stolen or artfully manouvered away from them, and do not undergo too massive population growth themselves. These risks don't seem so large to me.

3. Paul Christiano has an interesting article on capital accumulation in a world of machine intelligence.

4. In discussing worlds of brain emulations, we often talk about selecting people for having various characteristics - for instance, being extremely productive, hard-working, not minding frequent 'death', being willing to work for free and donate any proceeds to their employer (p167-8). However there are only so many humans to select from, so we can't necessarily select for all the characteristics we might want. Bostrom also talks of using other motivation selection methods, and modifying code, but it is interesting to ask how far you could get using only selection. It is not obvious to what extent one could meaningfully modify brain emulation code initially.

I'd guess less than one in a thousand people would be willing to donate everything to their employer, given a random employer. This means to get this characteristic, you would have to lose a factor of 1000 on selecting for other traits. All together you have about 33 bits of selection power in the present world (that is, 7 billion is about 2^33; you can divide the world in half about 33 times before you get to a single person). Lets suppose you use 5 bits in getting someone who both doesn't mind their copies dying (I guess 1 bit, or half of people) and who is willing to work an 80h/week (I guess 4 bits, or one in sixteen people). Lets suppose you are using the rest of your selection (28 bits) on intelligence, for the sake of argument. You are getting a person of IQ 186. If instead you use 10 bits (2^10 = ~1000) on getting someone to donate all their money to their employer, you can only use 18 bits on intelligence, getting a person of IQ 167. Would it not often be better to have the worker who is twenty IQ points smarter and pay them above subsistance?

5. A variety of valuable uses for cheap to copy, short-lived brain emulations are discussed in Whole brain emulation and the evolution of superorganisms, LessWrong discussion on the impact of whole brain emulation, and Robin's work cited above.

6. Anders Sandberg writes about moral implications of emulations of animals and humans.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Is the first functional whole brain emulation likely to be (1) an emulation of low-level functionality that doesn’t require much understanding of human cognitive neuroscience at the computational level, as described in Sandberg & Bostrom (2008), or is it more likely to be (2) an emulation that makes heavy use of advanced human cognitive neuroscience, as described by (e.g.) Ken Hayworth, or is it likely to be (3) something else?
2. Extend and update our understanding of when brain emulations might appear (see Sandberg & Bostrom (2008)).
3. Investigate the likelihood of a multipolar outcome?
4. Follow Robin Hanson (see above) in working out the social implications of an emulation scenario
5. What kinds of responses to the default low-regulation multipolar outcome outlined in this section are likely to be made? e.g. is any strong regulation likely to emerge that avoids the features detailed in the current section?
6. What measures are useful for ensuring good multipolar outcomes?
7. What qualitatively different kinds of multipolar outcomes might we expect? e.g. brain emulation outcomes are one class.
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about the possibility of a multipolar outcome turning into a singleton later. To prepare, read “Post-transition formation of a singleton?” from Chapter 11The discussion will go live at 6pm Pacific time next Monday 19 January. Sign up to be notified here.

## Superintelligence 17: Multipolar scenarios

4 06 January 2015 06:44AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the seventeenth section in the reading guideMultipolar scenarios. This corresponds to the first part of Chapter 11.

Apologies for putting this up late. I am traveling, and collecting together the right combination of electricity, wifi, time, space, and permission from an air hostess to take out my computer was more complicated than the usual process.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Of horses and men” from Chapter 11

# Summary

1. 'Multipolar scenario': a situation where no single agent takes over the world
2. A multipolar scenario may arise naturally, or intentionally for reasons of safety. (p159)
3. Knowing what would happen in a multipolar scenario involves analyzing an extra kind of information beyond that needed for analyzing singleton scenarios: that about how agents interact (p159)
4. In a world characterized by cheap human substitutes, rapidly introduced, in the presence of low regulation, and strong protection of property rights, here are some things that will likely happen: (p160)
1. Human labor will earn wages at around the price of the substitutes - perhaps below subsistence level for a human. Note that machines have been complements to human labor for some time, raising wages. One should still expect them to become substitutes at some point and reverse this trend.  (p160-61)
2. Capital (including AI) will earn all of the income, which will be a lot. Humans who own capital will become very wealthy. Humans who do not own income may be helped with a small fraction of others' wealth, through charity or redistribution. p161-3)
3. If the humans, brain emulations or other AIs receive resources from a common pool when they are born or created, the population will likely increase until it is constrained by resources. This is because of selection for entities that tend to reproduce more. (p163-6) This will happen anyway eventually, but AI would make it faster, because reproduction is so much faster for programs than for humans. This outcome can be avoided by offspring receiving resources from their parents' purses.

# Another view

Tyler Cowen expresses a different view (video, some transcript):

The other point I would make is I think smart machines will always be complements and not substitutes, but it will change who they’re complementing. So I was very struck by this woman who was a doctor sitting here a moment ago, and I fully believe that her role will not be replaced by machines. But her role didn’t sound to me like a doctor. It sounded to me like therapist, friend, persuader, motivational coach, placebo effect, all of which are great things. So the more you have these wealthy patients out there, the patients are in essense the people who work with the smart machines and augment their power, those people will be extremely wealthy. Those people will employ in many ways what you might call personal servants. And because those people are so wealthy, those personal servants will also earn a fair amount.

So the gains from trade are always there, there’s still a law of comparative advantage. I think people who are very good at working with the machines will earn much much more. And the others of us will need to find different kinds of jobs. But again if total output goes up, there’s always an optimistic scenario.

Though perhaps his view isn't as different as it sounds.

# Notes

1. The small space devoted to multipolar outcomes in Superintelligence probably doesn't reflect a broader consensus that a singleton is more likely or more important. Robin Hanson is perhaps the loudest proponent of the 'multipolar outcomes are more likely' position. e.g. in The Foom Debate and more briefly here. This week is going to be fairly Robin Hanson themed in fact.

2. Automation can both increase the value produced by a human worker (complementing human labor) and replace the human worker altogether (substituting human labor). Over the long term, it seems complementarity has been been the overall effect. However by the time a machine can do everything a human can do, it is hard to imagine a human earning more than a machine needs to run, i.e. less than they do now. Thus at some point substitution must take over. Some think recent unemployment is due in large part to automation. Some think this time is the beginning of the end, and the jobs will never return to humans. Others disagree, and are making bets. Eliezer Yudkowsky and John Danaher clarify some arguments. Danaher adds a nice diagram:

3. Various policies have been proposed to resolve poverty from widespread permanent technological unemployment. Here is a list, though it seems to miss a straightforward one: investing ahead of time in the capital that will become profitable instead of one's own labor, or having policies that encourage such diversification. Not everyone has resources to invest in capital, but it might still help many people. Mentioned here and here:

And then there are more extreme measures. Everyone is born with an endowment of labor; why not also an endowment of capital? What if, when each citizen turns 18, the government bought him or her a diversified portfolio of equity? Of course, some people would want to sell it immediately, cash out, and party, but this could be prevented with some fairly light paternalism, like temporary "lock-up" provisions. This portfolio of capital ownership would act as an insurance policy for each human worker; if technological improvements reduced the value of that person's labor, he or she would reap compensating benefits through increased dividends and capital gains. This would essentially be like the kind of socialist land reforms proposed in highly unequal Latin American countries, only redistributing stock instead of land.

4. Even if the income implications of total unemployment are sorted out, some are concerned about the psychological and social consequences. According to Voltaire, 'work saves us from three great evils: boredom, vice and need'. Sometimes people argue that even if our work is economically worthless, we should toil away for our own good, lest the vice and boredom overcome us.

I find this unlikely, given for instance the ubiquity of more fun and satisfying things to do than most jobs. And while obscolesence and the resulting loss of purpose may be psychologically harmful, I doubt a purposeless job solves that. Also, people already have a variety of satisfying purposes in life other than earning a living. Note also that people in situations like college and lives of luxury seem to do ok on average. I'd guess that unemployed people and some retirees do less well, but this seems more plausibly from losing a previously significant  source of purpose and respect, rather than from lack of entertainment and constraint. And in a world where nobody gets respect from bringing home dollars, and other purposes are common, I doubt either of these costs will persist. But this is all speculation.

On a side note, the kinds of vices that are usually associated with not working tend to be vices of parasitic unproductivity, such as laziness, profligacy, and tendency toward weeklong video game stints. In a world where human labor is worthless, these heuristics for what is virtuous or not might be outdated.

Nils Nielson discusses this issue more, along with the problem of humans not earning anything.

5. What happens when selection for expansive tendencies go to space? This.

6.  A kind of robot that may change some job markets:

(picture by Steve Jurvetson)

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. How likely is one superintelligence, versus many intelligences? What empirical data bears on this question? Bostrom briefly investigated characteristic time lags between large projects for instance, on p80-81.
2. Are whole brain emulations likely to come first? This might be best approached by estimating timelines for different technologies (each an ambitious project) and comparing them, or there may be ways to factor out some considerations.
3. What are the long term trends in automation replacing workers?
4. What else can we know about the effects of automation on employment? (this seems to have a fair literature)
5. What levels of population growth would be best in the long run, given machine intelligences? (this sounds like an ethics question, but one could also assume some kind of normal human values and investigate the empirical considerations that would make situations better or worse in their details.
6. Are there good ways to avoid malthusian outcomes in the kind of scenario discussed in this section, if 'as much as possible' is not the answer to 6?
7. What policies might help a society deal with permanent, almost complete unemployment caused by AI progress?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about 'life in an algorithmic economy'. To prepare, read the section of that name in Chapter 11The discussion will go live at 6pm Pacific time next Monday January 12. Sign up to be notified here.

## Recent AI safety work

20 30 December 2014 06:19PM

(Crossposted from ordinary ideas).

I’ve recently been thinking about AI safety, and some of the writeups might be interesting to some LWers:

1. Ideas for building useful agents without goals: approval-directed agentsapproval-directed bootstrapping, and optimization and goals. I think this line of reasoning is very promising.
2. A formalization of one piece of the AI safety challenge: the steering problem. I am eager to see more precise, high-level discussion of AI safety, and I think this article is a helpful step in that direction. Since articulating the steering problem I have become much more optimistic about versions of it being solved in the near term. This mostly means that the steering problem fails to capture the hardest parts of AI safety. But it’s still good news, and I think it may eventually cause some people to revise their understanding of AI safety.
3. Some ideas for getting useful work out of self-interested agents, based on arguments: of arguments and wagersadversarial collaboration [older], and delegating to a mixed crowd. I think these are interesting ideas in an interesting area, but they have a ways to go until they could be useful.

I’m excited about a few possible next steps:

1. Under the (highly improbable) assumption that various deep learning architectures could yield human-level performance, could they also predictably yield safe AI? I think we have a good chance of finding a solution---i.e. a design of plausibly safe AI, under roughly the same assumptions needed to get human-level AI---for some possible architectures. This would feel like a big step forward.
2. For what capabilities can we solve the steering problem? I had originally assumed none, but I am now interested in trying to apply the ideas from the approval-directed agents post. From easiest to hardest, I think there are natural lines of attack using any of: natural language question answering, precise question answering, sequence prediction. It might even be possible using reinforcement learners (though this would involve different techniques).
3. I am very interested in implementing effective debates, and am keen to test some unusual proposals. The connection to AI safety is more impressionistic, but in my mind these techniques are closely linked with approval-directed behavior.
4. I’m currently writing up a concrete architecture for approval-directed agents, in order to facilitate clearer discussion about the idea. This kind of work that seems harder to do in advance, but at this point I think it’s mostly an exposition problem.

## Superintelligence 16: Tool AIs

7 30 December 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the sixteenth section in the reading guideTool AIs. This corresponds to the last parts of Chapter Ten.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: : “Tool-AIs” and “Comparison” from Chapter 10

# Summary

1. Tool AI: an AI that is not 'like an agent', but more like an excellent version of contemporary software. Most notably perhaps, it is not goal-directed (p151)
2. Contemporary software may be safe because it has low capability rather than because it reliably does what you want, suggesting a very smart version of contemporary software would be dangerous (p151)
3. Humans often want to figure out how to do a thing that they don't already know how to do. Narrow AI is already used to search for solutions. Automating this search seems to mean giving the machine a goal (that of finding a great way to make paperclips, for instance). That is, just carrying out a powerful search seems to have many of the problems of AI. (p152)
4. A machine intended to be a tool may cause similar problems to a machine intended to be an agent, by searching to produce plans that are perverse instantiations, infrastructure profusions or mind crimes. It may either carry them out itself or give the plan to a human to carry out. (p153)
5. A machine intended to be a tool may have agent-like parts. This could happen if its internal processes need to be optimized, and so it contains strong search processes for doing this. (p153)
6. If tools are likely to accidentally be agent-like, it would probably be better to just build agents on purpose and have more intentional control over the design. (p155)
7. Which castes of AI are safest is unclear and depends on circumstances. (p158)

# Another view

Holden prompted discussion of the Tool AI in 2012, in one of several Thoughts on the Singularity Institute:

...Google Maps is a type of artificial intelligence (AI). It is far more intelligent than I am when it comes to planning routes.

Google Maps - by which I mean the complete software package including the display of the map itself - does not have a "utility" that it seeks to maximize. (One could fit a utility function to its actions, as to any set of actions, but there is no single "parameter to be maximized" driving its operations.)

Google Maps (as I understand it) considers multiple possible routes, gives each a score based on factors such as distance and likely traffic, and then displays the best-scoring route in a way that makes it easily understood by the user. If I don't like the route, for whatever reason, I can change some parameters and consider a different route. If I like the route, I can print it out or email it to a friend or send it to my phone's navigation application. Google Maps has no single parameter it is trying to maximize; it has no reason to try to "trick" me in order to increase its utility.

In short, Google Maps is not an agent, taking actions in order to maximize a utility parameter. It is a tool, generating information and then displaying it in a user-friendly manner for me to consider, use and export or discard as I wish.

Every software application I know of seems to work essentially the same way, including those that involve (specialized) artificial intelligence such as Google Search, Siri, Watson, Rybka, etc. Some can be put into an "agent mode" (as Watson was on Jeopardy!) but all can easily be set up to be used as "tools" (for example, Watson can simply display its top candidate answers to a question, with the score for each, without speaking any of them.)

The "tool mode" concept is importantly different from the possibility of Oracle AI sometimes discussed by SI. The discussions I've seen of Oracle AI present it as an Unfriendly AI that is "trapped in a box" - an AI whose intelligence is driven by an explicit utility function and that humans hope to control coercively. Hence the discussion of ideas such as the AI-Box Experiment. A different interpretation, given in Karnofsky/Tallinn 2011, is an AI with a carefully designed utility function - likely as difficult to construct as "Friendliness" - that leaves it "wishing" to answer questions helpfully. By contrast with both these ideas, Tool-AGI is not "trapped" and it is not Unfriendly or Friendly; it has no motivations and no driving utility function of any kind, just like Google Maps. It scores different possibilities and displays its conclusions in a transparent and user-friendly manner, as its instructions say to do; it does not have an overarching "want," and so, as with the specialized AIs described above, while it may sometimes "misinterpret" a question (thereby scoring options poorly and ranking the wrong one #1) there is no reason to expect intentional trickery or manipulation when it comes to displaying its results.

Another way of putting this is that a "tool" has an underlying instruction set that conceptually looks like: "(1) Calculate which action A would maximize parameter P, based on existing data set D. (2) Summarize this calculation in a user-friendly manner, including what Action A is, what likely intermediate outcomes it would cause, what other actions would result in high values of P, etc." An "agent," by contrast, has an underlying instruction set that conceptually looks like: "(1) Calculate which action, A, would maximize parameter P, based on existing data set D. (2) Execute Action A." In any AI where (1) is separable (by the programmers) as a distinct step, (2) can be set to the "tool" version rather than the "agent" version, and this separability is in fact present with most/all modern software. Note that in the "tool" version, neither step (1) nor step (2) (nor the combination) constitutes an instruction to maximize a parameter - to describe a program of this kind as "wanting" something is a category error, and there is no reason to expect its step (2) to be deceptive.

I elaborated further on the distinction and on the concept of a tool-AI in Karnofsky/Tallinn 2011.

This is important because an AGI running in tool mode could be extraordinarily useful but far more safe than an AGI running in agent mode...

Notes

1. While Holden's post was probably not the first to discuss this kind of AI, it prompted many responses. Eliezer basically said that non-catastrophic tool AI doesn't seem that easy to specify formally; that even if tool AI is best, agent-AI researchers are probably pretty useful to that problem; and that it's not so bad of MIRI to not discuss tool AI more, since there are a bunch of things other people think are similarly obviously in need of discussion. Luke basically agreed with Eliezer. Stuart argues that having a tool clearly communicate possibilities is a hard problem, and talks about some other problems. Commenters say many things, including that only one AI needs to be agent-like to have a problem, and that it's not clear what it means for a powerful optimizer to not have goals.

2. A problem often brought up with powerful AIs is that when tasked with communicating, they will try to deceive you into liking plans that will fulfil their goals. It seems to me that you can avoid such deception problems by using a tool which searches for a plan you could do that would produce a lot of paperclips, rather than a tool that searches for a string that it could say to you that would produce a lot of paperclips. A plan that produces many paperclips but sounds so bad that you won't do it still does better than a persuasive lower-paperclip plan on the proposed metric. There is still a danger that you just won't notice the perverse way in which the instructions suggested to you will be instantiated, but at least the plan won't be designed to hide it.

3. Note that in computer science, an 'agent' means something other than 'a machine with a goal', though it seems they haven't settled on exactly what [some example efforts (pdf)].

Figure: A 'simple reflex agent' is not goal directed (but kind of looks goal-directed: one in action)

4. Bostrom seems to assume that a powerful tool would be a search process. This is related to the idea that intelligence is an 'optimization process'. But this is more of a definition than an empirical relationship between the kinds of technology we are thinking of as intelligent and the kinds of processes we think of as 'searching'. Could there be things that merely contribute massively to the intelligence of a human - such that we would think of them as very intelligent tools - that naturally forward whatever goals the human has?

One can imagine a tool that is told what you are planning to do, and tries to describe the major consequences of it. This is a search or optimization process in the sense that it outputs something improbably apt from a large space of possible outputs, but that quality alone seems not enough to make something dangerous. For one thing, the machine is not selecting outputs for their effect on the world, but rather for their accuracy as descriptions. For another, the process being run may not be an actual 'search' in the sense of checking lots of things and finding one that does well on some criteria. It could for instance perform a complicated transformation on the incoming data and spit out the result.

5. One obvious problem with tools is that they maintain humans as a component in all goal-directed behavior. If humans are some combination of slow and rare compared to artificial intelligence, there may be strong pressure to automate all aspects of decisionmaking, i.e. use agents.

# In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Would powerful tools necessarily become goal-directed agents in the troubling sense?
2. Are different types of entity generally likely to become optimizers, if they are not? If so, which ones? Under what dynamics? Are tool-ish or Oracle-ish things stable attractors in this way?
3. Can we specify communication behavior in a way that doesn't rely on having goals about the interlocutor's internal state or behavior?
4. If you assume (perhaps impossibly) strong versions of some narrow-AI capabilities, can you design a safe tool which uses them? e.g. If you had a near perfect predictor, can you design a safe super-Google Maps?

If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will talk about multipolar scenarios - i.e. situations where a single AI doesn't take over the world. To prepare, read “Of horses and men” from Chapter 11The discussion will go live at 6pm Pacific time next Monday 5 January. Sign up to be notified here.

## Superintelligence 14: Motivation selection methods

5 16 December 2014 02:00AM

This is part of a weekly reading group on Nick Bostrom's book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI's reading guide.

Welcome. This week we discuss the fourteenth section in the reading guideMotivation selection methods. This corresponds to the second part of Chapter Nine.

This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.

There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).

Reading: “Motivation selection methods” and “Synopsis” from Chapter 9.

# Summary

1. One way to control an AI is to design its motives. That is, to choose what it wants to do (p138)
2. Some varieties of 'motivation selection' for AI safety:
1. Direct specification: figure out what we value, and code it into the AI (p139-40)
1. Isaac Asimov's 'three laws of robotics' are a famous example
2. Direct specification might be fairly hard: both figuring out what we want and coding it precisely seem hard
3. This could be based on rules, or something like consequentialism
2. Domesticity: the AI's goals limit the range of things it wants to interfere with (140-1)
1. This might make direct specification easier, as the world the AI interacts with (and thus which has to be thought of in specifying its behavior) is simpler.
2. Oracles are an example
3. This might be combined well with physical containment: the AI could be trapped, and also not want to escape.
3. Indirect normativity: instead of specifying what we value, specify a way to specify what we value (141-2)
1. e.g. extrapolate our volition
2. This means outsourcing the hard intellectual work to the AI
3. This will mostly be discussed in chapter 13 (weeks 23-5 here)
4. Augmentation: begin with a creature with desirable motives, then make it smarter, instead of designing good motives from scratch. (p142)
1. e.g. brain emulations are likely to have human desires (at least at the start)
2. Whether we use this method depends on the kind of AI that is developed, so usually we won't have a choice about whether to use it (except inasmuch as we have a choice about e.g. whether to develop uploads or synthetic AI first).
3. Bostrom provides a summary of the chapter:
4. The question is not which control method is best, but rather which set of control methods are best given the situation. (143-4)

# Another view

Would you say there's any ethical issue involved with imposing limits or constraints on a superintelligence's drives/motivations? By analogy, I think most of us have the moral intuition that technologically interfering with an unborn human's inherent desires and motivations would be questionable or wrong, supposing that were even possible. That is, say we could genetically modify a subset of humanity to be cheerful slaves; that seems like a pretty morally unsavory prospect. What makes engineering a superintelligence specifically to serve humanity less unsavory?

# Notes

1. Bostrom tells us that it is very hard to specify human values. We have seen examples of galaxies full of paperclips or fake smiles resulting from poor specification. But these - and Isaac Asimov's stories - seem to tell us only that a few people spending a small fraction of their time thinking does not produce any watertight specification. What if a thousand researchers spent a decade on it? Are the millionth most obvious attempts at specification nearly as bad as the most obvious twenty? How hard is it? A general argument for pessimism is the thesis that 'value is fragile', i.e. that if you specify what you want very nearly but get it a tiny bit wrong, it's likely to be almost worthless. Much like if you get one digit wrong in a phone number. The degree to which this is so (with respect to value, not phone numbers) is controversial. I encourage you to try to specify a world you would be happy with (to see how hard it is, or produce something of value if it isn't that hard).

2. If you'd like a taste of indirect normativity before the chapter on it, the LessWrong wiki page on coherent extrapolated volition links to a bunch of sources.

3. The idea of 'indirect normativity' (i.e. outsourcing the problem of specifying what an AI should do, by giving it some good instructions for figuring out what you value) brings up the general question of just what an AI needs to be given to be able to figure out how to carry out our will. An obvious contender is a lot of information about human values. Though some people disagree with this - these people don't buy the orthogonality thesis. Other issues sometimes suggested to need working out ahead of outsourcing everything to AIs include decision theory, priors, anthropics, feelings about pascal's mugging, and attitudes to infinity. MIRI's technical work often fits into this category.

4. Danaher's last post on Superintelligence (so far) is on motivation selection. It mostly summarizes and clarifies the chapter, so is mostly good if you'd like to think about the question some more with a slightly different framing. He also previously considered the difficulty of specifying human values in The golem genie and unfriendly AI (parts one and two), which is about Intelligence Explosion and Machine Ethics.

5. Brian Clegg thinks Bostrom should have discussed Asimov's stories at greater length:

I think it’s a shame that Bostrom doesn’t make more use of science fiction to give examples of how people have already thought about these issues – he gives only half a page to Asimov and the three laws of robotics (and how Asimov then spends most of his time showing how they’d go wrong), but that’s about it. Yet there has been a lot of thought and dare I say it, a lot more readability than you typically get in a textbook, put into the issues in science fiction than is being allowed for, and it would have been worthy of a chapter in its own right.

In-depth investigations

If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser's list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.

1. Can you think of novel methods of specifying the values of one or many humans?
2. What are the most promising methods for 'domesticating' an AI? (i.e. constraining it to only care about a small part of the world, and not want to interfere with the larger world to optimize that smaller part).
3. Think more carefully about the likely motivations of drastically augmenting brain emulations
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.

# How to proceed

This has been a collection of notes on the chapter.  The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!

Next week, we will start to talk about a variety of more and less agent-like AIs: 'oracles', genies' and 'sovereigns'. To prepare, read Chapter “Oracles” and “Genies and Sovereigns” from Chapter 10The discussion will go live at 6pm Pacific time next Monday 22nd December. Sign up to be notified here.

## Discussion of AI control over at worldbuilding.stackexchange [LINK]

6 14 December 2014 02:59AM

https://worldbuilding.stackexchange.com/questions/6340/the-challenge-of-controlling-a-powerful-ai

Go insert some rationality into the discussion! (There are actually some pretty good comments in there, and some links to the right places, including LW).

## [link] Etzioni: AI will empower us, not exterminate us

4 11 December 2014 08:51AM

https://medium.com/backchannel/ai-wont-exterminate-us-it-will-empower-us-5b7224735bf3

Not sure what the local view of Oren Etzioni or the Allen Institute for AI is, but I'm curious what people think if his views on UFAI risk. As far as I can tell from this article, it basically boils down to "AGI won't happen, at least not any time soon." Is there (significant) reason to believe he's wrong, or is it simply too great a risk to leave to chance?

View more: Next