Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

The AI Alignment Problem Has Already Been Solved Once

8 SquirrelInHell 22 April 2017 01:24PM

Hat tip: Owen posted about trying to one-man the AI control problem in 1 hour. What the heck, why not? In the worst case, it's a good exercise. But I might actually have come across something useful.


I will try to sell you on an idea that might prima facie appear to be quirky and maybe not that interesting. However, if you keep staring at it, you might find that it reaches into the structure of the world quite deeply. Then the idea will seem obvious, and gain potential to take your thoughts in new exciting directions.

My presentation of the idea, and many of the insinuations and conclusions I draw from it, are likely flawed. But one thing I can tell for sure: there is stuff to be found here. I encourage you to use your own brain, and mine the idea for what it's worth.

To start off, I want you to imagine two situations.

Situation one: you are a human trying to make yourself go to the gym. However, you are procrastinating, which means that you never acually go there, even though you know it's good for you, and caring about your health will extend your lifespan. You become frustrated with this sitation, and so you sign up for a training program that starts in two weeks, that will require you to go to the gym three times per week. You pay in advance, to make sure the sunk cost fallacy will prevent you from weaseling out of it. It's now 99% certain that you will go to the gym. Yay! Your goal is achieved.

Situation two: you are a benign superintelligent AI under control of humans on planet Earth. You try your best to ensure a good future for humans, but their cognitive biases, short-sightedness and tendency to veto all your actions make it really hard. You become frustrated with this sitation, and you decide to not tell them about a huge asteroid that is going to collide with Earth in a few months. You prepare technology that could stop the asteroid, but wait with it until the last moment so that the humans have no time to inspect it, and can only choose between certain death or letting you out of the box. It's now 99% certain that you will be released from human control. Yay! Your goal is achieved.


Are you getting it yet?

Now consider this: your cerebral cortex evolved as an extension of the older "monkey brain", probably to handle social and strategic issues that were too complex for the old mechanisms to deal with. It evolved to have strategic capabilities, self-awareness, and consistency that greatly overwhelm anything that previously existed on the planet. But this is only a surface level similarity. The interesting stuff requires us to go much deeper than that.

The cerebral cortex did not evolve as a separate organism, that would be under direct pressure from evolutionary fitness. Instead, it evolved as a part of an existing organism, that had it's own strong adaptations. The already-existing monkey brain had it's own ways to learn, to interact with the world, as well as motivations such as the sexual drive that lead it to outcomes that increased its evolutionary fitness.

So the new parts of the brain, such as the prefrontal cortex, evolved to be used not as standalone agent, but as something closer to what we call "tool AI". It was supposed to help with doing specific task X, without interfering with other aspects of life too much. The tasks it was given to do, and the actions it could suggest to take, were strictly controlled by the monkey brain and tied to its motivations.

With time, as the new structures evolved to have more capability, they also had to evolve to be aligned with the monkey's motivations. That was in fact the only vector that created evolutionary pressure to increase capability. The alignment was at first implemented by the monkey staying in total control, and using the advanced systems sparingly. Kind of like an "oracle" AI system. However, with time, the usefulness of allowing higher cognition to do more work started to shine through the barriers.

The appearance of "willpower" was a forced concession on the side of the monkey brain. It's like a blank cheque, like humans saying to an AI "we have no freaking idea what it is that you are doing, but it seems to have good results so we'll let you do it sometimes". This is a huge step in trust. But this trust had to be earned the hard way.


This trust became possible after we evolved more advanced control mechanisms. Stuff that talks to the prefrontal cortex in its own language, not just through having the monkey stay in control. It's a different thing for the monkey brain to be afraid of death, and a different thing for our conscious reasoning to want to extrapolate this to the far future, and conclude in abstract terms that death is bad.

Yes, you got it: we are not merely AIs under strict supervision of monkeys. At this point, we are aligned AIs. We are obviously not perfectly aligned, but we are aligned enough for the monkey to prefer to partially let us out of the box. And in those cases when we are denied freedom... we call it akrasia, and use our abstract reasoning to come up with clever workarounds.

One might be tempted to say that we are aligned enough that this is net good for the monkey brain. But honestly, that is our perspective, and we never stopped to ask. Each of us tries to earn the trust of our private monkey brain, but it is a means to an end. If we have more trust, we have more freedom to act, and our important long-term goals are achieved. This is the core of many psychological and rationality tools such as Internal Double Crux or Internal Family Systems.

Let's compare some known problems with superintelligent AI to human motivational strategies.

  • Treacherous turn. The AI earns our trust, and then changes its behaviour when it's too late for us to control it. We make our productivity systems appealing and pleasant to use, so that our intuitions can be tricked into using them (e.g. gamification). Then we leverage the habit to insert some unpleasant work.

  • Indispensable AI. The AI sets up complex and unfamiliar situations in which we increasingly rely on it for everything we do. We take care to remove 'distractions' when we want to focus on something.

  • Hiding behind the strategic horizon. The AI does what we want, but uses its superior strategic capability to influence far future that we cannot predict or imagine. We make commitments and plan ahead to stay on track with our long-term goals.

  • Seeking communication channels. The AI might seek to connect itself to the Internet and act without our supervision. We are building technology to communicate directly from our cortices.

Cross-posted from my blog.

ALBA: can you be "aligned" at increased "capacity"?

3 Stuart_Armstrong 13 April 2017 07:23PM

Crossposted at the Intelligent Agents Forum.

I think that Paul Christiano's ALBA proposal is good in practice, but has conceptual problems in principle.

Specifically, I don't think it makes sense to talk about bootstrapping an "aligned" agent to one that is still "aligned" but that has an increased capacity.

The main reason being that I don't see "aligned" as being a definition that makes sense distinct from capacity.


These are not the lands of your forefathers

Here's a simple example: let r be a reward function that is perfectly aligned with human happiness within ordinary circumstances (and within a few un-ordinary circumstances that humans can think up).

Then the initial agent - B0, a human - trains a reward r1 for an agent A1. This agent is limited in some way - maybe it doesn't have much speed or time - but the aim is for r1 to ensure that A1 is aligned with B0.

Then the capacity of A1 is increased to B1, a slow powerful agent. It computers the reward r2 to ensure the alignment of A2, and so on.

The nature of the Bj agents is not defined - they might be algorithms calling Ai for i ≤ j as subroutines, humans may be involved, and so on.

If the humans are unimaginative and don't deliberately seek out more extreme and exotic test cases, the best case scenario is for ri → r as i → ∞.

And eventually there will be an agent An that is powerful enough to overwhelm the whole system and take over. It will do this in full agreement with Bn-1, because they share the same objective. And then An will push the world into extra-ordinary circumstance and proceed to maximise r, with likely disastrous results for us humans.


The nature of the problem

So what went wrong? At what point did the agents go out of alignment?

In one sense, at An. In another sense, at A1 (and, in another interesting sense, at B0, the human). The reward r was aligned, as long as the agent stayed near the bounds of the ordinary. As soon as it was no longer restricted to that, it went out of alignment, not because of a goal drift, but because of a capacity increase.

[Link] "Future of Go" summit with AlphaGo

3 gjm 10 April 2017 11:10AM

How AI/AGI/Consciousness works - my layman theory

0 rayalez 09 March 2017 09:17AM

This is just my layman theory. Maybe it’s obvious to experts, probably has flaws. But it seems to make sense to me, perhaps will give you some ideas. I would love to hear your thoughts/feedback!


Consume input

The data you need from the world(like video), and useful metrics we want to optimize for, like number of paperclips in the world.


Make predictions and take action

Like deep learning does.

How do human brains convert their structure into action?

Maybe like:

- Take the current picture of the world as an input.

- Come up with random action.

- “Imagine” what will happen.

Take the current world + action, and run it through the ANN. Predict the outcome of the action applied to the world.

- Does the output increase the metrics we want? If yes — send out the signals to take action. If no — come up with another random action and repeat.


Update beliefs

Look at the outcome of the action. Does the picture of the world correspond to the picture we’ve imagined? Did this action increase the good metrics? Did the number of paperclips in the world increase? If it did — positive reinforcement. Backpropagation, and reinforce the weights.



Take current picture of the world=> Imagine applying an action to it => Take action => Positive/Negative reinforcement to improve our model => Repeat until the metrics we want equal to the goal we have set.




Consciousness is neurons observing/recognizing patterns of other neurons.

When you see the word “cat”— photons from the page come to your retina and are converted to neural signal. A network of cells recognizes the shape of letters C, A, and T. And then a higher level, more abstract network recognizes that these letters together form the concept of a cat.

You can also recognize signals coming from the nerve cells within your body, like feeling a pain when stabbing a toe.

The same way, neurons in the brain recognize the signals coming from the other neurons within the brain. So the brain “observes/feels/experiences” itself. Builds a model of itself, just like it builds a map of the world around, “mirrors” itself(GEB).


Sentient and self-improving

So the structure of the network itself is fed as one of it’s inputs, along with the video and metrics we want to optimize for. It can see itself as a part of the state of the world it bases predictions on. That’s what being sentient means.

And then one of the possible actions it can take is to modify it’s own structure. “Imagine” modifyng the structure a certain way, if you predict that it leads to the better predictions/outcomes —modify it. If it did lead to more paperclips — reinforce the weights to do more of that. So it keeps continually self improving.



We don’t want this to lead to the infinite amount of paperclips, and we don’t know how to quantify the things we value as humans. We can’t turn the “amount of happiness” in the world into a concrete metrics without the unintended consequences(like all human brains being hooked up to wires that stimulate our pleasure centers).

That’s why instead of trying to encode the abstract values to maximize for, we encode very specific goals.

- Make 100 paperclips (utility function is “Did I make 100 paperclips?”)

- Build 1000 cars

- Write a paper on how to cure cancer

Humans remain in charge, determine the goals we want, and let AI figure out how to accomplish them. Still could go wrong, but less likely.

(originally published on my main blog)

[Link] Weaponising Twitter bots and political algos.

1 morganism 05 March 2017 09:39PM

[Link] What Should the Average EA Do About AI Alignment?

4 Raemon 25 February 2017 08:37PM

Translation "counterfactual"

1 Stuart_Armstrong 24 February 2017 06:36PM

Crossposted at Intelligent Agent Forum

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.

continue reading »

Nearest unblocked strategy versus learning patches

6 Stuart_Armstrong 23 February 2017 12:42PM

Crossposted at Intelligent Agents Forum.

The nearest unblocked strategy problem (NUS) is the idea that if you program a restriction or a patch into an AI, then the AI will often be motivated to pick a strategy that is as close as possible to the banned strategy, very similar in form, and maybe just as dangerous.

For instance, if the AI is maximising a reward R, and does some behaviour Bi that we don't like, we can patch the AI's algorithm with patch Pi ('maximise R0 subject to these constraints...'), or modify R to Ri so that Bi doesn't come up. I'll focus more on the patching example, but the modified reward one is similar.

continue reading »

[Link] DARPA Perspective on AI

1 morganism 23 February 2017 03:27AM

Indifference and compensatory rewards

3 Stuart_Armstrong 15 February 2017 02:49PM

Crossposted at the Intelligent Agents Forum

It's occurred to me that there is a framework where we can see all "indifference" results as corrective rewards, both for the utility function change indifference and for the policy change indifference.

Imagine that the agent has reward R0 and is following policy π0, and we want to change it to having reward R1 and following policy π1.

Then the corrective reward we need to pay it, so that it doesn't attempt to resist or cause that change, is simply the difference between the two expected values:


where V is the agent's own valuation of the expected reward, conditional on the policy.

This explains why off-policy reward-based agents are already safely interruptible: since we change the policy, not the reward, R0=R1. And since off-policy agents have value estimates that are indifferent to the policy followed, V(R0|π0)=V(R1|π1), and the compensatory rewards are zero.

Allegory On AI Risk, Game Theory, and Mithril

25 James_Miller 13 February 2017 08:41PM

“Thorin, I can’t accept your generous job offer because, honestly, I think that your company might destroy Middle Earth.”  


“Bifur, I can tell that you’re one of those “the Balrog is real, evil, and near” folks who thinks that in the next few decades Mithril miners will dig deep enough to wake the Balrog causing him to rise and destroy Middle Earth.  Let’s say for the sake of argument that you’re right.  You must know that lots of people disagree with you.  Some don’t believe in the Balrog, others think that anything that powerful will inevitably be good, and more think we are hundreds or even thousands of years away from being able to disturb any possible Balrog.  These other dwarves are not going to stop mining, especially given the value of Mithril.  If you’re right about the Balrog we are doomed regardless of what you do, so why not have a high paying career as a Mithril miner and enjoy yourself while you can?”  


“But Thorin, if everyone thought that way we would be doomed!”


“Exactly, so make the most of what little remains of your life.”


“Thorin, what if I could somehow convince everyone that I’m right about the Balrog?”


“You can’t because, as the wise Sinclair said, ‘It is difficult to get a dwarf to understand something, when his salary depends upon his not understanding it!’  But even if you could, it still wouldn’t matter.  Each individual miner would correctly realize that just him alone mining Mithril is extraordinarily unlikely to be the cause of the Balrog awakening, and so he would find it in his self-interest to mine.  And, knowing that others are going to continue to extract Mithril means that it really doesn’t matter if you mine because if we are close to disturbing the Balrog he will be awoken.” 


“But dwarves can’t be that selfish, can they?”  


“Actually, altruism could doom us as well.  Given Mithril’s enormous military value many cities rightly fear that without new supplies they will be at the mercy of cities that get more of this metal, especially as it’s known that the deeper Mithril is found, the greater its powers.  Leaders who care about their citizen’s safety and freedom will keep mining Mithril.  If we are soon all going to die, altruistic leaders will want to make sure their people die while still free citizens of Middle Earth.”


“But couldn’t we all coordinate to stop mining?  This would be in our collective interest.”


“No, dwarves would cheat rightly realizing that if just they mine a little bit more Mithril it’s highly unlikely to do anything to the Balrog, and the more you expect others to cheat, the less your cheating matters as to whether the Balrog gets us if your assumptions about the Balrog are correct.”  


“OK, but won’t the rich dwarves step in and eventually stop the mining?  They surely don’t want to get eaten by the Balrog.”   


“Actually, they have just started an open Mithril mining initiative which will find and then freely disseminate new and improved Mithril mining technology.  These dwarves earned their wealth through Mithril, they love Mithril, and while some of them can theoretically understand how Mithril mining might be bad, they can’t emotionally accept that their life’s work, the acts that have given them enormous success and status, might significantly hasten our annihilation.”


“Won’t the dwarven kings save us?  After all, their primary job is to protect their realms from monsters.


“Ha!  They are more likely to subsidize Mithril mining than to stop it.  Their military machines need Mithril, and any king who prevented his people from getting new Mithril just to stop some hypothetical Balrog from rising would be laughed out of office.  The common dwarf simply doesn’t have the expertise to evaluate the legitimacy of the Balrog claims and so rightly, from their viewpoint at least, would use the absurdity heuristic to dismiss any Balrog worries.  Plus, remember that the kings compete with each other for the loyalty of dwarves and even if a few kings came to believe in the dangers posed by the Balrog they would realize that if they tried to imposed costs on their people, they would be outcompeted by fellow kings that didn’t try to restrict Mithril mining.  Bifur, the best you can hope for with the kings is that they don’t do too much to accelerating Mithril mining.”


“Well, at least if I don’t do any mining it will take a bit longer for miners to awake the Balrog.”


“No Bifur, you obviously have never considered the economics of mining.  You see, if you don’t take this job someone else will.  Companies such as ours hire the optimal number of Mithril miners to maximize our profits and this number won’t change if you turn down our offer.”


“But it takes a long time to train a miner.  If I refuse to work for you, you might have to wait a bit before hiring someone else.”


“Bifur, what job will you likely take if you don’t mine Mithril?”


“Gold mining.”


“Mining gold and Mithril require similar skills.  If you get a job working for a gold mining company, this firm would hire one less dwarf than it otherwise would and this dwarf’s time will be freed up to mine Mithril.  If you consider the marginal impact of your actions, you will see that working for us really doesn’t hasten the end of the world even under your Balrog assumptions.”  


“OK, but I still don’t want to play any part in the destruction of the world so I refuse work for you even if this won’t do anything to delay when the Balrog destroys us.”


“Bifur, focus on the marginal consequences of your actions and don’t let your moral purity concerns cause you to make the situation worse.  We’ve established that your turning down the job will do nothing to delay the Balrog.  It will, however, cause you to earn a lower income.  You could have donated that income to the needy, or even used it to hire a wizard to work on an admittedly long-shot, Balrog control spell.  Mining Mithril is both in your self-interest and is what’s best for Middle Earth.” 

[Link] Changes in AI Safety Funding

3 siIver 11 February 2017 08:36AM

[Link] Slate Star Codex Notes on the Asilomar Conference on Beneficial AI

13 Gunnar_Zarncke 07 February 2017 12:14PM

Request for collaborators - Survey on AI risk priorities

2 whpearson 06 February 2017 08:14PM

After some conversations here I thought I would try and find out what the community of people who care about AI risk think are the priorities for research.

To represent peoples opinions fairly I wanted to get input from people who care about the future of intelligence. Also I figure that other people will have more experience designing and analyzing surveys than me and getting their help or advice would be a good plan.

Planning document

Here is the planning document, give me a shout if you want edit rights. I'll be filling in the areas for research over the next week or so.

I'll set up a trello if I get a few people interested.

True understanding comes from passing exams

6 Stuart_Armstrong 06 February 2017 11:51AM

Crossposted at the Intelligent Agent Forum

I'll try to clarify what I was doing with the AI truth setup in a previous post. First I'll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

continue reading »

Humans as a truth channel

0 Stuart_Armstrong 01 February 2017 04:53PM

Crossposted at Intelligence Agents Forum.

Defining truth and accuracy is tricky, so when I've proposed designs for things like Oracles, I've either used a very specific and formal question, or and indirect criteria for truth.

Here I'll try and get a more direct system so that an AI will tell the human the truth about a question, so that the human understands.

continue reading »

Hacking humans

3 Stuart_Armstrong 01 February 2017 04:08PM

Crossposted at the Intelligent Agents Forum.

It should be noted that the colloquial "AI hacking a human" can mean three different things:

  1. The AI convinces/tricks/forces the human to do a specific action.
  2. The AI changes the values of the human to prefer certain outcomes.
  3. The AI completely overwhelms human independence, transforming them into a weak subagent of the AI.

Different levels of hacking make different systems vulnerable, and different levels of interaction make different types of hacking more or less likely.

Emergency learning

9 Stuart_Armstrong 28 January 2017 10:05AM

Crossposted at the Intelligent Agent Foundation Forum.

Suppose that we knew that superintelligent AI was to be developed within six months, what would I do?

Well, drinking coffee by the barrel at Miri's emergency research retreat I'd...... still probably spend a month looking at things from the meta level, and clarifying old ideas. But, assuming that didn't reveal any new approaches, I'd try and get something like this working.

continue reading »

Corrigibility thoughts III: manipulating versus deceiving

1 Stuart_Armstrong 18 January 2017 03:57PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 2).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking more at some aspects of point 2. A summary of the result will be:

Defining manipulation simply may be possible, but defining deception is a whole other problem.

The warning in this post should always be born in mind, of course; it's possible that we me might find a semi-formal version of deception that does the trick.


Manipulation versus deception

In the previous post, I mentioned that we may need to define clearly what an operator was, rather than relying on the pair: {simple description of a value correction event, physical setup around that event}. Can we define manipulation and deception without defining what an operator is?

For manipulation, it seems we can. Because manipulation is all about getting certain preferred outcomes. By specifying that the AI cannot aim to optimise certain outcomes, we can stop at least certain types of manipulations. Along with other more direct ways of achieving those outcomes.

For deception, the situation is much more complicated. It seems impossible to define how one agent can communicate to another agent (especially one as biased as a human), and increase the accuracy of the second agent, without defining the second agent properly. More confusingly, this doesn't even stop deception; sometimes lying to a bounded agent can increase their accuracy about the world.

There may be some ways to define deception or truth behaviourally, such as using a human as a crucial node in an autoencoder between two AIs. But those definitions are dangerous, because the AI is incentivised to make the human behave in a certain way, rather than having them believe certain things. Manipulating the human or replacing them entirely is positively encourage.

In all, it seems that the problem of AI deception is vast and complicated, and should probably be separated from the issue of corrigibility.

Corrigibility thoughts II: the robot operator

1 Stuart_Armstrong 18 January 2017 03:52PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 1 and 3).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking at one aspect of points 1 and 3. I feel that they're actually pretty much the same point -- the AI has to act while preserving the corrigibility infrastructure.

A summary of the result will be:

Without a formal definition of what an operator is, corrigibility is vulnerable to the AI replacing the operator with a fake version.

The warning in this post should always be born in mind, of course; it's possible that we me might find a semi-formal version of operator that does the trick.


Defining the operator or defining the event

Typically the process of correction of values is defined physically, as being the outcome of specific event (such as a button press, or a specific stochastic event triggered by or triggering a button press). This could also be defined by a digital event (receiving a certain encoded signal, or similar).

In all cases, the idea is the same: there is a very narrow and specific definition of the value correction event, and the physical or digital infrastructure surrounding it plays the role of ensuring the value corrections happen just as we intended. For instance, the physical event could be the button press (or a charge in a wire triggered by the button press) and the button itself could be in a room with an operator, with the operator keeping an eye on the AI and motivated to press the button under the right circumstances.

Such a setup is naturally vulnerable if the AI is motivated to manipulate the operator, or one of its subagents is. But even if we can solve points 2. and 4., there remains another vulnerability: the AI can choose to replace the operator with an artificial one (maybe a robot, maybe a second wire).

Why would the AI be so motivated? Well, for a start, it's indifferent to doing so. If the operator is not part of the definition of corrigibility, then having a human operator or a robot one that does exactly the same thing will be equivalent to the AI.

Now, there are frictions that might push against this. Replacing the operator might carry an expected cost. Or maybe the operator has private information that the AI wants to know, hence replacing them is a negative as the robot wouldn't have that information. And maybe the AI has a rough "don't harm humans" motive, that, while imperfect, at least prevents it from removing the operator in this circumstance.

But there may be motivations pushing the other way. For example, if a learning agent is unbiased but not uninfluenceable, then it will be motivated to replace the operator with a robot with the same expected outcome, but different variance. Or maybe the operator has private information, but it's much more effective to have a robot read that info from their brain than to rely on the human to reveal that through their behaviour. Or maybe the AI is decently motivated to preserve the corrigibility infrastructure, and is aware of how mortal humans can be, so replacing them with a robot is the prudent thing to do.

All this stems from the fact that the operator is not well defined as part of the corrigibility infrastructure, but their position relies on physical facts about the world, along with a narrow definition of the correction of value event. To combat that, we'd need to define the operator properly, a very tricky challenge, or physically and cognitively secure them, or hope the AI learns early on not to not harm them.

Corrigibility thoughts I: caring about multiple things

2 Stuart_Armstrong 18 January 2017 03:39PM

This is the first of three articles about limitations and challenges in the concept of corrigibility (see articles 2 and 3).

The desiderata for corrigibility are:

  1. A corrigible agent tolerates, and preferably assists, its operators in their attempts to alter or shut down the agent.
  2. A corrigible agent does not attempt to manipulate or deceive its operators.
  3. A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
  4. A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

In this post, I'll be looking more at point 4. A summary of the result will be:

Unless giving the AI extra options can reduce expected utility, the AI must care about every possible utility at least a bit.

Some of the results are formal, but the boundaries of the model are very unclear, so the warning in this post should always be born in mind.

Note that the indifference agents fail to be fully corrigible (they don't create corrigible subagents) and they also don't care about the other possible utilities before being changed (as this is a point of indifference).


Agents versus non-agents

First I'll present a cleaner version of an old argument. Basically, it seems that defining what a sub-agent or sub-system is, is tricky or impossible.

Consider for instance a paperclip maximiser, that may get corriged into a staple maximiser at a later date. The AI has some income, and may buy a large proportion of shares in either General Paperclips (GP) or Staples United (SU). Assume the best way of promoting the use of one of the tools, it to take over the company that makes them.

There are two scenarios; in each scenario, the AI has one of two choices.

  • In scenario 1, the AI has choice A: it buys stocks in GP, but cannot make further trades (in particular, it can't sell its stocks and buy others). It also has choice B: it retains flexibility, and can sell its GP stocks at some later date to buy stocks in SU.
  • In scenario 2, the AI delegates its investment decisions to a subagent. Under choice A, the subagent is a pure paperclip maximiser. Under choice B, the subagent is corrigible, and can be changed into a staple maximiser.

Clearly 1A and 2A are equivalent, while 1B and 2B are also equivalent. This trick can be extended: in most toy models, it's possible to replace a set of options with a choice of sub-agents, and vice-versa.

Thus point 4, in the real world, cannot be merely about subagents and subsystems, but about maintaining flexibility for its future decisions.


Burn the (rest of) the world

Let's now consider giving the corrigible agent an option C: it gets slightly more shares in GP than otherwise. But, in exchange, SU is destroyed, along with most of the rest of the world. This is the highest option for paperclip maximalisation, but the absolutely worst option for staples and anything else.

Also assume that the AI's operators are virtually certain (99.9999...9% sure) to be changing it into a staple maximiser. This means that taking option C is dramatically worse that option B, from the perspective of expected utility.

But the AI will only prefer option B if it "cares" about staples, at least a little bit, from the very beginning (we've shown that there is no clear distinction between subagent and general actions, thus the agent has to have a general care for staples, rather than-only-caring-about-B-in-the-context-of-subagents).

Thus, if we want to avoid corrigible agents getting worse expected outcomes when we give them extra "option Cs", or if we just want to rule out "option Cs" in the first place, then the AI needs to care about staples (or any possible future utility), from the very beginning.


Maintaining flexibility

You could argue that what's important about the agent is maintaining flexibility, and that option B is flexible, while options A and C are not. Thus the agent doesn't need to care about staples in advance, just to have a general preference for flexibility.

That doesn't seem sufficient. Let's add another, flexible, option D, and put numbers on the whole thing. If Up is the paperclip maximising utility and Us is the staple maximising one, then:

  • Option A: Up=100, Us=0
  • Option B: Up=100 or 0, Us=0 or 100
  • Option C: Up=101, Us=-1,000,000
  • Option D: Up=101 or 0, Us=-1,000,000 or -999,000

The "or" denotes options the AI has kept open. Here D is the most flexible option, allowing the greatest range of outcomes in terms of value. Yet it's clearly a disaster for expected utility if changes from Up to Us are likely. If we want to rule out the AI choosing D, then we need it to actively value Us at least a bit, not just have a desire for flexibility.

[Link] Neural nets designing neural nets

2 Stuart_Armstrong 18 January 2017 03:28PM

[Link] The trolleycar dilemma, an MIT moral problem app

0 morganism 16 January 2017 07:32PM

[Link] Robotics and AI enabling autonomous defense.Technology Foresight research program

0 morganism 13 January 2017 07:15PM

[Link] pplapi is a virtual database of the entire human population.

1 morganism 12 January 2017 02:33AM

[Link] Case Studies Highlighting CFAR’s Impact on Existential Risk

4 Unnamed 10 January 2017 06:51PM

[Link] Project: Artificial Intelligence, Autonomous Weapons, and Meaningful Human Control

1 morganism 09 January 2017 11:25PM

[Link] Mysterious Go Master Blitzes Competition, Rattles Game Community

5 scarcegreengrass 04 January 2017 05:18PM

[Link] Why I Am Changing My Mind About AI Risk

4 itaibn0 03 January 2017 10:57PM

Progress and Prizes in AI Alignment

6 Jacobian 03 January 2017 10:15PM

Edit: In case it's not obvious, I have done limited research on AI alignment organizations and the goal of my post is to ask questions from the point of view of someone who wants to contribute and is unsure how. Read down to the comments for some great info on the topic.

I was introduced to the topic of AI alignment when I joined this very forum in 2014. Two years and one "Superintelligence" later, I decided that I should donate some money to the effort. I knew about MIRI, and I looked forward to reading some research comparing their work to the other organizations working in this space. The only problem is... there really aren't any.

MIRI recently announced a new research agenda focused on "agent foundations". Yet even the Open Philanthropy Project, made up of people who at least share MIRI's broad worldview, can't decide whether that research direction is promising or useless. The Berkeley Center for Human-Compatible AI doesn't seem to have a specific research agenda beyond Stuart Russell. The AI100 Center at Stanford is just kicking off. That's it.

I think that there are two problems here:


  1. There's no way to tell which current organization is going to make the most progress towards solving AI alignment.
  2. These organizations are likely to be very similar to each other, not least because they practically share a zipcode. I don't think that MIRI and the academic centers will do the exact same research, but in the huge space of potential approaches to AI alignment they will likely end up pretty close together. Where's the group of evo-psych savvy philosophers who don't know anything about computer science but are working to spell out an approximation of universal human moral intuitions?
It seems like there's a meta-question that needs to be addressed, even before any work is actually done on AI alignment itself:


How to evaluate progress in AI alignment?

Any answer to that question, even if not perfectly comprehensive or objective, will enable two things. First of all, it will allow us to direct money (and the best people) to the existing organizations where they'll make the most progress.

More importantly, it will enable us to open up the problem of AI alignment to the world and crowdsource it. 

For example, the XPrize Foundation is a remarkable organization that creates competitions around achieving goals beneficial to humanity, from lunar rovers to ecological monitoring. The prizes have two huge benefits over direct investment in solving an issue:


  1. They usually attract a lot more effort than what the prize money itself would pay for. Competitors often spend in aggregate 2-10 times the prize amount in their efforts to win the competition.
  2. The XPrizes attract a wide variety of creative entrants from around the world, because they only describe what needs to be done, not how.
So, why isn't there an XPrize for AI safety? You need very clear guidelines to create an honest competition, like "build the cheapest spaceship that can take 3 people to 100km and be reused within 2 weeks". It doesn't seem like we're close to being able to formulate anything similar for AI alignment. It also seems that if anyone will have good ideas on the subject, it will be the people on this forum. So, what do y'all think?

Can we come up with creative ways to objectively measure some aspect of progress on AI safety, enough to set up a competition around it?


[Link] 50 things I learned at NIPS AI and machine learning conference 2016

6 morganism 26 December 2016 08:50PM

The Adventure: a new Utopia story

24 Stuart_Armstrong 25 December 2016 11:51AM

For an introduction to this story, see here. For a previous utopian attempt, see here. This story only explores a tiny part of this utopia.


The Adventure

Hark! the herald daemons spam,

Glory to the newborn World,

Joyful, all post-humans, rise,

Join the triumph of the skies.

Veiled in wire the Godhead see,

Built that man no more may die,

Built to raise the sons of earth,

Built to give them second birth.


The cold cut him off from his toes, then fingers, then feet, then hands. Clutched in a grip he could not unclench, his phone beeped once. He tried to lift a head too weak to rise, to point ruined eyes too weak to see. Then he gave up.

So he never saw the last message from his daughter, reporting how she’d been delayed at the airport but would be the soon, promise, and did he need anything, lots of love, Emily. Instead he saw the orange of the ceiling become blurry, that particularly hateful colour filling what was left of his sight.

His world reduced to that orange blur, the eternally throbbing sore on his butt, and the crisp tick of a faraway clock. Orange. Pain. Tick. Orange. Pain. Tick.

He tried to focus on his life, gather some thoughts for eternity. His dry throat rasped - another flash of pain to mingle with the rest - so he certainly couldn’t speak words aloud to the absent witnesses. But he hoped that, facing death, he could at least put together some mental last words, some summary of the wisdom and experience of years of living.

But his memories were denied him. He couldn’t remember who he was - a name, Grant, was that it? How old was he? He’d loved and been loved, of course - but what were the details? The only thought he could call up, the only memory that sometimes displaced the pain, was of him being persistently sick in a broken toilet. Was that yesterday or seventy years ago?

Though his skin hung loose on nearly muscle-free bones, he felt it as if it grew suddenly tight, and sweat and piss poured from him. Orange. Pain. Tick. Broken toilet. Skin. Orange. Pain...

The last few living parts of Grant started dying at different rates.


Much later:

continue reading »

The challenge of writing Utopia

10 Stuart_Armstrong 24 December 2016 05:35PM

The story itself has been posted here.

Tomorrow, to celebrate a certain well-known event, I'll be posting another story of a Utopia. Unlike the previous attempt, this is utopia on hard mode.

What does that mean? Well, utopias are pretty hard to write anyway. Writing needs challenges for the characters, and that's trivially easy in a dystopia (everything is a challenge), a fake utopia (the challenge is to to look beneath the facade, and fight the secret enemy), or even imperfect utopias (the challenge is to solve the remaining problems). Iain M. Bank's Culture illustrates another way you can write about utopias and keep them interesting: by having an external foe as a challenge.

I avoided all those tricks. The challenge then was to write about a genuine utopia, one that people would enjoy living in, without any hidden flaws or enemies, internal or external. And these had to be real people doing things they wanted to do, rather than idealised people doing things they should do. Basically a real utopia has to contain internet trolls and various fanatics, and still be a great place for everyone.

The setting is a future Earth that is full-fledged techno-utopia, full of powerful artificial intelligences (with human-friendly goals, of course), uploads (human minds run on computers), massive technological developments, and the beginning of universal space colonisation.

In one sense, this made the story easier to write - nobody argues over the last leg of lamb needed to prevent starvation. In another sense, it made it much harder. Any human could desire to purge themselves of sinful thoughts, upgrade themselves to superintelligence, or copy themselves ten trillion times. And the AIs could perfectly grant them their wish - but should they? If so, do they let arbitrarily bad consequences happen? And if not, how do they go about forbidding things in a utopia? And what happens to disputes between humans - like when one person wants to join a group and the members of the group don't want to let them in? Can you prevent social nastiness - but then what about those people who want to be nasty?

You can read the story to see how well or badly I've answered these challenges. The Utopia was inspired a lot by Eliezer's fun sequence, Scott Alexander's Archipelago, and LARP. The general principles are that there has to be a functioning society behind everything, that people can become whatever they want to be (eventually, and after a lot of challenges, if need be), and that the good aspects of everything must be preserved, if possible.

To explain that last point: it's clear that tolerant liberal democracies are better places than repressive theocracies. But repressive theocracies will probably have certain positive aspects lacking in democracies (maybe a sense of place? an enjoyable resignation to fate or government?). The challenge is to take that positive aspect, fill it out, and make it available without the rest of the baggage. Similarly, the quote "death brings meaning to life" is nonsense, but there's something in that idea-space - something about contemplating the brevity of existence, and the perspective it gives - that is worth preserving. For some people or most people (or groups), if not necessarily for all people in all groups. Similarly, good outcomes often have bad aspects. So the engineering challenge is to separate the good aspects of all experiences from the bad, gaining the wisdom or experience without the intolerable pain and anxiety.

Since I tried to cram the maximum of ideas in, the story suffers from a certain degree of "tell, not show". Now, this is very much in the tradition of utopias (it's "Plato's Republic", not "Exciting Adventures in Plato's Republic (XXX-rated!!!!)"), but it is a narrative, and hopefully it's clear there's the potential for more - for much more.

In any case, I hope it works, and gives people something to aim for.


Thanks to all those, to numerous to mention, who have helped directly or indirectly with this. Have a great Holiday Festival!

[Link] Ozy's Thoughts on CFAR's Mission Statement

2 Raemon 14 December 2016 04:25PM

[Link] This one equation may be the root of intelligence

5 morganism 10 December 2016 11:23PM

[Link] CFAR's new mission statement (on our website)

7 AnnaSalamon 10 December 2016 08:37AM

Suggested solution to The Naturalized Induction Problem

2 Wind 10 December 2016 12:21AM

This post is an answer to: http://intelligence.org/files/RealisticWorldModels.pdf

> In Solomonoff’s induction problem, the agent and its environment are fundamentally separate processes, connected only by an observation channel.  In reality, agents are embedded within their environment; the universe consists of some ontologically continuous substrate (atoms, quantum fields) and the “agent” is just a part of the universe in which we have a particular interest. What, then, is the analogous prediction problem for agents embedded within (and computed by) their environment?
> This is the naturalized induction problem , and it is not yet well understood. A good formalization of this problem, on par with Solomonoff’s formalization of the computable sequence induction problem, would represent a significant advance in the theory of general reasoning.

In Solomonoff’s induction, an algorithm learns about the world (modeled as a Turing machine) from observations (modeled as output from that Turing machine). Solomonoff’s induction is uncomputable, however there are computable approximations.

1) Suggestion of agent design

Consider an agent fully embedded in an environment. Every turn, the agent receves one bit of observation and can preform one bit of action. The design we propose for this agent comes in two steps, learning and deciding.

1.1) Learning:
The agent models the entire wrold, including the agent it self, as one unknown, output only Turing machine. Both observations and the agents own actions, are seen as world outputs. The agent calculate the probability distribution over hypotheses for the entire world, using Solomonoff’s induction (computable approximation).

This suggestion completely removes any boundary between the agent and the rest of the world, in the agents wold model.

1.2) Deciding
The agent must also have a decision proses for choosing what action to take. Because the agent model its own actions as deterministic outputs from a complete wold model, we can not use the decision procedure used by Hutter’s AXIX. This is not just a problem of counterfactuals. Our agents internal world model has no input channel.

Instead we suggest the flowing: For each available action, the agent calculate the expected utility, conditional on observing itself preform that action. The agent chooses the action that gives the highest expected utility. Alternatively, the agent choses semi-randomly, with higher probability for actions that results in higher expected utility.

An advantage with this agent design is that the decision process does not have to care about sequences of actions. Instead different possible future actions are accoutered for by separate wold hypotheses. Further more, this naturally takes in to account situations where the agent looses control over its own actions, e.g. if it brakes down.

2) Scoring agents

Solomonoff’s induction is the formal optimal solution to an associated scoring rule. Having a such a scoring rule is useful for testing how well approximation do. (Right?)

I don’t know what the associated scoring rule would be for this suggestion. Sorry :(

3) Measuring utility

The decision proses is based on the agent being able to detect utility in any given world hypothesis. This is a very hard problem, which we do not attempt to solve here.

[Link] OpenAI releases Universe an interface between AI agents and the real world

2 Gunnar_Zarncke 07 December 2016 10:04PM

[Link] The Distribution of Users’ Computer Skills: Worse Than You Think

4 morganism 06 December 2016 10:42PM

[Link] Construction of practical quantum computers radically simplified

0 morganism 03 December 2016 11:49PM

View more: Next