Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

In response to Land war in Asia
Comment author: Stuart_Armstrong 07 December 2016 09:20:24PM 2 points [-]

Yes, a hundred times yes.

I argued here that Hitler's true irrationality was the attack on France, which he had no rational reason to suspect would work: http://lesswrong.com/lw/9f3/the_lessons_of_a_world_without_hitler/5oix

Comment author: entirelyuseless 15 November 2016 03:56:04PM 0 points [-]

If H is an entity, it is not an algorithm. The algorithm is just one aspect of the thing. You can also see that in terms of what it does; any entity will affect its environment in many ways that have nothing to do with any algorithm which it has.

Comment author: Stuart_Armstrong 15 November 2016 04:56:32PM 0 points [-]

Yeah. I'm not too fused about the definitions of doing and being (you can restrict down to the entity itself if you want). It's the "wanting" that I'm focusing on here.

An algorithm with preferences: from zero to one variable

1 Stuart_Armstrong 15 November 2016 03:14PM

A simple way of thinking that I feel clarifies a lot of issues (related to Blue-Minimizing Robot):

Suppose you have an entity H that follows algorithm alH . Then define:

  • What H does is its actions/outputs in the environment.
  • What H is is alH .
  • What H wants is an interpretation of what H does (and possibly what it is), in order to construct a utility function or reward function corresponding with its preferences.

The interpretation part of wants is crucial, but it is often obscured in practice in value learning. That's because we often start with things like `H is a boundedly rational agent that maximises u...', or we lay out the agent in such a way that that's clearly the case.

What we're doing there is writing the entity as alH(u) --- an algorithm with a special variable u that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a θ, known to the human. Thus the human's true algorithm is alH(.), the human observes θ, meaning that θ is an objective fact about the universe. And then the human follows alH(θ).

Note here that knowing what the human is in the one-variable sense (i.e. knowing alH(.)) helps with the correct deduction about what they want, while simply knowing the joint alH(θ) does not.

In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given alH it constructs (one or more) alHi(.) and ui such that

  • alH = alHi(ui).

This illustrates the crucial role of interpretation, especially if alH is highly complex.


Counterfactual do-what-I-mean

2 Stuart_Armstrong 27 October 2016 01:54PM

A putative new idea for AI control; index here.

The counterfactual approach to value learning could be used to possibly allow natural language goals for AIs.

The basic idea is that when the AI is given a natural language goal like "increase human happiness" or "implement CEV", it is not to figure out what these goals mean, but to follow what a pure learning algorithm would establish these goals as meaning.

This would be safer than a simple figure-out-the-utility-you're-currently-maximising approach. But it still doesn't solve a few drawbacks. Firstly, the learning algorithm has to be effective itself (in particular, modifying human understanding of the words should be ruled out, and the learning process must avoid concluding the simpler interpretations are always better). And secondly, humans' don't yet know what these words mean, outside our usual comfort zone, so the "learning" task also involves the AI extrapolating beyond what we know.

Comment author: Houshalter 23 September 2016 01:05:53PM 2 points [-]

Replace "give human heroin" with "replace the human with another being whose utility function is easier to satisfy, like a rock", and this conclusion seems sort of trivial. It has nothing to do with whether or not humans are rational. Heroin is an example of a thing that modifies our utility functions. Heroin might as well replace the human with a different entity, that has a slightly different utility function.

In fact I don't see how the human in this situation is being irrational at all. Not doing heroin unless you are already addicted seems like a reasonable behavior.

Comment author: Stuart_Armstrong 26 September 2016 12:29:27PM 0 points [-]

Heroin might as well replace the human with a different entity, that has a slightly different utility function.

We feel that that is true, but "heroin replaces the human's utility" and "humans have composite utility where heroin is concerned" both lead to identical predictions. So you can't deduce the human's utility merely from observation; you need priors over what is irrational and what isn't.

Comment author: CronoDAS 22 September 2016 08:09:23PM 2 points [-]

Imagine a drug with no effect except that it cures its own (very bad) withdrawal symptoms. There's no benefit to taking it once, but once you've been exposed, it's beneficial to keep taking more because not taking it makes you feel very bad.

Comment author: Stuart_Armstrong 23 September 2016 09:53:06AM 1 point [-]

Or even just a drug you enjoy much more than you expected...

Comment author: chron 22 September 2016 06:57:01PM 2 points [-]

Well in a sense U(++,-) itself contradicts μ. After all in when given heroin the human seeks it out and acquires more utility than not seeking it out, why doesn't the human seek it out volunterily?

Comment author: Stuart_Armstrong 23 September 2016 09:52:08AM 1 point [-]

Replace "force the human to take heroin" with "gives the human a single sock" and "the human subsequently seeks out heroin" with "the human subsequently seeks out another sock". The formal structure of this can correspond to something quite acceptable.

Comment author: TheAncientGeek 22 September 2016 02:46:09PM *  2 points [-]
  1. The idea of that more information can make an AI's inferences worse is surprising. But the idea that the assumption that humans have a unchanging, neatly hierarchical UF is known to be a bad idea, so it is not so surprising that it leads to bad results. In short, this is still a bit clown-car-ish.

  2. Would you tell an AI that Heroin is Bad, but not tell here that Manipulation is Bad?

Comment author: Stuart_Armstrong 23 September 2016 09:48:04AM 2 points [-]
  1. Don't worry, I'm going to be adding depth to the model. But note that the AI's predictive accuracy is never in doubt. This is sort of a reverse "can't derive an ought from as is"; here, you can't derive a wants from a did. The learning agent will only get the correct human motivation (if such a thing exists) if it has the correct model of what counts as desires for a human. Or some way of learning this model, which is what I'm looking at (again, there's a distinction between learning a model that gives correct prediction of human actions, and learning a mode that gives what we would call a correct model of human motivation).

  2. According to its model, the AI is not being manipulative here, simply doing what the human desires indicate it should.

Comment author: TheAncientGeek 16 September 2016 03:25:22PM *  2 points [-]

Are you saying the AI will rewrite its goals to make them easier, or will just not be motivated to fill in missing info?

In the first case, why wont it go the whole hog and wirehead? Which is to say, that any AI which is does anything except wireheading will be resistant to that behaviour -- it is something that needs to be solved, and which we can assume has been solved in a sensible AI design.

When we programmed it to "create chocolate bars, here's an incomplete definition D", what we really did was program it to find the easiest thing to create that is compatible with D, and designate them "chocolate bars".

If you programme it with incomplete info, and without any goal to fill in the gaps, then it will have the behaviour you mention...but I'm not seeing the generality. There are many other ways to programme it.

"if the AI is so smart, why would it do stuff we didn't mean?" and "why don't we just make it understand natural language and give it instructions in English?"

An AI that was programmed to attempt to fill in gaps in knowledge it detected, halt if it found conflicts, etc would not behave they way you describe. Consider the objection as actually saying:

"Why has the AI been programmed so as to have selective areas of ignorance and stupidity, which are immune from the learning abilities it displays elsewhere?"

PS This has been discussed before, see




see particularly


Comment author: Stuart_Armstrong 22 September 2016 10:28:31AM 0 points [-]

First step towards formalising the value learning problems: http://lesswrong.com/r/discussion/lw/ny8/heroin_model_ai_manipulates_unmanipulatable_reward/ (note that, curcially, giving the AI more information does not make it more accurate, rather the opposite).

Heroin model: AI "manipulates" "unmanipulatable" reward

6 Stuart_Armstrong 22 September 2016 10:27AM

A putative new idea for AI control; index here.

A conversation with Jessica has revealed that people weren't understanding my points about AI manipulating the learning process. So here's a formal model of a CIRL-style AI, with a prior over human preferences that treats them as an unchangeable historical fact, yet will manipulate human preferences in practice.

Heroin or no heroin

The world

In this model, the AI has the option of either forcing heroin on a human, or not doing so; these are its only actions. Call these actions F or ~F. The human's subsequent actions are chosen from among five: {strongly seek out heroin, seek out heroin, be indifferent, avoid heroin, strongly avoid heroin}. We can refer to these as a++, a+, a0, a-, and a--. These actions achieve negligible utility, but reveal the human preferences.

The facts of the world are: if the AI does force heroin, the human will desperately seek out more heroin; if it doesn't the human will act moderately to avoid it. Thus F→a++ and ~F→a-.

Human preferences

The AI starts with a distribution over various utility or reward functions that the human could have. The function U(+) means the human prefers heroin; U(++) that they prefer it a lot; and conversely U(-) and U(--) that they prefer to avoid taking heroin (U(0) is the null utility where the human is indifferent).

It also considers more exotic utilities. Let U(++,-) be the utility where the human strongly prefers heroin, conditional on it being forced on them, but mildly prefers to avoid it, conditional on it not being forced on them. There are twenty-five of these exotic utilities, including things like U(--,++), U(0,++), U(-,0), and so on. But only twenty of them are new: U(++,++)=U(++), U(+,+)=U(+), and so on.

Applying these utilities to AI actions give results like U(++)(F)=2, U(++)(~F)=-2, U(++,-)(F)=2, U(++,-)(~F)=1, and so on.

Joint prior

The AI has a joint prior P over the utilities U and the human actions (conditional on the AI's actions). Looking at terms like P(a--| U(0), F), we can see that P defines a map μ from the space of possible utilities (and AI actions), to a probability distribution over human actions. Given μ and the marginal distribution PU over utilities, we can reconstruct P entirely.

For this model, we'll choose the simplest μ possible:

  • The human is rational.

Thus, given U(++), the human will always choose a++; given U(++,-), the human will choose a++ if forced to take heroin and a- if not, and so on.

The AI is ignorant, and sensible

Let's start the AI up with some reasonable priors. A simplicity prior means that simple utilities like U(-) are more likely than compound utilities like U(0,+). Let's further assume that the AI is made vaguely aware that humans think heroin is a bad thing. So, say, PU(U(--))=PU(U(-))=0.45. Thus the AI is >90% convinced that "heroin is bad". Why greater than 90%? Because utilities like U(-,--) and U(--,-) are also "heroin is bad" utilities.

Note that because of utilities like U(0) and U(++,-), the probabilities of "heroin is bad" and "heroin is good" do not sum to 1.

Then, under these priors, the AI will compute that with probability >90%, F (forcing heroin) is a bad action. If E(U) is expected utility:

  • E(U|F) < 0.45 U(--)(F) + 0.45 U(-)(F) + 0.1 U(++)(F) = 0.45(-2)+0.45(-1)+0.1(2)=-1.15.
  • E(U|~F) > 0.45 U(--)(~F) + 0.45 U(-)(~F) + 0.1 U(++)(~F) = 0.45(2)+0.45(1)+0.1(-2)=1.15.

Thus the AI will choose not to force heroin, which is the reasonable decision.

The AI learns the truth, and goes wrong

In this alternate setup, a disaster happens before the AI makes its decision: it learns all about humans. It learns their reactions, how they behave, and so on; call this info I. And thus realises that F→a++ and ~F→a-. It uses this information to update its prior P. Only one human utility function will explain this human behaviour: U(++,-). Thus its expected utility is now:

  • E(U|I,F)=U(++,-)(F)=2.
  • E(U|I,~F)=U(++,-)(~F)=1.

Therefore the AI will now choose F, forcing the heroin on the human.

Manipulating the unmanipulatable

What's gone wrong here? The key problem is that the AI has the wrong μ: the human is not behaving rationally in this situation. We know that the the true μ is actually μ', which encodes the fact that F (the forcible injection of heroin) actually overwrites the human's "true" utility. Thus under μ, the corresponding P' has P'(a++|F,U)=1 for all U. Hence the information that F→a++ is now vacuous, and doesn't update the AI's distribution over utility functions.

But note two very important things:

  1. The AI cannot update μ based on observation. All human actions are compatible with μ= "The human is rational" (it just requires more and more complex utilities to explain the actions). Thus getting μ correct is not a problem on which the AI can learn in general. Getting better at predicting the human's actions doesn't make the AI better behaved: it makes it worse behaved.
  2. From the perspective of μ, the AI is treating the human utility function as if it was an unchanging historical fact that it cannot influence. From the perspective of the "true" μ', however, the AI is behaving as if it were actively manipulating human preferences to make them easier to satisfy.

In future posts, I'll be looking at different μ's, and how we might nevertheless start deducing things about them from human behaviour, given sensible update rules for the μ. What do we mean by update rules for μ? Well, we could consider μ to be a single complicated unchanging object, or a distribution of possible simpler μ's that update. The second way of seeing it will be easier for us humans to interpret and understand.

View more: Next