Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: gjm 03 May 2016 05:46:06PM -1 points [-]

Again, maybe I'm misunderstanding something -- but it sounds as if you're agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.

I thought we were exploring a disagreement here; is there still one?

Comment author: Stuart_Armstrong 04 May 2016 09:21:43AM 0 points [-]

I think there is no remaining disagreement - I just want to emphasise that before the AI observes such evidence, it will behave the way we want.

Comment author: gjm 03 May 2016 04:53:48PM -1 points [-]

I may be misunderstanding something, but it seems like what you just said can't be addressing the actual situation we're talking about, because nothing in it makes reference to the AI's utility function, which is the thing that gets manipulated in the schemes we're talking about.

(I agree that the AI's nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I'm talking about its actions, not about its purely-internal nominal beliefs.)

Comment author: Stuart_Armstrong 03 May 2016 05:21:32PM *  0 points [-]

Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u' = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).

Then if the AI is motivated to maximise u' (assume for the moment that it can't affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won't do Z.

Then, after it notices the message is read, it shifts to assuming Y happened - equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong - that it's more likely in a world outside of V entirely where neither X nor Y happened - but it still tries, on the off-chance that it's in W.

However, since it's an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.

Comment author: gjm 03 May 2016 12:33:27PM *  -1 points [-]

(I'm getting downvotes because The Person Formerly Known As Eugine_Nier doesn't like me and is downvoting everything I post.)

Yes, I agree that the utility-function hack isn't the same as altering the AI's prior. It's more like altering its posterior. But isn't it still true that the effects on its inferences (or, more precisely, on its effective inferences -- the things it behaves as if it believes) are the same as if you'd altered its beliefs? (Posterior as well as prior.)

If so, doesn't what I said follow? That is:

  • Suppose that believing X would lead the AI to infer Y and do Z.
    • Perhaps X is "my message was corrupted by a burst of random noise before reaching the users", Y is "some currently mysterious process enables the users to figure out what my message was despite the corruption", and Z is some (presumably undesired) change in the AI's actions, such as changing its message to influence the users' behaviour.
  • Then, if you tweak its utility function so it behaves exactly as if it believed X ...
  • ... then in particular it will behave as if had inferred Y ...
  • ... and therefore will still do Z.
Comment author: Stuart_Armstrong 03 May 2016 03:04:43PM 0 points [-]

After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn't, because P(Y|X) is (presumably) very low.

Comment author: gjm 29 April 2016 11:15:15AM -2 points [-]

If your method truly makes the AI behave exactly as if it had a given false belief, and if having that false belief would lead it to the sort of conclusions V_V describes, then your method must make it behave as if it has been led to those conclusions.

Comment author: Stuart_Armstrong 03 May 2016 12:24:32PM 0 points [-]

Not quite (PS: not sure why you're getting down-votes). I'll write it up properly sometime, but false beliefs via utility manipulation are only the same as false beliefs via prior manipulation if you set the probability/utility of one event to zero.

For example, you can set the prior for a coin flip being heads as 2/3. But then, the more the AI analyses the coin and physics, the more the posterior will converge on 1/2. If, however, you double the the AI's reward in the heads world, it will behave as if the probability is 2/3 even after getting huge amounts of data.

Comment author: So8res 02 May 2016 07:25:55PM 3 points [-]

FYI, this is not what the word "corrigibility" means in this context. (Or, at least, it's not how we at MIRI have been using it, and it's not how Stuart Russell has been using it, and it's not a usage that I, as one of the people who originally brought that word into the AI alignment space, endorse.) We use the phrase "utility indifference" to refer to what you're calling "corrigibility", and we use the word "corrigibility" for the broad vague problem that "utility indifference" was but one attempt to solve.

By analogy, imagine people groping around in the dark attempting to develop probability theory. They might call the whole topic the topic of "managing uncertainty," and they might call specific attempts things like "fuzzy logic" or "multi-valued logic" before eventually settling on something that seems to work pretty well (which happened to be an attempt called "probability theory.") We're currently reserving the "corrigibilty" word for the analog of "managing uncertainty"; that is, we use the "corrigibility" label to refer to the highly general problem of developing AI algorithms that cause a system to (in an intuitive sense) reason without incentives to deceive/manipulate, and to reason (vaguely) as if it's still under construction and potentially dangerous :-)

Comment author: Stuart_Armstrong 03 May 2016 12:19:23PM 1 point [-]

Good to know. I should probably move to your usage, as it's more prevalent.

Will still use words like "corrigible" to refer to certain types of agents, though, since that makes sense for both definitions.

Comment author: Lumifer 29 April 2016 02:35:20PM 2 points [-]

Priors are a local term. Often enough a prior used to be a posterior during the previous iteration.

Comment author: Stuart_Armstrong 29 April 2016 04:49:13PM 1 point [-]

But if the probability ever goes to zero, it stays there.

Comment author: HungryHobo 29 April 2016 04:14:50PM *  0 points [-]

Why must the oracle continue to believe it's messages weren't read?

In the example you give I'm guessing the reason you'd want an oracle to believe with cold certainty that it's messages won't be read is to avoid it trying to influence the world with them but that doesn't require that it continue to believe that later. As long as when it's composing and ouputing the message it believes solidly that it will never be read and nothing can move that belief from zero then that's fine. That does not preclude it being perfectly accepting that it's past messages were in fact read and basing it's beliefs about the world on that. That knowledge after all cannot shift the belief that this next message will never, ever ever be read unlike all the others.

Of course that brings up the question of why an oracle would even be designed as a goal based AI with any kind of utility function. Square peg, round hole and all that.

Comment author: Stuart_Armstrong 29 April 2016 04:48:26PM 0 points [-]

For Oracles, you can reset them after they've sent out their message. For autonomous AIs, this is more tricky.

Comment author: Lumifer 28 April 2016 04:33:46PM *  0 points [-]

The issue is that it only has one tool to change beliefs - Bayesian updating

That idea has issues. Where is the agent getting its priors? Does it have the ability to acquire new priors or it can only chain forward from pre-existing priors? And if so, is there a ur-prior, the root of the whole prior hierarchy?

How will it deal with an Outside Context Problem?

Comment author: Stuart_Armstrong 29 April 2016 10:45:58AM 0 points [-]

Does it have the ability to acquire new priors [...]?

It might, but that would be a different design. Not that that's a bad thing, necessarily, but that's not what is normally meant by priors.

Comment author: V_V 28 April 2016 07:52:12PM 0 points [-]

The oracle can infer that there is some back channel that allows the message to be transmitted even it is not transmitted by the designated channel (e.g. the users can "mind read" the oracle). Or it can infer that the users are actually querying a deterministic copy of itself that it can acausally control. Or something.

I don't think there is any way to salvage this. You can't obtain reliable control by planting false beliefs in your agent.

Comment author: Stuart_Armstrong 29 April 2016 10:38:40AM 0 points [-]

I am not planting false beliefs. The basic trick is that the AI only gets utility in worlds in which its message isn't read (or, more precisely, in worlds where a particular stochastic event happens, which would almost certainly erase the message before reading). It's fully aware that in most worlds, its message is read; it just doesn't care about those worlds.

Comment author: Lumifer 28 April 2016 04:08:07PM 0 points [-]

Technically, no - an expected utility maximiser doesn't even have a self model.

Why not? Is there something that prevents it from having a self model?

Comment author: Stuart_Armstrong 28 April 2016 04:18:10PM 0 points [-]

You're right, it could, and that's not even the issue here. The issue is that it only has one tool to change beliefs - Bayesian updating - and that tool has not impact with a prior of zero.

View more: Next