gjm comments on JFK was not assassinated: prior probability zero events - Less Wrong

20 Post author: Stuart_Armstrong 27 April 2016 11:47AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (47)

You are viewing a single comment's thread. Show more comments above.

Comment author: gjm 03 May 2016 12:33:27PM *  -1 points [-]

(I'm getting downvotes because The Person Formerly Known As Eugine_Nier doesn't like me and is downvoting everything I post.)

Yes, I agree that the utility-function hack isn't the same as altering the AI's prior. It's more like altering its posterior. But isn't it still true that the effects on its inferences (or, more precisely, on its effective inferences -- the things it behaves as if it believes) are the same as if you'd altered its beliefs? (Posterior as well as prior.)

If so, doesn't what I said follow? That is:

  • Suppose that believing X would lead the AI to infer Y and do Z.
    • Perhaps X is "my message was corrupted by a burst of random noise before reaching the users", Y is "some currently mysterious process enables the users to figure out what my message was despite the corruption", and Z is some (presumably undesired) change in the AI's actions, such as changing its message to influence the users' behaviour.
  • Then, if you tweak its utility function so it behaves exactly as if it believed X ...
  • ... then in particular it will behave as if had inferred Y ...
  • ... and therefore will still do Z.
Comment author: Stuart_Armstrong 03 May 2016 03:04:43PM 0 points [-]

After witnessing the message being read, it would conclude Y happened, as P(Y|X and message read) is high. Before witnessing this, it wouldn't, because P(Y|X) is (presumably) very low.

Comment author: gjm 03 May 2016 04:53:48PM -1 points [-]

I may be misunderstanding something, but it seems like what you just said can't be addressing the actual situation we're talking about, because nothing in it makes reference to the AI's utility function, which is the thing that gets manipulated in the schemes we're talking about.

(I agree that the AI's nominal beliefs might be quite different in the two cases, but the point of the utility-function hack is to make its actions correspond to a different set of beliefs. I'm talking about its actions, not about its purely-internal nominal beliefs.)

Comment author: Stuart_Armstrong 03 May 2016 05:21:32PM *  0 points [-]

Let V be the set of worlds in which X happens. Let W be the set of worlds in which X and Y happens. Since Y is very unlikely, P(W)<<P(V) (however, P(W|message read) is roughly P(V|message read)). The AI gets utility u' = u|V (the utility in the non-V worlds is constant, which we may as well set to zero).

Then if the AI is motivated to maximise u' (assume for the moment that it can't affect the probability of X), it will assume it is in the set V, and essentially ignore W. To use your terminology, u(Z|X) is low or negative, u(Z|X,Y) is high, but P(Y|X)*u(Z|X,Y) is low, so it likely won't do Z.

Then, after it notices the message is read, it shifts to assuming Y happened - equivalently, that it is in the world set W. When doing so, it knows that it is almost certainly wrong - that it's more likely in a world outside of V entirely where neither X nor Y happened - but it still tries, on the off-chance that it's in W.

However, since it's an oracle, we turn it off before that point. Or we use corrigibility to change its motivations.

Comment author: gjm 03 May 2016 05:46:06PM -1 points [-]

Again, maybe I'm misunderstanding something -- but it sounds as if you're agreeing with me: once the AI observes evidence suggesting that its message has somehow been read, it will infer (or at least act as if it has inferred) Y and do Z.

I thought we were exploring a disagreement here; is there still one?

Comment author: Stuart_Armstrong 04 May 2016 09:21:43AM 0 points [-]

I think there is no remaining disagreement - I just want to emphasise that before the AI observes such evidence, it will behave the way we want.