You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

paulfchristiano comments on Approval-directed agents - Less Wrong Discussion

9 Post author: paulfchristiano 12 December 2014 10:38PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (22)

You are viewing a single comment's thread. Show more comments above.

Comment author: paulfchristiano 26 December 2014 09:57:13PM *  1 point [-]

Arthur is making choices from a small set of options; say it's just two options. (See here for how to move between small and large sets of options, and here for how to do this kind of thing safely.) Suppose the available actions are NULL and HACK, with the obvious effects. So there are four relevant numbers:

  1. Hugh's approval of NULL
  2. Hugh's approval of HACK
  3. Hacked Hugh's approval of NULL
  4. Hacked Hugh's approval of HACK

When I talked about "two ways..." I meant that counterfactually choosing HACK moves you from 1/3 to 2/4, by changing what you do; it also moves you from 1/2 to 3/4, by changing whether Hugh is hacked.

AIXI compares item 1 to item 4, and hacks if 4 is higher. That is, when AIXI considers the counterfactual it applies both of these changes.

Arthur estimates P(hack), then compares ( (item 1) * P(hack) + (item 3) * P(no hack) ) to ( (item 2) * P(hack) + (item 4) * P(no hack)).

So suppose that Hugh gives a high rating to NULL and a low rating to HACK, and hacked Hugh gives a high rating to HACK and a low rating to NULL. Then Arthur hacks iff P(hack) is high enough. It's hard to know what would actually happen, it seems like it's up to lower-level parts of the agent to break the ambiguity. This seems qualitatively different from AIXI, who would always HACK if it wasn't possible to achieve maximal reward through other channels.

That said, I don't think this is a major part of being protected from this kind of attack. The larger protections (vs. 1-step AIXI) are from (1) having a small enough set of actions that 1-step attacks are unlikely, (2) defining approval by considering how you would rate if the action didn't happen.