jimrandomh comments on What can you do with an Unfriendly AI? - Less Wrong

16 Post author: paulfchristiano 20 December 2010 08:28PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (127)

You are viewing a single comment's thread. Show more comments above.

Comment author: jimrandomh 21 December 2010 12:40:11AM *  2 points [-]

If all the stated assumptions about the AI's utility function hold, plus the additional assumption that we commit to run the algorithm to completion no matter what we predict the UFAIs will do, then no such coordination is possible. (I am not sure whether this additional assumption is necessary, but I'm adding it since no such prediction of the AI was included in the article and it closes off some tricky possibilities.)

The assumption was that each subgenie has lexicographic preferences, first preferring to not be punished, and then preferring to destroy the universe.

This is not a prisoner's dilemma problem; if you treat answering truthfully about the existence of a proof as defection, and lying so as to enable destruction of our universe as cooperation, then you will notice that not only does each subgenie prefer to defect, each subgenie also prefers (D,D) to (C,C). No subgenie will cooperate, because the other subgenies have nothing to offer it that will make up for the direct penalty for doing so.

The real reason this doesn't work is because the assumptions about the genie's utility function are very unlikely to hold, even if we think they should. For example, taking over our universe might allow the genie to retroactively undo its punishment in a way we don't foresee, in which case it would rather take over the universe than not be punished by us.