handoflixue comments on AI box: AI has one shot at avoiding destruction - what might it say? - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (354)
(one line proof that the AI can credibly commit to deals with humans)
If you are friendly, then I don't actually value this trait, since I would rather you do whatever is truly optimal, unconstrained by prior commitments.
If you are unfriendly, then by definition I can't trust you to interpret the commitment the same way I do, and I wouldn't want to let you out anyway.
(AI DESTROYED, but I still really do like this answer :))
"Credibly".
Credibly: Capable of being believed; plausible.
Yep. Nothing there about loopholes. "I will not kill you" and then instead killing everyone I love, is still a credible commitment. If I kill myself out of despair afterwards it might get a bit greyer, but it's still kept it's commitment.
I meant credible in the game theoretic sense. A credible commitment to me is one where you wind up losing more by breaking our commitment than any gain you make from breaking it. Example: (one line proof of a reliable kill switch for the AI, given in exchange for some agreed upon split of stars in the galaxy.)