A much stronger argument than all-powerful AIs suddenly escaping (which is still not without merit) is that AI will have an incentive to behave as we expect it to behave, until at some point we no longer control it. It'll try its best to pass all tests.
So: while it believes it is under evaluation it does its very best to behave itself?
Can we wire that belief in as a prior with p=1.0?
This thread is for the discussion of Less Wrong topics that have not appeared in recent posts. If a discussion gets unwieldy, celebrate by turning it into a top-level post.