Stuart_Armstrong comments on The Ultimate Testing Grounds - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (14)
On 'certain to fail'... what if it would have pursued plan X that requires only abilities it has, but only if it had ability Y that it doesn't have and you made it think it has, that comes up in a contingency that turns out not to arise?
Like for a human, "I'll ask so-and-so out, and if e says no, I'll leave myself a note and use my forgetfulness potion on both of us so things don't get awkward."
Only for a world-spanning AI, the parts of the contingency table that are realizable could involve wiping out humanity.
So we're going to need to test at the intentions level, or sandbox.
A better theory of counterfactuals - that can deal with events of zero probability - could help here.