Eliezer_Yudkowsky comments on A toy model of the control problem - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (24)
Of course, with this model it's a bit of a mystery why A gave B a reward function that gives 1 per block, instead of one that gives 1 for the first block and a penalty for additional blocks. Basically, why program B with a utility function so seriously out of whack with what you want when programming one perfectly aligned would have been easy?
I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.