# CarlShulman comments on A toy model of the control problem - Less Wrong

20 16 September 2015 02:59PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Sort By: Best

You are viewing a single comment's thread.

Comment author: 17 September 2015 03:02:48AM 2 points [-]

Of course, with this model it's a bit of a mystery why A gave B a reward function that gives 1 per block, instead of one that gives 1 for the first block and a penalty for additional blocks. Basically, why program B with a utility function so seriously out of whack with what you want when programming one perfectly aligned would have been easy?

Comment author: 17 September 2015 05:42:58AM 7 points [-]

It's a trade-off. The example is simple enough that the alignment problem is really easy to see, but it also means that it is easy to shrug it off and say "duh, just the use obvious correct utility function for B".

Perhaps you could follow it up with an example with more complex mechanics (and or more complex goal for A) where the bad strategy for B is not so obvious. You then invite the reader to contemplate the difficulty of the alignment problem as the complexity approaches that of the real world.

Comment author: 17 September 2015 06:34:54AM *  5 points [-]

Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.

Comment author: 17 September 2015 06:02:50PM *  1 point [-]

That still involves training it with no negative feedback error term for excess blocks (which would overwhelm a mere 0.1% uncertainty).

Comment author: 18 September 2015 12:01:22PM 1 point [-]

This is supposed to be a toy model of excessive simplicity. Do you have suggestions for improving it (for purposes of presenting to others)?

Comment author: 18 September 2015 03:31:48PM 1 point [-]

Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/runs more trial-and-error trials?

Comment author: 18 September 2015 03:56:55PM 0 points [-]

Ok.

Comment author: 18 September 2015 07:57:18PM 2 points [-]

I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.