In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.
That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consu... (read more)
Good idea. I have two objections, one more funny-but-interesting objection and one more fatal.
The funny objection is that if the penalty is enough to stop the AI from doing bad things, it's also enough to stop the AI from doing anything at all except rushing to turn off the stars and forestall entropy production in the universe.
So you want to say that producing lots of extra entropy (or equivalently, using lots of extra free energy) is bad, but making there be less entropy than "what would happen if you did nothing" doesn't earn you bonus points. I've put "what would happen if you did nothing" in scare quotes here because the notion we want to point to is a bit trickier than it might seem - logical counterfactuals are an unsolved problem, or rather they're a problem where it seems like the solution involves making subjective choices that match up with humans'.
The more fatal objection is that there's lots of policies that don't increase entropy much but totally rearrange the universe. So this is going to have trouble preventing the AI from breaking things that matter a lot to you.
Many of these policies take advantage of the fact that there's a bunch of entropy being created all the time (allowing for "entropy offsets"), so perhaps you might try to patch this by putting in some notion of "actions that are my fault" and "actions that are not my fault" - where a first pass at this might say that if "something would happen" (in scare quotes because things that happen are not ontologically basic parts of the world, you need an abstract model to make this comparison within) even if I took the null action, then it's not my fault.
At this point we could keep going deeper, or I could appeal to the general pattern that patching things in this sort of way tends to break - you're still in some sense building an AI that runs a search for vulnerabilities you forgot to patch, and you should not build that AI.
In EY's talk AI Alignment: Why its Hard and Where to Start he describes alignment problems with the toy example of the utility function that is {1 if cauldron full, 0 otherwise} and its vulnerabilities. And attempts at making that safer by adding so called Impact Penalties. He talks through (timestamp 18:10) one such possible penalty, the Euclidean Distance penalty, and various flaws that this leaves open.
That penalty function does seem quite vulnerable to unwanted behaviors. But what about a more physical one, such as a penalty for additional-energy-consu... (read more)