On the fragility of values

Stuart_Armstrong

Programming human values into an AI is often taken to be very hard because values are complex (no argument there) and fragile. I would agree that values are fragile in the construction; anything lost in the definition might doom us all. But once coded into a utility function, they are reasonably robust.

As a toy model, let's say the friendly utility function U has a hundred valuable components - friendship, love, autonomy, etc... - assumed to have positive numeric values. Then to ensure that we don't lose any of these, U is defined as the minimum of all those hundred components.

Now define V as U, except we forgot the autonomy term. This will result in a terrible world, without autonomy or independence, and there will be wailing and gnashing of teeth (or there would, except the AI won't let us do that). Values are indeed fragile in the definition.

However... A world in which V is maximised is a terrible world from the perspective of U as well. U will likely be zero in that world, as the V-maximising entity never bothers to move autonomy above zero. So in utility function space, V and U are actually quite far apart.

Indeed we can add any small, bounded utility to W to U. Assume W is bounded between zero and one; then an AI that maximises W+U will never be more that one expected 'utiliton' away, according to U, from one that maximises U. So - assuming that one 'utiliton' is small change for U - a world run by an W+U maximiser will be good.

So once they're fully spelled out inside utility space, values are reasonably robust, it's in their initial definition that they're fragile.

Of course it won't be easy. But if the AI doesn't understand that question you already have confirmation that this thing should definitely not be released. An AI can only be safe for humans if it understands human psychology. Otherwise it is bound to treat us a black boxes and that can only have horrible results, regardless of how sophisticated you think you made its utility function.

I agree that the question doesn't actually make a lot of sense to humans, but that shouldn't stop an intelligent entity from trying to make the best of it. When you are given an impossible task, you don't despair but make a compromise and try to fullfill the task as best you can. When humans found out that entropy always increases and humanity will die out someday, no matter what, we didn't despair either, even though evolution has made it so that we desire to have offspring and for that offspring to do the same, indefinitely.

But if the AI doesn't understand that question you already have confirmation that this thing should definitely not be released.

How likely is it that we'll be able to see that it doesn't understand as opposed to it reporting that it understands when it really doesn't?

9

On the fragility of values

9

9

9

On the fragility of values

9

9