paulfchristiano comments on What can you do with an Unfriendly AI? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (127)
Suppose you can verify that an AI obeys a simple utility function ('press a button'), but can't verify if it obeys a complex one ('implement CEV of humanity').
Suppose also that you can formally describe the complex function and run it (i.e. output the utility value of a proposed action or scenario), you just can't verify that an AI follows this function. In other words, you can build an AI that follows this complex function, but you don't know what will happen when it self-modifies.
You might proceed as follows. Run two copies of the AI. Copy 1 has the complex utility function, and is not allowed to self-modify (it might therefore be quite slow and limited). Copy 2 has the following simple utility function: "ask AI copy 1 for permission for all future actions. Never modify AI copy 1's behavior." Copy 2 may self-modify and foom.
This puts some limits on the kinds of self-modification that copy 2 will be able to do. It won't be able to take clever shortcuts that preserve its utility function in non-obvious ways. But it will be able to find a superintelligent way of doing something, if that something is sufficiently self-contained that it can describe it to copy 1.
Of course, if AI copy 2 morphs into something truly adversarial, then copy 1's utility function had better be very foolproof. But this is something that has to be solved anyway to implement the seed AI's utility function in a fooming FAI scenario.
So I think verifying simple utility functions under self-modification is indeed 90% of the way towards verifying complex ones.
I think that much of the difficulty with friendliness is that you can't write down a simple utility function such that maximizing that utility is friendly. By "complex goal" I mean one which is sufficiently complex that articulating it precisely is out of our league.
I do believe that any two utility functions you can write down precisely should be basically equivalent in terms of how hard it is to verify that an AI follows them.