paulfchristiano comments on Cryptographic Boxes for Unfriendly AI - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (155)
The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.
In particular the sort of proof techniques I currently have in mind - what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.
To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI's self-modification, not to create Friendliness. People hear the word "prove" and assume that this is because you don't trust the programmers, or because you have a psychological need for unobtainable absolute certainty. No, it's because if you don't prove certain things (and have the AI prove certain things before each self-modification) then you can't build a Friendly AI no matter how good your intentions are. The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn't make the whole process absolutely certain, but if you don't strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist - not with absolute certainty, but with high probability - and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don't support that sort of proof. "It is hard to explain something to someone whose job depends on not understanding it.")
The probability I assign to being able to build a friendly AI directly before being able to build a hostile AI is very low. You have thought more about the problem, but I'm not really convinced. I guess we can both be right concurrently, and then we are in trouble.
I will say that I think you underestimate how powerful allowing a superintelligence to write a proof for you is. The question is not really whether you have proof techniques to verify friendliness. It is whether you have a formal language expressive enough to describe friendliness in which a transhuman can find a proof. Maybe that is just as hard as the original problem, because even formally articulating friendliness is incredibly difficult.
Usually, verifying a proof is considerably easier than finding one - and it doesn't seem at all unreasonable to use a machine to find a proof - if you are looking for one.