JamesAndrix comments on Cryptographic Boxes for Unfriendly AI - Less Wrong

24 Post author: paulfchristiano 18 December 2010 08:28AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (155)

You are viewing a single comment's thread. Show more comments above.

Comment author: Perplexed 18 December 2010 04:18:40PM 9 points [-]

the sort of proof techniques I currently have in mind ... would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence.

But what if the hostile superintelligence handed you a finished AI together with a purported proof of its Friendliness. Would you have enough trust in the soundness of your proof system to check the purported proof and act on the results of that check?

Comment author: JamesAndrix 19 December 2010 05:44:14PM 1 point [-]

That would then be something you'd have to read and likely show to dozens of other people to verify reliably, leaving opportunities for all kinds of mindhacks. the OP proposal requires us to have an automatic verifier ready to run, that can return reliably without human intervention.

Comment author: jsalvatier 19 December 2010 08:54:38PM *  5 points [-]

Actually computers can mechanically check proofs for any formal system.

Comment author: wedrifid 20 December 2010 05:56:30AM 0 points [-]

Is there something missing from the parent? It does not seem to parse.

Comment author: jsalvatier 20 December 2010 06:54:48AM 0 points [-]

Yes, edited. thanks.

Comment author: wedrifid 20 December 2010 07:04:32AM 0 points [-]

And upvoted. :)

Comment author: Pfft 19 December 2010 08:55:05PM 4 points [-]

Yes, but the point is that the automatic verifier gets to verify a proof that the AI-in-the-box produced -- it doesn't have to examine an arbitrary program and try to proof friendliness from scratch.

In a comment below, paulfchristiano makes the point that any theory of friendliness at all should give us such a proof system, for some restricted class of programs. For example, Eliezer envisions a theory about how to let programs evolve without losing friendliness. The corresponding class of proofs have the form "the program under consideration can be derived from the known-friendly program X by the sequence Y of friendliness-preserving transformations".