Eliezer_Yudkowsky comments on Cryptographic Boxes for Unfriendly AI - Less Wrong

24 Post author: paulfchristiano 18 December 2010 08:28AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (155)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 18 December 2010 02:44:08PM *  19 points [-]

a certifiably friendly AI: a class of optimization processes whose behavior we can automatically verify will be friendly

The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.

In particular the sort of proof techniques I currently have in mind - what they prove and what it means - for ensuring Friendliness through a billion self-modifications of something that started out relatively stupid and built by relatively trusted humans, would not work for verifying Friendliness of a finished AI that was handed you by a hostile superintelligence, and it seems to me that the required proof techniques for that would have to be considerably stronger.

To paraphrase Mayor Daley, the proof techniques are there to preserve the Friendly intent of the programmers through the process of constructing the AI and through the AI's self-modification, not to create Friendliness. People hear the word "prove" and assume that this is because you don't trust the programmers, or because you have a psychological need for unobtainable absolute certainty. No, it's because if you don't prove certain things (and have the AI prove certain things before each self-modification) then you can't build a Friendly AI no matter how good your intentions are. The good intentions of the programmers are still necessary, and assumed, beyond the parts that are proved; and doing the proof work doesn't make the whole process absolutely certain, but if you don't strengthen certain parts of the process using logical proof then you are guaranteed to fail. (This failure is knowable to a competent AGI scientist - not with absolute certainty, but with high probability - and therefore it is something of which a number of would-be dabblers in AGI maintain a careful ignorance regardless of how you try to explain it to them, because the techniques that make them enthusiastic don't support that sort of proof. "It is hard to explain something to someone whose job depends on not understanding it.")

Comment author: [deleted] 18 December 2010 03:16:38PM 6 points [-]

The probability I assign to achieving a capability state where it is (1) possible to prove a mind Friendly even if it has been constructed by a hostile superintelligence, (2) possible to build a hostile superintelligence, and (3) not possible to build a Friendly AI directly, is very low.

A general theory of quarantines would nevertheless be useful.

Comment author: Eliezer_Yudkowsky 18 December 2010 04:11:09PM 2 points [-]

For what?

Comment author: PeterS 18 December 2010 08:48:34PM 4 points [-]

The OP framed the scenario in terms of directing the AI to design a FAI, but the technique is more general. It's possibly safe for all problems with a verifiable solution.

Comment author: wedrifid 20 December 2010 05:59:32AM *  2 points [-]

For what?

People I don't trust but don't want to kill (or modify to cripple). A non-compliant transhuman with self modification ability may not be able to out-compete an FAI but if it is not quarantined it could force the FAI to burn resources to maintain dominance.

But it is something we can let the FAI build for us.

Comment deleted 20 December 2010 10:22:57AM [-]
Comment author: wedrifid 20 December 2010 11:14:24AM 0 points [-]

Shrug. For the purposes here they could be called froogles for all I care. The quarantine could occur in either stage depending on the preferences being implemented.