D_Alex comments on Cryptographic Boxes for Unfriendly AI - Less Wrong

24 Post author: paulfchristiano 18 December 2010 08:28AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (155)

You are viewing a single comment's thread. Show more comments above.

Comment author: Eliezer_Yudkowsky 20 December 2010 06:17:52AM 3 points [-]

B is false.

Does this still hold if you remove the word "hostile", i. e. if the "friendliness" of the superintelligence you construct first is simply not known?

Heh. I'm afraid AIs of "unknown" motivations are known to be hostile from a human perspective. See Omohundro on the Basic AI Drives, and the Fragility of Value supersequence on LW.

Comment author: D_Alex 21 December 2010 03:17:42AM 4 points [-]

B is false.

This is... unexpected. Mathematical theory, not to mention history, seems to be on my side here. How did you arrive at this conclusion?

I'm afraid AIs of "unknown" motivations are known to be hostile from a human perspective.

Point taken. I was thinking of "unknown" in the sense of "designed with aim of friendliness, but not yet proven to be friendly", but worded it badly.

Comment author: paulfchristiano 21 December 2010 03:37:41AM 2 points [-]

The point is that it may be possible to design a heuristically friendly AI which, if friendly, will remain just as friendly after changing itself, without having any infallible way to recognize a friendly AI (in particular, its bad if your screening has any false positives at all, since you have a transhuman looking for pathologically bad cases).

Comment author: timtyler 21 December 2010 06:46:17PM 2 points [-]

To recap, (b) was: 'Verification of "proof of friendliness" is easier than its production'.

For that to work as a plan in context, the verification doesn't have to be infallible. It just needs not to have false positives. False negatives are fine - i.e. if a good machine is rejected, that isn't the end of the world.