drethelin comments on The Need for Human Friendliness - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (28)
I think you're assigning too little weight to provably friendly but this is pretty funny
How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?
Very little. I don't like my odds. If Eliezer has provable friendliness theorems but not an AI, it's in his and everyone's interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there's nothing I can do about it at all. So why worry?
I usually assume "provably friendly" means "will provably optimise for complex human-like values correctly" and thus includes both actual humanity-wide values and one person's values (and the two options can plausibly be switched between at a late stage of the design process).
And, well, I meant for it to be a little funny, so I'll take that as a win!
Friendly means something like "will optimize for the appropriate complex human-like values correctly."
Saying "we don't have clear criteria for appropriate human values" is just another way of saying that defining Friendly is hard.
Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.
An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.
How do you define "appropriate"? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we'd want a Friendly AI to optimise.
You might say that "appropriate" values are ones which "we" would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I'm leaning towards "no"), in which case you should probably have a contingency definition for what to do if they, in fact, do not.
I would also be shocked if there were a "provable" definition of "appropriate" (as opposed to the friendliness of the program being provable with respect to some definition of "appropriate").