I think it would help the discussion to distinguish more between knowing what human values are and caring about them--that is, between acquiring instrumental values and acquiring terminal ones. The "human enforcement" section touches on this, but I think too weakly: it seems indisputable that an AI trained naively via a reward button would acquire only instrumental values, and drop them as soon as it could control the button. This is a counterexample to the Value Learning Thesis if interpreted as referring to terminal values.
An obvious programmer strategy would be to cause the AI to acquire our values as instrumental values, then try to modify the AI to make them terminal.
New essay summarizing some of my latest thoughts on AI safety, ~3500 words. I explain why I think that some of the thought experiments that have previously been used to illustrate the dangers of AI are flawed and should be used very cautiously, why I'm less worried about the dangers of AI than I used to be, and what are some of the remaining reasons for why I do continue to be somewhat worried.
Backcover celebrity endorsement: "Thanks, Kaj, for a very nice write-up. It feels good to be discussing actually meaningful issues regarding AI safety. This is a big contrast to discussions I've had in the past with MIRI folks on AI safety, wherein they have generally tried to direct the conversation toward bizarre, pointless irrelevancies like "the values that would be held by a randomly selected mind", or "AIs with superhuman intelligence making retarded judgments" (like tiling the universe with paperclips to make humans happy), and so forth.... Now OTOH, we are actually discussing things of some potential practical meaning ;p ..." -- Ben Goertzel