Creating a link post to an important argument about a potential motte and bailey fallacy used in the AI alignment community. 

New Comment
3 comments, sorted by Click to highlight new comments since:

tl;dr - what Max_He-Ho said

Author makes a good point that humans are not choosing the utility function for the AI randomly. They are trying to build something useful, which dramatically limits the possible choices.

The problem is that after filtering the choices by "does something profitable and seemingly safe (in the beta version)", there are still many possible utility functions left, most of them such that we would not want a superhuman AI to optimize for that.

I'd be curious to hear what you think about my arguments that deceptive alignment is unlikely. Without deceptive alignment, there are many fewer realistic internal goals that produce good training results. 

Active deception seems unlikely, I agree with that part (weak opinion, didn't spend much time thinking about it). At this moment, my risk model is like: "AI destroys everything the humans were not paying attention to... plus a few more things after the environment changes dramatically".

(Humans care about 100 different things. If you train the AI, you check 10 of them, and the AI will sincerely tell you whether it cares about them or not. You make the AI care about all 10 of them and run it. The remaining 90 things are now lost. Out of the 10 preserved things, 1 was specified a little incorrectly, now it is too late to do anything about it. As a consequence of losing the 90 things, the environment changes so that 2 more of the 10 preserved things do not make much sense in the new context, so they are effectively also lost. The humanity gets to keep 7 out of 100 things they originally cared about.)