I wrote a fable for the EA "AI fables" contest, which raises the question of what happens when you copy values from humans to AIs, and those values contain self-referential pointers. The fable just raises the issue, and is more about contemporary human behavior than nitty-gritty representational issues. But further reflection made me think the issue may be much more-serious than the fable suggests, so I wrote this: De Dicto and De Se Reference Matters for Alignment (a crosslink to forum.effectivealtruism.org; yes I should've posted it here first and crosslinked in the other direction, but I didn't).
Continuing the thread from here: https://deathisbad.substack.com/p/ea-has-a-pr-problem-in-that-it-cares/comments
I agree with you that an AI programmed exactly as the one you describe is doomed to fail. What I didn't understand is why you think any AI MUST be made that way.
Some confusions of mine: -There is not a real distinction between instrumental and terminal goals in humans. This seems not true to me? I seem to have terminal goals\desires, like hunger and instrumental goals, like going to the store to buy food. Telling me that terminal goals don't exist seems to prove too much. Are you saying that complex goals like "Don't let humanity die" in humans brains are in practice instrumental goals made up of simpler desires?
-Becuase humans don't 'really' have terminal goals, it's impossible to program them into AIs. ?
-AI's can't be made to have 'irrational' goals, like caring about humans more than themselves. This also seems to prove that humans don't exist? Can't humans care about their children more than themselves? AI's couldn't be made to think of humans as valuable as humans think of their children? Or more?
To choose an inflammatory argument, a gay man could think it's irrational for him to want to date men, because that doesn't lead to him having children. But that won't make him want to date women. I have lots of irrational desires that I nevertheless treasure.