I wrote a fable for the EA "AI fables" contest, which raises the question of what happens when you copy values from humans to AIs, and those values contain self-referential pointers. The fable just raises the issue, and is more about contemporary human behavior than nitty-gritty representational issues. But further reflection made me think the issue may be much more-serious than the fable suggests, so I wrote this: De Dicto and De Se Reference Matters for Alignment (a crosslink to forum.effectivealtruism.org; yes I should've posted it here first and crosslinked in the other direction, but I didn't).
No, I'm definitely just thinking about IRL here.
IRL takes a model of the world and of the human's affordances as given constants, assumes the human is (maybe noisily) rational, and then infers human desires in terms of that world model, which then can also be used by the AI to choose actions if you have a model of the AI's affordances. It has many flaws, but it's definitely worth refreshing yourself about occasionally.