[Epistemic status: ¯\_(ツ)_/¯ ]
Armstrong and Mindermann write about a no free lunch theorem for inverse reinforcement learning (IRL): the same action can reflect many different combinations of values and (irrational) planning algorithms.
I think even assuming humans were fully rational expected utility maximizers, there would be an important underdetermination problem with IRL and with all other approaches that infer human preferences from their actual behavior. This is probably obvious if and only if it's correct, and I don't know if any non-straw people disagree, but I'll expand on it anyway.
Consider two rational expected utility maximizing humans, Alice and Bob.
Alice is, herself, a value learner. She wants to maximize her true utility function, but she doesn't know what it is, so in practice she uses a probability distribution over several possible utility functions to decide how to act.
If Alice received further information (from a moral philosopher, maybe), she'd start maximizing a specific one of those utility functions instead. But we'll assume that her information stays the same while her utility function is being inferred, and she's not doing anything to get more; perhaps she's not in a position to.
Bob, on the other hand, isn't a value learner. He knows what his utility function is: it's a weighted sum of the same several utility functions. The relative weights in this mix happen to be identical to Alice's relative probabilities.
Alice and Bob will act the same. They'll maximize the same linear combination of utility functions, for different reasons. But if you could find out more than Alice knows about her true utility function, then you'd act differently if you wanted to truly help Alice than if you wanted to truly help Bob.
So in some cases, it's not enough to look at how humans behave. Humans are Alice on some points and Bob on some points. Figuring out details will require explicitly addressing human moral uncertainty.
I meant to assume that away:
In cases where you're not in a position to get more information about your utility function (e.g. because the humans you're interacting with don't know the answer), your behavior won't depend on whether or not you think it would be useful to have more information about your utility function, so someone observing your behavior can't infer the latter from the former.
Maybe practical cases aren't like this, but it seems to me like they'd only have to be like this with respect to at least one aspect of the utility function for it to be a problem.
Paul above seems to think it would be possible to reason from actual behavior to counterfactual behavior anyway, I guess because he's thinking in terms of modeling the agent as a physical system and not just as an agent, but I'm confused about that so I haven't responded and I don't claim he's wrong.