Did you read Rohin Shah's value learning sequence? It covers this whole area in a good amount of detail, and I think answers your question pretty straightforwardly:
Existing error models for inverse reinforcement learning tend to be very simple, ranging from Gaussian noise in observations of the expert’s behavior or sensor readings, to the assumption that the expert’s choices are randomized with a bias towards better actions.
In fact humans are not rational agents with some noise on top. Our decisions are the product of a complicated mess of interacting process, optimized by evolution for the reproduction of our children’s children. It’s not clear there is any good answer to what a “perfect” human would do. If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.
I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.
We can’t use normal AI techniques to learn this kind of model, either — what is it that makes a model good or bad? The standard view — “more accurate models are better” — is fine as long as your goal is just to emulate human performance. But this view doesn’t provide guidance about how to separate the “good” part of human decisions from the “bad” part.
Here is a link to the full sequence: https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc
Fwiw the quoted section was written by Paul Christiano, and I have used that blog post in my sequence (with permission).
Also, for this particular question you can read just Chapter 1 of the sequence.
Thank you for your feedback! I haven't read this yet, but it comes pretty close to a discussion I had with a friend over this post.
Essentially, her argument started with a simple counterargument: She bought peanut M&Ms when she didn't want to, and didn't realise she was doing it until afterwards. In a similar situation where she was hungry and in the same place, she desired peanut M&Ms to satisfy her hunger, but this time she didn't want them. She knew she didn't want peanut M&Ms, and didn't consciously decide to get them against that want; in
...
Given the following conditions, is it possible to approximate the coherent extrapolated value of humanity to a "good enough" level?:
Here is my reasoning to believe that this approximation will in fact work:
First, we assume that all these constraints are true.
The estimated reward function is continuously updated with data from every individual it meets, using some form of weighted experience replay system so as to not overwrite previously-learned information.
Given that IRL/IOC can already estimate the reward function of one agent, or even a specific class of agents such as streaked shearwater birds¹, with a sufficiently complex system this algorithm should be able to extend to complex (read: human) agents.
As the number of observations n approaches infinity (or some sufficiently large number), the reward function should approach a reward function that is a "good enough" approximation of the coherent extrapolated value of humanity.
Note that there does not need to exist some actual reward function that is natively used by real humans, evaluated by their brain. As long as human behaviour can be sufficiently approximated by a neural network, this will hold; given the wide abilities of neural networks, from classifiers to learning agents to machine translation, I don't see this as too much of a stretch.
However, I do anticipate certain objections to this explanation. Let me run through a few of them.
However, I'd be interested to see if there are any rebuttals to my responses to these counterarguments, as well as any counterarguments that I didn't bring up, of which there are definitely many. Also, if I made any mistakes or if anything in this post isn't clear, feel free to ask and I'll clarify it.
Footnotes