This seems like a very good perspective to me.
It made me think about the way that classic biases are often explained by constructing money pumps. A money pump is taken to be a clear, knock-down demonstration of irrationality, since "clearly" no one would want to lose arbitrarily large amounts of money. But in fact any money pump could be rational if the agent just enjoyed making the choices involved. If I greatly enjoyed anchoring on numbers presented to me, I might well pay a lot of extra money to get anchored; this would be like buying a kind of enjoyable product. Likewise someone might just get a kick out of making choices in intransitive loops, or hyperbolic discounting, or whatever. (In the reverse direction, if you didn't know I enjoyed some consumer good, you might think I was getting "money pumped" by paying for it again and again.)
So there is a missing step here, and to supply the step we need psychology. The reason these biases are biases and not values is "those aren't the sort of things we care about," but to formalize that, we need an account of "the sort of things we care about" which, as you say, can't be solved for from policy data alone.
Agents can be modelled as pairs (p, R), where R is a reward, and p is a planner that takes R and more or less rationally outputs the policy p(R).
Normative assumptions are assumptions that can distinguish between different pairs (p, R) and (p', R'), despite those pairs generating the same policy p(R) = p'(R'), and hence being indistinguishable by observation. We've already looked at the normative assumption "stated regret is accurate"; this post will look at more general normative assumptions, and why you'd want to use them to define human (ir)rationality and reward.
Talking it all out
The previous post used "stated reward". This suggests an extension: why restrict to certain human utterances like regret, and why not just use human utterances in general? If we want to know about human values and human rationality, why not ask... humans?
But humans, and I'm sorry if this shocks you, sometimes lie or mislead or tell less than the full truth. We're subject to biases and inconsistencies, and hence our opinions are not a reliable indicator of our values.
Can we not just detect biases and lies, and train a learning agent to ignore or discount them? For example, we could use labelled databases of lies or partial truths, and...
The problem is that a "labelled database of lies and partial truths" is just asking humans again, at a meta level. If we can be misleading at the object level, how can we trust the meta-level assessment ("I am not racist"; "I am telling the truth in the previous statement")? If we use different people to do the speaking and the labelling, then we're just using the biases and utterance of some people as assessments of the biases and utterances of others.
Especially when people's values are so under-defined, and biases are both rampant and un-obvious, and we expect meta-level statements to be even more noisy that object level ones.
Let P be the procedure ="take a lot of human statements about values, selected according to some criteria, and a lot of human meta-statements about values and object level statements, do some sort of machine learning on them with some sort of regularisation". In the language of this post, is P something that you would be comfortable to see as the definition of human values?
As definitions go, it certainly seems like it's not completely useless, but it probably has some hideous edge case failures. Can we try and do better?
Why is bias bias, and rationality rationality?
Returning to the point of previous posts, one cannot deduce human reward and rationality from observations. Humans, however, do it all the time, about themselves and about each other, and we often agree with each other. So how do we do it?
Basically, we need to add normative assumptions - assumptions, not derived from observations about the world, that distinguish between different models of (ir)rationality and reward.
That old post mentioned feelings of regret ("I shouldn't have done that!"), which seem to be one of the prime reasons we model ourselves as having certain rewards - generally, the opposite to whatever we're regretting. When we specifically start regretting our own actions, rather than the outcome ("I knew it was the wrong decision when I made it!" "Why didn't I stop to think?") this helps us model ourselves as partially irrational agents.
What other assumptions can we reach for? Basically, we need to look for how humans define rationality and irrationality, and use this to define our very values.
Rationality: logic and irrelevant elements?
One strong assumption that underlies the definition of rationality, is that it follows the rules of logic. By transitivity, if I prefer A to B, and B to C, then I must prefer A to C. In fact, an even more basic assumption is that people actually prefer A to B, or vice versa, or rank them equally.
So far, so good. But nobody is shocked that I prefer cereals to curry most mornings, and the reverse at all lunches. Do I not have a preference? Ah, but my preferences is time and appetite dependent, and people don't see this as a failure of rationality.
What we see as a failure of rationality, is if someone shifts behaviours for unimportant or irrelevant reasons. We put someone in a situation, check their A vs B preferences, put the same or a different person in a functionally identical situation, check their B vs C, etc...
Thus a lot of rationality can be reduced to logic, plus a theory about what constitutes a functionally identical situation. In other words, rationality is mainly a theory about what doesn't matter, or about what shouldn't matter.
Anchoring bias, emotions, and narratives
Let's look again at anchoring bias. In this bias, people who hear a low-but-irrelevant number will offer less for a product than people who hear a high-but-irrelevant number.
It's one of the biases that people agree the most is a bias - I've yet to hear anyone argue that people value possessing products at prices close to random numbers they're heard recently.
Now imagine the situation in which these tests would occur. It might be in a classroom, or a university room. Suggestive words might be "calm, dispassionate, clinical, formal, uniform" or terms of that nature.
Let's try another variant of the anchoring bias. In this version, the vendor either insults the subject, or compliments them. I'd be willing to bet that we would find people willing to pay more in the second case than in the first.
Finally, in the third variant, the participants are told that the whole interaction, including the insults/compliments and any subsequent decision, will be broadcast to the world.
Now, we have the same behaviour in three situations - random number, differential treatment, and public differential treatment. I'd classify the first as a bias, and the last as perfectly rational behaviour, with the middle situation falling somewhat in between.
What this means is that we judge that random numbers in relaxed environments are irrelevant to human values; that emotional interactions are relevant; and that publically visible emotional reactions are very relevant.
At this point I'd introduce narratives - the stories we tell ourselves, about ourselves and others. A strong element of these narratives is that we have few complex emotions and preferences, rather than many simple ones. Think of all the situations that go under the label "shame", "joy", or "doubt", for instances (and when it was more popular, all the factors that defined "honour").
Small isolated preferences get classified as quirks, or compulsions, and we generally feel that we could easily get rid of these without affecting the core of our personalities.
Back to the anchoring biases. In the original setting, everything is carefully constructed to remove strong emotions and preferences. We are not getting insulted. The setting is relaxed. The random numbers are not connected with anything deep about us. Therefore, according to our narratives, we have no genuine preferences or emotions at stake (apart from the preference for whatever is being sold). So the different behaviours that happen cannot be due to preferences, and must be biases.
In the other settings, we have strong emotions and then social judgement at stake, and our narratives about ourselves count these as very valid sources of preferences.
It's all a bit vague...
This mention of regret, emotions, and narratives seems a bit vague and informal. I haven't even bothered to provide any references, and there are weasel words like "generally". Could someone else not give different interpretations about what's going on, maybe by triggering invoking narratives about ourselves?
And indeed they could. The fundamental problem remains: our values are inconsistent and underdefined, and our narratives are also inconsistent and underdefined.
We still have to make choices about how to resolve these inconsistencies and definitions, and different people and different cultures would resolve them very differently.
Nevertheless, I think we have made some progress.
It's all humans answering questions - which ones?
If we make use of emotions and narratives, this suggests a slightly different way of proceeding. Rather than just taking a lot of human answers and meta-answers, we first identify the main emotions and narratives that we care about. We train the AIs to recognise these in humans (it need not be perfect - humans are not perfect at it either, but are better than chance).
Only then do we unleash it on human answers, and allow it to ask questions itself. Instead of drawing categories from the answers, we draw some of the categories ourselves first, and it then uses there to interpret the answers. We don't have to label which answers and meta-answers are true or reliable - the AI will draw its own inferences about that.
Now, you might argue that this is just human answering questions all over again - identifying human emotions is just the AI learning from labelled data, and the label is a human answer.
To which I'd say... of course it is. Everything we train an AI on is essentially human answers of some sort. We provide the training data and the labels, we tweak the parameters depending on results, we judge which method performs better.
But of course, in machine learning, some approaches are better than others, even if they're "equivalent" in this way. Human feedback is more useful distinguishing between some categories than between others. And it seems to me that "identifying strong narratives and emotions, training the AI on it, then unleashing it on a lot of examples of human (meta-)answers, and getting human feedback on its next level conclusions (often using the initial categories to phrase questions)" is a better and more stable approach than "unleash the AI on a lot of examples of human (meta-)answers, and let it draw its own categories".
For a start, I suspect the first approach will give a regularisation we're more comfortable with (as our narrative categories give some hints as to what we consider important and what we consider contingent). Secondly, I feel that this approach is less likely to collapse into pathological behaviour. And I feel that this could succeed at the task of "identify subconscious aspects of human preferences that we really want to endorse, if we but knew about them".