A putative new idea for AI control; index here.

I'll roughly divide ways of establishing human preferences into four categories:

#. Assume true #. Best fit #. Proxy measures #. Modelled irrationality


In some cases, these are more differences of emphasis than sharp technical differences.

For instance Cooperative Inverse Reinforcement Learning (CIRL) assumes that humans have access to their true reward function, and act along with the AI to jointly maximise that. This is a "assume true" model: it assumes humans have a real reward function and knowledge of that, and now it attempts to deduce what this is.

My extremely abstract model of human preferences is a "proxy measure": it uses the imperfect agent's answer to questions as a way to define preferences.

Both of these methods fail in opposite ways. The big advantage of an "assume true" approach is that the AI is setting out to discover a reward function, not to define or manipulate one, hence it (initially) has no incentive to manipulate the human. The big problem is that humans don't actually have access to their true reward function (unless the AI overdoses on "revealed preferences" and assumes everything humans do as being a rational expression of a complicated reward function). This can cause a lot of problems depending on how the AI interprets what human irrationality is; indeed, for some interpretations, the AI's incentive to manipulate is still present.

The big advantage with proxy measures is that they refer to a real thing in the real world: the human's answers to AI questions are genuine pieces of data that don't assume anything impossible. The big problem is that since its a proxy and not a genuine measure, the AI is directly incentivised to manipulate the outcome, and will find it much easier to do.

Best fit and Modelled irrationality

In a sense, all approaches are "best fit" approaches. Unless we had a perfect ideal model of human preferences, some process is needed to fill in the gaps. But I categorise approaches as "best fit" if the purpose is specifically to fit a reward/value/utility function to humans, without assuming that humans know this function or that it comes from proxies.

An example is my vague idea, mentioned here, of specifying the complexity of human values and requiring that the value function found be of the same level of complexity -- and then choosing the best fit at that level.

Finally, "modelled irrationality" is all about figuring out what biases and irrationalities humans suffer from, and flagging them specifically as irrationalities, and providing instructions as to how they should be ignored or corrected for. The more detailed our model of irrationalities, the better the remainder of human behaviour approaches an ideal agent, and the more powerful the other three approaches can become in conjunction.

I think the best way of establishing human preferences might be some automated version of modelled irrationality.

Modelling knowledge

Note that a very accurate model of human irrationality would serve the purpose of allowing the AI to successfully model what human knowledge is, thus allowing us to define statements that are misleading. This would be of great use for Oracle designs, for Tool AIs, and just for general AIs.

New Comment