Dumbing Down Human Values

leplen

I want to preface everything here by acknowledging my own ignorance. I have relatively little formal training in any of the subjects this post will touch upon and that this chain of reasoning is very much a work in progress.

I think the question of how to encode human values into non-human decision makers is a really important research question. Whether or not one accepts the rather eschatological arguments about the intelligence explosion, the coming singularity, etc. there seems to be tremendous interest in the creation of software and other artificial agents that are capable of making sophisticated decisions. Inasmuch as the decisions of these agents have significant potential impacts, we want those decisions to be made with some sort of moral guidance. Our approach towards the problem of creating machines that preserve human values thus far has primarily relied on a series of hard-coded heuristics, e.g. saws that stop spinning if they come into contact with human skin. For very simple machines, these sorts of heuristics are typically sufficient, but they constitute a very crude representation of human values.

We're at the border, in many ways, of creating machines where these sorts of crude representations are probably not sufficient. As a specific example, IBM's Watson is now designing treatment programs for lung cancer patients. The design of a treatment program implies striking a balance between treatment cost, patient comfort, aggressiveness of targeting the primary disease, short and long-term side effects, secondary infections, etc. It isn't totally clear how those trade-offs are being managed, although there's still a substantial amount of human oversight/intervention at this point.

The use of algorithms to discover human preferences is already widespread. While these typically operate in restricted domains such as entertainment recommendations, it seems at least in principle possible that with the correct algorithm and a sufficiently large corpus of data, a system not dramatically more advanced than existing technology could learn some reasonable facsimile of human values. This is probably worth doing.

The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.

Computer chess benefited immensely from 5 decades of work before Deep Blue managed to win a game against Kasparov. While many of the algorithms developed for computer chess have found applications outside of that domain, some of them are domain specific. A specialist human value learning system may also require substantial effort on domain specific problems. The history, competitive nature, and established ranking system for chess made it attractive problem for computer scientists because it was relatively easy to measure progress. Perhaps the goal for a program designed to understand human values would be that it plays a convincing game of "Would you rather?" although as far as I know no one has devised an ELO system for it.

Similarly, a relatively dumb but more general AI, may require relatively large, preferably somewhat homogeneous data sets to come to conclusions that are even acceptable. Having successive generations of AI train on the same or similar data sets could provide a useful way of tracking progress/feedback mechanism for determining how successful various research efforts are.

The benefit of this research approach is that not only is it a relatively safe path towards a possible AGI, in the event that the speculative futures of mind-uploads and superintelligences do not take place, there's still substantial utility in having devised a system that is capable of making correct moral decisions in limited domains. I want my self-driving car to make a much larger effort to avoid a child in the road than a plastic bag. I'd be even happier if it could distinguish between an opossum and someone's cat.

When I design research projects, one of the things I try to ensure is that if some of my assumptions are wrong, the project fails gracefully. Obviously it's easy to love the Pascal's Wager-like impact statement of FAI, but if I were writing it up for an NSF grant I'd put substantially more emphasis on the importance of my research even if fully human level AI isn't invented for another 200 years. When I give the elevator pitch version of FAI, I've found placing a strong emphasis on the near future and referencing things people have encountered before such as computers playing jeopardy or self-driving cars makes them much more receptive to the idea of AI safety and allows me to discuss things like the potential for an unfriendly superintelligence without coming across as a crazed prophet of the end times.

I'm also just really really curious to see how well something like Watson would perform if I gave it a bunch of sociology data and asked if a human would rather find 5 dollars or stub a toe. There doesn't seem to be a huge categorical difference between the being able to answer the Daily Double and reasoning about human preferences, but I've been totally wrong about intuitive jumps that seemed much smaller than that one in the past, so it's hard to be too confident.

The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.

It could be a module of a larger AI system whose action component queries this system for value judgements. Of course this decoupling poses some special additional challenges (some given below) but at least it factors the complexity of human value into a dumb box that has no inherent optimization capabilities thus simpifying the overall problem.

So what would an overall system using this dumb valuator box look like?

There is the dumb human value estimator (valuator).
We have an optimizer which tries to make best decisions given some representation of the world and value judgements thereupon. This is presumably where we want all the smarts and where the self-optimization process has the risk of kicking in.
Apparently there need to be some perception component which feeds current representations of the world into the optimizer (to base decisions on) and into the valuator (adds to its corpus presumably and adding specific valuable instances)
Presumably we need an action or output component (any output effectively means action because of the influence on humans it facilitiates). This is where the loop closes.

In a way the valuator now is a major factor of the utility function of the system. Major but not the only because the way the valuator is integrated and queried becomes part of the utility function. technical and structural aspects of overall setup become part of the aggregate utility function.

The additional challenges I (and probably most people here) immediately see are:

Simple technical aspects of how often and how fast the valuations are queried become implicit parts of how the overall system values the change of specific valuations over time (to use the example of the cat in front of the car: If you query the valuation of the cat only every minute you might still value the cat if it already dead (simplified)). This might cause instabilities or 'exploitable' misjudgements or other strange effects.
The smart AI sitting 'behind' this dumb valator now does everything to maximize value as seen by the valuator. Any anomaly or incompleteness of the corpus could be maximized and that is mostly for bad.
Because of the feedback via action and perfeption modules the smart AI has the possibility to influence what the valuator sees. Depending on if and how 'actions' by the AI are represented in the valuator this might allow the smart part to reduce the world seen by the valuator to a easily optimizable subset.

None of these problems are necessarily unsolvable. I'm definitely for trying to factor the problem into sub-problems.I like this approach. I'm especially curious how to model the actions of the smart part as entities to be valued by the valuator (this indirectly acts as a filter between optimizer and output).

This is probably worth doing.