The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.
It could be a module of a larger AI system whose action component queries this system for value judgements. Of course this decoupling poses some special additional challenges (some given below) but at least it factors the complexity of human value into a dumb box that has no inherent optimization capabilities thus simpifying the overall problem.
So what would an overall system using this dumb valuator box look like?
There is the dumb human value estimator (valuator).
We have an optimizer which tries to make best decisions given some representation of the world and value judgements thereupon. This is presumably where we want all the smarts and where the self-optimization process has the risk of kicking in.
Apparently there need to be some perception component which feeds current representations of the world into the optimizer (to base decisions on) and into the valuator (adds to its corpus presumably and adding specific valuable instances)
Presumably we need an action or output component (any output effectively means action because of the influence on humans it facilitiates). This is where the loop closes.
In a way the valuator now is a major factor of the utility function of the system. Major but not the only because the way the valuator is integrated and queried becomes part of the utility function. technical and structural aspects of overall setup become part of the aggregate utility function.
The additional challenges I (and probably most people here) immediately see are:
Simple technical aspects of how often and how fast the valuations are queried become implicit parts of how the overall system values the change of specific valuations over time (to use the example of the cat in front of the car: If you query the valuation of the cat only every minute you might still value the cat if it already dead (simplified)). This might cause instabilities or 'exploitable' misjudgements or other strange effects.
The smart AI sitting 'behind' this dumb valator now does everything to maximize value as seen by the valuator. Any anomaly or incompleteness of the corpus could be maximized and that is mostly for bad.
Because of the feedback via action and perfeption modules the smart AI has the possibility to influence what the valuator sees. Depending on if and how 'actions' by the AI are represented in the valuator this might allow the smart part to reduce the world seen by the valuator to a easily optimizable subset.
None of these problems are necessarily unsolvable. I'm definitely for trying to factor the problem into sub-problems.I like this approach. I'm especially curious how to model the actions of the smart part as entities to be valued by the valuator (this indirectly acts as a filter between optimizer and output).
This is probably worth doing.
This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain.
I doubt that's possible. Human values are really complicated, and they refer to fairly arbitrary (from a computational point of view) objects that can't be simplified very much without giving you an unacceptably wrong model of human values. So anything that can successfully learn human values would have to consider broad-domain problems.
How are human values categorically different from things like music preference? Descriptions of art also seem to to rely on lots of fairly arbitrary objects that it's difficult to simplify.
I'm also not sure what qualifies as unacceptably wrong. There's obviously some utility in having very crude models of human preferences. How would a slightly less crude model suddenly result in something that was "unacceptably" wrong?
There's a lot more room for variation in the space of possible states of the universe than in the space of possible pieces of music. Also, people can usually give you a straightforward answer if you ask them how much they like a certain piece of music, but when asked which of two hypothetical scenarios better satisfies their values, people's answers tend to be erratic, and people will give sets of answers not consistent with any coherent value system, making quite a mess to untangle. Revealed preferences won't help, because people don't act like optimizers, and often do things that they know conflict with what they want.
By "unacceptably" wrong, I meant wrong enough that it would be a disaster if they were used as the utility function of a superintelligence. In situations with a much larger margin of error, it is possible to use fairly simple algorithms to usefully approximate human values in some domain of interest.
I'm not sure that the number of possible states of the universe is relevant. I would imagine that the vast majority of the variation in that state space would be characterized by human indifference. The set of all possible combinations of sound frequencies is probably comparably enormous, but that doesn't seem to have precluded Pandora's commercial success.
I have to categorically disagree with the statement that people don't have access to their values or that their answers about what they value will tend to be erratic. I would wager that an overwhelming percentage of people would rather find 5 dollars than stub their toe. I would furthermore expect that answer to be far more consistent and more stable than asking people to name their favorite music genre or favorite song. This reflects something very real about human values. I can create ethical or value situations that are difficult for humans to resolve, but I can do that in almost any domain where humans express preferences. We're not going to get it perfectly right. But with some practice, we can probably make it to acceptably wrong.
I want to preface everything here by acknowledging my own ignorance. I have relatively little formal training in any of the subjects this post will touch upon and that this chain of reasoning is very much a work in progress.
I think the question of how to encode human values into non-human decision makers is a really important research question. Whether or not one accepts the rather eschatological arguments about the intelligence explosion, the coming singularity, etc. there seems to be tremendous interest in the creation of software and other artificial agents that are capable of making sophisticated decisions. Inasmuch as the decisions of these agents have significant potential impacts, we want those decisions to be made with some sort of moral guidance. Our approach towards the problem of creating machines that preserve human values thus far has primarily relied on a series of hard-coded heuristics, e.g. saws that stop spinning if they come into contact with human skin. For very simple machines, these sorts of heuristics are typically sufficient, but they constitute a very crude representation of human values.
We're at the border, in many ways, of creating machines where these sorts of crude representations are probably not sufficient. As a specific example, IBM's Watson is now designing treatment programs for lung cancer patients. The design of a treatment program implies striking a balance between treatment cost, patient comfort, aggressiveness of targeting the primary disease, short and long-term side effects, secondary infections, etc. It isn't totally clear how those trade-offs are being managed, although there's still a substantial amount of human oversight/intervention at this point.
The use of algorithms to discover human preferences is already widespread. While these typically operate in restricted domains such as entertainment recommendations, it seems at least in principle possible that with the correct algorithm and a sufficiently large corpus of data, a system not dramatically more advanced than existing technology could learn some reasonable facsimile of human values. This is probably worth doing.
The goal would be to have a sufficient representation of human values using as dumb a machine as possible. This putative value-learning machine could be dumb in the way that Deep Blue was dumb, by being a hyper-specialist in the problem domain of chess/learning human values and having very little optimization power outside of that domain. It could also be dumb in the way that evolution is dumb, obtaining satisfactory results more through an abundance of data and resources that through any particular brilliance.
Computer chess benefited immensely from 5 decades of work before Deep Blue managed to win a game against Kasparov. While many of the algorithms developed for computer chess have found applications outside of that domain, some of them are domain specific. A specialist human value learning system may also require substantial effort on domain specific problems. The history, competitive nature, and established ranking system for chess made it attractive problem for computer scientists because it was relatively easy to measure progress. Perhaps the goal for a program designed to understand human values would be that it plays a convincing game of "Would you rather?" although as far as I know no one has devised an ELO system for it.
Similarly, a relatively dumb but more general AI, may require relatively large, preferably somewhat homogeneous data sets to come to conclusions that are even acceptable. Having successive generations of AI train on the same or similar data sets could provide a useful way of tracking progress/feedback mechanism for determining how successful various research efforts are.
The benefit of this research approach is that not only is it a relatively safe path towards a possible AGI, in the event that the speculative futures of mind-uploads and superintelligences do not take place, there's still substantial utility in having devised a system that is capable of making correct moral decisions in limited domains. I want my self-driving car to make a much larger effort to avoid a child in the road than a plastic bag. I'd be even happier if it could distinguish between an opossum and someone's cat.
When I design research projects, one of the things I try to ensure is that if some of my assumptions are wrong, the project fails gracefully. Obviously it's easy to love the Pascal's Wager-like impact statement of FAI, but if I were writing it up for an NSF grant I'd put substantially more emphasis on the importance of my research even if fully human level AI isn't invented for another 200 years. When I give the elevator pitch version of FAI, I've found placing a strong emphasis on the near future and referencing things people have encountered before such as computers playing jeopardy or self-driving cars makes them much more receptive to the idea of AI safety and allows me to discuss things like the potential for an unfriendly superintelligence without coming across as a crazed prophet of the end times.
I'm also just really really curious to see how well something like Watson would perform if I gave it a bunch of sociology data and asked if a human would rather find 5 dollars or stub a toe. There doesn't seem to be a huge categorical difference between the being able to answer the Daily Double and reasoning about human preferences, but I've been totally wrong about intuitive jumps that seemed much smaller than that one in the past, so it's hard to be too confident.