Good summary. But concerning your final point :
For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.
I suspect this is impossible in principle, because human values are dependent on our models of the world.
The key is to develop methods that scale; where values become aligned as the world model approaches human level of capability.
But then there is a scope, apparently unexplored so far, for finding morally relevant subsets of value. You don't have to see everything's though the lens of utilitarianism.
I'm an advocate of this approach in general for a number of reasons, and it's typically how I explain the idea of FAI to people without seeming like a prophet of the end times. Most of the reasons I like value-learning focus on what happens before a super-intelligence or what happens if a super-intelligence never comes into being.
I am strongly of the opinion that real world testing and real world application of theoretical results often exposes totally unanticipated flaws, and it seems like for the value-learning problem that partial/incomplete solutions are still tremendously useful. This means that progress on the value-learning problem is likely to attract lots of attention and resources and that consequently proposed solutions will be more thoroughly tested in the real world.
Some of the potential advantages:
Resources: It seems like there's a strong market incentive for understanding human preferences in the form of various recommendation engines. The ability to discern human values, even partially, translates well into any number of potentially useful applications. Symptoms of success in this type of research will almost certainly attract the investment of substantial additional resources to the problem, which is less obviously true for some of the other research directions.
Raising the sanity waterline: Machines aren't seen as competitors for social status and it's typically easier to stomach correction from a machine than from another person. The ability to share preferences with a machine and get feedback on the values those preferences relate to would potentially be an invaluable tool for introspection. It's possible that this could result in people being more rational or even more moral.
Translation: Humans have never really tried to translate human values into a form that would be comprehensible to a non-human before. Value learning is a way to give humans practice discovering/explaining their values in precise ways. This, to my mind, is preferable to the alternative approach of relying on a non-human actor to successfully guess human morality. One of my human values is for humans to have a role in shaping the future, and I'd feel much more comfortable if we got to contribute in a meaningful way to the estimate of human values held by any future super-intelligence.
Relative Difficulty: The human values problem is hard, but discovering human values from data is probably much harder than just learning/representing human values. Learning quantum mechanics is hard, but the discovery of the laws of quantum mechanics was much much more difficult. If we can get human values problem small enough to make it into a seed AI, the chances of AI friendliness increase dramatically.
I haven't taken the time here to consider in detail how the approaches outlined in your post interact with some of these advantages, but I may try and revisit them when I have the opportunity.
I feel like a mixed approach is the most desirable. There is a risk that if the AI is allowed to simply learn from humans, we might get a greedy AI that maximizes its Facebook experience while the rest of the World keeps dying of starvation and wars. Also, our values probably evolve with time (slavery, death penalty, freedom of speech...) so we might as well try and teach the AI what our values should be rather than what they are right now. Maybe then it's the case of developing a top-down, high level ethical system and use it to seed a neural network that then picks up patterns in more detailed scenarios?
It feels to me like you are straying off the technical issues by looking at a huge picture.
In this case, a picture so huge it's unsolvable. So here's an assertion which might be interesting: Its better to focus on clusters of small, manageable machine-ethics problems and gradually build up to a Grand Scheme, or more likely in my guess, a Grand Messy But Workable System, rather than teasing-out a Bible of global ethical abstraction. There's no working consensus on ethical rules anyway, outside the Three Laws.
An example, maybe already solved: autonomous cars are coming quite soon, much sooner than most of us thought. Several people have wondered about the machine ethics of a car in a crash situation, assuming you accept Google's position that humans will never react fast enough to resume control. Various trolley problem-like scenarios of minimizing irrevocable hurt to humans have been kicked around. But I think I already read a solution to the decision problem in the discussion-
a) Ethical-decisions-during-crash is going to be a very rare occurrence. b) The over-all reduction in accidents is much more significant than a small subset of accidents theoretically made worse by the robot cars. c) Humans can't agree on complex algorithms for the hypothetical proposed scenarios anyway. d) Machines always need a default mode when the planned-for reactions conflict.
So if you accept a-d above then you'll probably agree that simply having the car slow to stop and pull over to the side as best it can, is the default which will produce the least damage. This is the same routine to follow if the car comes upon debris in the road, a wreck, confusing safety beacons, some catastrophe with the road itself and so forth. It's pretty much what you'd tell your teenager to do.
But I think there are lessons to draw from the robot cars: 1) The robot, though fully autonomous in every-day situations, will encounter in an accident, an ever-narrowing range of options in its decision-tree so that it will end up with the default option only. In contrast, a human will panic and take action which often adds options to an already-over-loaded decision tree, options which can't be evaluated in real-time and whose outcomes are probably worse than just stopping as fast as possible anyway. 2) Robots don't have to be perfect, they just have to be better than humans in the aggregate, and, see #1, default to avoiding action when disaster strikes. 3) Once you get to #2, then you are already better than humans and therefore saving lives and property. At this point the engineers can further tune the robot to improve gradually.
So what about the paper-clip-monster, the AGI that wants to run the world and most important, writes its own code? I agree it could be done in theory, just as we'll surely have computers running artificial evolution scenarios with DNA, and data-mining/surveillance on a scale so huge it makes the Stazi look like kindergarten. But as everyone has noted, writing your own code is utterly uncharted territory. A lot of LW commentators treat the prospect with myth: they propose an AGI that is better described as an alien overlord than a machine. Myth may be the only way humans can wrap their brains around an idea so big. Engineers won't even try. They'll break the problem up into bits, do a lot of error-checking at a level of action they do understand, and run it in the lab to see what happens. For instance if there is still a layered approach to software, the OS might have the safety mechanisms built in, and maybe won't be self-upgradable, while the self-written code will run in apps that rely on the OS, then after a hundred similar steps of divide-and-conquer the system will be useful and controllable. But truly, I too am just hand-waving in a vacuum. Please continue...
I think the huge picture is pretty important to look at. If we know the goal is far away, then we know that current projects are not going to get their usefulness from solving the whole problem. But that's fine, there are plenty of other uses for projects. Among others:
Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).
One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.
The projects, in alphabetical order:
Techniques: Top-down design, game theory, moral philosophy
Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments
Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles
Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is
Techniques: Trying to identify learned moral concepts, unsupervised learning
The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.
Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.
Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.
Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.
Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.