It feels to me like you are straying off the technical issues by looking at a huge picture.
In this case, a picture so huge it's unsolvable. So here's an assertion which might be interesting: Its better to focus on clusters of small, manageable machine-ethics problems and gradually build up to a Grand Scheme, or more likely in my guess, a Grand Messy But Workable System, rather than teasing-out a Bible of global ethical abstraction. There's no working consensus on ethical rules anyway, outside the Three Laws.
An example, maybe already solved: autonomous cars are coming quite soon, much sooner than most of us thought. Several people have wondered about the machine ethics of a car in a crash situation, assuming you accept Google's position that humans will never react fast enough to resume control. Various trolley problem-like scenarios of minimizing irrevocable hurt to humans have been kicked around. But I think I already read a solution to the decision problem in the discussion-
a) Ethical-decisions-during-crash is going to be a very rare occurrence. b) The over-all reduction in accidents is much more significant than a small subset of accidents theoretically made worse by the robot cars. c) Humans can't agree on complex algorithms for the hypothetical proposed scenarios anyway. d) Machines always need a default mode when the planned-for reactions conflict.
So if you accept a-d above then you'll probably agree that simply having the car slow to stop and pull over to the side as best it can, is the default which will produce the least damage. This is the same routine to follow if the car comes upon debris in the road, a wreck, confusing safety beacons, some catastrophe with the road itself and so forth. It's pretty much what you'd tell your teenager to do.
But I think there are lessons to draw from the robot cars: 1) The robot, though fully autonomous in every-day situations, will encounter in an accident, an ever-narrowing range of options in its decision-tree so that it will end up with the default option only. In contrast, a human will panic and take action which often adds options to an already-over-loaded decision tree, options which can't be evaluated in real-time and whose outcomes are probably worse than just stopping as fast as possible anyway. 2) Robots don't have to be perfect, they just have to be better than humans in the aggregate, and, see #1, default to avoiding action when disaster strikes. 3) Once you get to #2, then you are already better than humans and therefore saving lives and property. At this point the engineers can further tune the robot to improve gradually.
So what about the paper-clip-monster, the AGI that wants to run the world and most important, writes its own code? I agree it could be done in theory, just as we'll surely have computers running artificial evolution scenarios with DNA, and data-mining/surveillance on a scale so huge it makes the Stazi look like kindergarten. But as everyone has noted, writing your own code is utterly uncharted territory. A lot of LW commentators treat the prospect with myth: they propose an AGI that is better described as an alien overlord than a machine. Myth may be the only way humans can wrap their brains around an idea so big. Engineers won't even try. They'll break the problem up into bits, do a lot of error-checking at a level of action they do understand, and run it in the lab to see what happens. For instance if there is still a layered approach to software, the OS might have the safety mechanisms built in, and maybe won't be self-upgradable, while the self-written code will run in apps that rely on the OS, then after a hundred similar steps of divide-and-conquer the system will be useful and controllable. But truly, I too am just hand-waving in a vacuum. Please continue...
I think the huge picture is pretty important to look at. If we know the goal is far away, then we know that current projects are not going to get their usefulness from solving the whole problem. But that's fine, there are plenty of other uses for projects. Among others:
Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).
One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.
The projects, in alphabetical order:
Techniques: Top-down design, game theory, moral philosophy
Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments
Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles
Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is
Techniques: Trying to identify learned moral concepts, unsupervised learning
The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.
Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.
Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.
Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.
Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.