This is a great post and some great points are made in discussion too.
Is it possible to make exact models exhibiting some of these intuitive points? For example, there is a debate about whether extrapolated human values would depend strongly on cognitive content or whether they could be inferred just from cognitive architecture. (This could be a case of metamoral relativism, in which the answer simply depends on the method of extrapolation.) Can we come up with simple programs exhibiting this dichotomy, and simple constructive "methods of extrapolation" which can give us some sense of how value extrapolation is supposed to work?
Do we have any useful examples of reflectively improved decision theory? The draft document on TDT argues in some places that TDT is better than the alternatives, so it might be proposed as a case study - that is, one could examine the arguments in favor of TDT, and try to determine the normative principles which are at work.
Taken from some old comments of mine that never did get a satisfactory answer.
1) One of the justifications for CEV was that extrapolating from an American in the 21st century and from Archimedes of Syracuse should give similar results. This seems to assume that change in human values over time is mostly "progress" rather than drift. Do we have any evidence for that, except saying that our modern values are "good" according to themselves, so whatever historical process led to them must have been "progress"?
2) How can anyone sincerely want to build an AI that fulfills anything except their own current, personal volition? If Eliezer wants the the AI to look at humanity and infer its best wishes for the future, why can't he task it with looking at himself and inferring his best idea to fulfill humanity's wishes? Why must this particular thing be spelled out in a document like CEV and not left to the mysterious magic of "intelligence", and what other such things are there?