I've been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might - or might not - do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.
Ok, that's an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them - I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.
So I'd like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.
To start it off, here's my (simplistic) suggestion:
Volition
Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.
Extrapolation
Whenever revealed preferences are non-transitive or non-independent, use the person's stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don't know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you've never heard of non-standard reals, lexicographical preferences, refusal to choose and related issues) everyone's preferences are now expressible as utility functions.
Coherence
Normalise each existing person's utility function and add them together to get your CEV. At the FHI we're looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.
I think this doesn't matter, if we can
1) successfully define the CEV concept itself,
2) define a suitable reference class,
3) build a superintelligence, and
4) ensure that the superintelligence continues to pursue the best CEV it can find for the appropriate reference class.
Well, it would be helpful if we could also:
2.5) work out a reliable test for whether a given X really is an instance of the CEV concept for the given reference class
Which seems to depend on having some kind of understanding.
Lacking that, we are left with having to trust that whatever the SI we've built is doing is actually what we "really want" it to do, even if we don't seem to want it to do that, which is an awkward place to be.