I've been reading through this to get a sense of the state of the art at the moment:
http://lukeprog.com/SaveTheWorld.html
Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans. This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.
Why shouldn't we just try to design an average bounded utility maximizer? You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time. To measure the utility (or at least approximate it), you could just ask the models. The number this spits out is the output of your utility function. It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and I acknowledge that that last one is going to be a bitch to formalize. When you've got this utility function, you just... maximize it.
Something like this approach seems much more robust. Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded. Because it's bounded, you're even protected against utility monsters. Has something like this been considered? Is there an obvious reason it won't work, or would produce undesirable results?
Thanks,
Dolores
So, it is choosing among a set F of possible futures for a set A of agents whose values it is trying to implement by that choice.
And the idea is, for each Fn and An, it models An in Fn and performs a set of tests T designed to elicit reports of the utility of Fn to An at various times, the total of which it represents as a number on a shared scale. After doing this for all of A for a given Fn, it has a set of numbers, which it averages to get an average utility for Fn.
Then it chooses the future with the maximum average utility.
Yes? Did I understand that correctly?
If so, then I agree that something like this can work, but the hard part seems to be designing T in such a way that it captures the stuff that actually matters.
For example, you say "nobody gets wireheaded", but I don't see how that follows. If we want to avoid wireheading, we want T designed in such a way that it returns low scores when applied to a model of me in a future in which I wirehead. But how do we ensure T is designed this way?
The same question arises for lots of other issues.
If I've understood correctly, this proposal seems to put a conceptually hard part of the problem in a black box and then concentrate on the machinery that uses that box.
EDIT: looking at your reply to mitchell porter above, I conclude that your answer is T consists of asking An in Fn how satisfied it is with its life on some bounded scale. In which case I really don't understand how this avoids wireheading.
I think you've pretty much got it. Basically, instead of trying to figure out a universal morality across humans, you just say 'okay, fine, people are black boxes whose behavior you can predict, let's build a system to deal with that black box.'
However, instead of trying to get T to be immune to wireheading, I suggested that we require reflexive consistency -- i.e. the model-as-it-is-now should be given a veto vote over predicted future states of itself. So, if the AI is planning to turn you into a barely-sapient happy monster, your model should be abl... (read more)