I've been reading through this to get a sense of the state of the art at the moment:
http://lukeprog.com/SaveTheWorld.html
Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans. This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.
Why shouldn't we just try to design an average bounded utility maximizer? You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time. To measure the utility (or at least approximate it), you could just ask the models. The number this spits out is the output of your utility function. It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and I acknowledge that that last one is going to be a bitch to formalize. When you've got this utility function, you just... maximize it.
Something like this approach seems much more robust. Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded. Because it's bounded, you're even protected against utility monsters. Has something like this been considered? Is there an obvious reason it won't work, or would produce undesirable results?
Thanks,
Dolores
Because this strikes me as a nightmare scenario. Besides, we're relying on the models to self-report total happiness. Leaving it on an unbounded scale creates incentives for abuse
The question would be more like 'assuming you understand standard deviation units, how satisfied with your life are you right now, in standard deviation units, relative to the average?' Happy, satisfied people give the machine more utility.
Okay, but that doesn't mean you can't build a machine that maximizes the number of happy people, under these conditions. Calling it utility is just short hand.
I need to go to class right now, but I'll get into population changes when I get home this evening.
Presumably, the reflective consistency criterion would be something along the lines of 'hey, model, here's this other model -- does he seem like a valid continuation of you?' No value judgments involved.
EDIT:
Okay, here's how you handle agents being created or destroyed in your predicted future. For agents that die, you feed that fact back into the original state of the model, and allow it to determine utility for that state. So, if you want to commit suicide, that's fine -- dying becomes positive utility for the machine.
Creating people is a little more problematic. If new people's utility is naively added, well, that's bad. Because then, the fastest way to maximize its utility function is to kill the whole human race, and then start building resource-cheap barely-sapient happy monsters that report maximum happiness all the time. So you need to add a necessary-but-not-sufficient condition that any action taken has to maximize both the utility of all forseeable minds, AND the utility of all minds currently alive. That means that happy monsters are no good (in so far as they eat resources that we'll eventually need), and it means that Dr. Evil won't be allowed to make billions of clones of himself and take over the world. This should also eliminate repugnant conclusion scenarios.
So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a "valid continuation" of it? That is, what do you mean by "valid continuation"? What kinds of judgments do you want these models to make?
There are a few very different ways you could go here. For the purpose of illustratio... (read more)