Brief Question about FAI approaches

Dolores1984

I've been reading through this to get a sense of the state of the art at the moment:

http://lukeprog.com/SaveTheWorld.html

Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans. This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.

Why shouldn't we just try to design an average bounded utility maximizer? You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time. To measure the utility (or at least approximate it), you could just ask the models. The number this spits out is the output of your utility function. It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and I acknowledge that that last one is going to be a bitch to formalize. When you've got this utility function, you just... maximize it.

Something like this approach seems much more robust. Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded. Because it's bounded, you're even protected against utility monsters. Has something like this been considered? Is there an obvious reason it won't work, or would produce undesirable results?

Thanks,

Dolores

I've certainly considered this - and I'm pretty sure I got the idea from Eliezer_2001. He has some made-up phrase that ends in 'semantics' that means "figure out what makes people do what they do, find the part that looks moral, and do that."

The main trouble with the straight-up interpretation is that humans don't so much have a morality as we have a treasure map for finding morality, and modeling us as utiliity-maximizers doesn't capture this well. Which over the long term is pretty undesirable - it would be like if the ancient Greeks built an AI and it still had the preconceptions of the ancient Greeks. So either you can pour tons of resources into modeling humans as utility-maximizers, possibly hitting overfitting problems (that is, to actually get a utility function over histories rather than word states, you always get some troublesome utilities for situations humans haven't experienced yet, which have more to do with the model you use than any properties of humans), or you can use a different abstraction. E.g. find some way of representing "treasure map" algorithms where it makes sense to add them together.