OK, I think I see.
So, one can of course get arbitrarily fussy about this sort of thing in not-very-interesting ways, but I guess the core of my question is: why in the world should the judge (AI or whatever) treat its model of me as a black box? What does that add?
For example, if the model of me-as-I-am-now rejects wireheading, the judge presumably knows precisely why it rejects wireheading, in the sense that it knows the mechanisms that lead to that rejection. After all, it created those mechanisms in its model, and is executing them. They aren't mysterious to the judge.
Yes?
So why not just build the judge so that it implements the algorithms humans use and applies them to evaluating various futures? It seems easier than implementing those algorithms as part of a model of humans, extrapolating the perceived experience of those models in various futures, extrapolating the expected replies of those models to questions about that perceived experience, and evaluating the future based on those replies.
I'm not sure why my above post is being downvoted. Anyways, on to your point.
We don't know the mechanisms that're being used to model human beings. They are not necessarily transparently reducible -- or, if they are, the AI may not reduce them into the same components that an introspective human does. In the case of neural networks, they are very powerful at matching the outputs of various systems, but if the programmer is asked to explain why the system did a particular behavior, it is usually not possible to provide a satisfactory explanation. Simp...
I've been reading through this to get a sense of the state of the art at the moment:
http://lukeprog.com/SaveTheWorld.html
Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans. This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.
Why shouldn't we just try to design an average bounded utility maximizer? You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time. To measure the utility (or at least approximate it), you could just ask the models. The number this spits out is the output of your utility function. It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and I acknowledge that that last one is going to be a bitch to formalize. When you've got this utility function, you just... maximize it.
Something like this approach seems much more robust. Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded. Because it's bounded, you're even protected against utility monsters. Has something like this been considered? Is there an obvious reason it won't work, or would produce undesirable results?
Thanks,
Dolores