Made me think of Rawl's veil of ignorance, somewhat. I wonder- is there a whole family of techniques along the lines of "design intelligence B, given some ambiguity about your own values", with different forms or degrees of uncertainty?
It seems like it should avoid extreme or weirdly specialized results (i.e. paper-clipping), since hedging your bets is an immediate consequence. But it's still highly dependent on the language you're using to model those values in the first place.
I'm a little unclear on the behavioral consequences of 'utility function uncertainty' as opposed to the more usual empirical uncertainty. Technically, it is an empirical question, but what does it mean to act without having perfect confidence in your own utility function?
but what does it mean to act without having perfect confidence in your own utility function?
If you look at utility functions as actual functions (not as affine equivalence classes of functions) then that uncertainty can be handled the usual way.
Suppose you want to either maximise u (the number of paperclips) or -u, you don't know which, but will find out soon. Then, in any case, you want to gain control of the paperclip factories...
I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.
We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).
What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.
My rough idea goes something like this:
AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.
What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.
So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.
Do people feel this idea is implementable/improvable?