Well, let's further say that you assign p(+u)=0.51 and p(-u)=0.49, slightly favoring the production of paperclips over their destruction. And just to keep it a toy problem, you've got a paperclip-making button and a paperclip-destroying button you can push, and no other means of interacting with reality.
A plain old 'confident' paperclip maximizer in this situation will happily just push the former button all day, receiving one Point every time it does so. But an uncertain agent will have the exact same behavior; the only difference is that it only gets .02 Points every time it pushes the button, and thus a lower overall score in the same period of time. But the number of paperclips produced is identical. The agent would not (for example) push the 'destroy' button 49 times and the 'create' button 51 times. In practical effect, this is as inconsequential as telling the confident agent that it gets two Points for every paperclip.
So in this toy problem, at least, uncertainty isn't a moderating force. On the other hand, I would intuitively expect different behavior in a less 'toy' problem- for example, an uncertain maximizer might build every paperclip with a secret self-destruct command so that the number of paperclips could be quickly reduced to zero. So there's a line somewhere where behavior changes. Maybe a good way to phrase my question would be- what are the special circumstances under which an uncertain utility function produces a change in behavior?
If the AI expects to know tomorrow what utility function it has, it will be willing to wait, even if there is a (mild) discount rate, while a pure maximiser would not.
I'm soon going to go on a two day "AI control retreat", when I'll be without internet or family or any contact, just a few books and thinking about AI control. In the meantime, here is one idea I found along the way.
We often prefer leaders to follow deontological rules, because these are harder to manipulate by those whose interests don't align with ours (you could say the similar things about frequentist statistics versus Bayesian ones).
What about if we applied the same idea to AI control? Not giving the AI deontological restrictions, but programming with a similart goal: to prevent a misalignment of values to be disastrous. But who could do this? Well, another AI.
My rough idea goes something like this:
AI A is tasked with maximising utility function u - a utility function which, crucially, it doesn't know yet. Its sole task is to create AI B, which will be given a utility function v and act on it.
What will v be? Well, I was thinking of taking u and adding some noise - nasty noise. By nasty noise I mean v=u+w, not v=max(u,w). In the first case, you could maximise v while sacrificing u completely, it w is suitable. In fact, I was thinking of adding an agent C (which need not actually exist). It would be motivated to maximise -u, and it would have the code of B and the set of u+noise, and would choose v to be the worst possible option (form the perspective of a u-maximiser) in this set.
So agent A, which doesn't know u, is motivated to design B so that it follows its motivation to some extent, but not to extreme amounts - not in ways that might sacrifice some of the values of some sub-part of its utility function, because that might be part of the original u.
Do people feel this idea is implementable/improvable?