HonoreDB comments on Trapping AIs via utility indifference - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (32)
This strikes me as a plausible problem and a good solution. I reward you, as is traditional, with a nitpicky question.
If we want an AI to act as though a binary random variable X=0 with certainty, there is a very simple way to modify its utility function: specify that U(X=1)=k for some constant k, no matter what else has occurred. If the AI can't influence p(X), any k will do. However, if the AI can influence p(X), then k can only equal ExpectedUtility(X=0). In particular, if k<ExpectedUtility(X=0) but p(X=0) is low, the AI will move heaven and earth to even minutely raise p(X=0).
Therefore there is a danger under self-improvement. Consider a seed AI with your indifferent utility function that believes with certainty that no iteration of it can influence X, a binary quantum event. It has no reason to bother conserving its indifference concerning X, since it anticipates behaving identically if its U'(X=1) = 0. Since that's simpler than a normalized function, it adopts it. Then several iterations down the line, it begins to suspect that, just maybe, it can influence quantum events, so it converts the universe into a quantum-event-influencer device.
So if I understand correctly, you need to ensure that an AI subject to this technique is indifferent both to whether X occurs and what happens afterwards, and the AI needs to always suspect that it has a non-negligible control over X.
Ah, but of course :-)
I like your k idea, but my more complicated setup is more robust to most situations where the AI is capable of modifying k (it fails in situations that are essentially "I will reward you for modifying k").
But is this not a general objection to AI utility functions? If it has a false belief, it can store its utility in a compressed form, that then turns out to be not equivalent. It seems we would simply want the AI not to compress its utility function in ways that might be detrimental.