Toy model for wire-heading [EDIT: removed for improvement]

Stuart_Armstrong

EDIT: these ideas are too underdeveloped, I will remove them and present a more general idea after more analysis.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

Let's assume a simple model for an expected utility maximising agent. There's the input assessor module A, which takes various inputs and computes the agent's "reward" or "utility". For a reward-based agent, A is typically outside of the agent; for a utility-maximiser, it's typically inside the agent, though the distinction need not be sharp. And there's the the decision module D, which assess the possible actions to take to maximise the output of A. If E is the general environment, we have D+A+E.

Now let's make the agent superintelligent. If we add superintelligence to module D, then D will wirehead by taking control of A (whether A is inside the agent or not) and controlling E to prevent interference. If we add superintelligence to module A, then it will attempt to compute rewards as effectively as possible, sacrificing D and E to achieve it's efficient calculations.

Therefore to prevent wireheading, we need to "add superintelligence" to (D+A), making sure that we aren't doing so to some sub-section of the algorithm - which might be hard if the "superintelligence" is obscure or black-box.

EDIT: these ideas are too underdeveloped, I will remove them and present a more general idea after more analysis.

This is a (very) simple toy model of the wire-heading problem to illustrate how it might or might not happen. The great question is "where do we add the (super)intelligence?"

I agree with your point as stated, but I think a sharper distinction between utility-maximizing and reward-maximizing reveals more alternatives.

A reward-maximizing agent attempts to predict A; D maximizes this predicted future A.

A utility-maximizing agent has direct access to A; D applies A to evaluate possible futures, and maximizes A.

In the first case, a superintelligent D would want to wrestle control of A and modify it.

In the second case, when D thinks about the planned modification of A, it evaluates this possible future using the current A. It sees that the current A does not value this future particularly highly. Therefore, it does not wirehead.

5

Toy model for wire-heading [EDIT: removed for improvement]

5

5

5

Toy model for wire-heading [EDIT: removed for improvement]

5

5