A putative new idea for AI control; index here.

This presents one way of implementing the indifference-based correlations of these posts.

Let u be a utility function, a map from worlds to real numbers. An expected utility maximiser considering whether to produce output Y, looks at the expected utility

Σw u(w)P(w|Y).

We now assume that there is another random variable X in the world, and we want the AI to be indifferent to worlds where Y≠X. We also want it to be indifferent to worlds where Z=0. Then it will assess the value of output Y as:

Σw u(w)P(w|X=Y,Z=1,Y).

Now, the idea of the setup was to ensure that Z=1 would erase the output Y so that it was never read. Hence P(w|Z=1,Y) Hence that equation simplifies to:

Σw u(w)P(w|X=Y).

Therefore the AI will chose the Y that maximises the (conditional) expected utility of u if X=Y. To get the full version of the initial post, you need to define some function f of Y and modify this to

Σw u(w)P(w|X=Y) + f(Y).

 

New to LessWrong?

New Comment