A putative new idea for AI control; index here.

In a previous post, I briefly mentioned translations as one of three possible counterfactuals for indifference. Here I want to clarify what I meant there, because the idea is interesting.


Imagine that there is a button, which, if a human presses it (event ), will cause an AI to have reward signal . If the button isn't pressed (event ), the AI will have reward signal .

Initially, the probabilities of and are equal at .

Now, suppose that the AI takes an action that increases the probability of to (and decreases the probability of to ).

We want the AI to be indifferent to the change in probability caused by its actions.

Evidential counterfactual

In the "evidential" counterfactual, the AI will simply behave as if the probability of and were fixed, and unaltered from the initial odds:

The problem with this approach is that this doesn't correspond to a true utility/reward functions, leading to the paradoxes detailed here and here (see the section on outcome pumps).

Causal counterfactual

In the "causal" counterfactual, there is some other event that, with small probability, will force the button to be pressed, or prevent it from being pressed, whatever humans want or try.

This event is supposed to be independent of anything anyone does (imagine some specified radioactive decay event, or being triggered by distant cosmic events).

Call the event that the button is forcibly pressed, while means it is forcibly not-pressed. The AI does not value being in any other world (meaning that the AI gets constant reward in any world where neither nor happen).

Then the AI will behave as if the ratio of probabilities of following versus is the (constant) ratio of to , whatever the probability of becomes.

The problem is that (the human presses the button) is not the same as (the button is forcibly pressed by some stochastic event), so expected rewards conditional on one will generally not be the same as those conditional on the other.

Translation "counterfactual"

In the "translation" counterfactual, in contrast, there are no unusual conditionings or assuming unlikely events. Instead, the event is decoupled somewhat from the event of the AI maximising or .

In this specific case, the probability of the AI maximising , given , is reduced, so that the final probability is the same as the initial one:

In this case, the probability of the AI maximising and happening, remains at , while the probability of maximising and is reduced to , and the probability of maximising and is increased to .

Thus the probabilities of maximising and remain each, and the AI is has a correct probability distribution about future events (there are some subtleties concerning probabilities versus weights here, but this is the general picture).

New Comment