Benja had a post about trying to get predictors to not manipulate you. It involved a predictor that could predict tennis matches, but where the prediction could also manipulate the impact of those matches.

To solve this, Benja imagined the actions of a hypothetical CDT reasoning agent located on the moon, and unable to affect the outcome.

While his general approach is interesting, it seems the specific problem has a much simpler solution: before the AI's message is outputed, it's run through a randomised scrubber that has a tiny chance of erasing it.

Then the predictor would try and maximise expected correctness of its prediction, given that the scrubber erased its output (utility indifference can have a similar effect). In practice the scrubber would almost never trigger, so we would get accurate predictions, unaffected by our reading of them.

Does this seem it'll work?

New Comment
1 comment, sorted by Click to highlight new comments since:
[-]jessicataΩ000

I discussed this with Benja at a previous MIRIx workshop and I don't remember exactly what we concluded, but I think it mostly works, it just requires that people behave sensibly when they get scrubbed predictions.

Now that I think about it: to handle cases when people don't behave that sensibly with scrubbed predictions, maybe we want some kind of sequence of oracles, where oracle 0 outputs nothing, and oracle n+1 outputs what would happen if it were replaced with oracle n. We could take the limit as n approaches infinity, but then we don't know that much about which fixed point we will get (it will be controlled by subtle feedback loops), so maybe we want something like n=3 being most probable (although we will want to make n random between 0 and 3 so it's meaningful to condition on n=0, n=1, n=2).