Self-confirming predictions can be arbitrarily bad

Stuart_Armstrong

Predicting perverse donors

There is a rich donor who is willing to donate up to £2,000,000 to your cause. They’ve already written a cheque for £1,000,000, but, before they present it to you, they ask you to predict how much they'll be donating.

The donor is slightly perverse. If you predict any amount £P, they’ll erase their cheque and write £(P-1) instead, one pound less than what your predicted.

Then if you want your prediction to be accurate, there’s only one amount you can predict: £P=£0, and you will indeed get nothing.

Suppose the donor was perverse in a more generous way, and they’d instead write £(P+1), one more than your prediction, up to their maximum. In that case, the only accurate guess is £P=£2,000,000, and you get the whole amount.

If we extend the range above £2,000,000, or below £0 (maybe the donor is also a regulator, who can fine you) then the correct predictions get ever more extreme. It also doesn’t matter if the donor subtracts or adds £1, £100, or one pence (£0.01): the only accurate predictions are at the extreme of the range.

Greek mythology is full of oracular predictions that only happened because people took steps to avoid them. So there is a big difference between “prediction P is true”, and “prediction P is true even if P is generally known”.

Continuity assumption

A prediction P is self-confirming if, once P is generally known, then P will happen (or P is the expectation of what will then happen). The previous section has self-confirming predictions, but these don’t always exist. They exist when the outcome is continuous in the prediction P (and a few technical assumptions, like the outcome taking values in a closed interval). If that assumption is violated, then there need not be any self-confirming prediction.

For example, the generous donor could give £(P+1), except if you ask for too much (more than £1,999,999), in which case you get nothing. In that case, there is no correct prediction £P (the same goes for the £(P-1) donor who will give you the maximum if you’re modest enough to ask for less than £1).

Prediction feedback loops

But the lack of self-confirming prediction is not really the big problem. The big problem is that, as you attempt to refine your prediction (maybe you encounter perverse donors regularly), where you end up at will not be determined by the background facts of the world (the donor’s default generosity) but it will entirely be determined by the feedback loop with your prediction. See here for a similar example in game theory.

Sloppier prediction are no better

One obvious answer would be to allow sloppier predictions. For example, if we require that the prediction be "within £1 of the true value", then all values between £0 and £2,000,000 are equally valid; averaging those, we get £1,000,000, the same as would have happened without the prediction.

But that's just a coincidence. We could have constructed the example so that only a certain region has "within £1" performance, while all others have "within £2" performance. More dammingly, we could have defined "they’ve already written a cheque for £X" for absolutely any X, and it wouldn't have changed anything. So there is no link between the self-confirming prediction and what would have happened without prediction. And making the self-confirming aspect weaker won't improve matters.

Real-world dangers

How often would scenarios like that happen in the real world? The donor example is convoluted, and feels very implausible; what kind of person is willing to donate around £1,000,000 if no predictions are made, but suddenly changes to £(P±1) if there is a prediction?

Donations normally spring from better thought-out processes, involving multiple actors, for specific purposes (helping the world, increasing a certain subculture or value, PR...). They are not normally so sensitive to predictions. And though there are cases where there are true self-confirming or self-fulfilling predictions (notably in politics), these tend to be areas which are pretty close to a knife-edge anyway, and could have gone in multiple directions, with the prediction giving them a small nudge in one direction.

So, though in theory there is no connection between a self-confirming prediction and what would have happened if the prediction had not been uttered, it seems that in practice they are not too far apart (for example, no donor can donate more money than they have, and they generally have their donation amount pretty fixed).

Though beware prediction like "what's the value of the most undervalued/overvalued stock on this exchange", where knowing predictions will affect behaviour quite extensively. That is a special case of the next section; the "new approach" the prediction suggests is "buy/sell these stocks".

Predictions causing new approaches

There is one area where it is very plausible for a prediction to cause a huge effect, though, and that's when the prediction suggests the possibilities of new approaches. Suppose I'm running a million-dollar company with a hundred thousand dollars in yearly profit., and ask a smart AI to predict my expected profit next year. The AI answers zero.

At that point, I'd be really tempted to give up, and go home (or invest/start a new company in a different area). The AI has foreseen some major problem, making my work useless. So I'd give up, and the company folds, thus confirming the prediction.

Or maybe the AI would predict ten million dollars of profit. What? Ten times more than the current capitalisation of the company? Something strange is going on. So I sift through the company's projects with great care. Most of them are solid and stolid, but one looks like a massive-risk-massive-reward gamble. I cancel all the other projects, and put everything into that, because that is the only scenario where I see ten million dollar profits being possible. And, with the unexpected new financing, the project takes off.

There are some more exotic scenarios, like an AI that predicts £192,116,518,914.20 profit. Separating that as 19;21;16;5;18;9;14;20 and replacing numbers with letters, this is is SUPERINT: the AI is advising me to build a superintelligence, which, if I do, will grant me exactly the required profit to make that prediction true in expectation (and after that... well, then bad things might happen). Note that the AI need not be malicious; if it's smart enough and has good enough models, it might realise that £192,116,518,914.20 is self-confirming, without "aiming" to construct a superintelligence.

All these examples share the feature that the prediction P causes a great change in behaviour. Our intuitions that outcome-with-P and outcome-without-P should be similar, is based on the idea that P does not change behaviour much.

Exotic corners

Part of the reason that AIs could be so powerful is that they could unlock new corners of strategy space, doing things that are inconceivable to us, to achieve objectives in ways we didn't think was possible.

A predicting AI is more constrained than that, because it can't act directly. But it can act indirectly, with its prediction causing us to unlock new corners of strategy space.

Would a purely predictive AI do that? Well, it depends on two things:

How self-confirming the exotic corners are, compared with more mundane predictions, and
Whether the AI could explore these corners sufficiently well to come up with self-confirming prediction in them.

For 1, it's very hard to tell; after all, in the example of this post and in the game-theory example, arbitrarily tiny misalignment at standard outcomes, can push the self-confirming outcome arbitrarily far into the exotic area. I'd be nervous about trusting our intuitions here, because approximations don't help us. And the Quine-like "P causes the production of a superpowered AI that causes P to be true" seems like a perfect and exact exotic self-confirming prediction that works in almost all areas.

What about 2? Well, that's a practical barrier for many designs. If the AI is a simple sequence predictor without a good world-model, it might not be able to realise that there are exotic self-confirming predictions. A predictor that had been giving standard stock market predictions for all of its existence, is unlikely to suddenly hit on a highly manipulative prediction.

But I fear scenarios where the AI gradually learns how to manipulate us. After all, even for standard scenarios, we will change our behaviour a bit, based on the prediction. The AI will learn to give the most self-confirming of these standard predictions, and so will gradually build up experience in manipulating us effectively (in particular, I'd expect the "zero profit predicted -> stockholders close the company" to become quite standard). The amount of manipulation may grow slowly, until the AI has a really good understanding of how to deal with the human part of the environment, and the exotic manipulations are just a continuation of what it's already been doing.

It seems that we want is usually going to be a counterfactual prediction: what would happen if the AI gave no output, or gave some boring default prediction. This is computationally simpler, but philosophically triciker. It also requires that we be the sort of agents who won't act too strangely if we find ourselves in the counterfactual world instead of the real one.

I think I'm missing something about your perverse donor example. What makes your number a prediction rather than just a preference? If they're going to erase the (meaningless) 1M and give you P-1, you just maximize P, right? A prediction is just a stated belief, and if it's not paying rent in conditional future experience, it's probably not worth having.

More generally, is the self-confirming prediction just the same as a conditional probability on a not-revealed-by-the-oracle condition? In what cases will the oracle NOT want to reveal the condition? In this case, the nature of adversarial goals needs to be examined - why wouldn't the oracle just falsify the prediction in addition to hiding the conditions?

Also, I'm not sure where the "continuous" requirement comes from. Your example isn't continuous, only whole pennies are allowed. Even if only prime multiples of 3 were allowed, it would seem the same lesson holds.

Separately (and minor), I'm not enjoying the "can be arbitrarily bad" titles. They don't convey information, and confuse me into thinking the posts are about something more fundamental than they seem to be. _ANY_ arbitrary scenario can be arbitrarily bad, why are these topics special on that front?

A self-confirming prediction is what an oracle that was a naive sequence predictor (or that was rewarded on results) would give. https://www.lesswrong.com/posts/i2dNFgbjnqZBfeitT/oracles-sequence-predictors-and-self-confirming-predictions

The donor example was to show how such a predictor could end up moving you far in the positive or negative direction. If you were optimising for income rather than accuracy, the choice is obvious.

The £(P±1) is a continuous model of a discontinuous reality. The model has a self-confirming prediction, and it turns out "reality" (the discretised version) has one too. Unless derivatives get extremely high, a continuous model implies a self-confirming prediction implies a close-to-self-confirming prediction in the discretised model.

I think I'm still confused - a naive sequence predictor is _OF COURSE_ broken by perverse or adversarial unmodelled (because of the naivety of the predictor) behaviors. And such a predictor cannot unlock new corners of strategy space, or generate self-reinforcing predictions, because the past sequence on which it's trained won't have those features.

And such a predictor cannot unlock new corners of strategy space, or generate self-reinforcing predictions, because the past sequence on which it's trained won't have those features.

See my last paragraph above; I don't think we can rely on predictors not unlocking new corners of strategy space, because it may be able to learn gradually how to do so.

There's a cool name for this donor's action: blindspotting (yeah, it's written like this) - after a Roy Sorensen book from 1988.

In that case, there is no correct prediction £P

But there is a distance between predictions and results, which is greater for some predictions.

If you want to avoid changing distances, set the outcome as £P+1 for P less than a million, and £P-1 for P greater than or equal to a million (for example).

I am not sure what exactly you are meaning by predicting. You can tell the donor a different amount, than you are internally expecting to obtain.

The post concerns self-confirming predictions. The donor asked for a prediction of how much money they'll give you...after they hear your prediction. A prediction you give them would be "self-confirming" if they gave you the amount you specified. Here "prediction" refers to "the amount you tell them", as opposed to the amount

you are internally expecting to obtain.

which no one other than you actually knows.

The donor example was to show how such a predictor could end up moving you far in the positive or negative direction. If you were optimising for income rather than accuracy, the choice is obvious.

And such a predictor cannot unlock new corners of strategy space, or generate self-reinforcing predictions, because the past sequence on which it's trained won't have those features.

See my last paragraph above; I don't think we can rely on predictors not unlocking new corners of strategy space, because it may be able to learn gradually how to do so.

There's a cool name for this donor's action: blindspotting (yeah, it's written like this) - after a Roy Sorensen book from 1988.

In that case, there is no correct prediction £P

But there is a distance between predictions and results, which is greater for some predictions.

If you want to avoid changing distances, set the outcome as £P+1 for P less than a million, and £P-1 for P greater than or equal to a million (for example).

I am not sure what exactly you are meaning by predicting. You can tell the donor a different amount, than you are internally expecting to obtain.

you are internally expecting to obtain.

which no one other than you actually knows.

LESSWRONG
LW

LESSWRONG
LW

49

Self-confirming predictions can be arbitrarily bad

49

Ω 16

Predicting perverse donors

Continuity assumption

Prediction feedback loops

Sloppier prediction are no better

Real-world dangers

Predictions causing new approaches

Exotic corners

49

Ω 16

49

Ω 16