Followup to: How to Disentangle the Past and the Future
Some agents are memoryless, reacting to each new observation as it happens, without generating a persisting internal structure. When a LED observes voltage, it emits light, regardless of whether it did so a second earlier.
Other agents have very persistent memories. The internal structure of an amber nugget can remain unchanged by external conditions for millions of years.
Neither of these levels of memory persistence makes for very intelligent agents, because neither allows them to be good separators of the past and the future. Memoryless agents only have access to the most recent input of their sensors, which leaves them oblivious to the hidden internal structures of other things around them, and to the existence of things not around them. Unchanging agents, on the other hand, fail to entangle themselves with new evidence, which prevents them from keeping up to date with a changing world.
Intelligence requires observations. An intelligent agent needs to strike a delicate balance between the persistence of its internal structure and its susceptibility to new evidence. The optimal balance, a Bayesian update, has been explained many times before, and was shown to be optimal in keeping information about the world. This post highlights yet another aspect.
Suppose I predicted that a roll of a die will give 4, and then it did. How surprised would you be?
You may intuitively realize that the degree of your surprise should be a decreasing function of the probability that you assign to the event. Predicting a 4 on a fair die is more surprising than on one that is loaded in favor of 4. You may also want the measure of surprise to be extensive: if I repeated the feat a second time, you would be twice as surprised.
In that case, there's essentially one consistent notion of surprise (also called: self-information, surprisal). The amount of surprise that a random variable X has value x is
S(x) = -log Pr(X=x).
This is the negative logarithm of the probability of the event X=x.
This is a very useful concept in information theory. For example, the entropy of the random variable X is the surprise we expect to have upon seeing its value. The mutual information between X and another random variable Y is the difference between how much we expect x to surprise us now, and how much we expect it to surprise us after we first look at y.
The surprise of seeing 4 when rolling a fair die is
S(1/6) = -log(1/6) ≈ 2.585 bit.
That's also the surprise of any other result, so that's also the entropy of a die.
By the way, the objective surprise of a specific result is the same whether or not I actually announce it as a prediction. The reason you don't "feel surprised" when no prediction is made, is that your intuition evolved to successfully avoid hindsight bias. As we'll see in a moment, not all surprise is useful evidence.
The Bayesian update is optimal in that it gives the agent the largest possible gain in information about the world. Exactly how much information is gained?
We might ask this question when we choose between an agent that doesn't update at all and one that updates perfectly: this is what we stand to gain, and we can compare it with the cost (in energy, computation power, etc.) of actually performing the update.
We can also ask this question when we consider what observations to make. Not updating on new evidence is equivalent, in terms of information, to not gathering it in the first place. So the benefit of gathering some more evidence is, at most, the benefit of subsequently using it in a Bayesian update.
Suppose that at time t the world is in a state Wt, and that the agent may look at it and make an observation Ot. Objectively, the surprise of this observation would be
Sobj = S(Ot|Wt) = -log Pr(Ot|Wt).
However, the agent doesn't know the state of the world. Before seeing the observation at time t, the agent has its own memory state Mt-1, which is entangled with the state of the world through past observations, but is not enough for the agent to know everything about the world.
The agent has a subjective surprise upon seeing Ot, which is the result of its own private prior:
Ssubj = S(Ot|Mt-1) = -log Pr(Ot|Mt-1),
and this may be significantly different than the objective surprise.
Interestingly, the amount of information that the agent stands to gain by making the new observation, and perfectly updating on it in a new memory state Mt, is exactly equal to the agent's expected oversurprise upon seeing the evidence. That is, any update from Mt-1 to Mt gains at most this much information:
I(Wt;Mt) - I(Wt;Mt-1) ≤ E[Ssubj - Sobj],
and this holds with equality when the update is Bayesian. (The math is below in a comment.)
For example, let's say I have two coins, one of them fair and the other turns up heads with probability 0.7. I don't know which coin is which, so I toss one of them at random. I expect it to show heads with probability
0.5 / 2 + 0.7 / 2 = 0.6,
so if it does I'll be surprised S(0.6) ≈ 0.737 bit, and if it doesn't I'll be surprised S(0.4) ≈ 1.322 bit. On average, I expect to be surprised
0.6 * S(0.6) + 0.4 * S(0.4) ≈ 0.971 bit.
The objective surprise depends on how the world really is. If I toss the fair coin, the objective surprise is S(0.5) = 1 bit for each of the two outcomes, and the expectation is 1 bit too. If I toss the loaded coin, the objective surprise is S(0.7) ≈ 0.515 bit for heads and S(0.3) ≈ 1.737 bit for tails, for an average of
0.7 * S(0.7) + 0.3 * S(0.3) ≈ 0.881 bit.
See how I'm a little undersurprised for the fair coin, but oversurprised for the loaded one. On average I'm oversurprised by
0.971 - (1 / 2 + 0.881 / 2) = 0.0305 bit.
By observing which side the coin turns up, I gain 0.971 bit of information, but most of it is just about this specific toss. Only 0.0305 bit of that information, my oversurprise, goes beyond the specific toss to teach me something about the coin itself, if I update perfectly.
We are not interested in evidence for their own sake, only inasmuch as they teach us about the world. Winning the lottery is very surprising, but nobody gambles for epistemological reasons. When the odds are known in advance, the subjective surprise is exactly matched by the objective surprise - you are not oversurprised, and you learn nothing useful.
The oversurprise is the "spillover" of surprise beyond what is just about the observation, and onto the world. It's the part of the narrowness that originates in the world, not in the observation. And that's how much the observation teaches us about the world.
An interesting corollary is that you can never expect to be undersurprised, i.e. less surprised than you objectively should be. That's the same as saying that a Bayesian update can never lose information.
The next time you find yourself wondering how best to observe the world and gather information, ask this: how much more surprised do you expect to be, than someone who already knows what you wish to know.
Continue reading: Update Then Forget
When the new memory state is generated by a Bayesian update from the previous one and the new observation , it's a sufficient statistic of these information sources for the world state , so that keeps all the information about the world that was remembered or observed:
As this is all the information available, other ways to update can only have less information.
The amount of information gained by a Bayesian update is
and because the observation only depends on the world