What's a perfect agent? No one is infallible, except the Pope.
How do you reconcile
When faced with new evidence, an intelligent agent should update on it and then forget it.
with
We can use the actual data we gather to introspect on our faulty reasoning.
given that you have discarded the data which led to the faulty reasoning? How do you know when it's safe to discard? In your example
I'd hate to forget how I stacked the deck, but only in those cards that are actually in play.
If you forget the discarded cards, and later realize that you may have an incorrect map of the deck, aren't you SOL?
an intelligent agent should update on it and then forget it.
Should being the operative word. This refers to a "perfect" agent (emphasis added in text; thanks!).
People don't do this, as well they shouldn't, because we update poorly and need the original data to compensate.
If you forget the discarded cards, and later realize that you may have an incorrect map of the deck, aren't you SOL?
If I remember the cards in play, I don't care about the discarded ones. If I don't, the discarded cards could help a bit, but that's not the heart of my problem.
Update Then Forget
Followup to: How to Be Oversurprised
A Bayesian update needs never lose information. In a dynamic world, though, the update is only half the story. The other half, where the agent takes an action and predicts its result, may indeed "lose" information in some sense.
We have a dynamical system which consists of an agent and the world around it. It's often useful to describe the system in discrete time steps, and insightful to split them into half-steps where both parties (agent and world) take turns changing and affecting each other.
The agent's update is the half-step in which the agent changes. It takes in an observation Ot of the world (t is for time), and uses it to improve its understanding of the world. In doing so, its internal changeable parts change from some configuration (memory state) Mt-1 that they had before, to a new configuration Mt.
The other half-step is when the world changes, and the agent predicts the result of that change. The change from a previous world state Wt to a new one Wt+1 may depend on the action At that the agent takes.
In changing itself, the agent cares about information. There's a clear way - the Bayesian Way - to do it optimally, keeping all of the available information about the world.
In changing the world, the agent cares about rewards, the one it will get now and the ones to come later, possibly much later. The need to make long-term plans with only partial information about the world makes it very hard to be optimal.

Last week we quantified how much information is gained in the update half-step:
I(Wt;Mt) - I(Wt;Mt-1)
(where I is the mutual information). Quantifying how much information is discarded in the prediction half-step is complicated by the action: at the same time that the agent predicts the next world state, it also affects it. The agent can have acausal information about the future by virtue of creating it.
So last week's counterpart
I(Wt;Mt) - I(Wt+1;Mt)
is interesting, but not what we want to study here. To understand what kind of information reduction we do mean here, let's put aside the issue of prediction-through-action, and ask: why would the agent lose any information at all when only the world changes, not the agent itself?
The agent may have some information about the previous world state that simply no longer applies to the changed state. This never happens if the dynamics of the world is reversible, but middle world is irreversible for thermodynamic reasons. Different starting macrostates can lead to the same resulting macrostate.
Example. We're playing a hand of poker. Your betting and your other actions are part of my observations, and they reveal information about your hidden cards and your personality. If you're making a large bet, your hand seems more likely to be stronger, and you seem to be more loose and aggressive.
Now there's a showdown. You reveal your hand, collapsing that part of the world state that was hidden from me, and at once a large chunk of information spills over from this observation to your personality. Perhaps your hand is not that strong after all, and I learn that you are more likely than I previously thought to be really loose, and a bluffer to boot.
And now I'm shuffling. The order of the cards loses any entanglement with the hand we just played. During that last round, I gained perhaps 50 bit of information about the deck, to be used in prediction and decision making. Only a small portion of this information, no more than 2 or 3 bit if we're strangers, had spilled over to reflect on your personality; the other 40-something bit of information is completely useless now, completely irrelevant for the future, and I can safely forget it. This is the information "lost" by my action of shuffling.
Or maybe I'm a card sharp, and I'm stacking the deck instead of truly shuffling it. In this case I can actually gain information about the deck. But even if I replace 50 known bits about the deck with 100 known bits, these aren't the "same" bits - they are independent bits. The way I'm loading the deck (if I'm a skilled prestidigitator) has little to do with the way it was before, or with the cards I saw during play, or with what it teaches me of your strategy.
This is why
I(Wt;Mt) - I(Wt+1;Mt)
is not a good measure of how much information about the world an agent discards in the action-prediction half-step. It can actually have more information after the action than before, but still be free to discard some irrelevant information.
To be clear: the agent is unmodified by the action-prediction half-step. Only the world changes. So the agent doesn't discard information by wiping any bits from memory. Rather, the information content of the agent's memory state simply stops being about the world. In the update that follows, a good agent (say, a Bayesian one) can - and therefore should - lose that useless information content.
A better way to measure how much information in Mt becomes useless is this:
I(Wt,At;Mt) - I(Wt+1;Mt)
This is information not just about the world, but also about the action. Once the action is completed, this information is also useless - of course, except for the part of it that is still about the world! I'd hate to forget how I stacked the deck, but only in those cards that are actually in play.
This nifty trick shows us why information about the world (and the action) must always be lost by the irreversible transition from Wt and At to Wt+1. The former pair separates the agent from the latter, such that any information about the next world state must go through what is known about the previous one and the action (see the figure above). Formally, the Data Processing Inequality implies that the amount of information lost is nonnegative, since Mt and Wt+1 are independent given (Wt,At).
As a side benefit, we see why we shouldn't bother specifying our actions beyond what is actually about the world. Any information processing that isn't effective in shaping the future is just going to waste.
When faced with new evidence, an intelligent agent should ideally update on it and then forget it. The update is always about the present. The evidence remains entangled with the past, more and more distant as time goes by. Whatever part of it stops being true must be discarded. (Don't confuse this with the need to remember things which are part of the present, only currently hidden.)
People don't do this. We update too seldom, and instead we remember too much in a wishful hope to update at some later convenience.
There are many reasons for us to remember the past, given our shortcomings. We can use the actual data we gather to introspect on our faulty reasoning. Stories of the past help us communicate our experiences, allowing others to update on shared evidence much more reliably than if we just tried to convey our current beliefs.
Optimally, evidence should only be given its due weight in due time, no more and no later. Arguments should not be recycled. The study of history should control our anticipation.
The data is not falsifiable, only the conclusions are - relevant to the world, predictive and useful.
if you really care about the values on that list, then there are linear aggregations
Of course existence doesn't mean that we can actually find these coefficients. Even if you have only 2 well-defined value functions, finding an optimal tradeoff between them is generally computationally hard.
Suppose that at time t the world is in a state Wt, and that the agent may look at it and make an observation Ot. Objectively, the surprise of this observation would be Sobj = S(Ot|Wt) = -log Pr(Ot|Wt).
One note on philosophy of probability: if the world is in state Wt, what does it mean to say that an observation Ot has some probability given Wt? Surely all observations have probability 1 if the state of the world is exhaustively known.
Philosiphically, yes.
Practically, it may be useful to distinguish between a coin and a toss. The coin has persisting features which make it either fair or loaded for a long time, with correlation between past and future. The toss is transient, and essentially all information about it is lost when I put the coin away - except through the memory of agents.
So yes, the toss is a feature of the present state of the world. But it has the very special property, that given the bias of the coin, the toss is independent of the past and the future. It's sometimes more useful to treat a feature like that as an observation external to the world, but of course it "really" isn't.
Thanks for the post; I particularly enjoyed the one-sentence takeaway at the end. One criticism though: you use mathematical notation like I(Wt;Mt) without saying what it denotes. Even though that can be inferred from the surroundings, it would be less likely to confuse if you stated it explicitly.
I'm trying to balance between introducing terminology to new readers and not boring those who've read my previous posts. Thanks for the criticism, I'll use it (and its upvotes) to correct my balance.
This one was fun for the math. Thank you. The practical advice is pretty prosaic - study the things you're most uncertain about.
Well, thank you!
Yes, I do this more for the math and the algorithms than for advice for humans.
Still, the advice is perhaps not so trivial: study not what you're most uncertain about (highest entropy given what you know) but those things with entropy generated by what you care about. And even this advice is incomplete - there's more to come.
When the new memory state is generated by a Bayesian update from the previous one
and the new observation
, it's a sufficient statistic of these information sources for the world state
, so that
keeps all the information about the world that was remembered or observed:
As this is all the information available, other ways to update can only have less information.
The amount of information gained by a Bayesian update is
and because the observation only depends on the world
How to Be Oversurprised
Followup to: How to Disentangle the Past and the Future
Some agents are memoryless, reacting to each new observation as it happens, without generating a persisting internal structure. When a LED observes voltage, it emits light, regardless of whether it did so a second earlier.
Other agents have very persistent memories. The internal structure of an amber nugget can remain unchanged by external conditions for millions of years.
Neither of these levels of memory persistence makes for very intelligent agents, because neither allows them to be good separators of the past and the future. Memoryless agents only have access to the most recent input of their sensors, which leaves them oblivious to the hidden internal structures of other things around them, and to the existence of things not around them. Unchanging agents, on the other hand, fail to entangle themselves with new evidence, which prevents them from keeping up to date with a changing world.
Intelligence requires observations. An intelligent agent needs to strike a delicate balance between the persistence of its internal structure and its susceptibility to new evidence. The optimal balance, a Bayesian update, has been explained many times before, and was shown to be optimal in keeping information about the world. This post highlights yet another aspect.
Suppose I predicted that a roll of a die will give 4, and then it did. How surprised would you be?
You may intuitively realize that the degree of your surprise should be a decreasing function of the probability that you assign to the event. Predicting a 4 on a fair die is more surprising than on one that is loaded in favor of 4. You may also want the measure of surprise to be extensive: if I repeated the feat a second time, you would be twice as surprised.
In that case, there's essentially one consistent notion of surprise (also called: self-information, surprisal). The amount of surprise that a random variable X has value x is
S(x) = -log Pr(X=x).
This is the negative logarithm of the probability of the event X=x.
This is a very useful concept in information theory. For example, the entropy of the random variable X is the surprise we expect to have upon seeing its value. The mutual information between X and another random variable Y is the difference between how much we expect x to surprise us now, and how much we expect it to surprise us after we first look at y.
The surprise of seeing 4 when rolling a fair die is
S(1/6) = -log(1/6) ≈ 2.585 bit.
That's also the surprise of any other result, so that's also the entropy of a die.
By the way, the objective surprise of a specific result is the same whether or not I actually announce it as a prediction. The reason you don't "feel surprised" when no prediction is made, is that your intuition evolved to successfully avoid hindsight bias. As we'll see in a moment, not all surprise is useful evidence.
The Bayesian update is optimal in that it gives the agent the largest possible gain in information about the world. Exactly how much information is gained?
We might ask this question when we choose between an agent that doesn't update at all and one that updates perfectly: this is what we stand to gain, and we can compare it with the cost (in energy, computation power, etc.) of actually performing the update.
We can also ask this question when we consider what observations to make. Not updating on new evidence is equivalent, in terms of information, to not gathering it in the first place. So the benefit of gathering some more evidence is, at most, the benefit of subsequently using it in a Bayesian update.
Suppose that at time t the world is in a state Wt, and that the agent may look at it and make an observation Ot. Objectively, the surprise of this observation would be
Sobj = S(Ot|Wt) = -log Pr(Ot|Wt).
However, the agent doesn't know the state of the world. Before seeing the observation at time t, the agent has its own memory state Mt-1, which is entangled with the state of the world through past observations, but is not enough for the agent to know everything about the world.
The agent has a subjective surprise upon seeing Ot, which is the result of its own private prior:
Ssubj = S(Ot|Mt-1) = -log Pr(Ot|Mt-1),
and this may be significantly different than the objective surprise.
Interestingly, the amount of information that the agent stands to gain by making the new observation, and perfectly updating on it in a new memory state Mt, is exactly equal to the agent's expected oversurprise upon seeing the evidence. That is, any update from Mt-1 to Mt gains at most this much information:
I(Wt;Mt) - I(Wt;Mt-1) ≤ E[Ssubj - Sobj],
and this holds with equality when the update is Bayesian. (The math is below in a comment.)
For example, let's say I have two coins, one of them fair and the other turns up heads with probability 0.7. I don't know which coin is which, so I toss one of them at random. I expect it to show heads with probability
0.5 / 2 + 0.7 / 2 = 0.6,
so if it does I'll be surprised S(0.6) ≈ 0.737 bit, and if it doesn't I'll be surprised S(0.4) ≈ 1.322 bit. On average, I expect to be surprised
0.6 * S(0.6) + 0.4 * S(0.4) ≈ 0.971 bit.
The objective surprise depends on how the world really is. If I toss the fair coin, the objective surprise is S(0.5) = 1 bit for each of the two outcomes, and the expectation is 1 bit too. If I toss the loaded coin, the objective surprise is S(0.7) ≈ 0.515 bit for heads and S(0.3) ≈ 1.737 bit for tails, for an average of
0.7 * S(0.7) + 0.3 * S(0.3) ≈ 0.881 bit.
See how I'm a little undersurprised for the fair coin, but oversurprised for the loaded one. On average I'm oversurprised by
0.971 - (1 / 2 + 0.881 / 2) = 0.0305 bit.
By observing which side the coin turns up, I gain 0.971 bit of information, but most of it is just about this specific toss. Only 0.0305 bit of that information, my oversurprise, goes beyond the specific toss to teach me something about the coin itself, if I update perfectly.
We are not interested in evidence for their own sake, only inasmuch as they teach us about the world. Winning the lottery is very surprising, but nobody gambles for epistemological reasons. When the odds are known in advance, the subjective surprise is exactly matched by the objective surprise - you are not oversurprised, and you learn nothing useful.
The oversurprise is the "spillover" of surprise beyond what is just about the observation, and onto the world. It's the part of the narrowness that originates in the world, not in the observation. And that's how much the observation teaches us about the world.
An interesting corollary is that you can never expect to be undersurprised, i.e. less surprised than you objectively should be. That's the same as saying that a Bayesian update can never lose information.
The next time you find yourself wondering how best to observe the world and gather information, ask this: how much more surprised do you expect to be, than someone who already knows what you wish to know.
Continue reading: Update Then Forget
Subscribe to RSS Feed
= f037147d6e6c911a85753b9abdedda8d)
I've been enjoying this series, but feel like I could get more out of it if I had more of an information theory background. Is there a particular textbook you would recommend? Thanks
Thanks!
The best book is doubtlessly Elements of Information Theory by Cover and Thomas. It's very clear (to someone with some background in math or theoretical computer science) and lays very strong introductory foundations before giving a good overview of some of the deeper aspects of the theory.
It's fortunate that many concepts of information theory share some of their mathematical meaning with the everyday meaning. This way I can explain the new theory (popularized here for the first time) without defining these concepts.
I'm planning another sequence where these and other concepts will be expressed in the philosophical framework of this community. But I should've realized that some readers should be interested in a complete mathematical introduction. That book is what you're looking for.