## Update Then Forget

9 17 January 2013 06:36PM

Followup to: How to Be Oversurprised

A Bayesian update needs never lose information. In a dynamic world, though, the update is only half the story. The other half, where the agent takes an action and predicts its result, may indeed "lose" information in some sense.

We have a dynamical system which consists of an agent and the world around it. It's often useful to describe the system in discrete time steps, and insightful to split them into half-steps where both parties (agent and world) take turns changing and affecting each other.

The agent's update is the half-step in which the agent changes. It takes in an observation Ot of the world (t is for time), and uses it to improve its understanding of the world. In doing so, its internal changeable parts change from some configuration (memory state) Mt-1 that they had before, to a new configuration Mt.

The other half-step is when the world changes, and the agent predicts the result of that change. The change from a previous world state Wt to a new one Wt+1 may depend on the action At that the agent takes.

In changing itself, the agent cares about information. There's a clear way - the Bayesian Way - to do it optimally, keeping all of the available information about the world.

In changing the world, the agent cares about rewards, the one it will get now and the ones to come later, possibly much later. The need to make long-term plans with only partial information about the world makes it very hard to be optimal.

Last week we quantified how much information is gained in the update half-step:

I(Wt;Mt) - I(Wt;Mt-1)

(where I is the mutual information). Quantifying how much information is discarded in the prediction half-step is complicated by the action: at the same time that the agent predicts the next world state, it also affects it. The agent can have acausal information about the future by virtue of creating it.

So last week's counterpart

I(Wt;Mt) - I(Wt+1;Mt)

is interesting, but not what we want to study here. To understand what kind of information reduction we do mean here, let's put aside the issue of prediction-through-action, and ask: why would the agent lose any information at all when only the world changes, not the agent itself?

The agent may have some information about the previous world state that simply no longer applies to the changed state. This never happens if the dynamics of the world is reversible, but middle world is irreversible for thermodynamic reasons. Different starting macrostates can lead to the same resulting macrostate.

Example. We're playing a hand of poker. Your betting and your other actions are part of my observations, and they reveal information about your hidden cards and your personality. If you're making a large bet, your hand seems more likely to be stronger, and you seem to be more loose and aggressive.

Now there's a showdown. You reveal your hand, collapsing that part of the world state that was hidden from me, and at once a large chunk of information spills over from this observation to your personality. Perhaps your hand is not that strong after all, and I learn that you are more likely than I previously thought to be really loose, and a bluffer to boot.

And now I'm shuffling. The order of the cards loses any entanglement with the hand we just played. During that last round, I gained perhaps 50 bit of information about the deck, to be used in prediction and decision making. Only a small portion of this information, no more than 2 or 3 bit if we're strangers, had spilled over to reflect on your personality; the other 40-something bit of information is completely useless now, completely irrelevant for the future, and I can safely forget it. This is the information "lost" by my action of shuffling.

Or maybe I'm a card sharp, and I'm stacking the deck instead of truly shuffling it. In this case I can actually gain information about the deck. But even if I replace 50 known bits about the deck with 100 known bits, these aren't the "same" bits - they are independent bits. The way I'm loading the deck (if I'm a skilled prestidigitator) has little to do with the way it was before, or with the cards I saw during play, or with what it teaches me of your strategy.

This is why

I(Wt;Mt) - I(Wt+1;Mt)

is not a good measure of how much information about the world an agent discards in the action-prediction half-step. It can actually have more information after the action than before, but still be free to discard some irrelevant information.

To be clear: the agent is unmodified by the action-prediction half-step. Only the world changes. So the agent doesn't discard information by wiping any bits from memory. Rather, the information content of the agent's memory state simply stops being about the world. In the update that follows, a good agent (say, a Bayesian one) can - and therefore should - lose that useless information content.

A better way to measure how much information in Mt becomes useless is this:

I(Wt,At;Mt) - I(Wt+1;Mt)

This is information not just about the world, but also about the action. Once the action is completed, this information is also useless - of course, except for the part of it that is still about the world! I'd hate to forget how I stacked the deck, but only in those cards that are actually in play.

This nifty trick shows us why information about the world (and the action) must always be lost by the irreversible transition from Wt and At to Wt+1. The former pair separates the agent from the latter, such that any information about the next world state must go through what is known about the previous one and the action (see the figure above). Formally, the Data Processing Inequality implies that the amount of information lost is nonnegative, since Mt and Wt+1 are independent given (Wt,At).

As a side benefit, we see why we shouldn't bother specifying our actions beyond what is actually about the world. Any information processing that isn't effective in shaping the future is just going to waste.

When faced with new evidence, an intelligent agent should ideally update on it and then forget it. The update is always about the present. The evidence remains entangled with the past, more and more distant as time goes by. Whatever part of it stops being true must be discarded. (Don't confuse this with the need to remember things which are part of the present, only currently hidden.)

People don't do this. We update too seldom, and instead we remember too much in a wishful hope to update at some later convenience.

There are many reasons for us to remember the past, given our shortcomings. We can use the actual data we gather to introspect on our faulty reasoning. Stories of the past help us communicate our experiences, allowing others to update on shared evidence much more reliably than if we just tried to convey our current beliefs.

Optimally, evidence should only be given its due weight in due time, no more and no later. Arguments should not be recycled. The study of history should control our anticipation.

The data is not falsifiable, only the conclusions are - relevant to the world, predictive and useful.

## How to Be Oversurprised

13 07 January 2013 04:02AM

Followup to: How to Disentangle the Past and the Future

Some agents are memoryless, reacting to each new observation as it happens, without generating a persisting internal structure. When a LED observes voltage, it emits light, regardless of whether it did so a second earlier.

Other agents have very persistent memories. The internal structure of an amber nugget can remain unchanged by external conditions for millions of years.

Neither of these levels of memory persistence makes for very intelligent agents, because neither allows them to be good separators of the past and the future. Memoryless agents only have access to the most recent input of their sensors, which leaves them oblivious to the hidden internal structures of other things around them, and to the existence of things not around them. Unchanging agents, on the other hand, fail to entangle themselves with new evidence, which prevents them from keeping up to date with a changing world.

Intelligence requires observations. An intelligent agent needs to strike a delicate balance between the persistence of its internal structure and its susceptibility to new evidence. The optimal balance, a Bayesian update, has been explained many times before, and was shown to be optimal in keeping information about the world. This post highlights yet another aspect.

Suppose I predicted that a roll of a die will give 4, and then it did. How surprised would you be?

You may intuitively realize that the degree of your surprise should be a decreasing function of the probability that you assign to the event. Predicting a 4 on a fair die is more surprising than on one that is loaded in favor of 4. You may also want the measure of surprise to be extensive: if I repeated the feat a second time, you would be twice as surprised.

In that case, there's essentially one consistent notion of surprise (also called: self-information, surprisal). The amount of surprise that a random variable X has value x is

S(x) = -log Pr(X=x).

This is the negative logarithm of the probability of the event X=x.

This is a very useful concept in information theory. For example, the entropy of the random variable X is the surprise we expect to have upon seeing its value. The mutual information between X and another random variable Y is the difference between how much we expect x to surprise us now, and how much we expect it to surprise us after we first look at y.

The surprise of seeing 4 when rolling a fair die is

S(1/6) = -log(1/6) ≈ 2.585 bit.

That's also the surprise of any other result, so that's also the entropy of a die.

By the way, the objective surprise of a specific result is the same whether or not I actually announce it as a prediction. The reason you don't "feel surprised" when no prediction is made, is that your intuition evolved to successfully avoid hindsight bias. As we'll see in a moment, not all surprise is useful evidence.

The Bayesian update is optimal in that it gives the agent the largest possible gain in information about the world. Exactly how much information is gained?

We might ask this question when we choose between an agent that doesn't update at all and one that updates perfectly: this is what we stand to gain, and we can compare it with the cost (in energy, computation power, etc.) of actually performing the update.

We can also ask this question when we consider what observations to make. Not updating on new evidence is equivalent, in terms of information, to not gathering it in the first place. So the benefit of gathering some more evidence is, at most, the benefit of subsequently using it in a Bayesian update.

Suppose that at time t the world is in a state Wt, and that the agent may look at it and make an observation Ot. Objectively, the surprise of this observation would be

Sobj = S(Ot|Wt) = -log Pr(Ot|Wt).

However, the agent doesn't know the state of the world. Before seeing the observation at time t, the agent has its own memory state Mt-1, which is entangled with the state of the world through past observations, but is not enough for the agent to know everything about the world.

The agent has a subjective surprise upon seeing Ot, which is the result of its own private prior:

Ssubj = S(Ot|Mt-1) = -log Pr(Ot|Mt-1),

and this may be significantly different than the objective surprise.

Interestingly, the amount of information that the agent stands to gain by making the new observation, and perfectly updating on it in a new memory state Mt, is exactly equal to the agent's expected oversurprise upon seeing the evidence. That is, any update from Mt-1 to Mt gains at most this much information:

I(Wt;Mt) - I(Wt;Mt-1) ≤ E[Ssubj - Sobj],

and this holds with equality when the update is Bayesian. (The math is below in a comment.)

For example, let's say I have two coins, one of them fair and the other turns up heads with probability 0.7. I don't know which coin is which, so I toss one of them at random. I expect it to show heads with probability

0.5 / 2 + 0.7 / 2 = 0.6,

so if it does I'll be surprised S(0.6) ≈ 0.737 bit, and if it doesn't I'll be surprised S(0.4) ≈ 1.322 bit. On average, I expect to be surprised

0.6 * S(0.6) + 0.4 * S(0.4) ≈ 0.971 bit.

The objective surprise depends on how the world really is. If I toss the fair coin, the objective surprise is S(0.5) = 1 bit for each of the two outcomes, and the expectation is 1 bit too. If I toss the loaded coin, the objective surprise is S(0.7) ≈ 0.515 bit for heads and S(0.3) ≈ 1.737 bit for tails, for an average of

0.7 * S(0.7) + 0.3 * S(0.3) ≈ 0.881 bit.

See how I'm a little undersurprised for the fair coin, but oversurprised for the loaded one. On average I'm oversurprised by

0.971 - (1 / 2 + 0.881 / 2) = 0.0305 bit.

By observing which side the coin turns up, I gain 0.971 bit of information, but most of it is just about this specific toss. Only 0.0305 bit of that information, my oversurprise, goes beyond the specific toss to teach me something about the coin itself, if I update perfectly.

We are not interested in evidence for their own sake, only inasmuch as they teach us about the world. Winning the lottery is very surprising, but nobody gambles for epistemological reasons. When the odds are known in advance, the subjective surprise is exactly matched by the objective surprise - you are not oversurprised, and you learn nothing useful.

The oversurprise is the "spillover" of surprise beyond what is just about the observation, and onto the world. It's the part of the narrowness that originates in the world, not in the observation. And that's how much the observation teaches us about the world.

An interesting corollary is that you can never expect to be undersurprised, i.e. less surprised than you objectively should be. That's the same as saying that a Bayesian update can never lose information.

The next time you find yourself wondering how best to observe the world and gather information, ask this: how much more surprised do you expect to be, than someone who already knows what you wish to know.

## How to Disentangle the Past and the Future

12 02 January 2013 05:43PM

I'm on my way to an important meeting. Am I worried? I'm not worried. The presentation is on my laptop. I distinctly remember putting it there (in the past), so I can safely predict that it's going to be there when I get to the office (in the future) - this is how well my laptop carries information through space and time.

My partner has no memory of me copying the file to the laptop. For her, the past and the future have mutual information: if Omega assured her that I'd copied the presentation, she would be able to predict the future much better than she can now.

For me, the past and the future are much less statistically dependent. Whatever entanglement remains between them is due to my memory not being perfect. If my partner suddenly remembers that she saw me copying the file, I will be a little bit more sure that I remember correctly, and that we'll have it at the meeting. Or if somehow, despite my very clear mental image of copying the file, it's not there at the meeting, my first suspicion will nevertheless be that I hadn't.

These unlikely possibilities aside, my memory does serve me. My partner is much less certain of the future than me, and more to the point, her uncertainty would decrease much more than mine if we both suddenly became perfectly aware of the past.

But now she turns on my laptop and checks. The file is there, yes, I could have told her that. And now that we know the state of my laptop, the past and the future are completely disentangled: maybe I put the file there, maybe the elves did - it makes no difference for the meeting. And maybe by the time we get to the office a hacker will remotely delete the file - its absence at the meeting will not be evidence that I'd neglected to copy the file: it was right there! We saw it!

(By "entanglement" I don't mean anything quantum theoretic; here it refers to statistical dependence, and its strength is measured by mutual information.)

The past and the future have mutual information. This is a necessary condition for life, for intelligence: we use the past to predict and plan for the future, and we couldn't do it if it were useless.

On the other hand, the future is independent of the past given the present. That's not profound metaphysics, that's simply how we define the state of a system: it's everything one needs to know of the past of the system to compute its future. The past is gone, and any information that it had on the future - information we could use to predict and make a better future - is inherited by the present.

But as limited agents inside the system, we don't get to know its entire state. Most of it is hidden from us, behind walls and hills and inside skulls and in nanostructures. So for us, the past and the future are entangled, which means that by learning one we could reduce our uncertainty about the other.

In control theory, memoryless agents have an unchanging internal structure, unable to entangle with the past and carry information useful for the future. Instead they react to whatever last input they received, like a function. These degenerate agents have an internal memory state Mt that depends only on the most recent observation Ot:

Figure 1: Dynamics of a memoryless reinforcement learning agents

Wt is the state of the world outside the agent at time t, Ot is the observation the agent makes of the world, Mt is the resulting internal state of the agent, and At is the action the agent chooses to take.

When a LED observes voltage, it emits light, regardless of whether it did so a second earlier. When the LED's internal attributes entangle with the voltage, they lose all information of what came before.

When the q key on a keyboard is pressed and released, the keyboard sends a signal to that effect to the computer, and that signal is mostly independent of which keys were pressed before and in what order (with a few exceptions; a keyboard is not entirely memoryless). A keyboard gets entangled with vast amounts of information over the years, but streams it through and loses almost any trace of it within seconds.

For a memoryless agent, all of the information between past and future flows through the environment outside the agent - through Wt. The world sans agent retains all the power to disentangle the past and the future: you can check in the graphical model in Figure 1 that Wt-1 and Wt+1 are independent given Wt.

The internal state Mt of the memoryless agent, on the other hand, of course reveals nothing about the link between past and future. Looking at my keyboard, you can't tell if I copied the presentation to my laptop, and if it's going to be there in half an hour.

Intelligent agents need to have memory:

Figure 2: General dynamics of a reinforcement learning agent

Now control over the flow of information has shifted, to some extent, in favor of the agent. The agent has a much wider channel over which to receive information from the past, and it can use this information to recover some truths about the present which aren't currently observable.

The past and the future are no longer independent given Wt alone, you need Mt and Wt together to completely separate the past and the future. An agent with memory of how Wt came to be can partly assume both roles, thus making itself a better separator than its memoryless counterpart could.

So here's how to become a good separator of past and future: remember things that are relevant for the future.

Memory is necessary for intelligence, you already knew that. What's new here is a way to measure just how useful memory is. Memory is useful exactly to the extent to which it shifts the power to control the flow of information.

For the agent, memory disentangles the past and the future. If you maintain in yourself some of the information that the past has about the future, you overcome the limitations of your ability to observe the present. I know I put the presentation on my laptop, so I don't need to check it's there.

For other agents, at the same time, the agent's memory is yet another limitation to their observability. Think of a secret handshake, for example. It's useful precisely because it predicts (and controls) the future for the confidants, while keeping it entangled with the hidden past for everyone else.

Continue reading: How to Be Oversurprised

## Point-Based Value Iteration

9 08 October 2012 06:19AM

Followup to: The Bayesian Agent

This post explains one interesting and influential algorithm for achieving high utility of the actions of a Bayesian agent, called Point-Based Value Iteration (original paper). Its main premise resembles some concept of internal availability.

A reinforcement-learning agent chooses its actions based on its internal memory state. The memory doesn't have to include an exact account of all past observations - it's sufficient for the agent to keep track of a belief of the current state of the world.

This mitigates the problem of having the size of the memory state grow indefinitely. The memory-state space is now the set of all distributions over the world state. Importantly, this space doesn't grow exponentially with time like the space of observable histories. Unfortunately, it still has volume exponential in the number of world states.

So we moved away from specifying the action to take after each observable history, in favor of specifying the action to take in each belief state. This was justified by the belief being a sufficient statistic for the world state, and motivated by the belief space being much smaller (for long processes) than the history space. But for the purpose of understanding this algorithm, we should go back to describing the action policy in terms of the action to take after each observable history, π(At|O1,...,Ot).

Now join us as we're watching the dynamics of the process unfold. We're at time t, and the agent has reached Bayesian belief Bt, the distribution of the world state to the best of the agent's knowledge. What prospects does the agent still have for future rewards? What is its value-to-go of being in this belief state?

If we know what action the agent will take after each observable history (if and when it comes to pass), it turns out that the expected total of future rewards is linear in Bt. To see why this is so, consider a specific future (and present):

wt, ot, at, wt+1, ot+1, at+1, ..., wn, on, an

The reward of this future is:

R(wt,at) + R(wt+1,at+1) + ... + R(wn,an)

The probability of this future, given that the agent already knows o1,...,ot, is:

Bt(wt)·π(at|o1,...,ot)·p(wt+1|wt,at)·σ(ot+1|wt+1)·π(at+1|o1,...,ot+1)···

···p(wn|wn-1,an-1)·σ(on|wn)·π(an|o1,...,on)

We arrive at this by starting with the agent's belief Bt, which summarizes the distribution of wt given the observable history; then going on to multiply the probability for each new variable, given those before it that have part in causing it.

Note how Bt linearly determines the probability for wt, while the dynamics of the world and the agent determine the probability for the rest of the process. Now, to find the expected reward we sum, over all possible futures, the product of the future's reward and probability. This is all linear in the belief Bt at time t.

If the value of choosing a specific action policy is a linear function of Bt, the value of choosing the best action policy is the maximum of many linear functions of Bt. How many? If there are 2 possible observations, the observable history that will have unfolded in the next n-t steps can be any of 2n-t possible "future histories". If there are 2 possible actions to choose for each one, there are 22n-t possible policies. A scary number if the end is still some time ahead of us.

Of course, we're not interested in the value of just any belief, only of those that can actually be reached at time t. Since the belief Bt is determined by the observable history o1,...,ot, there are at most 2t such values. Less scary, but still not practical for long processes.

The secret of the algorithm is that it limits both of the figures above to a fixed, manageable number. Suppose that we have a list of 100 belief states which are the most interesting to us, perhaps because they are the most likely to be reached at time t. Suppose that we also have a list Πt of 100 policies: for each belief on the first list, we have in the second list a policy that gets a good reward if indeed that belief is reached at time t. Now what is the value, at time t, of choosing the best action policy of these 100? It's the maximum of, not many, but |Πt|=100 linear functions of Bt. Much more manageable.

Ok, we're done with time t. What about time t-1? We are going to consider 100 belief states which can be reached at time t-1. For each such belief Bt-1, we are going to consider each possible action At-1. Then we are going to consider each possible observation Ot that can come next - we can compute its probability from Bt-1 and the dynamics of the system. Given the observation, we will have a new belief Bt. And the previous paragraph tells us what to do in this case: choose one of the policies in Πt which is best for Bt, to use for the rest of the process.

(Note that this new Bt doesn't have to be one of the 100 beliefs that were used to construct the 100 policies in Πt.)

So this is how the recursive step goes:

• For each of 100 possible values for Bt-1
• For each possible At-1
• For each possible Ot
• We compute Bt
• We choose for Bt the best policy in Πt, and find its value (it's a linear function of Bt)
• We go back and ask: how good is it really to choose At-1? What is the expected value before knowing Ot?
• We choose the best At-1
• This gives us the best policy for Bt-1, under the restriction that we are only willing to use one of the policies in Πt from time t onward
• After going over the 100 beliefs for time t-1, we now have a new stored list, Πt-1, of 100 policies for time t-1.

We continue backward in time, until we reach the starting time t0, where we choose the best policy in Πt0 for the initial (prior) belief Bt0.

We still need to say how to choose those special 100 beliefs to optimize for. Actually, this is not a fixed set. Most point-based algorithms start with only one or a few beliefs, and gradually expand the set to find better and better solutions.

The original algorithm chooses beliefs in a similar way to an internal availability mechanism that may be responsible for conjuring up images of likely futures in human brains. It starts with only one belief (the prior Bt0) and finds a solution where only one policy is allowed in each Πt. Then it adds some beliefs which seem likely to be reached in this solution, and finds a new and improved solution. It repeats this computation until a good-enough solution is reached, or until it runs out of time.

Point-based algorithms are today some of the best ones for finding good policies for Bayesian agents. They drive some real-world applications, and contribute to our understanding of how intelligent agents can act and react.

## The Bayesian Agent

11 18 September 2012 03:23AM

A reinforcement-learning agent interacts with its environment through the perception of observations and the performance of actions. A very abstract and non-standard description of such an agent is in two parts. The first part, the inference policy, tells us what states the agent can be in, and how these states change when the agent receives new input from its environment. The second part, the action policy, tells us what action the agent chooses to perform on the environment, in each of its internal states.

There are two special choices for the inference policy, marking two extremes. One extreme is for the agent to remain absolutely oblivious to the information coming its way. The transition from a past internal state to a current one is made independent of the observation, and no entanglement is formed between the agent and the environment. A rock, for example, comes close to being this little alive.

This post focuses on the other extreme, where the agent updates perfectly for the new information.

Keeping track of all the information is easy, on paper. All the agent has to do is maintain the sequence of past observations, the observable history O1, O2, ..., Ot. As each new observation is perceived, it can simply be appended to the list. Everything the agent can possibly know about the world, anything it can possibly hope to use in choosing actions, is in the observable history - there's no clairvoyance.

But this is far from practical, for many related reasons. Extracting the useful information from the raw observations can be a demanding task. The number of observations to remember grows indefinitely with time, putting a strain on the resources of an agent attempting longevity. The number of possible agent states grows exponentially with time, making it difficult to even specify (let alone decide) what action to take in each one.

Clearly we need some sort of compression when producing the agent's memory state from the observable history. Two requirements for the compression process: one, as per the premise of this post, is that it preserves all information about the world; the other is that it can be computed sequentially - when computing Mt the agent only has access to the new observation Ot and the previous compression Mt-1. The explicit value of all previous observations is forever lost.

This is a good moment to introduce proper terminology. A function of the observable history is called a statistic. Intuitively, applying a function to the data can only decrease, never increase the amount of information we end up having about the world. This intuition is solid, as the Data Processing Inequality proves. If the function does not lose any information about the world, if looking at the agent's memory is enough and there's nothing more relevant in the observations themselves, then the memory state is a sufficient statistic of the observable history, for the world state. The things the agent does forget about past perceptions are not at all informative for the present. Ultimately, when nothing further can be forgotten this way, we are left with a minimal sufficient statistic.

If I tell you the observable history of the agent, what will you know about the world state? If you know the dynamics of the world and how observations are generated, you'll have the Bayesian belief, assigning to each world state the posterior distribution:

Bt(Wt) = Pr(Wt|O1,...,Ot)

(where Pr stands for "probability"). Importantly, this can be computed sequentially from Bt-1 and Ot, using Bayes' theorem. (The gory details of how to do this are below in a comment.)

Aha! So the Bayesian belief is an expression for precisely everything the agent can possibly know about the world. Why not have the agent's memory represent exactly that?

Mt = Bt

As it turns out, the Bayesian belief is indeed a minimal sufficient statistic of the observable history for the world state. For the agent, it is the truth, the whole truth, and nothing but the truth - and a methodical way to remember the truth, to boot.

Thus we've compressed into the agent's memory all and only the information from its past that is relevant for the present. We've discarded any information that is an artifact of the senses, and is not real. We've discarded any information that used to be real, but isn't anymore, because the world has since changed.

The observant reader will notice that we haven't discussed actions yet. We're getting there. The question of what information is relevant for future actions is deep enough to justify this meticulous exposition. For the moment, just note that keeping a sufficient statistic for the current world state is also sufficient for the controllable future, since the future is independent of the past given the present.

What we have established here is an "ultimate solution" for how a reinforcement-learning agent should maintain its memory state. It should update a Bayesian belief of what the current world state is. This inference policy is so powerful and natural, that standard reinforcement learning doesn't even make a distinction between the Bayesian belief and the agent's memory state, ignoring anything else we could imagine the latter to be.

## Reinforcement, Preference and Utility

7 08 August 2012 06:23AM

A reinforcement-learning agent is interacting with its environment through the perception of observations and the performance of actions.

We describe the influence of the world on the agent in two steps. The first is the generation of a sensory input Ot based on the state of the world Wt. We assume that this step is in accordance with the laws of physics, and out of anyone's hands. The second step is the actual changing the agent's mind to a new state Mt. The probability distributions of these steps are, respectively, σ(Ot|Wt) and q(Mt|Mt-1,Ot).

Similarly, the agent affects the world by deciding on an action At and performing it. The designer of the agent can choose the probability distribution of actions π(At|Mt), but not the natural laws p(Wt+1|Wt,At) saying how these actions change the world.

So how do we choose q and π? Let's first assume that it matters how. That is, let's assume that we have some preference over the results of the process, the actual values that all the variables take. Let's also make some further assumptions regarding this preference relation:

1. The first assumption will be the standard von Neumann-Morgenstern rationality. This is a good opportunity to point out a common misconception in the interpretation of that result. It is often pointed out that humans, for instance, are not rational in the VNM sense. That is completely beside the point.

The agent doesn't choose the transitions q and π. The agent is these transitions. So it's not the agent that needs to be rational about the preference, and indeed it may appear not to be. If the agent has evolved, we may argue that evolutionary fitness is VNM-rational (even if the local-optimization process leads to sub-optimal fitness). If humans design a Mars rover to perform certain functions, we may argue that the goal dictates a VNM-rational preference (even if we, imperfect designers that we are, can only approximate it).

2. The second assumption will be that the preference is strictly about the territory Wt, never about the map Mt. This means that we are never interested in merely putting the memory of the agent in a "good" state. We need it to be followed up by "good" actions.

You may think that an agent that generates accurate predictions of the stock market could be very useful. But if the agent doesn't follow up with actually investing well, or at least reliably reporting its findings to someone who does, then what good is it?

So we are going to define the preference with respect to only the states W1, ..., Wn and the actions A1, ..., An. If the observations are somehow important, we can always define them as being part of the world. If the agent's memory is somehow important, it will have to reliably communicate it out through actions.

3. The third assumption will be the sunk cost assumption. We can fix some values of the first t steps of the process, and consider the preference with respect to the remaining n-t steps. This is like finding ourselves in time t, with a given t-step history, considering what to do next (though of course we plan for that contingency ahead of time). Our assumption is that we find that the preference is the same, regardless of the fixed history.

This last assumption gives rise to a special version of the von Neumann-Morgenstern utility theorem. We find that what we really prefer is to have as high as possible the expectation of a utility function with a special structure: the utility ut depends only on Wt, At and the expectation of ut+1 from the following step:

ut(Wt,At,E(ut+1))

This kind of recursion which goes backwards in time is a recurring theme in reinforcement learning.

We would like to have even more structure in our utility function, but this is where things become less principled and more open to personal taste among researchers. We base our own taste on a strong intuition that the following could, one day, be made to rely on much better principles than it currently does.

We will assume that ut is simply the summation of some immediate utility and the expectation of future utility:

ut(Wt,At,E(ut+1)) = R(Wt,At) + E(ut+1)

Here R is the Reward, the additional utility that the agent gets from taking the action At when the world is in state Wt. It's not hard to see that we can write down our utility in closed form, as the total of all rewards we get throughout the process

R(W1,A1) + R(W2,A2) + ... + R(Wn,An)

As a final note, if the agent is intended to achieve a high expectation of the total reward, then it may be helpful for the agent to actually observe its reward when it gets it. And indeed, many reinforcement learning algorithms require that the reward signal is visible to the agent as part of its observation each step. This can help the agent adapt its behavior to changes in the environment. After all, reinforcement learning means, quite literally, adaptation in response to a reward signal.

However, reinforcement learning can also happen when the reward signal is not explicit in the agent's observations. To the degree that the observations carry information about the state of the world, and that the reward is determined by that state, information about the reward is anyway implicit in the observations.

In this sense, the reward is just a score for the quality of the agent's design. If we take it into account when we design the agent, we end up choosing q and π so that the agent will exploit the implicit information in its observations to get a good reward.

## Reinforcement Learning: A Non-Standard Introduction (Part 2)

9 02 August 2012 08:17AM

Followup to: Part 1

In part 1 we modeled the dynamics of an agent and its environment as a turn-based discrete-time process. We now start on the path of narrowing the model, in the ambitious search for an explanation and an algorithm for the behavior of intelligent agents.

The spacecraft is nearing the end of its 9 month journey. In its womb rests a creature of marvelous design. Soon, it will reach its destination, and on the surface of the alien planet will gently land an intelligent agent. Almost immediately the agent will start interacting with its environment, collecting and analyzing samples of air and ground and radiation, and eventually moving about to survey the land and to find energy and interesting stuff to do.

It takes too long, about 14 minutes at the time of landing, for the state of Mars to affect the state of the brains of the people at the NASA mission control. This makes the state of the software of the rover important: it will be the one to decide the rover's actions, to a large degree.

Even after it's landed, the software of the rover can be changed remotely, but not the hardware. This puts some limitations on any brilliant ideas we could have for improving the rover at this late stage of the mission. How can we model these limitations?

Our model so far defined the dynamics of the agent's interaction with its environment through the probabilities:

p(Wt|Wt-1,Mt-1)

q(Mt|Mt-1,Wt)

for the current state of the world (Mars) and the agent (the rover), given their previous state.

At some point the rover will take a sample of the air. Could we choose q to be such that Mt at that time will reflect the oxygen level around the rover? Yes, the rover has hardware that, subsequent to analyzing the sample, will have one of a number of possible states, depending on the oxygen levels. It is wired to the rover's central controller, which can then have its state reflect that of the oxygen analyzer.

At some point the rover will take a sample of the ground. Could we choose q to be such that Mt at that time will reflect Mars' sweetness? Probably not any more than it already does. The rover is probably not equipped with any hardware sensitive to substances perceived by living creatures on Earth as sweet. This was a design choice, guided by the assumption that these organic molecules are extremely unlikely to be found on Mars. If they are, the rover cannot be made to reflect their concentration level.

In other words, the rover is not omniscient. The scope of things it can possibly know of the state of Mars is strictly (and vastly) smaller than the scope of things which are true of the state of Mars. We call O for Observation the aspects of the environment that the agent can perceive.

Similarly, p cannot take any value, because the rover is not omnipotent. It has engines, so it can cause Mars (relative to the rover's own point of view) to rotate under its wheels. But it cannot (I dare hope) blast Mars out of existence, nor plant a rose there. We call A for Action the operation of the rover's actuators.

This more detailed model is illustrated in the Bayesian network:

The dynamics of the model are (pardon my Greek)

p(Wt|Wt-1,At-1)

σ(Ot|Wt)

q(Mt|Mt-1,Ot)

π(At|Mt)

Now Mt can only depend on the state of the world Wt through the agent's observation Ot. When designing the rover's software, we can choose the memory update scheme q, thereby deciding how the rover processes the information in Ot and what part of it it remembers. But we cannot change σ, which is how the rover's sensory input depend on the truth of the Martian world.

To clarify: the agent can change, through action, the kind of observations it gets. The quality of the ground sample can greatly improve if the correct appendage actually goes down to the ground. But that's not changing σ. That's changing the position of Mars with respect to the appendage (a part of Wt), and through that affecting the observation.

Similarly, the rover can only change the state Wt of Mars through an action At. The software controls what command is sent to the engines in any state of its execution, which is summarized in π. But it cannot change p, which is how the chosen action and the laws of physics cause Mars to change.

So by adding detail to the model we have introduced further independences, which help narrow the model. Not only is the future independent on the past given the present, but now the dependence between the two parts of the current state, Mt and Wt, is completely explained by the previous memory state Mt-1 and the observation Ot. This means that everything the agent knows about the world must come from the memory and the senses - no clairvoyance!

The symmetric independence - no telekinesis! - can be found by flipping the Bayesian network. Its details are left as an exercise for the reader.

As a final note, the inability to change p or σ is only with respect to that specific rover which is now 14 light minutes away from here. Future rovers can easily have sweetness gauges or rose-planting devices, but they will still not be omniscient nor omnipotent. We can exemplify this principle further on 3 different scales:

1. If we try to improve humans through training and education, but without changing their biology (nor enhancing it with technology), we will probably be limited in the kinds of things we can make the brain notice, or remember, or decide to do.

2. Suppose that some senses (for example, the eye) have reached a local maximum in their evolution, so that any improvement is extremely unlikely. Suppose further that the brain is not in such a local maximum, and can easily evolve to be better. Then the evolution of the brain will be constrained by the current suboptimal properties of the senses.

3. Even if we are allowed to design an intelligent agent from scratch, the laws of physics as we know them will limit the part of the world it can perceive through observations, and the part it can influence through actions.

Continue reading: Reinforcement, Preference and Utility

## Reinforcement Learning: A Non-Standard Introduction (Part 1)

20 29 July 2012 12:13AM

Imagine that the world is divided into two parts: one we shall call the agent and the rest - its environment. Imagine you could describe in full detail the state of both the agent and the environment. The state of the agent is denoted M: it could be a Mind if you're a philosopher, a Machine if you're researching machine learning, or a Monkey if you're a neuroscientist. Anyway, it's just the Memory of the agent. The state of the rest of the World (or just World, for short) is denoted W.

These states change over time. In general, when describing the dynamics of a system, we specify how each state is determined by the previous states. So we have probability distributions for the states Wt and Mt of the world and the agent in time t:

p(Wt|Wt-1,Mt-1)

q(Mt|Wt-1,Mt-1)

This gives us the probabilities that the world is currently in state Wt, and the agent in state Mt, given that they previously were in states Wt-1 and Mt-1. This can be illustrated in the following Bayesian network (see also):

Bayesian networks look like they represent causation: that the current state is "caused" by the immediately previous state. But what they really represent is statistical independence: that the current joint state (Wt, Mt) depends only on the immediately previous joint state (Wt-1, Mt-1), and not on any earlier state. So the power of Bayesian networks is in what they don't show, in this case there's no arrow from, say, Wt-2 to Wt.

The current joint state of the world and the agent represents everything we need to know in order to continue the dynamics forward. Given this state, the past is independent of the future. This property is so important, that it has a name, borrowed from one of its earliest researchers, Markov.

The Markov property is not enough for our purposes. We are going to make a further assumption, which is that the states of the world and the agent don't both change together. Rather, they take turns changing, and while one does the other remains the same. This gives us the dynamics:

p(Wt|Wt-1,Mt-1)

q(Mt|Mt-1,Wt)

and the Bayesian network:

Sometimes this assumption can be readily justified. For example, let's use this model to describe a chess player.

Suppose that at time t the game has reached state Wt where it is our agent's turn to play. Our agent has also reached a decision of what to do next, and its mind is now in state Mt, including memory, plan, general knowledge of chess, and all.

Our agent takes its turn, and then enters stasis: we are going to assume that it's not thinking off-turn. This is true of most existing artificial chess players, and disregarding time constraints their play is not worse off for it. They are not missing out on anything other than time to think. So the agent keeps its state until the opponent has taken its turn. This completes the change of the state of the game from Wt to Wt+1.

Now the agent takes a look at the board, and starts thinking up a new strategy to counter the last move of the opponent. If reaches a decision, and commits to its next action. This completes the change of the agent's state from Mt to Mt+1.

Chess is a turn-based game. But even in other scenarios, when such division of the dynamics into turns is not a good approximation of the process, our assumption can still be justified. If the length of each time step is taken to be smaller and smaller, the state of each of the parties remains more and more the same during each step, with increasing probability and accuracy. In the limit where we describe a continuous change of state over time, the turn-based assumption disappears, and we are back to the general model.

This is the first part of an intuitive and highly non-standard introduction to reinforcement learning. This is more typical of what neuroscientists mean when they use the term. We, on the other hand, will get closer as we move forward to its meaning in machine learning (but not too close).

In following posts we will continue to assume the Markov property in its turn-based variant. We will describe the model in further detail and explore its decision-making aspect.

## A Crash Course in the Neuroscience of Human Motivation

120 19 August 2011 09:15PM

Whenever I write a new article for Less Wrong, I'm pulled in two opposite directions.

One force pulls me toward writing short, exciting posts with lots of brain candy and just one main point. Eliezer has done that kind of thing very well many times: see Making Beliefs Pay Rent, Hindsight Devalues Science, Probability is in the MindTaboo Your Words, Mind Projection FallacyGuessing the Teacher's Password, Hold Off on Proposing Solutions, Applause Lights, Dissolving the Question, and many more.

Another force pulls me toward writing long, factually dense posts that fill in as many of the pieces of a particular argument in one fell swoop as possible. This is largely because I want to write about the cutting edge of human knowledge but I keep realizing that the inferential gap is larger than I had anticipated, and I want to fill in that inferential gap quickly so I can get to the cutting edge.

For example, I had to draw on dozens of Eliezer's posts just to say I was heading toward my metaethics sequence. I've also published 21 new posts (many of them quite long and heavily researched) written specifically because I need to refer to them in my metaethics sequence.1 I tried to make these posts interesting and useful on their own, but my primary motivation for writing them was that I need them for my metaethics sequence.

And now I've written only four posts2 in my metaethics sequence and already the inferential gap to my next post in that sequence is huge again. :(

So I'd like to try an experiment. I won't do it often, but I want to try it at least once. Instead of writing 20 more short posts between now and the next post in my metaethics sequence, I'll attempt to fill in a big chunk of the inferential gap to my next metaethics post in one fell swoop by writing a long tutorial post (a la Eliezer's tutorials on Bayes' Theorem and technical explanation).3

So if you're not up for a 20-page tutorial on human motivation, this post isn't for you, but I hope you're glad I bothered to write it for the sake of others. If you are in the mood for a 20-page tutorial on human motivation, please proceed.