Followup to: Reinforcement Learning: A Non-Standard Introduction, Reinforcement, Preference and Utility
A reinforcement-learning agent interacts with its environment through the perception of observations and the performance of actions. A very abstract and non-standard description of such an agent is in two parts. The first part, the inference policy, tells us what states the agent can be in, and how these states change when the agent receives new input from its environment. The second part, the action policy, tells us what action the agent chooses to perform on the environment, in each of its internal states.
There are two special choices for the inference policy, marking two extremes. One extreme is for the agent to remain absolutely oblivious to the information coming its way. The transition from a past internal state to a current one is made independent of the observation, and no entanglement is formed between the agent and the environment. A rock, for example, comes close to being this little alive.
This post focuses on the other extreme, where the agent updates perfectly for the new information.
Keeping track of all the information is easy, on paper. All the agent has to do is maintain the sequence of past observations, the observable history O1, O2, ..., Ot. As each new observation is perceived, it can simply be appended to the list. Everything the agent can possibly know about the world, anything it can possibly hope to use in choosing actions, is in the observable history - there's no clairvoyance.
But this is far from practical, for many related reasons. Extracting the useful information from the raw observations can be a demanding task. The number of observations to remember grows indefinitely with time, putting a strain on the resources of an agent attempting longevity. The number of possible agent states grows exponentially with time, making it difficult to even specify (let alone decide) what action to take in each one.
Clearly we need some sort of compression when producing the agent's memory state from the observable history. Two requirements for the compression process: one, as per the premise of this post, is that it preserves all information about the world; the other is that it can be computed sequentially - when computing Mt the agent only has access to the new observation Ot and the previous compression Mt-1. The explicit value of all previous observations is forever lost.
This is a good moment to introduce proper terminology. A function of the observable history is called a statistic. Intuitively, applying a function to the data can only decrease, never increase the amount of information we end up having about the world. This intuition is solid, as the Data Processing Inequality proves. If the function does not lose any information about the world, if looking at the agent's memory is enough and there's nothing more relevant in the observations themselves, then the memory state is a sufficient statistic of the observable history, for the world state. The things the agent does forget about past perceptions are not at all informative for the present. Ultimately, when nothing further can be forgotten this way, we are left with a minimal sufficient statistic.
If I tell you the observable history of the agent, what will you know about the world state? If you know the dynamics of the world and how observations are generated, you'll have the Bayesian belief, assigning to each world state the posterior distribution:
Bt(Wt) = Pr(Wt|O1,...,Ot)
(where Pr stands for "probability"). Importantly, this can be computed sequentially from Bt-1 and Ot, using Bayes' theorem. (The gory details of how to do this are below in a comment.)
Aha! So the Bayesian belief is an expression for precisely everything the agent can possibly know about the world. Why not have the agent's memory represent exactly that?
Mt = Bt
As it turns out, the Bayesian belief is indeed a minimal sufficient statistic of the observable history for the world state. For the agent, it is the truth, the whole truth, and nothing but the truth - and a methodical way to remember the truth, to boot.
Thus we've compressed into the agent's memory all and only the information from its past that is relevant for the present. We've discarded any information that is an artifact of the senses, and is not real. We've discarded any information that used to be real, but isn't anymore, because the world has since changed.
The observant reader will notice that we haven't discussed actions yet. We're getting there. The question of what information is relevant for future actions is deep enough to justify this meticulous exposition. For the moment, just note that keeping a sufficient statistic for the current world state is also sufficient for the controllable future, since the future is independent of the past given the present.
What we have established here is an "ultimate solution" for how a reinforcement-learning agent should maintain its memory state. It should update a Bayesian belief of what the current world state is. This inference policy is so powerful and natural, that standard reinforcement learning doesn't even make a distinction between the Bayesian belief and the agent's memory state, ignoring anything else we could imagine the latter to be.
Continue reading: Point-Based Value Iteration
If you're a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here's how:
As usual with Bayes' theorem, we only need to calculate the numerator for different values of , and the denominator will normalize them to sum to 1, as probabilities do. We know as part of the dynamics of the system, so we only need ). This can be calculated by introducing the other variables in the process:
An important thing to notice is that, given the observable history, the world state and the action are independent - the agent can't act on unseen information. We continue:
Recall that the agent's belief is a function of the observable history, and that the action only depends on the observable history through its memory . We conclude:
I'm not seeing how this lets the agent update itself. The formula requires knowledge of sigma, pi, and p. (BTW, could someone add to the comment help text instructions for embedding Latex?) pi is part of the agent but sigma and p are not. You say
But all the agent knows, as you've described it so far, is the sequence of observations. In fact, it's stretching it to say that we know sigma or p -- we have just given these names to them. sigma is a complete description of how the world state determines what... (read more)