In response to comment by royf on The Bayesian Agent
Comment author: [deleted] 20 September 2012 07:52:12PM *  4 points [-]

Right, in degenerate cases, when there's nothing to be learned, the two extremes of learning nothing and everything coincide.

In the case where your prior says "the past is not informative about the future". You learn nothing. A degenerate prior, not degenerate situation.

To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?

Imagine a bowl of jellybeans. you put in ten red and ten white. You take out 3, all of which are red, the probability of getting a red on the next draw is 7/17.

Take another boal, have a monkey toss in red beans and white beans with 50% probability. You draw 3 red, the draw probability is now 50% (becuase you had a maxentropy prior).

Take another boal. Beans were loaded in with unknown probabilitities. You draw 3 red, your draw probability is 4/5 red.

See how depening on your assumptions, you learn in different directions with the same observations? Hence you can learn in the wrong direction with a bad prior.

Learning sideways is a bit of metaphor-stretching, but if you like you can imagine observing 3 red beans proves the existence of god under some prior.

given that prior and the agent's observations

Yes yes. I was being pedantic because your post didn't talk about priors and inductive bias.

very little

where part of the change is revealed to you through new observations, you have to keep pace.

I thought of that. I didn't think enough. "very little" was the wrong phrasing. It's not that you do less updating, it's that your updates are on concrete things like "who took the cookies" instead of "does gravity go as the squre or the cube" because your prior already encodes correct physics. Very little updating on physics.

In response to comment by [deleted] on The Bayesian Agent
Comment author: royf 20 September 2012 11:37:44PM *  0 points [-]

Imagine a bowl of jellybeans. [...]

Allow me to suggest a simpler thought experiment, that hopefully captures the essence of yours, and shows why your interpretation (of the correct math) is incorrect.

There are 100 recording studios, each recording each day with probability 0.5. Everybody knows that.

There's a red light outside each studio to signal that a session is taking place that day, except for one rogue studio, where the signal is reversed, being off when there's a session and on when there isn't. Only persons B and C know that.

A, B and C are standing at the door of a studio, but only C knows that it's the rogue one. How do their beliefs that there's a session inside change by observing that the red light is on? A keeps the 50-50. B now thinks it's 99-1. Only C knows that there's no session.

So your interpretation, as I understand it, would be to say that A and B updated in the "wrong direction". But wait! I practically gave you the same prior information that C has - of course you agree with her! Let's rewrite the last paragraph:

A, B and C are standing at the door of a studio. For some obscure reason, C secretly believes that it's the rogue one. Wouldn't you now agree with B?

And now I can do the same for A, by not revealing to you, the reader, the significance of the red lights. My point is that as long as someone runs a Bayesian update, you can't call that the "wrong direction". Maybe they now believe in things that you judge less likely, based on the information that you have, but that doesn't make you right and them wrong. Reality makes them right or wrong, unfortunately there's no one around who knows reality in any other way than through their subjective information-revealing observations.

Comment author: Unnamed 20 September 2012 06:49:20AM 6 points [-]

Pick your answer to this poll at random:

Submitting...

Comment author: royf 20 September 2012 05:03:05PM *  5 points [-]

To anyone thinking this is not random, with 42 votes in:

  • The p-value is 0.895 (this is the probability of seeing at least this much non-randomness, assuming a uniform distribution)

  • The entropy is 2.302bits instead of log(5) = 2.322bits, for 0.02bits KL-distance (this is the number of bits you lose for encoding one of these votes as if it was random)

If you think you see a pattern here, you should either see a doctor or a statistician.

In response to The Bayesian Agent
Comment author: [deleted] 18 September 2012 02:41:38PM 4 points [-]

two extremes...Bayesian belief

It is perfectly legal under the bayes to learn nothing from your observations. Or learn in the wrong direction, or sideways, or whatever. All depending on prior and inductive bias. There is no unique "Bayesian belief". If you had the "right" prior, you would find that would have to do very little updating, because the right prior is already right.

In response to comment by [deleted] on The Bayesian Agent
Comment author: royf 18 September 2012 06:30:53PM *  1 point [-]

It is perfectly legal under the bayes to learn nothing from your observations.

Right, in degenerate cases, when there's nothing to be learned, the two extremes of learning nothing and everything coincide.

Or learn in the wrong direction, or sideways, or whatever.

To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?

There is no unique "Bayesian belief".

If you mean to say that there's no unique justifiable prior, I agree. The prior in our setting is basically what you assume you know about the dynamics of the system - see my reply to RichardKennaway.

However, given that prior and the agent's observations, there is a unique Bayesian belief, the one I defined above. That's pretty much the whole point of Bayesianism, the existence of a subjectively objective probability.

If you had the "right" prior, you would find that would have to do very little updating, because the right prior is already right.

This is true in a constant world, or with regard to parts of the world which are constant. And mind you, it's true only with high probability: there's always the slight chance that the sky is not, after all, blue.

But in a changing world, where part of the change is revealed to you through new observations, you have to keep pace. The right prior was right yesterday, today there's new stuff to know.

In response to comment by royf on The Bayesian Agent
Comment author: RichardKennaway 18 September 2012 07:28:22AM 3 points [-]

I'm not seeing how this lets the agent update itself. The formula requires knowledge of sigma, pi, and p. (BTW, could someone add to the comment help text instructions for embedding Latex?) pi is part of the agent but sigma and p are not. You say

We know sigma as part of the dynamics of the system

But all the agent knows, as you've described it so far, is the sequence of observations. In fact, it's stretching it to say that we know sigma or p -- we have just given these names to them. sigma is a complete description of how the world state determines what the agent senses, and p is a complete description of how the agent's actions affect the world. As the designer of the agent, will you be explicitly providing it with that information in some future instalment?

Comment author: royf 18 September 2012 06:09:59PM *  3 points [-]

Everything you say is essentially true.

As the designer of the agent, will you be explicitly providing it with that information in some future instalment?

Technically, we don't need to provide the agent with p and sigma explicitly. We use these parameters when we build the agent's memory update scheme, but the agent is not necessarily "aware" of the values of the parameters from inside the algorithm.

Let's take for example an autonomous rover on Mars. The gravity on Mars is known at the time of design, so the rover's software, and even hardware, is built to operate under these dynamics. The wind velocity at the time and place of landing, on the other hand, is unknown. The rover may need to take measurements to determine this parameter, and encode it in its memory, before it can take it into account in choosing further actions.

But if we are thoroughly Bayesian, then something is known about the wind prior to experience. Is it likely to change every 5 minutes or can the rover wait longer before measuring again? What should be the operational range of the instruments? And so on. In this case we would include this prior in p, while the actual wind velocity is instead hidden in the world state (only to be observed occasionally and partially).

Ultimately, we could include all of physics in our belief - there's always some Einstein to tell us that Newtonian physics is wrong. The problem is that a large belief space makes learning harder. This is why most humans struggle with intuitive understanding of relativity or quantum mechanics - our brains are not made to represent this part of the belief space.

This is also why reinforcement learning gives special treatment to the case where there are unknown but unchanging parameters of the world dynamics: the "unknown" part makes the belief space large enough to make special algorithms necessary, while the "unchanging" part makes these algorithms possible.

For LaTeX instructions, click "Show help" and then "More Help" (or go here).

The Bayesian Agent

11 royf 18 September 2012 03:23AM

Followup to: Reinforcement Learning: A Non-Standard IntroductionReinforcement, Preference and Utility

A reinforcement-learning agent interacts with its environment through the perception of observations and the performance of actions. A very abstract and non-standard description of such an agent is in two parts. The first part, the inference policy, tells us what states the agent can be in, and how these states change when the agent receives new input from its environment. The second part, the action policy, tells us what action the agent chooses to perform on the environment, in each of its internal states.

There are two special choices for the inference policy, marking two extremes. One extreme is for the agent to remain absolutely oblivious to the information coming its way. The transition from a past internal state to a current one is made independent of the observation, and no entanglement is formed between the agent and the environment. A rock, for example, comes close to being this little alive.

This post focuses on the other extreme, where the agent updates perfectly for the new information.

Keeping track of all the information is easy, on paper. All the agent has to do is maintain the sequence of past observations, the observable history O1, O2, ..., Ot. As each new observation is perceived, it can simply be appended to the list. Everything the agent can possibly know about the world, anything it can possibly hope to use in choosing actions, is in the observable history - there's no clairvoyance.

But this is far from practical, for many related reasons. Extracting the useful information from the raw observations can be a demanding task. The number of observations to remember grows indefinitely with time, putting a strain on the resources of an agent attempting longevity. The number of possible agent states grows exponentially with time, making it difficult to even specify (let alone decide) what action to take in each one.

Clearly we need some sort of compression when producing the agent's memory state from the observable history. Two requirements for the compression process: one, as per the premise of this post, is that it preserves all information about the world; the other is that it can be computed sequentially - when computing Mt the agent only has access to the new observation Ot and the previous compression Mt-1. The explicit value of all previous observations is forever lost.

This is a good moment to introduce proper terminology. A function of the observable history is called a statistic. Intuitively, applying a function to the data can only decrease, never increase the amount of information we end up having about the world. This intuition is solid, as the Data Processing Inequality proves. If the function does not lose any information about the world, if looking at the agent's memory is enough and there's nothing more relevant in the observations themselves, then the memory state is a sufficient statistic of the observable history, for the world state. The things the agent does forget about past perceptions are not at all informative for the present. Ultimately, when nothing further can be forgotten this way, we are left with a minimal sufficient statistic.

If I tell you the observable history of the agent, what will you know about the world state? If you know the dynamics of the world and how observations are generated, you'll have the Bayesian belief, assigning to each world state the posterior distribution:

Bt(Wt) = Pr(Wt|O1,...,Ot)

(where Pr stands for "probability"). Importantly, this can be computed sequentially from Bt-1 and Ot, using Bayes' theorem. (The gory details of how to do this are below in a comment.)

Aha! So the Bayesian belief is an expression for precisely everything the agent can possibly know about the world. Why not have the agent's memory represent exactly that?

Mt = Bt

As it turns out, the Bayesian belief is indeed a minimal sufficient statistic of the observable history for the world state. For the agent, it is the truth, the whole truth, and nothing but the truth - and a methodical way to remember the truth, to boot.

Thus we've compressed into the agent's memory all and only the information from its past that is relevant for the present. We've discarded any information that is an artifact of the senses, and is not real. We've discarded any information that used to be real, but isn't anymore, because the world has since changed.

The observant reader will notice that we haven't discussed actions yet. We're getting there. The question of what information is relevant for future actions is deep enough to justify this meticulous exposition. For the moment, just note that keeping a sufficient statistic for the current world state is also sufficient for the controllable future, since the future is independent of the past given the present.

What we have established here is an "ultimate solution" for how a reinforcement-learning agent should maintain its memory state. It should update a Bayesian belief of what the current world state is. This inference policy is so powerful and natural, that standard reinforcement learning doesn't even make a distinction between the Bayesian belief and the agent's memory state, ignoring anything else we could imagine the latter to be.

Continue reading: Point-Based Value Iteration

In response to The Bayesian Agent
Comment author: royf 18 September 2012 03:12:21AM *  2 points [-]

If you're a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here's how:

As usual with Bayes' theorem, we only need to calculate the numerator for different values of , and the denominator will normalize them to sum to 1, as probabilities do. We know as part of the dynamics of the system, so we only need . This can be calculated by introducing the other variables in the process:

An important thing to notice is that, given the observable history, the world state and the action are independent - the agent can't act on unseen information. We continue:

Recall that the agent's belief is a function of the observable history, and that the action only depends on the observable history through its memory . We conclude:

Comment author: royf 23 August 2012 05:16:25AM 1 point [-]

p(H|E1,E2) [...] is simply not something you can calculate in probability theory from the information given [i.e. p(H|E1) and p(H|E2)].

Jaynes would disapprove.

You continue to give more information, namely that p(H|E1,E2) = p(H|E1). Thanks, that reduces our uncertainty about p(H|E1,E2).

But we are hardly helpless without it. Whatever happened to the Maximum Entropy Principle? Incidentally, the maximum entropy distribution (given the initial information) does have E1 and E2 independent. If your intuition says this before having more information, it is good.

Don't say that an answer can't be reached without further information. Say: here's more information to make your answer better.

Comment author: RichardKennaway 09 August 2012 12:53:57PM 1 point [-]

That's three preliminary postings, and Godot has not yet arrived. But surely he will arrive tomorrow.

Comment author: royf 09 August 2012 03:06:55PM *  5 points [-]

Clearly you have some password I'm supposed to guess.

This post is not preliminary. It's supposed to be interesting in itself. If it's not, then I'm doing something wrong, and would appreciate constructive criticism.

Reinforcement, Preference and Utility

7 royf 08 August 2012 06:23AM

Followup to: Reinforcement Learning: A Non-Standard Introduction

A reinforcement-learning agent is interacting with its environment through the perception of observations and the performance of actions.

We describe the influence of the world on the agent in two steps. The first is the generation of a sensory input Ot based on the state of the world Wt. We assume that this step is in accordance with the laws of physics, and out of anyone's hands. The second step is the actual changing the agent's mind to a new state Mt. The probability distributions of these steps are, respectively, σ(Ot|Wt) and q(Mt|Mt-1,Ot).

Similarly, the agent affects the world by deciding on an action At and performing it. The designer of the agent can choose the probability distribution of actions π(At|Mt), but not the natural laws p(Wt+1|Wt,At) saying how these actions change the world.

So how do we choose q and π? Let's first assume that it matters how. That is, let's assume that we have some preference over the results of the process, the actual values that all the variables take. Let's also make some further assumptions regarding this preference relation:

   1. The first assumption will be the standard von Neumann-Morgenstern rationality. This is a good opportunity to point out a common misconception in the interpretation of that result. It is often pointed out that humans, for instance, are not rational in the VNM sense. That is completely beside the point.

The agent doesn't choose the transitions q and π. The agent is these transitions. So it's not the agent that needs to be rational about the preference, and indeed it may appear not to be. If the agent has evolved, we may argue that evolutionary fitness is VNM-rational (even if the local-optimization process leads to sub-optimal fitness). If humans design a Mars rover to perform certain functions, we may argue that the goal dictates a VNM-rational preference (even if we, imperfect designers that we are, can only approximate it).

   2. The second assumption will be that the preference is strictly about the territory Wt, never about the map Mt. This means that we are never interested in merely putting the memory of the agent in a "good" state. We need it to be followed up by "good" actions.

You may think that an agent that generates accurate predictions of the stock market could be very useful. But if the agent doesn't follow up with actually investing well, or at least reliably reporting its findings to someone who does, then what good is it?

So we are going to define the preference with respect to only the states W1, ..., Wn and the actions A1, ..., An. If the observations are somehow important, we can always define them as being part of the world. If the agent's memory is somehow important, it will have to reliably communicate it out through actions.

   3. The third assumption will be the sunk cost assumption. We can fix some values of the first t steps of the process, and consider the preference with respect to the remaining n-t steps. This is like finding ourselves in time t, with a given t-step history, considering what to do next (though of course we plan for that contingency ahead of time). Our assumption is that we find that the preference is the same, regardless of the fixed history.

This last assumption gives rise to a special version of the von Neumann-Morgenstern utility theorem. We find that what we really prefer is to have as high as possible the expectation of a utility function with a special structure: the utility ut depends only on Wt, At and the expectation of ut+1 from the following step:

ut(Wt,At,E(ut+1))

This kind of recursion which goes backwards in time is a recurring theme in reinforcement learning.

We would like to have even more structure in our utility function, but this is where things become less principled and more open to personal taste among researchers. We base our own taste on a strong intuition that the following could, one day, be made to rely on much better principles than it currently does.

We will assume that ut is simply the summation of some immediate utility and the expectation of future utility:

ut(Wt,At,E(ut+1)) = R(Wt,At) + E(ut+1)

Here R is the Reward, the additional utility that the agent gets from taking the action At when the world is in state Wt. It's not hard to see that we can write down our utility in closed form, as the total of all rewards we get throughout the process

R(W1,A1) + R(W2,A2) + ... + R(Wn,An)

As a final note, if the agent is intended to achieve a high expectation of the total reward, then it may be helpful for the agent to actually observe its reward when it gets it. And indeed, many reinforcement learning algorithms require that the reward signal is visible to the agent as part of its observation each step. This can help the agent adapt its behavior to changes in the environment. After all, reinforcement learning means, quite literally, adaptation in response to a reward signal.

However, reinforcement learning can also happen when the reward signal is not explicit in the agent's observations. To the degree that the observations carry information about the state of the world, and that the reward is determined by that state, information about the reward is anyway implicit in the observations.

In this sense, the reward is just a score for the quality of the agent's design. If we take it into account when we design the agent, we end up choosing q and π so that the agent will exploit the implicit information in its observations to get a good reward.

Continue reading: The Bayesian Agent

Comment author: Johnicholas 03 August 2012 10:19:20PM 0 points [-]

As I understand it, you're dividing the agent from the world; once you introduce a reward signal, you'll be able to call it reinforcement learning. However, until you introduce a reward signal, you're not doing specifically reinforcement learning - everything applies just as well to any other kind of agent, such as a classical planner.

Comment author: royf 04 August 2012 01:20:33AM *  0 points [-]

That's an excellent point. Of course one cannot introduce RL without talking about the reward signal, and I've never intended to.

To me, however, the defining feature of RL is the structure of the solution space, described in this post. To you, it's the existence of a reward signal. I'm not sure that debating this difference of opinion is the best use of our time at this point. I do hope to share my reasons in future posts, if only because they should be interesting in themselves.

As for your last point: RL is indeed a very general setting, and classical planning can easily be formulated in RL terms.

View more: Prev | Next