Comment author: V_V 04 January 2013 02:29:15AM 1 point [-]

That seems to be a convoluted way of defining a Markov process .

It would preferable if you attempted to use standard terminology and provide references frame the discourse within the theory.

Comment author: royf 04 January 2013 07:44:53AM *  1 point [-]

I explained this in my non-standard introduction to reinforcement learning.

We can define the world as having the Markov property, i.e. as a Markov process. But when we split the world into an agent and its environment, we lose the Markov property for each of them separately.

I'm using non-standard notation and terminology because they are needed for the theory I'm developing in these posts. In future posts I'll try to link more to the handful of researchers who do publish on this theory. I did publish one post relating the terminology I'm using to more standard research.

Comment author: kpreid 03 January 2013 05:07:55AM *  6 points [-]

When a light bulb observes voltage, it emits light, regardless of whether it did so a second earlier. When the light bulb's internal attributes entangle with the voltage, they lose all information of what came before.

This example is false. An incandescent light bulb has a memory: its temperature. The temperature both determines the amount of light currently emitted by the bulb, and also the electrical resistance of the filament (higher when hot), which means that even the connected electrical circuit is affected by the state of the bulb — turning on the bulb produces a high “inrush current”.

A much better example would be a LED (not an LED light bulb, which likely contains a stateful power supply circuit), which is stateless for most practical purposes. (For example, once upon a time, there was networking hardware which could be snooped optically — the activity indicator LEDs were simply connected to the data lines and therefore transmitted them as visible light. Modern equipment typically uses programmed blinking intervals instead.)

Comment author: royf 03 January 2013 05:53:47AM 1 point [-]

Fixed. Thanks!

How to Disentangle the Past and the Future

12 royf 02 January 2013 05:43PM

I'm on my way to an important meeting. Am I worried? I'm not worried. The presentation is on my laptop. I distinctly remember putting it there (in the past), so I can safely predict that it's going to be there when I get to the office (in the future) - this is how well my laptop carries information through space and time.

My partner has no memory of me copying the file to the laptop. For her, the past and the future have mutual information: if Omega assured her that I'd copied the presentation, she would be able to predict the future much better than she can now.

For me, the past and the future are much less statistically dependent. Whatever entanglement remains between them is due to my memory not being perfect. If my partner suddenly remembers that she saw me copying the file, I will be a little bit more sure that I remember correctly, and that we'll have it at the meeting. Or if somehow, despite my very clear mental image of copying the file, it's not there at the meeting, my first suspicion will nevertheless be that I hadn't.

These unlikely possibilities aside, my memory does serve me. My partner is much less certain of the future than me, and more to the point, her uncertainty would decrease much more than mine if we both suddenly became perfectly aware of the past.

But now she turns on my laptop and checks. The file is there, yes, I could have told her that. And now that we know the state of my laptop, the past and the future are completely disentangled: maybe I put the file there, maybe the elves did - it makes no difference for the meeting. And maybe by the time we get to the office a hacker will remotely delete the file - its absence at the meeting will not be evidence that I'd neglected to copy the file: it was right there! We saw it!

(By "entanglement" I don't mean anything quantum theoretic; here it refers to statistical dependence, and its strength is measured by mutual information.)

The past and the future have mutual information. This is a necessary condition for life, for intelligence: we use the past to predict and plan for the future, and we couldn't do it if it were useless.

On the other hand, the future is independent of the past given the present. That's not profound metaphysics, that's simply how we define the state of a system: it's everything one needs to know of the past of the system to compute its future. The past is gone, and any information that it had on the future - information we could use to predict and make a better future - is inherited by the present.

But as limited agents inside the system, we don't get to know its entire state. Most of it is hidden from us, behind walls and hills and inside skulls and in nanostructures. So for us, the past and the future are entangled, which means that by learning one we could reduce our uncertainty about the other.

 


 

In control theory, memoryless agents have an unchanging internal structure, unable to entangle with the past and carry information useful for the future. Instead they react to whatever last input they received, like a function. These degenerate agents have an internal memory state Mt that depends only on the most recent observation Ot:

 

Figure 1: Dynamics of a memoryless reinforcement learning agents

Wt is the state of the world outside the agent at time t, Ot is the observation the agent makes of the world, Mt is the resulting internal state of the agent, and At is the action the agent chooses to take.

When a LED observes voltage, it emits light, regardless of whether it did so a second earlier. When the LED's internal attributes entangle with the voltage, they lose all information of what came before.

When the q key on a keyboard is pressed and released, the keyboard sends a signal to that effect to the computer, and that signal is mostly independent of which keys were pressed before and in what order (with a few exceptions; a keyboard is not entirely memoryless). A keyboard gets entangled with vast amounts of information over the years, but streams it through and loses almost any trace of it within seconds.

For a memoryless agent, all of the information between past and future flows through the environment outside the agent - through Wt. The world sans agent retains all the power to disentangle the past and the future: you can check in the graphical model in Figure 1 that Wt-1 and Wt+1 are independent given Wt.

The internal state Mt of the memoryless agent, on the other hand, of course reveals nothing about the link between past and future. Looking at my keyboard, you can't tell if I copied the presentation to my laptop, and if it's going to be there in half an hour.

 


 

Intelligent agents need to have memory:


Figure 2: General dynamics of a reinforcement learning agent

Now control over the flow of information has shifted, to some extent, in favor of the agent. The agent has a much wider channel over which to receive information from the past, and it can use this information to recover some truths about the present which aren't currently observable.

The past and the future are no longer independent given Wt alone, you need Mt and Wt together to completely separate the past and the future. An agent with memory of how Wt came to be can partly assume both roles, thus making itself a better separator than its memoryless counterpart could.

 


 

So here's how to become a good separator of past and future: remember things that are relevant for the future.

Memory is necessary for intelligence, you already knew that. What's new here is a way to measure just how useful memory is. Memory is useful exactly to the extent to which it shifts the power to control the flow of information.

For the agent, memory disentangles the past and the future. If you maintain in yourself some of the information that the past has about the future, you overcome the limitations of your ability to observe the present. I know I put the presentation on my laptop, so I don't need to check it's there.

For other agents, at the same time, the agent's memory is yet another limitation to their observability. Think of a secret handshake, for example. It's useful precisely because it predicts (and controls) the future for the confidants, while keeping it entangled with the hidden past for everyone else.

Continue reading: How to Be Oversurprised 

Comment author: Decius 08 October 2012 08:44:24PM 0 points [-]

How does deciding one model is true give you more information? Did you mean "If a model allows you to make more predictions about future observations, then it is a priori less likely?"

Comment author: royf 08 October 2012 09:43:09PM 0 points [-]

How does deciding one model is true give you more information?

Let's assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for "narrowing" our belief.

The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn't help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.

I'll expand on all of this in my sequence on reinforcement learning.

Comment author: aspera 08 October 2012 06:05:09PM 2 points [-]

Occam's Razor is non-Bayesian? Correct me if I'm wrong, but I thought it falls naturally out of Bayesian model comparison, from the normalization factors, or "Occam factors." As I remember, the argument is something like: given two models with independent parameters {A} and {A,B}, the P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Even if the argument is wrong, I think the result ends up being that more plausible models tend to have fewer independent parameters.

Comment author: royf 08 October 2012 08:33:11PM 1 point [-]

You're not really wrong. The thing is that "Occam's razor" is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.

P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) <= P(A model).

Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.

More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.

Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam's razor.

In response to comment by royf on Internal Availability
Comment author: timtyler 08 October 2012 05:52:46PM 0 points [-]

I was trying to give a specific reason that the availability heuristic is there: it's coupled with another mechanism that actually generates the availability; and then to say a few things about this other mechanism.

It seems obvious why the availability heuristic is there. The ease with which images, events and concepts come to mind is correlated with how frequently they have been observed, which in turn is correlated with how likely they are to happen again. So, the heuristic is a reasonably-good one which just happens to have some associated false positives.

Comment author: royf 08 October 2012 06:38:08PM 0 points [-]

The ease with which images, events and concepts come to mind is correlated with how frequently they have been observed, which in turn is correlated with how likely they are to happen again.

Yes, and I was trying to make this description one level more concrete.

Things never happen the exact same way twice. The way that past observations are correlated with what may happen again is complicated - in a way, that's exactly what "concepts" capture.

So we don't just recall something that happened and predict that it will happen again. Rather, we compose a prediction based on an integration of bits and patches from past experiences. Recalling these bits and patches as relevant for the context of the prediction - and of each other - is a complicated task, and I propose that an "internal availability" mechanism is needed to perform it.

In response to comment by royf on Internal Availability
Comment author: faul_sname 08 October 2012 08:47:23AM 0 points [-]

I'm still unsure of what you're actually saying. Perhaps you're talking about some sort of a "plausibility heuristic", where we look for instances of something in our model of the world, not just our experiences. That seems trivial, but that's not necessarily a bad thing (I would prefer to see more stuff here that seems really obvious to people, because those few times it's not obvious to everyone tend to be very valuable). If you're saying something else, I'm still not getting it.

Comment author: royf 08 October 2012 09:09:41AM *  1 point [-]

Take for example your analysis of the poker hand I partially described. You give 3 possibilities for what the truth of it may be. Are there any other possibilities? Maybe the player is bluffing to gain the reputation of a bluffer? Maybe she mistook a 4 for an ace (it happened to me once...)? Maybe aliens hijacked her brain?

It would be impossible to enumerate or notice all the possibilities, but fortunately we don't have to. We make only the most likely and important ones available.

Comment author: faul_sname 08 October 2012 08:21:51AM 1 point [-]

You're playing Texas Hold'em poker against another player, and she has just bet all her chips on the flop (the 2nd of 4 betting rounds, when there are 2 more shared cards to draw). You estimate that with high probability she has a low pair (say, under 9) with a high kicker (A or K, hoping to hit a second pair). You hold Q-J off-suit. Do you call?

It really depends on what the flop was. All in implies either a desperate player who's either bluffing (~30% of the time the player is desperate in my experience) or who has a good hand (the remaining 70% of the time the player is desperate) or, if the player still has a good-sized stack/this is a cash game, the player is bluffing a fairly small amount of the time (not enough to justify going in with a Q-J). In fact, after missing the flop with Q-J off, I can't think of any situation where I'd be likely to call a bet (well, unless there were a huge pot, but beyond that).

Poker analysis aside, I'm not quite sure what the point of this article was. If it was just saying that the availability heuristic is there for a reason and we should be careful about adjusting for it, I fully agree. If you were trying to say something more, I'm afraid I at least didn't get that point.

Comment author: royf 08 October 2012 08:35:05AM 1 point [-]

I was trying to give a specific reason that the availability heuristic is there: it's coupled with another mechanism that actually generates the availability; and then to say a few things about this other mechanism.

Does anyone have specific advice on how I could convey this better?

Point-Based Value Iteration

9 royf 08 October 2012 06:19AM

Followup to: The Bayesian Agent

This post explains one interesting and influential algorithm for achieving high utility of the actions of a Bayesian agent, called Point-Based Value Iteration (original paper). Its main premise resembles some concept of internal availability.

A reinforcement-learning agent chooses its actions based on its internal memory state. The memory doesn't have to include an exact account of all past observations - it's sufficient for the agent to keep track of a belief of the current state of the world.

This mitigates the problem of having the size of the memory state grow indefinitely. The memory-state space is now the set of all distributions over the world state. Importantly, this space doesn't grow exponentially with time like the space of observable histories. Unfortunately, it still has volume exponential in the number of world states.

So we moved away from specifying the action to take after each observable history, in favor of specifying the action to take in each belief state. This was justified by the belief being a sufficient statistic for the world state, and motivated by the belief space being much smaller (for long processes) than the history space. But for the purpose of understanding this algorithm, we should go back to describing the action policy in terms of the action to take after each observable history, π(At|O1,...,Ot).

Now join us as we're watching the dynamics of the process unfold. We're at time t, and the agent has reached Bayesian belief Bt, the distribution of the world state to the best of the agent's knowledge. What prospects does the agent still have for future rewards? What is its value-to-go of being in this belief state?

If we know what action the agent will take after each observable history (if and when it comes to pass), it turns out that the expected total of future rewards is linear in Bt. To see why this is so, consider a specific future (and present):

wt, ot, at, wt+1, ot+1, at+1, ..., wn, on, an

The reward of this future is:

R(wt,at) + R(wt+1,at+1) + ... + R(wn,an)

The probability of this future, given that the agent already knows o1,...,ot, is:

Bt(wt)·π(at|o1,...,ot)·p(wt+1|wt,at)·σ(ot+1|wt+1)·π(at+1|o1,...,ot+1)···

···p(wn|wn-1,an-1)·σ(on|wn)·π(an|o1,...,on)

We arrive at this by starting with the agent's belief Bt, which summarizes the distribution of wt given the observable history; then going on to multiply the probability for each new variable, given those before it that have part in causing it.

Note how Bt linearly determines the probability for wt, while the dynamics of the world and the agent determine the probability for the rest of the process. Now, to find the expected reward we sum, over all possible futures, the product of the future's reward and probability. This is all linear in the belief Bt at time t.

If the value of choosing a specific action policy is a linear function of Bt, the value of choosing the best action policy is the maximum of many linear functions of Bt. How many? If there are 2 possible observations, the observable history that will have unfolded in the next n-t steps can be any of 2n-t possible "future histories". If there are 2 possible actions to choose for each one, there are 22n-t possible policies. A scary number if the end is still some time ahead of us.

Of course, we're not interested in the value of just any belief, only of those that can actually be reached at time t. Since the belief Bt is determined by the observable history o1,...,ot, there are at most 2t such values. Less scary, but still not practical for long processes.

The secret of the algorithm is that it limits both of the figures above to a fixed, manageable number. Suppose that we have a list of 100 belief states which are the most interesting to us, perhaps because they are the most likely to be reached at time t. Suppose that we also have a list Πt of 100 policies: for each belief on the first list, we have in the second list a policy that gets a good reward if indeed that belief is reached at time t. Now what is the value, at time t, of choosing the best action policy of these 100? It's the maximum of, not many, but |Πt|=100 linear functions of Bt. Much more manageable.

Ok, we're done with time t. What about time t-1? We are going to consider 100 belief states which can be reached at time t-1. For each such belief Bt-1, we are going to consider each possible action At-1. Then we are going to consider each possible observation Ot that can come next - we can compute its probability from Bt-1 and the dynamics of the system. Given the observation, we will have a new belief Bt. And the previous paragraph tells us what to do in this case: choose one of the policies in Πt which is best for Bt, to use for the rest of the process.

(Note that this new Bt doesn't have to be one of the 100 beliefs that were used to construct the 100 policies in Πt.)

So this is how the recursive step goes:

  • For each of 100 possible values for Bt-1
    • For each possible At-1
      • For each possible Ot
        • We compute Bt
        • We choose for Bt the best policy in Πt, and find its value (it's a linear function of Bt)
      • We go back and ask: how good is it really to choose At-1? What is the expected value before knowing Ot?
    • We choose the best At-1
    • This gives us the best policy for Bt-1, under the restriction that we are only willing to use one of the policies in Πt from time t onward
  • After going over the 100 beliefs for time t-1, we now have a new stored list, Πt-1, of 100 policies for time t-1.

We continue backward in time, until we reach the starting time t0, where we choose the best policy in Πt0 for the initial (prior) belief Bt0.

We still need to say how to choose those special 100 beliefs to optimize for. Actually, this is not a fixed set. Most point-based algorithms start with only one or a few beliefs, and gradually expand the set to find better and better solutions.

The original algorithm chooses beliefs in a similar way to an internal availability mechanism that may be responsible for conjuring up images of likely futures in human brains. It starts with only one belief (the prior Bt0) and finds a solution where only one policy is allowed in each Πt. Then it adds some beliefs which seem likely to be reached in this solution, and finds a new and improved solution. It repeats this computation until a good-enough solution is reached, or until it runs out of time.

Point-based algorithms are today some of the best ones for finding good policies for Bayesian agents. They drive some real-world applications, and contribute to our understanding of how intelligent agents can act and react.

Internal Availability

2 royf 08 October 2012 06:19AM

Edit: Following mixed reception, I decided to split this part out of the latest post in my sequence on reinforcement learning. It wasn't clear enough, and anyway didn't belong there.

I'm posting this hopefully better version to Discussion, and welcome further comments on content and style.

 


 

The availability heuristic seems to be a mechanism inside our brains which uses the ease with which images, events and concepts come to mind as evidence for their prevalence or probability of occurrence. For this heuristic to be worth the trouble it causes, there needs to be a counterpart, a second mechanism which actually makes things available to the first one in correlation with their likelihood. In this post I discuss why having such an internal availability mechanism can be a good idea, and outline some of the ways it can fail.

 


 

You're playing Texas Hold'em poker against another player, and she has just bet all her chips on the flop (the 2nd of 4 betting rounds, when there are 2 more shared cards to draw). You estimate that with high probability she has a low pair (say, under 9) with a high kicker (A or K, hoping to hit a second pair). You hold Q-J off-suit. Do you call?

One question this depends on is: what's the probability p that you will win this hand? An experienced player will know that your best hope is to hit a pair, without the other player hitting anything better than her low pair. This has probability of slightly less than 25%.

We could compute or remember a better estimate if we notice the probability of runner-runner outs, but is it worth it? It won't help us pin down p with amazing accuracy - we could be wrong about the opponent's hand to begin with. And anyway, the decision of whether to actually call depends on many other factors: the sizes of your stack, her stack, the blinds and so on. A 1% error in the estimate of the win probability is unlikely to change your decision.

So instead of pointlessly trying to predict the future to an impossible and useless degree of accuracy, what we did was tell ourselves a bunch of likely stories about what might happen, then combine these scenarios into a simple probabilistic prediction of the future, and plan as best we can given this prediction.

This may be the mechanism that makes the availability heuristic a smart choice. The main observed effect of this heuristic is that past (subjective) prevalence seems to be highly linked to future predictions. Patches of stories we've heard may even work their way into the stories we tell of the future. The missing link is an internal availability mechanism which chooses which patches to make available for retelling. We seem to use such a mechanism to identify likely outcomes; before we forward them to the more commonly discussed process which integrates these stories of the future into a usable prediction.

What events would be good candidates for becoming available? One thing to notice is that evaluation of the expected value of our actions depends both on the probability and on the impact of their results; but for each specific future we don't need both these numbers, only their product. If the main function of the internal availability mechanism is to predict value, rather than probability, it stands to reason that high-impact but improbable outcomes will become as available as mundane probable ones. Yes, concepts which were encountered most often in the past, in a context similar to the current one, come to mind easily. But one-in-a-hundred or -thousand outcomes should also become available if they are very important. One-in-a-million ones, on the other hand, are almost never worth the trouble.

If something similar is indeed going on in our brains, then it seems to be working pretty well, usually. When I walk down the street, I give no mind to the possibility that there are invisible obstacles in my way. It is so remote, that even if I took it into account with adequately small probability, my actions would probably be roughly the same. It is therefore wise not to occupy my precious processing power with such nonsense.

Even when the internal availability mechanism is working properly, it generates unavoidable errors in prediction. Strictly speaking, ignoring some unlikely and unimportant possibilities is wrong, however practical. And while it makes noticing things evidence for their higher probability, this heuristic could sometimes fail, particularly if the internal availability mechanism is built for utility but used for probability.

The mechanism itself can also fail. Availability doesn't seem to be binary, so one type of failure is to make certain scenarios over- or under-available, marking them as more or less likely and important than they are. There also appears to be some threshold, some minimal value for non-zero availability. Another type of failure is when an important outcome fails to meet this threshold, not becoming available.

Or perhaps an unlikely future becomes available even though it shouldn't. This may explain why people are unable to estimate low probabilities. In their mind, the prospect of winning the lottery and becoming millionaires creates a vivid image of an exciting future. It's so immersive, that it really appears to be a real possibility - it could actually happen!

View more: Prev | Next