You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Perceptual Entropy and Frozen Estimates

10 Davidmanheim 03 June 2015 07:27PM

A Preface

During the 1990’s, a significant stream of research existed around how people process information, which combined very different streams in psychology and related areas with explicit predictive models about how actual cognitive processes differ from the theoretical ideal. This is not only the literature by Kahneman and Tversky about cognitive biases, but includes research about memory, perception, scope insensitivity, and other areas. The rationalist community is very familiar with some of this literature, but fewer are familiar with a masterful synthesis produced by Richards Heuer for the intelligence community in 1999[1], which was intended to start combating these problems, a goal we share. I’m hoping to put together a stream of posts based on that work, potentially expanding on it, or giving my own spin – but encourage reading the book itself (PDF) as well[2]. (This essay is based on Chapter 3.)

This will hopefully be my first set of posts, so feedback is especially welcome, both to help me refine the ideas, and to refine my presentation.

Entropy, Pressure, and Metaphorical States of Matter

Eliezer recommends updating incrementally but has noted that it’s hard. The central point, that it is hard to do  so, is one that some in our community have experienced and explicated, but there is deep theory I’ll attempt to outline, via an analogy, that I think explains how and why it occurs. The problem is that we are quick to form opinions and build models, because humans are good at pattern finding. We are less quick to discard them, due to limited mental energy. This is especially true when the pressure of evidence doesn’t shift overwhelmingly and suddenly.

I’ll attempt to answer the question of how this is true by stretching a metaphor and create an intuition pump for thinking about how our minds might be perform some think using uncertainty.

Frozen Perception

Heuer notes a stream of research about perception, and notes that “once an observer has formed an image – that is, once he or she has developed a mind set or expectation concerning the phenomenon being observed – this conditions future perceptions of that phenomenon.” This seems to follow a standard Bayesian practice, but in fact, as Eliezer noted, people fail to update. The following set of images, which Heuer reproduced from a 1976 book by Robert Jervis, show exactly this point;

Impressions Resist Change - Series of line drawings transitioning between a face and a crouching woman.

Looking at each picture, starting on the left, and moving to the right, you see a face slowly change. At what point does the face no longer seem to appear? (Try it!) For me, it’s at about the seventh image that it’s clear it morphed into a sitting, bowed figure. But what if you start at the other end? The woman is still clearly there long past the point where we see a face, starting in the other direction. What’s going on?

We seem to attach too strongly to our first approach, decision, or idea. Specifically, our decision seems to “freeze” once it get to one place, and needs much more evidence to start moving again. This has an analogue in physics, to the notion of freezing, which I think is more important than it first appears.

Entropy

To analyze this, I’ll drop into some basic probability theory, and physics, before (hopefully) we come out on the other side with a conceptually clearer picture. First, I will note that cognitive architecture has some way of representing theories, and implicitly assigns probabilities to various working theories. This is some sort of probability distribution over sample theories. Any probability distribution has a quantity called entropy[3], which is simply the probability of each state, multiplied by the logarithm of that probability, summed over all the states. (The probability is less than 1, so the logarithm is negative, but we traditionally flip the sign so entropy is a positive quantity.)

Need an example? Sure! I have two dice, and they can each land on any number, 1-6. I’m assuming they are fair, so each has probability of 1/6, and the logarithm (base 2) of 1/6 is about -2.585. There are 6 states, so the total is 6* (1/6) * 2.585 = 2.585. (With two dice, I have 36 possible combinations, each with probability 1/36, log(1/36) is -5.17, so the entropy is 5.17. You may have notices that I doubled the number of dice involved, and the entropy doubled – because there is exactly twice as much that can happen, but the average entropy is unchanged.) If I only have 2 possible states, such as a fair coin, each has probability of 1/2, and log(1/2)=-1, so for two states, (-0.5*-1)+(-0.5*-1)=1. An unfair coin, with a ¼ probability of tails, and a ¾ probability of heads, has an entropy  of 0.81. Of course, this isn’t the lowest possible entropy – a trick coin with both sides having heads only has 1 state, with entropy 0. So unfair coins have lower entropy – because we know more about what will happen.

 

Freezing, Melting, and Ideal Gases under Pressure

In physics, this has a deeply related concept, also called entropy, which in the form we see it on a macroscopic scale, just temperature. If you remember your high school science classes, temperature is a description of how much molecules move around. I’m not a physicist, and this is a bit simplified[4], but the entropy of an object is how uncertain we are about its state – gasses expand to fill their container, and the molecules could be anywhere, so they have higher entropy than a liquid, which stays in its container, which still has higher entropy than a solid, where the molecules don’t more much, which still has higher entropy than a crystal, where the molecules are sort of locked into place.

This partially lends intuition to the third law of thermodynamics; “the entropy of a perfect crystal at absolute zero is exactly equal to zero.” In our terms above, it’s like that trick coin – we know exactly where everything is in the crystal, and it doesn’t move. Interestingly, a perfect crystal at 0 Kelvin cannot exist in nature; no finite process can reduce entropy to that point; like infinite certainty, infinitely exact crystals are impossible to arrive at, unless you started there. So far, we could build a clever analogy between temperature and certainty, telling us that “you’re getting warmer” means exactly the opposite of what it does in common usage – but I think this is misleading[5].

In fact, I think that information in our analogy doesn’t change the temperature; instead, it reduces the volume! In the analogy, gases can become liquids or solids either by lowering temperature, or by increasing pressure – which is what evidence does. Specifically, evidence constrains the set of possibilities, squeezing our hypothesis space. The phrase “weight of evidence” is now metaphorically correct; it will actually constrain the space by applying pressure.

I think that by analogy, this explains the phenomenon we see with perception. While we are uncertain, information increases pressure, and our conceptual estimate can condense from uncertain to a relatively contained liquid state – not because we have less probability to distribute, but because the evidence has constrained  the space over which we can distribute it. Alternatively, we can settle on a lower energy state on our own, unassisted by evidence. If our minds too-quickly settle on a theory or idea, the gas settles into a corner of the available space, and if we fail to apply enough energy to the problem, our unchallenged opinion can even freeze into place.

Our mental models can be liquid, gaseous, or frozen in place – either by our prior certainty, our lack of energy required to update, or an immense amount of evidential pressure. When we look at those faces, our minds settle into a model quickly, and once there, fail to apply enough energy to re-evaporate our decision until the pressure of the new pictures is relatively immense. If we had started at picture 3 or 6, we could much more easily update away from our estimates; our minds are less willing to let the cloud settle into a puddle of probable answers, much less freeze into place. We can easily see the face, or the woman, moving between just these two images.

When we begin to search for a mental model to describe some phenomena, whether it be patterns of black and white on a page, or the way in which our actions will affect a friend, I am suggesting we settle into a puddle of likely options, and when not actively investing energy into the question, we are likely to freeze into a specific model.

What does this approach retrodict, or better, forbid?

Because our minds have limited energy, the process of maintaining an uncertain stance should be difficult. This seems to be borne out by personal and anecdotal experience, but I have not yet searched the academic literature to find more specific validation.

We should have more trouble updating away from a current model than we do arriving at that new model from the beginning. As Heuer puts it, “Initial exposure to… ambiguous stimuli interferes with accurate perception even after more and better information becomes available.” He notes that this was shown in Brunder and Potter, 1964 “Interference in Visual Recognition,” and that “the early but incorrect impression tends to persist because the amount of information necessary to invalidate a hypothesis is considerably greater than the amount of information required to make an initial interpretation.”

Potential avenues of further thought

The pressure of evidence should reduce the mental effort needed to switch models, but “leaky” hypothesis sets, where a class of model is not initially considered, should allow the pressure to metaphorically escape into the larger hypothesis space.

There is a potential for making this analogy more exact, but discussing entropy in graphical models (Bayesian Networks), especially in sets of graphical models with explicit uncertainty attached. I don’t have the math needed for this, but would be interested in hearing from those who did.



[1] I would like to thank both Abram Demski (Interviewed here) from providing a link to this material, and my dissertation chair, Paul Davis, who was able to point me towards how this has been used and extended in the intelligence community.

[2] There is a follow up book and training course which is also available, but I’ve not read it nor seen it online. A shorter version of the main points of that book is here (PDF), which I have only glanced through.

[3] Eliezer discusses this idea in Entropy and short codes, but I’m heading a slightly different direction.

[4] We have a LW Post, Entropy and Temperature that explains this a bit. For a different, simplified explanation, try this: http://www.nmsea.org/Curriculum/Primer/what_is_entropy.htm. For a slightly more complete version, try Wikipedia: https://en.wikipedia.org/wiki/Introduction_to_entropy. For a much more complete version, learn the math, talk to a PhD in thermodynamics, then read some textbooks yourself.

[5] I think this, of course, because I was initially heading in that direction. Instead, I realized there was a better analogy – but if we wanted to develop it in this direction instead, I’d point to the phase change energy required to changed phases of matter as a reason that our minds have trouble moving from their initial estimate. On reflection, I think this should be a small part of the story, if not entirely negligible.

A Somewhat Vague Proposal for Grounding Ethics in Physics

-3 capybaralet 27 January 2015 05:45AM

As Tegmark argues, the idea of "final goal" for AI is likely incoherent, at least if (as he states), "Quantum effects aside, a truly well-defined goal would specify how all particles in our Universe should be arranged at the end of time."  

But "life is a journey not a destination".  So really, what we should be specifying is the entire evolution of the universe through its lifespan.  So how can the universe "enjoy itself" as much as possible before the big crunch (or before and during the heat death)*.

I hypothesize that experience is related to, if not a product of, change.  I further propose (counter-intuitively, and with an eye towards "refinement" (to put it mildly))** that we treat experience as inherently positive and not try to distinguish between positive and negative experiences.

Then it seems to me the (still rather intractable) question is: how does the rate of entropy's increase relate to the quantity of experience produced?  Is it simply linear (in which case, it doesn't matter, ethically)?  My intuition is that is it more like the fuel efficiency of a car, non-linear and with a sweet spot somewhere between a lengthy boredom and a flash of intensity.



*I'm not super up on cosmology; are there other theories I ought to be considering?

**One idea for refinement: successful "prediction" (undefined here) creates positive experiences; frustrated expectations negative ones.


[LINK] Causal Entropic Forces

5 Qiaochu_Yuan 20 April 2013 11:57PM

This paper seems relevant to various LW interests. It smells like The Second Law of Thermodynamics, and Engines of Cognition, but I haven't wrapped my head enough around either to say more than that. Abstract:

Recent advances in fields ranging from cosmology to computer science have hinted at a possible deep connection between intelligence and entropy maximization, but no formal physical relationship between them has yet been established. Here, we explicitly propose a first step toward such a relationship in the form of a causal generalization of entropic forces that we find can cause two defining behaviors of the human “cognitive niche”—tool use and social cooperation—to spontaneously emerge in simple physical systems. Our results suggest a potentially general thermodynamic model of adaptive behavior as a nonequilibrium process in open systems.

Draft of Edwin Jaynes' "Probability Theory: The Logic of Science" online, with lost chapter 30

8 buybuydandavis 23 June 2012 05:48AM

http://thiqaruni.org/mathpdf9/(86).pdf

The book didn't include Chapter 30 - "MAXIMUM ENTROPY: MATRIX FORMULATION"

Opening in adobe seems to work out better for me.

 

 

 

Thoughts and problems with Eliezer's measure of optimization power

17 Stuart_Armstrong 08 June 2012 09:44AM

Back in the day, Eliezer proposed a method for measuring the optimization power (OP) of a system S. The idea is to get a measure of small a target the system can hit:

You can quantify this, at least in theory, supposing you have (A) the agent or optimization process's preference ordering, and (B) a measure of the space of outcomes - which, for discrete outcomes in a finite space of possibilities, could just consist of counting them - then you can quantify how small a target is being hit, within how large a greater region.

Then we count the total number of states with equal or greater rank in the preference ordering to the outcome achieved, or integrate over the measure of states with equal or greater rank.  Dividing this by the total size of the space gives you the relative smallness of the target - did you hit an outcome that was one in a million?  One in a trillion?

Actually, most optimization processes produce "surprises" that are exponentially more improbable than this - you'd need to try far more than a trillion random reorderings of the letters in a book, to produce a play of quality equalling or exceeding Shakespeare.  So we take the log base two of the reciprocal of the improbability, and that gives us optimization power in bits.

For example, assume there were eight equally likely possible states {X0, X1, ... , X7}, and S gives them utilities {0, 1, ... , 7}. Then if S can make X6 happen, there are two states better or equal to its achievement (X6 and X7), hence it has hit a target filling 1/4 of the total space. Hence its OP is log2 4 = 2. If the best S could manage is X4, then it has only hit half the total space, and has an OP of only log2 2 = 1. Conversely, if S reached the perfect X7, 1/8 of the total space, then it would have an OP of log2 8 = 3.

continue reading »

[LINK] stats.stackexchange.com question about Shalizi's Bayesian Backward Arrow of Time paper

3 p4wnc6 16 May 2012 03:58PM

Link to the Question

I haven't gotten an answer on this yet and I set up a bounty; I figured I'd link it here too in case any stats/physics people care to take a crack at it.

The Principle of Maximum Entropy

7 krey 10 February 2012 03:10AM

After having read the related chapters of Jaynes' book I was fairly amazed by the Principle of Maximum Entropy, a powerful method for choosing prior distributions. However it immediately raised a large number of questions.

I have recently read two quite intriguing (and very well-written) papers by Jos Uffink on this matter:

Can the maximum entropy principle be explained as a consistency requirement?

The constraint rule of the maximum entropy principle

I was wondering what you think about the principle of maximum entropy and its justifications.

In Defense of Objective Bayesianism: MaxEnt Puzzle.

6 Larks 06 January 2011 12:56AM

In Defense of Objective Bayesianism by Jon Williamson was mentioned recently in a post by lukeprog as the sort of book that should be being read by people on Less Wrong. Now, I have been reading it, and found some of it quite bizarre. This point in particular seems obviously false. If it’s just me, I’ll be glad to be enlightened as to what was meant. If collectively we don’t understand, that’d be pretty strong evidence that we should read more academic Bayesian stuff.

Williamson advocates use of the Maximum Entropy Principle. In short, you should take account of the limits placed on your probability by the empirical evidence, and then choose a probability distribution closest to uniform that satisfies those constraints.

So, if asked to assign a probability to an arbitrary A, you’d say p = 0.5. But if you were given evidence in the form of some constraints on p, say that p ≥ 0.8, you’d set p = 0.8, as that was the new entropy-maximising level. Constraints are restricted to Affine constraints. I found this somewhat counter-intuitive already, but I do follow what he means.

But now for the confusing bit. I quote directly;

 

“Suppose A is ‘Peterson is a Swede’, B is ‘Peterson is a Norwegian’, C is ‘Peterson is a Scandinavian’, and ε is ‘80% of all Scandinavians are Swedes’. Initially, the agent sets P(A) = 0.2, P(B) = 0.8, P(C) = 1 P(ε) = 0.2, P(A & ε) = P(B & ε) = 0.1. All these degrees of belief satisfy the norms of subjectivism. Updating by maxent on learning ε, the agent believes Peterson is a Swede to degree 0.8, which seems quite right. On the other hand, updating by conditionalizing on ε leads to a degree of belief of 0.5 that Peterson is a Swede, which is quite wrong. Thus, we see that maxent is to be preferred to conditionalization in this kind of example because the conditionalization update does not satisfy the new constraints X’, while the maxent update does.”

p80, 2010 edition. Note that this example is actually from Bacchus et al (1990), but Williamson quotes approvingly.

 

His calculation for the Bayesian update is correct; you do get 0.5. What’s more, this seems to be intuitively the right answer; the update has caused you to ‘zoom in’ on the probability mass assigned to ε, while maintaining relative proportions inside it.

As far as I can see, you get 0.8 only if we assume that Peterson is a randomly chosen Scandinavian. But if that were true, the prior given is bizarre. If he was a randomly chosen individual, the prior should have been something like P(A & ε) = 0.16 P(B & ε) = 0.04 The only way I can make sense of the prior is if constraints simply “don’t apply” until they have p=1.

Can anyone explain the reasoning behind a posterior probability of 0.8?