You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

SIA, conditional probability and Jaan Tallinn's simulation tree

10 Stuart_Armstrong 12 November 2012 05:24PM

If you're going to use anthropic probability, use the self indication assumption (SIA) - it's by far the most sensible way of doing things.

Now, I am of the strong belief that probabilities in anthropic problems (such as the Sleeping Beauty problem) are not meaningful - only your decisions matter. And you can have different probability theories but still always reach the decisions if you have different theories as to who bears the responsibility of the actions of your copies, or how much you value them - see anthropic decision theory (ADT).

But that's a minority position - most people still use anthropic probabilities, so it's worth taking a more through look at what SIA does and doesn't tell you about population sizes and conditional probability.

This post will aim to clarify some issues with SIA, especially concerning Jaan Tallinn's simulation-tree model which he presented in exquisite story format at the recent singularity summit. I'll be assuming basic familiarity with SIA, and will run away screaming from any questions concerning infinity. SIA fears infinity (in a shameless self plug, I'll mention that anthropic decision theory runs into far less problems with infinities; for instance a bounded utility function is a sufficient - but not necessary - condition to ensure that ADT give you sensible answers even with infinitely many copies).

But onwards and upwards with SIA! To not-quite-infinity and below!

 

SIA does not (directly) predict large populations

One error people often make with SIA is to assume that it predicts a large population. It doesn't - at least not directly. What SIA predicts is that there will be a large number of agents that are subjectively indistinguishable from you. You can call these subjectively indistinguishable agents the "minimal reference class" - it is a great advantage of SIA that it will continue to make sense for any reference class you choose (as long as it contains the minimal reference class).

The SIA's impact on the total population is indirect: if the size of the total population is correlated with that of the minimal reference class, SIA will predict a large population. A correlation is not implausible: for instance, if there are a lot of humans around, then the probability that one of them is you is much larger. If there are a lot of intelligent life forms around, then the chance that humans exist is higher, and so on.

In most cases, we don't run into problems with assuming that SIA predicts large populations. But we have to bear in mind that the effect is indirect, and the effect can and does break down in many cases. For instance imagine that you knew you had evolved on some planet, but for some odd reason, didn't know whether your planet had a ring system or not. You have managed to figure out that the evolution of life on planets with ring systems is independent of the evolution of life on planets without. Since you don't know which situation you're in, SIA instructs you to increase the probability of life on ringed and on non-ringed planets (so far, so good - SIA is predicting generally larger populations).

And then one day you look up at the sky and see:

continue reading »

SIA fears (expected) infinity

6 Stuart_Armstrong 12 November 2012 05:23PM

It's well known that the Self-Indication Assumption (SIA) has problems with infinite populations (one of the reasons I strongly recommend not using the probability as the fundamental object of interest, but instead the decision, as in anthropic decision theory).

SIA also has problems with arbitrarily large finite populations, at least in some cases. What cases are these? Imagine that we had these (non-anthropic) probabilities for various populations:

p0, p1, p2, p3, p4...

Now let us apply the anthropic correction from SIA; before renormalising, we have these weights for different population levels:

0, p1, 2p2, 3p3, 4p4...

To renormalise, we need to divide by the sum 0 + p1 + 2p2 + 3p3 + 4p4... This is actually the expected population! (note: we are using the population as a proxy for the size of the reference class of agents who are subjectively indistinguishable from us; see this post for more details)

So using SIA is possible if and only if the (non-anthropic) expected population is finite (and non-zero).

Note that it is possible for the anthropic expected population to be infinite! For instance if pj is C/j3, for some constant C, then the non-anthropic expected population is finite (being the infinite sum of C/j2). However once we have done the SIA correction, we can see that the SIA-corrected expected population is infinite (being the infinite sum of some constant times 1/j).

Question about application of Bayes

0 RolfAndreassen 31 October 2012 02:35AM

I have successfully confused myself about probability again. 

I am debugging an intermittent crash; it doesn't happen every time I run the program. After much confusion I believe I have traced the problem to a specific line (activating my debug logger, as it happens; irony...) I have tested my program with and without this line commented out. I find that, when the line is active, I get two crashes on seven runs. Without the line, I get no crashes on ten runs. Intuitively this seems like evidence in favour of the hypothesis that the line is causing the crash. But I'm confused on how to set up the equations. Do I need a probability distribution over crash frequencies? That was the solution the last time I was confused over Bayes, but I don't understand what it means to say "The probability of having the line, given crash frequency f", which it seems I need to know to calculate a new probability distribution. 

I'm going to go with my intuition and code on the assumption that the debug logger should be activated much later in the program to avoid a race condition, but I'd like to understand this math. 

A follow-up probability question: Data samples with different priors

3 PhilGoetz 25 October 2012 08:07PM

(Rewritten entirely after seeing pragmatist's answer.)

In this post, helpful people including DanielLC gave me the multiply-odds-ratios method for combining probability estimates given by independent experts with a constant prior, with many comments about what to do when they aren't independent.  (DanielLC's method turns out to be identical to summing up the bits of information for and against the hypothesis, which is what I'd expected to be correct.)

I ran into problems applying this, because sometimes the prior isn't constant across samples.  Right now I'm combining different sources of information to choose the correct transcription start site for a gene.  These bacterial genes typically have from 1 to 20 possible start sites.  The prior is 1 / (number of possible sites).

Suppose I want to figure out the correct likelihood multiplier for the information that a start site overlaps the stop of the previous gene, which I will call property Q.  Assume this multiplier, lm, is constant, regardless of the prior.  This is reasonable, since we always factor out the prior.  Some function of the prior gives me the posterior probability that a site s is the correct start (Q(s) is true), given that O(s).  That's P(Q(s) | prior=1/numStarts, O(s)).

Suppose I look just at those cases where numStarts = 4, I find that P(Q(s) | numStarts=4, O(s)) = .9.

9:1 / 1:3 = 27:1

Or I can look at the cases where numStarts=2, and find that in these cases, P(Q(s) | numStarts=2, O(s)) = .95:

19:1 / 1:1 = 19:1

I want to take one pass through the data and come up with a single likelihood multiplier, rather than binning all the data into different groups by numStarts.  I think I can just compute it as

(sum of numerator : sum of denominator) over all cases s_i where O(s_i) is true, where

     numerator = (numStarts_i-1) * Q(s_i)

     denominator = (1-Q(s_i))

Is this correct?

A probability question

6 PhilGoetz 19 October 2012 10:34PM

Suppose you have a property Q which certain objects may or may not have.  You've seen many of these objects; you know the prior probability P(Q) that an object has this property.

You have 2 independent measurements of object O, which each assign a probability that Q(O) (O has property Q).  Call these two independent probabilities A and B.

What is P(Q(O) | A, B, P(Q))?

To put it another way, expert A has opinion O(A) = A, which asserts P(Q(O)) = A = .7, and expert B says P(Q(O)) = B = .8, and the prior P(Q) = .4, so what is P(Q(O))?  The correlation between the opinions of the experts is unknown, but probably small.  (They aren't human experts.)  I face this problem all the time at work.

You can see that the problem isn't solvable without the prior P(Q), because if the prior P(Q) = .9, then two experts assigning P(Q(O)) < .9 should result in a probability lower than the lowest opinion of those experts.  But if P(Q) = .1, then the same estimates by the two experts should result in a probability higher than either of their estimates.  But is it solvable or at least well-defined even with the prior?

The experts both know the prior, so if you just had expert A saying P(Q(O)) = .7, the answer must be .7 .  Expert B's opinion B must revise the probability upwards if B > P(Q), and downwards if B < P(Q).

When expert A says O(A) = A, she probably means, "If I consider all the n objects I've seen that looked like this one, nA of them had property Q."

One approach is to add up the bits of information each expert gives, with positive bits for indications that Q(O) and negative bits that not(Q(O)).

No Anthropic Evidence

9 Vladimir_Nesov 23 September 2012 10:33AM

Closely related to: How Many LHC Failures Is Too Many?

Consider the following thought experiment. At the start, an "original" coin is tossed, but not shown. If it was "tails", a gun is loaded, otherwise it's not. After that, you are offered a big number of rounds of decision, where in each one you can either quit the game, or toss a coin of your own. If your coin falls "tails", the gun gets triggered, and depending on how the original coin fell (whether the gun was loaded), you either get shot or not (if the gun doesn't fire, i.e. if the original coin was "heads", you are free to go). If your coin is "heads", you are all right for the round. If you quit the game, you will get shot at the exit with probability 75% independently of what was happening during the game (and of the original coin). The question is, should you keep playing or quit if you observe, say, 1000 "heads" in a row?

Intuitively, it seems as if 1000 "heads" is "anthropic evidence" for the original coin being "tails", that the long sequence of "heads" can only be explained by the fact that "tails" would have killed you. If you know that the original coin was "tails", then to keep playing is to face the certainty of eventually tossing "tails" and getting shot, which is worse than quitting, with only 75% chance of death. Thus, it seems preferable to quit.

On the other hand, each "heads" you observe doesn't distinguish the hypothetical where the original coin was "heads" from one where it was "tails". The first round can be modeled by a 4-element finite probability space consisting of options {HH, HT, TH, TT}, where HH and HT correspond to the original coin being "heads" and HH and TH to the coin-for-the-round being "heads". Observing "heads" is the event {HH, TH} that has the same 50% posterior probabilities for "heads" and "tails" of the original coin. Thus, each round that ends in "heads" doesn't change the knowledge about the original coin, even if there were 1000 rounds of this type. And since you only get shot if the original coin was "tails", you only get to 50% probability of dying as the game continues, which is better than the 75% from quitting the game.

(See also the comments by simon2 and Benja Fallenstein on the LHC post, and this thought experiment by Benja Fallenstein.)

The result of this exercise could be generalized by saying that counterfactual possibility of dying doesn't in itself influence the conclusions that can be drawn from observations that happened within the hypotheticals where one didn't die. Only if the possibility of dying influences the probability of observations that did take place, would it be possible to detect that possibility. For example, if in the above exercise, a loaded gun would cause the coin to become biased in a known way, only then would it be possible to detect the state of the gun (1000 "heads" would imply either that the gun is likely loaded, or that it's likely not).

Chief Probability Officer

11 lukeprog 09 September 2012 11:45PM

Stanford Professor Sam Savage (also of Probability Management) proposes that large firms appoint a "Chief Probability Officer." Here is a description from Douglas Hubbard's How to Measure Anything, ch. 6:

 

Sam Savage... has some ideas about how to institutionalize the entire process of creating Monte Carlo simulations [for estimating risk].

...His idea is to appoint a chief probability officer (CPO) for the firm. The CPO would be in charge of managing a common library of probability distributions for use by anyone running Monte Carlo simulations. Savage invokes concepts like the Stochastic Information Packet (SIP), a pregenerated set of 100,000 random numbers for a particular value. Sometimes different SIPs would be related. For example, the company’s revenue might be related to national economic growth. A set of SIPs that are generated so they have these correlations are called “SLURPS” (Stochastic Library Units with Relationships Preserved). The CPO would manage SIPs and SLURPs so that users of probability distributions don’t have to reinvent the wheel every time they need to simulate inflation or healthcare costs.

 

Hubbard adds some of his own ideas to the proposal:

 

  • Certification of analysts. Right now, there is not a lot of quality control for decision analysis experts. Only actuaries, in their particular specialty of decision analysis, have extensive certification requirements. As for actuaries, certification in decision analysis should eventually be an independent not-for-profit program run by a professional association. Some other professional certifications now partly cover these topics but fall far short in substance in this particular area. For this reason, I began certifying individuals in Applied Information Economics because there was an immediate need for people to be able to prove their skills to potential employers.
  • Certification for calibrated estimators. As we discussed earlier, an uncalibrated estimator has a strong tendency to be overconfident. Any calculation of risk based on his or her estimates will likely be significantly understated. However, a survey I once conducted showed that calibration is almost unheard of among those who build Monte Carlo models professionally, even though a majority used at least some subjective estimates. (About a third surveyed used mostly subjective estimates.) Calibration training will be one of the simplest improvements to risk analysis in an organization.
  • Well-documented procedures and templates for how models are built from the input of various calibrated estimators. It takes some time to smooth out the wrinkles in the process. Most organizations don’t need to start from scratch for every new investment they are analyzing; they can base their work on that of others or at least reuse their own prior models. I’ve executed nearly the same analysis procedure following similar project plans for a wide variety of decision analysis problems from IT security, military logistics, and entertainment industry investments. But when I applied the same method in the same organization on different problems, I often found that certain parts of the model would be similar to parts of earlier models. An insurance company would have several investments that include estimating the impact on “customer retention” and “claims payout ratio.” Manufacturing-related investments would have calculations related to “marginal labor costs per unit” or “average order fulfillment time.” These issues don’t have to be modeled anew for each new investment problem. They are reusable modules in spreadsheets. 
  • Adoption of a single automated tool set. [In this book I show] a few of the many tool sets available. You can get as sophisticated as you like, but starting out doesn’t require any more than some good spreadsheet-based tools. I recommend starting simple and adopting more extensive tool sets as the situations demand.

 

Confused about Solomonoff induction

-3 nebulous 13 July 2012 11:36AM

Why wouldn't the probability of two algorithms of different lengths appearing approach the same value as longer strings of bits are searched?

Draft of Edwin Jaynes' "Probability Theory: The Logic of Science" online, with lost chapter 30

8 buybuydandavis 23 June 2012 05:48AM

http://thiqaruni.org/mathpdf9/(86).pdf

The book didn't include Chapter 30 - "MAXIMUM ENTROPY: MATRIX FORMULATION"

Opening in adobe seems to work out better for me.

 

 

 

Thoughts and problems with Eliezer's measure of optimization power

17 Stuart_Armstrong 08 June 2012 09:44AM

Back in the day, Eliezer proposed a method for measuring the optimization power (OP) of a system S. The idea is to get a measure of small a target the system can hit:

You can quantify this, at least in theory, supposing you have (A) the agent or optimization process's preference ordering, and (B) a measure of the space of outcomes - which, for discrete outcomes in a finite space of possibilities, could just consist of counting them - then you can quantify how small a target is being hit, within how large a greater region.

Then we count the total number of states with equal or greater rank in the preference ordering to the outcome achieved, or integrate over the measure of states with equal or greater rank.  Dividing this by the total size of the space gives you the relative smallness of the target - did you hit an outcome that was one in a million?  One in a trillion?

Actually, most optimization processes produce "surprises" that are exponentially more improbable than this - you'd need to try far more than a trillion random reorderings of the letters in a book, to produce a play of quality equalling or exceeding Shakespeare.  So we take the log base two of the reciprocal of the improbability, and that gives us optimization power in bits.

For example, assume there were eight equally likely possible states {X0, X1, ... , X7}, and S gives them utilities {0, 1, ... , 7}. Then if S can make X6 happen, there are two states better or equal to its achievement (X6 and X7), hence it has hit a target filling 1/4 of the total space. Hence its OP is log2 4 = 2. If the best S could manage is X4, then it has only hit half the total space, and has an OP of only log2 2 = 1. Conversely, if S reached the perfect X7, 1/8 of the total space, then it would have an OP of log2 8 = 3.

continue reading »

Logical Uncertainty as Probability

3 gRR 29 April 2012 10:26PM

This post is a long answer to this comment by cousin_it:

Logical uncertainty is weird because it doesn't exactly obey the rules of probability. You can't have a consistent probability assignment that says axioms are 100% true but the millionth digit of pi has a 50% chance of being odd.

I'd like to attempt to formally define logical uncertainty in terms of probability. Don't know if what results is in any way novel or useful, but.

Let X be a finite set of true statements of some formal system F extending propositional calculus, like Peano Arithmetic. X is supposed to represent a set of logical/mathematical beliefs of some finite reasoning agent.

Given any X, we can define its "Obvious Logical Closure" OLC(X), an infinite set of statements producible from X by applying the rules and axioms of propositional calculus. An important property of OLC(X) is that it is decidable: for any statement S it is possible to find out whether S is true (S∈OLC(X)), false ("~S"∈OLC(X)), or uncertain (neither).

We can now define the "conditional" probability P(*|X) as a function from {the statements of F} to [0,1] satisfying the axioms:

Axiom 1: Known true statements have probability 1:

    P(S|X)=1  iff  S∈OLC(X)

Axiom 2: The probability of a disjunction of mutually exclusive statements is equal to the sum of their probabilities:

    "~(A∧B)"∈OLC(X)  implies  P("A∨B"|X) = P(A|X) + P(B|X)

From these axioms we can get all the expected behavior of the probabilities:

    P("~S"|X) = 1 - P(S|X)

    P(S|X)=0  iff  "~S"∈OLC(X)

    0 < P(S|X) < 1  iff  S∉OLC(X) and "~S"∉OLC(X)

    "A=>B"∈OLC(X)  implies  P(A|X)≤P(B|X)

    "A<=>B"∈OLC(X)  implies  P(A|X)=P(B|X)

          etc.

This is still insufficient to calculate an actual probability value for any uncertain statement. Additional principles are required. For example, the Consistency Desideratum of Jaynes: "equivalent states of knowledge must be represented by the same probability values".

Definition: two statements A and B are indistinguishable relative to X iff there exists an isomorphism between OLC(X∪{A}) and OLC(X∪{B}), which is identity on X, and which maps A to B.
[Isomorphism here is a 1-1 function f preserving all logical operations:  f(A∨B)=f(A)∨f(B), f(~~A)=~~f(A), etc.]

Axiom 3: If A and B are indistinguishable relative to X, then  P(A|X) = P(B|X).

Proposition: Let X be the set of statements representing my current mathematical knowledge, translated into F.  Then the statements "millionth digit of PI is odd" and "millionth digit of PI is even" are indistinguishable relative to X.

Corollary:  P(millionth digit of PI is odd | my current mathematical knowledge) = 1/2.

 

Learning the basics of probability & beliefs

3 tomme 31 March 2012 09:18AM

Let's say that I believe that the sky is green.

1) How can I know whether this belief is true?

2) How can I assign a probability to it to test its degree of truthfulness?

3) How can I update this belief?

Thank you.

Causation, Probability and Objectivity

7 antigonus 18 March 2012 06:54AM

Most people here seem to endorse the following two claims:

1. Probability is "in the mind," i.e., probability claims are true only in relation to some prior distribution and set of information to be conditionalized on;
2. Causality is to be cashed out in terms of probability distributions á la Judea Pearl or something.

However, these two claims feel in tension to me, since they appear to have the consequence that causality is also "in the mind" - whether something caused something else depends on various probability distributions, which in turn depends on how much we know about the situation. Worse, it has the consequence that ideal Bayesian reasoners can never be wrong about causal relations, since they always have perfect knowledge of their own probabilities.

Since I don't understand Pearl's model of causality very well, I may be missing something fundamental, so this is more of a question than an argument.

Gambler's Reward: Optimal Betting Size

6 b1shop 17 January 2012 08:32PM

I've been trying my hand at card counting lately, and I've been doing some thinking about how a perfect gambler would act at the table. I'm not sure how to derive the optimal bet size.

Overall, the expected value of blackjack is small and negative. However, there is high variance in the expected value. By varying his bet size and sitting out rounds, the player can wager more money when expected value is higher and less money when expected value is lower. Overall, this can result in an edge.

However, I'm not sure what the optimal bet size is. Going all-in with a 60 percent chance of winning is EV+, but the 40 percent chance of loss would not only destroy your bankroll, it would also prevent you from participating in future EV+ situations. Ideally, one would want to not only increase EV, but also decrease variance.

Objective: Given a distribution of expected values, develop a function that transforms the current expected value into the percentage of the bankroll that should be placed at risk.

I'm not sure how to begin. Even if I had worked out the distribution of expected values. Are other inputs required (i.e. utility of marginal dollar won, desired risk of ruin)? Should the approach perhaps be to maximize expected value after one playing session? Why not a month of playing sessions, or a billion? Is there any chance the optimal betting size would produce behavior similar to the behavior predicted by prospect theory?

I eagerly await an informative discussion. If you have something against gambling, just pretend we're talking about how much of your wealth you plan on investing in an oil well with positive expected value.

The lessons of a world without Hitler

-4 Stuart_Armstrong 16 January 2012 04:16PM

What would the world look like without Hitler? Fiction is generally unequivocal about this: the removal of Hitler makes no difference, the world will still lurch towards a world war through some other path. WWII and the Holocaust are such major, defining events of the twentieth century, that we twist counterfactual events to ensure they still happen.

Against this, some have made the argument that Hitler was essentially sole responsible for WWII and especially for the Holocaust - no Hitler, no war, no extermination camps. The no Holocaust argument is quite solid: the extermination system was expensive, militarily counter-productive, and could only have happened given a leader lacking checks and balance and with an idée fixe that overrode everything else (general European antisemitism allowed the Holocaust, but didn't cause it). The no WWII argument points out that Hitler was both irrational and lucky: he often took great risks, on flimsy evidence, and got away with them. Certainly his decisions in the later, post-Barbarossa period of his reign belie political, military or organisational genius. And it was the height of stupidity to have gone to war, for a half of Poland, with simultaneously the world's greatest empire and what appeared to be the overwhelmingly strong French army. Yes Gamelin, the French commander in chief, did behave like a concussed duckling, and the German army outfought the French - but no-one could have predicted this, and no-one sensible would have counted on it, and hence they wouldn't have risked the war. Hitler wan't sensible, and lucked out.

continue reading »

Can you recognize a random generator?

2 uzalud 28 December 2011 01:59PM

I can't seem to get my head around a simple issue of judging probability. Perhaps someone here can point to an obvious flaw in my thinking.

Let's say we have a binary generator, a machine that outputs a required sequence of ones and zeros according to some internally encapsulated rule (deterministic or probabilistic). All binary generators look alike and you can only infer (a probability of) a rule by looking at its output.

You have two binary generators: A and B. One of these is a true random generator (fair coin tosser). The other one is a biased random generator: stateless (each digit is independently calculated from those given before), with probability of outputting zero p(0) somewhere between zero and one, but NOT 0.5 - let's say it's uniformly distributed in the range [0; .5) U (.5; 1]. At this point, chances that A is a true random generator are 50%.

Now you read the output of first ten digits generated by these machines. Machine A outputs 0000000000. Machine B outputs 0010111101. Knowing this, is the probability of machine A being a true random generator now less than 50%?

My intuition says yes.

But the probability that a true random generator will output 0000000000 should be the same as the probability that it will output 0010111101, because all sequences of equal length are equally likely. The biased random generator is also just as likely to output 0000000000 as it is 0010111101.

So there seems to be no reason to think that a machine outputting a sequence of zeros of any size is any more likely to be a biased stateless random generator than it is to be a true random generator.

I know that you can never know that the generator is truly random. But surely you can statistically discern between random and non-random generators?

Statisticsish Question

3 damang 28 November 2011 04:03PM

This is a question really, not a post, I just can't find the answer formally. Does laplace's rule of succession work when you are taking from a finite population without replacement? If I know that some papers in a hat have "yes" on them, and I know that the rest don't, and that there is a finite amount of papers, and every time I take a paper out I burn it, but I have no clue how many papers are in the hat, should I still use laplace's rule to figure out how much to expect the next paper to have a "yes" on it? or is there some adjustment you make, since every time I see a yes paper the odds of yes papers:~yes papers in the hat goes down.

Log-odds (or logits)

20 brilee 28 November 2011 01:11AM

(I wrote this post for my own blog, and given the warm reception, I figured it would also be suitable for the LW audience. It contains some nicely formatted equations/tables in LaTeX, hence I've left it as a dropbox download.)

Logarithmic probabilities have appeared previously on LW here, here, and sporadically in the comments. The first is a link to a Eliezer post which covers essentially the same material. I believe this is a better introduction/description/guide to logarithmic probabilities than anything else that's appeared on LW thus far.

 

 

Introduction:

Our conventional way of expressing probabilities has always frustrated me. For example, it is very easy to say nonsensical statements like, “110% chance of working”. Or, it is not obvious that the difference between 50% and 50.01% is trivial compared to the difference between 99.98% and 99.99%. It also fails to accommodate the math correctly when we want to say things like, “five times more likely”, because 50% * 5 overflows 100%.
Jacob and I have (re)discovered a mapping from probabilities to log- odds which addresses all of these issues. To boot, it accommodates Bayes’ theorem beautifully. For something so simple and fundamental, it certainly took a great deal of google searching/wikipedia surfing to discover that they are actually called “log-odds”, and that they were “discovered” in 1944, instead of the 1600s. Also, nobody seems to use log-odds, even though they are conceptually powerful. Thus, this primer serves to explain why we need log-odds, what they are, how to use them, and when to use them.

 

Article is here (Updated 11/30 to use base 10)

Bayes Slays Goodman's Grue

0 potato 17 November 2011 10:45AM

This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:

  1. I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
  2. I wanted the first LW article about the grue problem to attack it from a distinctly Lesswrongian aproach without the benefit of hindsight knowledge of the solutions of non-LW philosophy. 
  3. And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.

I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems. 

Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly Lesswrongian ways.

 


 

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

That is the inference that the grue problem threatens, courtesy of Nelson Goodman.  The grue problem starts by defining "grue":

"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."

So you see that before time T, from the list of premises:

"The first emerald ever observed was green.
 The second emerald ever observed was green.
 The third emerald ever observed was green.
 … etc.
 The nth emerald ever observed was green."
 (we will call these the green premises)

it follows that:

"The first emerald ever observed was grue.
The second emerald ever observed was grue.
The third emerald ever observed was grue.
… etc.
The nth emerald ever observed was grue."
(we will call these the grue premises)

The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.

This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm  trying to see if it works in principle.'" - Dan Dennett

We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.

If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.

Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:

"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.
 A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."

The skeptic would continue:

"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"

Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for  ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.

But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:

"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.
If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."

What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.

What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.

To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because  In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.

This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.

 


 

In conclusion:

Every random variable has as a part of it, stored in its definition/code, a frequency distribution over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never fundamental in the source code. Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.

The reason that:

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are blind to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine green-hood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of quanta states; if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.


Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.

If life is unlikely, SIA and SSA expectations are similar

3 Stuart_Armstrong 15 November 2011 04:45PM

Consider a scenario in which there are three rooms. In each room there is an independent 1/1000 chance of an agent being created. There is thus a 1/109 probability of there being an agent in every room, a (3*999)/109 probability of there being two agents, and a (3*9992)/109 probability of there being one.

Given that you are one of these agents, the SIA and SSA probabilities of there being n agents are:

Number of agents SIA SSA
0 0 0
1 (1*3*9992)/(3*1+2*3*999+1*3*9992) (3*9992)/(1+3*999+3*9992)
2 (2*3*999)/(3*1+2*3*999+1*3*9992) (3*999)/(1+3*999+3*9992)
3 (3*1)/(3*1+2*3*999+1*3*9992) (1)/(1+3*999+3*9992)

The expected numbers of agents is (1(3*9992) + 2(2*3*999) + 3(3*1))/(3*1+2*3*999+1*3*9992) = 1.002 for SIA, and (1(3*9992) + 2(3*999) + 3(1))/(1+3*999+3*9992) ≈ 1.001 for SSA. The high unlikelihood of life means that, given that we are alive, both SIA and SSA probabilities get dominated by worlds with very few agents.

This of course only applies to agents who existence is independent (for instance, separate galactic civilizations). If you're alive, chance are that your parents were also alive at some point too.

 

Which fields of learning have clarified your thinking? How and why?

12 [deleted] 11 November 2011 01:04AM

Did computer programming make you a clearer, more precise thinker? How about mathematics? If so, what kind? Set theory? Probability theory?

Microeconomics? Poker? English? Civil Engineering? Underwater Basket Weaving? (For adding... depth.)

Anything I missed?

Context: I have a palette of courses to dab onto my university schedule, and I don't know which ones to chose. This much is for certain: I want to come out of university as a problem solving beast. If there are fields of inquiry whose methods easily transfer to other fields, it is those fields that I want to learn in, at least initially.

Rip apart, Less Wrong!

Help with a (potentially Bayesian) statistics / set theory problem?

2 joshkaufman 10 November 2011 10:30PM

Update: as it turns out, this is a voting system problem, which is a difficult but well-studied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.


I'm working on a problem that I believe involves Bayes, I'm new to Bayes and a bit rusty on statistics, and I'm having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.

Here's the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no "indifferent" option.) The order of these combinations is not important, and repeated combinations are not allowed.

Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (7-2)! ) = 21.

Now, assume the researcher wants to know which single option has the highest probability of being the "most preferred" to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.

At the beginning of the study, there are no priors. (CORRECTION via dlthomas: "There are necessarily priors... we start with no information about rankings... and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the "most preferred" object should go up, and A should go down, if I'm understanding correctly.

NOTE: Direct ranking of objects 1-7 (instead of pairwise comparison) isn't ideal because it takes longer, which may encourage the participant to rationalize. The "pick-one-of-two" approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.

The ideal output looks like this: "Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred)."

Questions:

1. Is Bayes actually the most straightforward way of calculating the "most preferred"? (If not, what is? I don't want to be Maslow's "man with a hammer" here.)

2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?

Thanks in advance!

Foundations of Inference

8 amcknight 31 October 2011 07:48PM

I've recently been getting into all of this wonderful Information Theory stuff and have come across a paper (thanks to John Salvatier) that was written by Kevin H. Knuth:

Foundations of Inference

The paper sets up some intuitive minimal axioms for quantifying power sets and then (seems to) use them to derive Bayesian probability theory, information gain, and Shannon Entropy. The paper also claims to use less assumptions than both Cox and Kolmogorov when choosing axioms. This seems like a significant foundation/unification. I'd like to hear whether others agree and what parts of the paper you think are the significant contributions.

If a 14 page paper is too long for you, I recommend skipping to the conclusion (starting at the bottom of page 12) where there is a nice picture representation of the axioms and a quick summary of what they imply.

Religion, happiness, and Bayes

3 fortyeridania 04 October 2011 10:21AM

Religion apparently makes people happier. Is that evidence for the truth of religion, or against it?

(Of course, it matters which religion we're talking about, but let's just stick with theism generally.)

My initial inclination was to interpret this as evidence against theism, in the sense that it weakens the evidence for theism. Here's why:

  1. As all Bayesians know, a piece of information F is evidence for an hypothesis H to the degree that F depends on H. If F can happen just as easily without H as with it, then F is not evidence for H. The more likely we are to find F in a world without H, the weaker F is as evidence for H.
  2. Here, F is "Theism makes people happier." H is "Theism is true."
  3. The fact of widespread theism is evidence for H. The strength of this evidence depends on how likely such belief would be if H were false.
  4. As people are more likely to do something if it makes them happy, people are more likely to be theists given F.
  5. Thus F opens up a way for people to be theists even if H is false.
  6. It therefore weakens the evidence of widespread theism for the truth of H.
  7. Therefore, F should decrease one's confidence in H, i.e., it is evidence against H.

We could also put this in mathematical terms, where F represents an increase in the prior probability of our encountering the evidence. Since that prior is a denominator in Bayes' equation, a bigger one means a smaller posterior probability--in other words, weaker evidence.

OK, so that was my first thought.

But then I had second thoughts: Perhaps the evidence points the other way? If we reframe the finding as "Atheism causes unhappiness," or posit that contrarians (such as atheists) are dispositionally unhappy, does that change the sign of the evidence?

Obviously, I am confused. What's going on here?

P(X = exact value) = 0: Is it really counterintuitive?

8 lucidfox 29 July 2011 12:45PM

I'm probably not going to say anything new here. Someone must have pondered over this already. However, hopefully it will invite discussion and clear things up.

Let X be a random variable with a continuous distribution over the interval [0, 10]. Then, by the definition of probability over continuous domains, P(X = 1) = 0. The same is true for P(X = 10), P(X = sqrt(2)), P(X = π), and in general, the probability that X is equal to any exact number is always zero, as an integral over a single point.

This is sometimes described as counterintuitive: surely, at any measurement, X must be equal to something, and thus its probability cannot be zero since its clearly happened. It can be, of course, argued that mathematical probability is abstract function that does not exactly map to our intuitive understanding of probability, but in this case, I would argue that it does.

What if X is the x-coordinate of a physical object? If classical physics are in question - for example, we pointed a needle at a random point on a 10 cm ruler - then it cannot be a point object, and must have a nonzero size. Thus, we can measure the probability of the 1 cm point lying within the space the end of the needle occupies, a probability that is clearly defined and nonzero.

But even if we're talking about a point object, while it may well occupy a definite and exact coordinate in classical physics, we'll never know what exactly it is. For one, our measuring tools are not that precise. But even if they had infinite precision, statements like "X equals exactly 2.(0)" or "X equals exactly π" contain infinite information, since they specify all the decimal digits of the coordinate into infinity. We would have an infinite number of measurements to confirm it. So while X may objectively equal exactly 2 or π - again, under classical physics - measurers would never know it. At any given point, to measurers, X would lie in an interval.

Then of course there is quantum physics, where it is literally impossible for any physical object, including point objects, to have a definite coordinate with arbitrary precision. In this case, the purely mathematical notion that any exact value is an impossible event turns out (by coincidence?) to match how the universe actually works.

Looking for proof of conditional probability

-1 DanielLC 28 July 2011 02:24AM

From what I understand, the Kolmogorov axioms make no mention of conditional probability. That is simply defined. If I really want to show how probability actually works, I'm not going to argue "by definition". Does anyone know a modified form that uses simpler axioms than P(A|B) = P(A∩B)/P(B)?

Against improper priors

2 DanielLC 26 July 2011 11:50PM

An improper prior is essentially a prior probability distribution that's infinitesimal over an infinite range, in order to add to one. For example, the uniform prior over all real numbers is an improper prior, as there would be an infinitesimal probability of getting a result in any finite range. It's common to use improper priors for when you have no prior information.

The mark of a good prior is that it gives a high probability to the correct answer. If I bet 1,000,000 to one that a coin will land on heads, and it lands on tails, it could be a coincidence, but I probably had a bad prior. A good prior is one that results in me not being very surprised.

With a proper prior, probability is conserved, and more probability mass in one place means less in another. If I'm less surprised when a coin lands on tails, I'm more surprised when it lands on heads. This isn't true with an improper prior. If I wanted to predict the value of a random real number, and used a normal distribution with a mean of zero and a standard deviation of one, I'd be pretty darn surprised if it doesn't end up being pretty close to zero, but I'd be infinitely surprised if I used a uniform distribution. No matter what the number is, it will be more surprising with the improper prior. Essentially, a proper prior is better in every way. (You could find exceptions for this, such as averaging a proper and improper prior to get an improper prior that still has finite probabilities and they just add up to 1/2, or by using a proper prior that has zero in some places, but you can always make a proper prior that's better in every way to a given improper prior).

Dutch books also seems to be a popular way of showing what works and what doesn't, so here's a simple Dutch argument against improper priors: I have two real numbers: x and y. Suppose they have a uniform distribution. I offer you a bet at 1:2 odds that x has a higher magnitude. They're equally likely to be higher, so you take it. I then show you the value of x. I offer you a new bet at 100:1 odds that y has a higher magnitude. You know y almost definitely has a higher magnitude than that, so you take it again. No matter what happens, I win.

You could try to get out of it by using a different prior, but I can just perform a transformation on it to get what I want. For example, if you choose a logarithmic prior for the magnitude, I can just take the magnitude of the log of the magnitude, and have a uniform distribution.

There are certainly uses for an improper prior. You can use it if the evidence is so great compared to the difference between it and the correct value that it isn't worth worrying about. You can also use it if you're not sure what another person's prior is, and you want to give a result that is at least as high as they'd get no matter how much there prior is spread out. That said, an improper prior is never actually correct, even in things that you have literally no evidence for.

[Link] The Bayesian argument against induction.

4 Peterdjones 18 July 2011 09:52PM

In 1983 Karl Popper and David Miller published an argument to the effect that probability theory could be used to disprove induction. Popper had long been an opponent of induction. Since probability theory in general, and Bayes in particular is often seen as rescuing induction from the standard objections, the argument is significant.

It is being discussed over at the Critical Rationalism site.

Question about Large Utilities and Low Probabilities

4 sark 24 June 2011 06:33PM

Advanced apologies if this has been discussed before.

Question: Philosophy and Mathematics are fields in which we employ abstract reasoning to arrive at conclusions. Can the relative success of philosophy versus mathematics provide empirical evidence for how robust our arguments must be before we can even hope to have a non-negligible chance of arriving at correct conclusions? Considering how bad philosophy has been at arriving at correct conclusions, must they not be essentially as robust as mathematical proof, or correct virtually with probability 1? If so, should this not cast severe doubt on arguments showing how, in expected utility calculations, outcomes with vast sums of utility can easily swamp a low probability of their coming to pass? Won't our estimates of such probabilities be severely inflated?

Related: http://lesswrong.com/lw/673/model_uncertainty_pascalian_reasoning_and/

Considering all scenarios when using Bayes' theorem.

9 Alexei 20 June 2011 06:11PM

Disclaimer: this post is directed at people who, like me, are not Bayesian/probability gurus.

Recently I found an opportunity to use the Bayes' theorem in real life to help myself update in the following situation (presented in gender-neutral way):

Let's say you are wondering if a person is interested in you romantically. And they bought you a drink.
A = they are interested in you.
B = they bought you a drink.
P(A) = 0.3 (Just an assumption.)
P(B) = 0.05 (Approximately 1 out of 20 people, who might be at all interested in you, will buy you a drink for some unknown reason.)
P(B|A) = 0.2 (Approximately 1 out of 5 people, who are interested in you, will buy you a drink for some unknown reason. Though it's more likely they will buy you a drink because they are interested in you.)

These numbers seem valid to me, and I can't see anything that's obviously wrong. But when I actually use Bayes' theorem:
P(A|B) = P(B|A) * P(A) / P(B) = 1.2
Uh-oh! Where did I go wrong? See if you can spot the error before continuing.

Turns out:
P(B|A) = P(A∩B) / P(A) ≤ P(B) / P(A) = 0.1667
BUT
P(B|A) = 0.2 > 0.1667

I've made a mistake in estimating my probabilities, even though it felt intuitive. Yet, I don't immediately see where I went wrong when I look at the original estimates! What's the best way to prevent this kind of mistake?
I feel pretty confident in my estimates of P(A) and P(B|A). However, estimating P(B) is rather difficult because I need to consider many scenarios.

I can compute P(B) more precisely by considering all the scenarios that would lead to B happening (see wiki article):

P(B) = ∑i P(B|Hi) * P(Hi)

Let's do a quick breakdown of everyone who would want to buy you a drink (out of the pool of people who might be at all interested in you):
P(misc. reasons) = 0.05; P(B|misc) = 0.01
P(they are just friendly and buy drinks for everyone they meet) = 0.05; P(B|friendly) = 0.8
P(they want to be friends) = 0.3; P(B|friends) = 0.1
P(they are interested in you) = 0.6; P(B|interested) = P(B|A) = 0.2
So, P(B) = 0.1905
And, P(A|B) = 0.315 (very different from 1.2!)

Once I started thinking about all possible scenarios, I found one I haven't considered explicitly -- some people buy drinks for everyone they meet -- which adds a good amount of probability (0.04) to B happening. (Those types of people are rare, but they WILL buy you a drink.) There are also other interesting assumptions that are made explicit:

  • Out of all the people under consideration in this problem, there are twice as many people who would be romantically interested in you vs. people who would want to be your friend.
  • People who are interested in you will buy you a drink twice as often as people who want to be your friend.

The moral of the story is to consider all possible scenarios (models/hypothesis) which can lead to the event you have observed. It's possible you are missing some scenarios, which under consideration will significantly alter your probability estimates.

Do you know any other ways to make the use of Bayes' theorem more accurate? (Please post in comments, links to previous posts of this sort are welcome.)

Lightswitches

6 Alicorn 25 May 2011 04:43AM

There is probably some obvious solution to this puzzle, but it eludes me.  I'm not sure how to plug it into the equation for Bayes' Theorem.  And the situation described happened last August, so I'm probably not going to figure it out on my own.

There are two lightswitches next to each other, and they control two lights (which have no other switches connected to them).  I have used the switches a few times before, but don't occurrently recall which switch goes to which light, or whether the up or down position is the one that signifies off-ness.  One light is on, one light is off, and the switches are in different positions.  I want both lights off.  So I guess a switch, and I'm right.  What should be my credence be that my previous experience with this set of lightswitches helped me guess correctly, given that I felt like I was guessing at random (and would have had a 50% shot at being right were that the case)?  How much would this be different if I'd guessed wrong the first time?

Free Stats Textbook: Principles of Uncertainty

20 badger 24 May 2011 07:45PM

Joseph Kadane, emeritus at Carnegie Mellon, released his new statistics textbook Principles of Uncertainty as a free pdf. The book is written from a Bayesian perspective, covering basic probability, decision theory, conjugate distribution analysis, hierarchical modeling, MCMC simulation, and game theory. The focus is mathematical, but computation with R is touched on. A solid understanding of calculus seems sufficient to use the book. Curiously, the author devotes a fair number of pages to developing the McShane integral, which is equivalent to Lebesgue integration on the real line. There are lots of other unusual topics you don't normally see in an intermediate statistics textbook.

Having came across this today, I can't say whether it is actually very good or not, but the range of topics seems perfectly suited to Less Wrong readers.

The Joys of Conjugate Priors

41 TCB 21 May 2011 02:41AM

(Warning: this post is a bit technical.)

Suppose you are a Bayesian reasoning agent.  While going about your daily activities, you observe an event of type .  Because you're a good Bayesian, you have some internal parameter  which represents your belief that  will occur.

Now, you're familiar with the Ways of Bayes, and therefore you know that your beliefs must be updated with every new datapoint you perceive.  Your observation of  is a datapoint, and thus you'll want to modify .  But how much should this datapoint influence ?  Well, that will depend on how sure you are of  in the first place.  If you calculated  based on a careful experiment involving hundreds of thousands of observations, then you're probably pretty confident in its value, and this single observation of  shouldn't have much impact.  But if your estimate of  is just a wild guess based on something your unreliable friend told you, then this datapoint is important and should be weighted much more heavily in your reestimation of .

Of course, when you reestimate , you'll also have to reestimate how confident you are in its value.  Or, to put it a different way, you'll want to compute a new probability distribution over possible values of .  This new distribution will be , and it can be computed using Bayes' rule:



Here, since  is a parameter used to specify the distribution from which  is drawn, it can be assumed that computing  is straightforward.   is your old distribution over , which you already have; it says how accurate you think different settings of the parameters are, and allows you to compute your confidence in any given value of .  So the numerator should be straightforward to compute; it's the denominator which might give you trouble, since for an arbitrary distribution, computing the integral is likely to be intractable.

But you're probably not really looking for a distribution over different parameter settings; you're looking for a single best setting of the parameters that you can use for making predictions.  If this is your goal, then once you've computed the distribution , you can pick the value of  that maximizes it.  This will be your new parameter, and because you have the formula , you'll know exactly how confident you are in this parameter.

In practice, picking the value of  which maximizes  is usually pretty difficult, thanks to the presence of local optima, as well as the general difficulty of optimization problems.  For simple enough distributions, you can use the EM algorithm, which is guarranteed to converge to a local optimum.  But for more complicated distributions, even this method is intractable, and approximate algorithms must be used.  Because of this concern, it's important to keep the distributions  and  simple.  Choosing the distribution  is a matter of model selection; more complicated models can capture deeper patterns in data, but will take more time and space to compute with.

It is assumed that the type of model is chosen before deciding on the form of the distribution .  So how do you choose a good distribution for ?  Notice that every time you see a new datapoint, you'll have to do the computation in the equation above.  Thus, in the course of observing data, you'll be multiplying lots of different probability distributions together.  If these distributions are chosen poorly,  could get quite messy very quickly.

If you're a smart Bayesian agent, then, you'll pick  to be a conjugate prior to the distribution .  The distribution  is conjugate to  if multiplying these two distributions together and normalizing results in another distribution of the same form as .

Let's consider a concrete example: flipping a biased coin.  Suppose you use the bernoulli distribution to model your coin.  Then it has a parameter  which represents the probability of gettings heads.  Assume that the value 1 corresponds to heads, and the value 0 corresponds to tails.  Then the distribution of the outcome  of the coin flip looks like this:



It turns out that the conjugate prior for the bernoulli distribution is something called the beta distribution.  It has two parameters,  and , which we call hyperparameters because they are parameters for a distribution over our parameters.  (Eek!)

The beta distribution looks like this:



Since  represents the probability of getting heads, it can take on any value between 0 and 1, and thus this function is normalized properly.

Suppose you observe a single coin flip  and want to update your beliefs regarding .  Since the denominator of the beta function in the equation above is just a normalizing constant, you can ignore it for the moment while computing , as long as you promise to normalize after completing the computation:



Normalizing this equation will, of course, give another beta distribution, confirming that this is indeed a conjugate prior for the bernoulli distribution.  Super cool, right?

If you are familiar with the binomial distribution, you should see that the numerator of the beta distribution in the equation for  looks remarkably similar to the non-factorial part of the binomial distribution.  This suggests a form for the normalization constant:



The beta and binomial distributions are almost identical.  The biggest difference between them is that the beta distribution is a function of , with  and  as prespecified parameters, while the binomial distribution is a function of , with  and  as prespecified parameters.  It should be clear that the beta distribution is also conjugate to the binomial distribution, making it just that much awesomer.

Another difference between the two distributions is that the beta distribution uses gammas where the binomial distribution uses factorials.  Recall that the gamma function is just a generalization of the factorial to the reals; thus, the beta distribution allows  and  to be any positive real number, while the binomial distribution is only defined for integers.  As a final note on the beta distribution, the -1 in the exponents is not philosophically significant; I think it is mostly there so that the gamma functions will not contain +1s.  For more information about the mathematics behind the gamma function and the beta distribution, I recommend checking out this pdf: http://www.mhtl.uwaterloo.ca/courses/me755/web_chap1.pdf.  It gives an actual derivation which shows that the first equation for  is equivalent to the second equation for , which is nice if you don't find the argument by analogy to the binomial distribution convincing.

So, what is the philosophical significance of the conjugate prior?  Is it just a pretty piece of mathematics that makes the computation work out the way we'd like it to?  No; there is deep philosophical significance to the form of the beta distribution.

Recall the intuition from above: if you've seen a lot of data already, then one more datapoint shouldn't change your understanding of the world too drastically.  If, on the other hand, you've seen relatively little data, then a single datapoint could influence your beliefs significantly.  This intuition is captured by the form of the conjugate prior.   and  can be viewed as keeping track of how many heads and tails you've seen, respectively.  So if you've already done some experiments with this coin, you can store that data in a beta distribution and use that as your conjugate prior.  The beta distribution captures the difference between claiming that the coin has 30% chance of coming up heads after seeing 3 heads and 7 tails, and claiming that the coin has a 30% chance of coming up heads after seeing 3000 heads and 7000 tails.

Suppose you haven't observed any coin flips yet, but you have some intuition about what the distribution should be.  Then you can choose values for  and  that represent your prior understanding of the coin.  Higher values of  indicate more confidence in your intuition; thus, choosing the appropriate hyperparameters is a method of quantifying your prior understanding so that it can be used in computation.   and  will act like "imaginary data"; when you update your distribution over  after observing a coin flip , it will be like you already saw  heads and  tails before that coin flip.
 
If you want to express that you have no prior knowledge about the system, you can do so by setting  and to 1.  This will turn the beta distribution into a uniform distribution.  You can also use the beta distribution to do add-N smoothing, by setting  and  to both be N+1.  Setting the hyperparameters to a value lower than 1 causes them to act like "negative data", which helps avoid overfitting  to noise in the actual data.

In conclusion, the beta distribution, which is a conjugate prior to the bernoulli and binomial distributions, is super awesome.  It makes it possible to do Bayesian reasoning in a computationally efficient manner, as well as having the philosophically satisfying interpretation of representing real or imaginary prior data.  Other conjugate priors, such as the dirichlet prior for the multinomial distribution, are similarly cool.

Probability updating question - 99.9999% chance of tails, heads on first flip

2 nuckingfutz 16 May 2011 12:58AM

This isn't intended as a full discussion, I'm just a little fuzzy on how a Bayesian update or any other kind of probability update would work in this situation.

You have a coin with a 99.9999% chance of coming up tails, and a 100% chance of coming up either tails or heads.

You've deduced these odds by studying the weight of the coin. You are 99% confident of your results. You have not yet flipped it.

You have no other information before flipping the coin.

You flip the coin once. It comes up heads.

How would you update your probability estimates?

 

(this isn't a homework assignment; rather I was discussing with someone how strong the anthropic principle is. Unfortunately my mathematic abilities can't quite comprehend how to assemble this into any form I can work with.)

 

A potential problem with using Solomonoff induction as a prior

13 JoshuaZ 07 April 2011 07:27PM

There's a problem that has occurred to me that I haven't seen discussed anywhere: I don't think people actually wants to assign zero probability to all hypotheses which are not Turing computable. Consider the following hypothetical: we come up with a theory of everything that seems to explain all the laws of physics but there's a single open parameter (say the fine structure constant). We compute a large number of digits of this constant, and someone notices that when expressed in base 2, the nth digit seems to be 1 iff the nth Turing machine halts on the blank tape for some fairly natural ordering of all Turing machines. If we confirm this for a large number of digits (not necessarily consecutive digits- obviously some of the 0s won't be confirmable) shouldn't we consider the hypothesis the digits really are given by this simple but non-computable function? But if our priors assign zero probability to all non-computable hypotheses then this hypothesis must always be stuck with zero probability.

If the universe is finite we could approximate this function with a function which was instead "Halts within K" steps where K is some large number, but intutively this seems like a more complicated hypothesis than the original one.

I'm not sure what is a reasonable prior in this sort of context that handles this sort of thing. We don't want an uncountable set of priors. It might make sense to use something like hypotheses which are describable in Peano arithmetic or something like that.

 

Visualizing Bayesian Inference [link]

11 Dreaded_Anomaly 14 March 2011 08:10PM

Galton Visualizing Bayesian Inference (article @ CHANCE)

Excerpt:

What does Bayes Theorem look like? I do not mean what does the formula—

—look like; these days, every statistician knows that. I mean, how can we visualize the cognitive content of the theorem? What picture can we appeal to with the hope that any person curious about the theorem may look at it, and, after a bit of study say, “Why, that is clear—I can indeed see what is happening!”

Francis Galton could produce just such a picture; in fact, he built and operated a machine in 1877 that performs that calculation. But, despite having published the picture in Nature and the Proceedings of the Royal Institution of Great Britain, he never referred to it again—and no reader seems to have appreciated what it could accomplish until recently.

Schematics for the machine and its algorithm can be found at the link. This is a really cool design, and maybe it can aid Eliezer's and others' efforts to help people understand Bayes' Theorem.

[Draft] Holy Bayesian Multiverse, Batman!

0 b1shop 03 February 2011 01:47AM

I couldn't find the math for the quantum suicide and immortality thought experiment, so I'm placing it here for posterity. If one actually ran the experiment, Bayes' theorem would tell us how to update our belief in the multi-world interpretation (MWI) of quantum mechanics. I conclude by arguing that we don't need to run the experiment.

Prereqs: Understand the purpose of Bayes Theorem, possess at least rudimentary knowledge of the competing quantum worldviews, and have a nostalgic appreciation for Adam West.

The Fiendish Setup:

Suppose that, after catching Batman snooping in the shadows of his evil lair, Joker ties the caped crusader into a quantum, negative binomial death machine that, every ten seconds, measures the spin value of a fresh proton. Fifty percent of the time, the result will trigger a Bat-killing Rube Goldberg machine. The other 50 percent of the time, the quantum death machine will play a suspenseful stock sound effect and search for a new proton.

continue reading »

Sleeping Beauty

-3 DanielLC 01 February 2011 10:13PM

Someone comes up to you and tells you he flipped ten coins for ten people. They were fair coins, but only three came up heads. What is the probability yours was heads?

There are three people of ten who got heads. There is a 30% chance that you're one of those three, right?

Now take the sleeping beauty paradox. A coin is flipped. If it lands on heads, the subject is woken twice. If it lands on tails, the subject is woken once. For simplicity, assume it happens exactly once, and there are one trillion person-days. You wake up groggy in the morning, and take a second to remember who you are.

If the coin landed on tails, that would mean that there is a one in a trillion chance that you will remember that you're the subject. If it was heads, it would be two in a trillion. As such,  if you do remember being the subject, the probability that it's heads is P(H|U)=P(U|H)*P(H)/[P(U|H)+P(U|T)] = (2/trillion)*(1/2)/[(2/trillion+1/trillion)] =2/3, where H is coin lands on heads, T is coin lands on tails, and U is you are the subject.

Technically, it would be slightly less than 2/3, since there will be one more person-day if the coin lands on heads.

Non-trivial probability distributions for priors and Occam's razor

2 JoshuaZ 11 January 2011 03:59AM

Assume we have a countable set of hypotheses described in some formal way with some prior distribution such that 1) our prior for each hypothesis is non-zero 2) our formal description system has only a finite number of hypotheses of any fixed length. Then, I claim that that under just this set of weak constraints, our hypotheses are under a condition that informally acts a lot like Occam's razor. In particular, let h(n) be the the probability mass assigned to "a hypothesis with description at least exactly n is correct." (ETA: fixed from earlier statement) Then, as n goes to infinity, h(n) goes to zero. So, when one looks in the large-scale complicated hypotheses must have low probability. This suggests that one doesn't need any appeal to computability or anything similar to accept some form of Occam's razor. One only needs that one has a countable hypothesis space, no hypothesis has probability zero or one, and that has a non-stupid way of writing down hypotheses.

A few questions: 1) Am I correct in seeing this as Occam-like or is this just an indication that I'm using too weak a notion of Occam's razor?   

2) Is this point novel? I'm not as familiar with the Bayesian literature as other people here so I'm hoping that someone can point out if this point has been made before.

ETA: This was apparently a point made by Unknowns in an earlier thread which I totally forgot but probably read at the time. Thanks also for the other pointers.

 

Copying and Subjective Experience

5 lucidfox 20 December 2010 12:14PM

The subject of copying people and its effect on personal identity and probability anticipation has been raised and, I think, addressed adequately on Less Wrong.

Still, I'd like to bring up some more thought experiments.

Recently I had a dispute on an IRC channel. I argued that if some hypothetical machine made an exact copy of me, then I would anticipate a 50% probability of jumping into the new body. (I admit that it still feels a little counterintuitive to me, even though this is what I would rationally expect.) After all, they said, the mere fact the copy was created doesn't affect the original.

However, from an outside perspective, Maia1 would see Maia2 being created in front of her eyes, and Maia2 would see the same scene up to the moment of forking, at which point the field of view in front of her eyes would abruptly change to reflect the new location.

Here, it is obvious from both an inside and outside perspective which version has continuity of experience, and thus from a legal standpoint, I think, it would make sense to regard Maia1 as having the same legal identity as the original, and recognize the need to create new documents and records for Maia2 -- even if there is no physical difference.

Suppose, however, that the information was erased. For example, suppose a robot sedated and copied the original me, then dragged Maia1 and Maia2 to randomly chosen rooms, and erased its own memory. At this point, neither either of me, nor anyone else would be able to distinguish between the two. What would you do here from a legal standpoint? (I suppose if it actually came to this, the two of me would agree to arbitrarily designate one as the original by tossing an ordinary coin...)

And one more moment. What is this probability of subjective body-jump actually a probability of? We could set up various Sleeping Beauty-like thought experiments here. Supposing for the sake of argument that I'll live at most a natural human lifespan no matter which year I find myself in, imagine that I make a backup of my current state and ask a machine to restore a copy of me every 200 years. Does this imply that the moment the backup is made -- before I even issue the order, and from an outside perspective, way before any of this copying happens -- I should anticipate subjectively jumping into any given time in the future, and the probability of finding myself as any of them, including the original, tends towards zero the longer the copying machine survives?

 

View more: Prev | Next