Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

## Probabilistic Löb theorem

24 26 April 2013 06:45PM

In this post (based on results from MIRI's recent workshop), I'll be looking at whether reflective theories of logical uncertainty (such as Paul's design) still suffer from Löb's theorem.

Theories of logical uncertainty are theories which can assign probability to logical statements. Reflective theories are theories which know something about themselves within themselves. In Paul's theory, there is an external P, in the meta language, which assigns probabilities to statements, an internal P, inside the theory, that computes probabilities of coded versions of the statements inside the language, and a reflection principle that relates these two P's to each other.

And Löb's theorem is the result that if a (sufficiently complex, classical) system can prove that "a proof of Q implies Q" (often abbreviated as □Q → Q), then it can prove Q. What would be the probabilistic analogue? Let's use □aQ to mean P('Q')≥1-a (so that □0Q is the same as the old □Q; see this post on why we can interchange probabilistic and provability notions). Then Löb's theorem in a probabilistic setting could:

Probabilistic Löb's theorem: for all a<1, if the system can prove □aQ → Q, then the system can prove Q.

To understand this condition, we'll go through the proof of Löb's theorem in a probabilistic setting, and see if and when it breaks down. We'll conclude with an example to show that any decent reflective probability theory has to violate this theorem.

continue reading »

## Logic in the language of probability

12 26 April 2013 06:45PM

This post is a minor note, to go along with the post on the probabilistic Löb theorem. It simply seeks to justify why terms like "having probability 1" are used interchangeably with "provable" and why implications symbols "→" can be used in a probabilistic setting.

Take a system of classical logic, with a single rule of inference: modus ponens:

From A and A→B, deduce B.

Having a single rule of inference isn't much of a restriction, because you can replace other rules of inference ("from A1,A2,... and An, deduce B") with an axiom or axiom schema ("A1∧A2∧...∧An → B") and then use modus ponens on that axiom to get the other rule of inference.

In this logical system, I'm now going to make some purely syntactical changes - not changing the meaning of anything, just the way we write things. For any sentence A that doesn't contain an implication arrow →, replace

A with P(A)=1.

Similarly, replace any sentence of the type

A → B with P(B|A)=1.

This is recursive, so we replace

(A → B) → C with P(C | P(B|A)=1 )=1.

And instead of using modus ponens, we'll use a combined Bayesian inference and law of total probability:

From P(A)=1 and P(B|A)=1, deduce P(B)=1.

continue reading »

## Estimate Stability

5 13 April 2013 06:33PM

I've been trying to get clear on something you might call "estimate stability." Steven Kaas recently posted my question to StackExchange, but we might as well post it here as well:

I'm trying to reason about something I call "estimate stability," and I'm hoping you can tell me whether there’s some relevant technical language...
What do I mean by "estimate stability?" Consider these three different propositions:
1. We’re 50% sure that a coin (known to be fair) will land on heads.
2. We’re 50% sure that Matt will show up at the party.
3. We’re 50% sure that Strong AI will be invented by 2080.
These estimates feel different. One reason they feel different is that the estimates have different degrees of "stability." In case (1) we don't expect to gain information that will change our probability estimate. But for cases (2) and (3), we may well come upon some information that causes us to adjust the estimate either up or down.
So estimate (1) is more "stable," but I'm not sure how this should be quantified. Should I think of it in terms of running a Monte Carlo simulation of what future evidence might be, and looking at something like the variance of the distribution of the resulting estimates? What happens when it’s a whole probability distribution for e.g. the time Strong AI is invented? (Do you do calculate the stability of the probability density for every year, then average the result?)
Here are some other considerations that would be useful to relate more formally to considerations of estimate stability:
• If we’re estimating some variable, having a narrow probability distribution (prior to future evidence with respect to which we’re trying to assess the stability) corresponds to having a lot of data. New data, in that case, would make less of a contribution in terms of changing the mean and reducing the variance.
• There are differences in model uncertainty between the three cases. I know what model to use when predicting a coin flip. My method of predicting whether Matt will show up at a party is shakier, but I have some idea of what I’m doing. With the Strong AI case, I don’t really have any good idea of what I’m doing. Presumably model uncertainty is related to estimate stability, because the more model uncertainty we have, the more we can change our estimate by reducing our model uncertainty.
• Another difference between the three cases is the degree to which our actions allow us to improve our estimates, increasing their stability. For example, we can reduce the uncertainty and increase the stability of our estimate about Matt by calling him, but we don’t really have any good ways to get better estimates of Strong AI timelines (other than by waiting).
• Value-of-information affects how we should deal with delay. Estimates that are unstable in the face of evidence we expect to get in the future seem to imply higher VoI. This creates a reason to accept delays in our actions. Or if we can easily gather information that will make our estimates more accurate and stable, that means we have more reason to pay the cost of gathering that information. If we expect to forget information, or expect our future selves not to take information into account, dynamic inconsistency becomes important. This is another reason why estimates might be unstable. One possible strategy here is to precommit to have our estimates regress to the mean.
Thanks for any thoughts!

## Seize the Maximal Probability Moment

23 28 February 2013 11:22AM

Try and remember 3 or 4 things that you think would be effective hacks for your life but you have not so far implemented. Really, find three.

Probably that was not so hard.

Now think of at which moment in time did you have a maximal probability of having implemented such hacks. Sometimes you had no idea that was the moment. But sometimes you did, like when a friend tells you "I just read this great paper on how people report cartoons being funnier when their face is shaped in a more smiling fashion." and you thought "Great! I may one day implement the algorithm: if studying, force a smile".

You knew you didn't plan to read the article, you knew you trust that friend, and you knew you'd either forget it later, or in any case that from that moment on, the likelihood of you implementing the algorithm would lower.

So my hack of the day is: If you feel you are likely at the maximal probability moment to start a new policy, start immediately.

My friend was telling me about how he went abroad to research: "...so at this place and people there used very strong lights as cognitive enhancement and yadda yadda yadda... (stopped listening for 40s) yadda yadda yadda.... and I wrote a paper on ..."  By that time my room had an extra 110W light working.

Just now0 I thought: It was good I installed that light. Why didn't I do the same when I felt like finding a personalized shirt website where the front would be "I Don't want to talk about: [list]" and the back "Pick your topic: [list]" to once and for all stop the gossip and sports ice-breakers?

I didn't seize the maximal probability moment. That's what happened.

Then I noticed that that1 was the maximal probability moment to install in my mind the maximal probability moment algorithm, I did,  and that2 was the maximal probability moment of writing this post.

Now if you'll excuse me, I have3 a shirt to buy.

## [Link] On the Height of a Field

11 02 January 2013 11:20AM

Mark Eichenlaub posted a great little case-study about the difficulty of updating beliefs, even over trivial matters like the slope of a baseball field. The basic story of Bayes-updating assumes the likelihood of evidence in different states is obvious, but feedback between observations and judgments about likelihood quickly complicate the situation:

The story of how belief is supposed to work is that for each bit of evidence, you consider its likelihood under all the various hypotheses, then multiplying these likelihoods, you find your final result, and it tells you exactly how confident you should be. If I can estimate how likely it is for Google Maps and my GPS to corroborate each other given that they are wrong, and how likely it is given that they are right, and then answer the same question for every other bit of evidence available to me, I don’t need to estimate my final beliefs – I calculate them. But even in this simple testbed of the matter of a sloped baseball field, I could feel my biases coming to bear on what evidence I considered, and how strong and relevant that evidence seemed to me.  The more I believed the baseball field was sloped, the more relevant (higher likelihood ratio) it seemed that there was that short steep hill on the side, and the less relevant that my intuition claimed the field was flat. The field even began looking more sloped to me as time went on, and I sometimes thought I could feel the slope as I ran, even though I never had before.

That’s what I was interested in here. I wanted to know more about the way my feelings and beliefs interacted with the evidence and with my methods of collecting it. It is common knowledge that people are likely to find what they’re looking for whatever the facts, but what does it feel like when you’re in the middle of doing this, and can recognizing that feeling lead you to stop?

Edit: Title changed from "An Empirical Evaluation into Runner's High," the original title of the article, to match the author's new title.

## Some scary life extension dilemmas

2 01 January 2013 06:41PM

Let's imagine a life extension drug has been discovered.  One dose of this drug extends one's life by 49.99 years.  This drug also has a mild cumulative effect, if it has been given to someone who has been dosed with it before it will extend their life by 50 years.

Under these constraints the most efficient way to maximize the amount of life extension this drug can produce is to give every dose to one individual.  If there was one dose available for all seven-billion people alive on Earth then giving every person one dose would result in a total of 349,930,000,000 years of life gained.  If one person was given all the doses a total of 349,999,999,999.99 years of life would be gained.  Sharing the life extension drug equally would result in a net loss of almost 70 million years of life.  If you're concerned about people's reaction to this policy then we could make it a big lottery, where every person on Earth gets a chance to gamble their dose for a chance at all of them.

Now, one could make certain moral arguments in favor of sharing the drug.  I'll get to those later.  However, it seems to me that gambling your dose for a chance at all of them isn't rational from a purely self-interested point of view either.  You will not win the lottery.  Your chances of winning this particular lottery are almost 7,000 times worse than your chances of winning the powerball jackpot.  If someone gave me a dose of the drug, and then offered me a chance to gamble in this lottery, I'd accuse them of Pascal's mugging.

Here's an even scarier thought experiment.  Imagine we invent the technology for whole brain emulation.  Let "x" equal the amount of resources it takes to sustain a WBE through 100 years of life.  Let's imagine that with this particular type of technology, it costs 10x to convert a human into a WBE and it costs 100x to sustain a biological human through the course of their natural life.  Let's have the cost of making multiple copies of a WBE once they have been converted be close to 0.

Again, under these constraints it seems like the most effective way to maximize the amount of life extension done is to convert one person into a WBE, then kill everyone else and use the resources that were sustaining them to make more WBEs, or extend the life of more WBEs.  Again, if we are concerned about people's reaction to this policy we could make it a lottery.  And again, if I was given a chance to play in this lottery I would turn it down and consider it a form of Pascal's mugging.

I'm sure that most readers, like myself, would find these policies very objectionable.  However, I have trouble finding objections to them from the perspective of classical utilitarianism.  Indeed, most people have probably noticed that these scenarios are very similar to Nozick's "utility monster" thought experiment.  I have made a list of possible objections to these scenarios that I have been considering:

1. First, let's deal with the unsatisfying practical objections.  In the case of the drug example, it seems likely that a more efficient form of life extension will likely be developed in the future.  In that case it would be better to give everyone the drug to sustain them until that time.  However, this objection, like most practical ones, seems unsatisfying.  It seems like there are strong moral objections to not sharing the drug.

Another pragmatic objection is that, in the case of the drug scenario, the lucky winner of the lottery might miss their friends and relatives who have died.  And in the WBE scenario it seems like the lottery winner might get lonely being the only person on Earth.  But again, this is unsatisfying.  If the lottery winner were allowed to share their winnings with their immediate social circle, or if they were a sociopathic loner who cared nothing for others, it still seems bad that they end up killing everyone else on Earth.

2. One could use the classic utilitarian argument in favor of equality: diminishing marginal utility.  However, I don't think this works.  Humans don't seem to experience diminishing returns from lifespan in the same way they do from wealth.  It's absurd to argue that a person who lives to the ripe old age of 60 generates less utility than two people who die at age 30 (all other things being equal).  The reason the DMI argument works when arguing for equality of wealth is that people are limited in their ability to get utility from their wealth, because there is only so much time in the day to spend enjoying it.  Extended lifespan removes this restriction, making a longer-lived person essentially a utility monster.

3. My intuitions about the lottery could be mistaken.  It seems to me that if I was offered the possibility of gambling my dose of life extension drug with just one other person, I still wouldn't do it.  If I understand probabilities correctly, then gambling for a chance at living either 0 or 99.99 additional years is equivalent to having a certainty of an additional 49.995  years of life, which is better than the certainty of 49.99 years of life I'd have if I didn't make the gamble.  But I still wouldn't do it, partly because I'd be afraid I'd lose and partly because I wouldn't want to kill the person I was gambling with.

So maybe my horror at these scenarios is driven by that same hesitancy.  Maybe I just don't understand the probabilities right.  But even if that is the case, even if it is rational for me to gamble my dose with just one other person, it doesn't seem like the gambling would scale.  I will not win the "lifetime lottery."

4. Finally, we have those moral objections I mentioned earlier.  Utilitarianism is a pretty awesome moral theory under most circumstances.  However, when it is applied to scenarios involving population growth and scenarios where one individual is vastly better at converting resources into utility than their fellows, it tends to produce very scary results.  If we accept the complexity of value thesis (and I think we should), this suggests that there are other moral values that are not salient in the "special case" of scenarios with no population growth or utility monsters, but become relevant in scenarios where there are.

For instance, it may be that prioritarianism is better than pure utilitarianism, and in this case sharing the life extension method might be best because of the benefits it accords the least off.  Or it may be (in the case of the WBE example) that having a large number of unique, worthwhile lives in the world is valuable because it produces experiences like love, friendship, and diversity.

My tentative guess at the moment is that there probably are some other moral values that make the scenarios I described morally suboptimal, even though they seem to make sense from a utilitarian perspective.  However, I'm interested in what other people think.  Maybe I'm missing something really obvious.

EDIT:  To make it clear, when I refer to "amount of years added" I am assuming for simplicity's sake that all the years added are years that the person whose life is being extended wants to live and contain a large amount of positive experiences. I'm not saying that lifespan is exactly equivalent to utility. The problem I am trying to resolve is that it seems like the scenarios I've described seem to maximize the number of positive events it is possible for the people in the scenario to experience, even though they involve killing the majority of people involved.  I'm not sure "positive experiences" is exactly equivalent to "utility" either, but it's likely a much closer match than lifespan.

## A solvable Newcomb-like problem - part 3 of 3

3 06 December 2012 01:06PM

This is the third part of a three post sequence on a problem that is similar to Newcomb's problem but is posed in terms of probabilities and limited knowledge.

Part 1 - stating the problem
Part 2 - some mathematics
Part 3 - towards a solution

In many situations we can say "For practical purposes a probability of 0.9999999999999999999 is close enough to 1 that for the sake of simplicity I shall treat it as being 1, without that simplification altering my choices."

However, there are some situations where the distinction does significantly alter that character of a situation so, when one is studying a new situation and one is not sure yet which of those two categories the situations falls into, the cautious approach is to re-frame the probability as being (1 - δ) where δ is small (eg 10 to the power of -12), and then examine the characteristics of the behaviour as δ tends towards 0.

LessWrong wiki describes Omega as a super-powerful AI analogous to Laplace's demon, who knows the precise location and momentum of every atom in the universe, limited only by the laws of physics (so, if time travel isn't possible and some of our current thoughts on Quantum Mechanics are correct, then Omega's knowledge of the future is probabilistic, being limited by uncertainty).

For the purposes of Newcomb's problem, and the rationality of Fred's decisions, it doesn't matter how close to that level of power Omega actually is.   What matters, in terms of rationality, is the evidence available to Fred about how close Omega is to having to that level of power; or, more precisely, the evidence available to Fred relevant to Fred making predictions about Omega's performance in this particular game.

Since this is a key factor in Fred's decision, we ought to be cautious.  Rather than specify when setting up the problem that Fred knows with a certainty of 1 that Omega does have that power, it is better to specify a concrete level of evidence that would lead Fred to assign a probability of (1 - δ) to Omega having that power, then examine the effect upon which option to the box problem it is rational for Fred to pick, as δ tends towards 0.

The Newcomb-like problem stated in part 1 of this sequence contains an Omega that it is rational for Fred to assign a less than unity probability of being able to perfectly predict Fred's choices.  By using bets as analogies to the sort of evidence Fred might have available to him, we create an explicit variable that we can then manipulate to alter the precise probability Fred assigns to Omega's abilities.

The other nice feature of the Newcomb-like problem given in part 1, is that it is explicitly solvable using the mathematics given in part 2.  By making randomness an external feature (the device Fred brings with him) rather than purely a feature of Fred's internal mind, we can acknowledge the question of Omega being able to predict quantum events, capture it as a variable, and take it into account when setting out the payoff matrix for the problem.

This means that, instead of Fred having to think "When I walked into this room I was determined to pick one-box.  As far as anyone knew or could predict, including myself, I intended to pick one-box.  However nothing I do now can change Omegas decision - the money is already in the box.  So I've nothing to lose by changing my mind."; Fred can now allocate a specific probability to whether Omega could predict Fred's chance of changing his mind in such circumstances, and Fred can take that into account in his strategy by making his chance of changing strategy explicit and external - basing it upon a random number device.

Or, to put it another way, we are modelling a rational human who has a specific finite chance of talking himself into over riding a pre-committed strategy, as being made up from two components: a component that will infallibly stick to a pre-committed strategy plus a component with a known chance of change; we then treat the combined rational human as being someone infallibly committed to a meta-strategy that includes a chance of change - a mixed equilibrium, from Omega's point of view.

Ok, time to look at the numbers and draw a pretty diagram...

Fred is Player A, and he has two meta options:
A1 - play it safe.  100% chance of one-box and 0% chance of two-box
A2 - take a risk.   Mentally pre-commit to using the device to produce 99% chance of one-box and 1% chance of two-box.

Omega is Player B, and he has two meta options:
B1 - reward risk.  Not attempt to distinguish between the mental state of Fred taking 1% risk and Fred playing it safe.
B2 - punish risk.  Attempt to distinguish and, if Omega guesses Fred is taking risk rather than playing safe, punish it.

We'll start out by assuming that if Omega does attempt to distinguish, then Omega will have a 1 in 10,000 false positive rate (thinking Fred is going to use the device, when actually Fred intends to play it safe) and a 1 in 10,000 false negative rate (thinking Fred is going to play it safe, when actually Fred intends to use the device).

### A1 vs B1

Fred gains \$1,000,000
Omega loses \$1,000,000 to Fred but gains \$1,000,000,000 from Alpha, for a net gain of \$999,000,000

### A2 vs B1

99% of the time, Fred gains \$1,000,000 and Omega net gains \$999,000,000
1% of the time, Fred gains \$1,001,000 and Omega net loses \$10,001,001,000

Combining those gives an average of:
Fred gains: \$1,000,010
Omega gains: \$979,008,999

### A1 vs B2

99.99% of the time, Omega correctly discerns that Fred is playing safe
Fred gains \$1,000,000
Omega gains \$999,000,000

0.01% of the time, Omega falsely believes that Fred is taking a risk, and punishes that by putting \$0 in Box A
Fred gains \$0
Omega loses \$10,000,000,000

Combining those gives an average of:
Fred gains: \$999,900
Omega gains: \$997,900,100

### A2 vs B2

In 100 trials out of 1,000,000 trials Omega incorrectly thinks Fred will play it safe, when actually Fred takes the risk of using the device.  Of these:

In 1 trial out of 1,000,000 trials: Omega incorrectly thinks Fred will play it safe, when actually Fred takes the risk of using the device and, in this case, the device picks two-box
==> Fred gains \$1,001,000
==> Omega loses \$10,001,001,000

In 99 trials out of 1,000,000 trials: Omega incorrectly thinks Fred will play it safe, when actually Fred takes the risk of using the device and, in this case, the device picks one-box
==> Fred gains \$1,000,000
==> Omega gains \$999,000,000

In 999,900 trials out of 1,000,000 trials Omega correctly thinks that Fred is going to take the risk of using the device.  Of those:

In 9,999 trials out of 1,000,000 trials: Omega correctly thinks that Fred is going to take the risk of using the device and, in this case, the device picks two-box
==> Fred gains \$1,000
==> Omega gains \$999,999,000

In 989,901 trials out of 1,000,000 trials: Omega correctly thinks that Fred is going to take the risk of using the device and, in this case, the device picks one-box
==> Fred gains \$0
==> Omega loses \$10,000,000,000

Combining those gives an average of:
Fred gains \$110
Omega loses \$9,888,922,110

If Fred's best estimate, beforehand, of the chances of Omega choosing meta strategy B1 is P, then

Fred's expected gain from choosing meta strategy A1 would be: \$1,000,000 P + \$999,999 (1-P)
Fred's expected gain from choosing meta strategy A2 would be: \$1,000,010 P + \$110 (1-P)

We can ask ourselves at what value of P these two strategies would return the same expected gain:

\$1,000,000 P + \$999,999 (1-P)   =    \$1,000,010 P + \$110 (1-P)
\$999,889 (1-P)   =                \$10 P
\$999,889           =        \$999,899 P
P    = 0.99999
(1-P)  = 0.00001

In other words, Fred thinking that Omega has even a 1 in 100,000 chance of being vindictive should be enough to deter Fred from taking the risky strategy.

But how does that look from Omega's point of view?   If Omega thinks that Fred's chance of picking meta strategy A1 is Q, then what is the cost to Omega of picking B2 1 in 100,000 times?

Omega's expected gain from choosing meta strategy B1 would be: \$999,000,000 Q + \$979,008,999 (1-Q)
Omega's expected gain from choosing meta strategy B2 would be: \$997,900,100 Q - \$9,888,922,110 (1-Q)

0.99999 { \$999,000,000 Q + \$979,008,999 (1-Q)  } + 0.00001 { \$997,900,100 Q - \$9,888,922,110 (1-Q) }
= (1 - 0.00001) { \$979,008,999 + \$19,991,001 Q } + 0.00001 { - \$9,888,922,110  + \$10,886,822,210 Q  }
= \$979,008,999 + \$19,991,001 Q + 0.00001 { - \$9,888,922,110  + \$10,886,822,210 Q - \$979,008,999 - \$19,991,001 Q }
= \$979,008,999 + \$19,991,001 Q + 0.00001 { \$9,907,813,211 + \$10,866,831,209 Q }
= ( \$979,008,999 + \$99,078.13211) + ( \$19,991,001 + \$108,668.31209 ) Q
= \$979,108,077 + \$20,099,669 Q

Perhaps a meta strategy of 1% chance of two-boxing is not Fred's optimal meta strategy.  Perhaps, at that level compared to Omega's ability to discern, it is still worth Omega investing in being vindictive occasionally, in order to deter Fred from taking risk.   But, given sufficient data about previous games, Fred can make a guess at Omega's ability to discern.  And, likewise Omega, by including in the record of past games occasions when Omega has falsely accused a human player of taking risk, can signal to future players where Omega's boundaries are.   We can plot graphs of these to find the point at which Fred's meta strategy and Omega's meta strategy are in equilibrium - where if Fred took any larger chances, it would start becoming worth Omega's while to punish risk sufficiently often that it would no longer be in Fred's interests to take the risk.   Precisely where that point is will depend on the numbers we picked in Part 1 of this sequence.  By exploring the space created by using each variable number as a dimension, we can divide it into regions characterised by which strategies dominate within that region.

Extrapolating that as δ tends towards 0 should then carry us closer to a convincing solution to Newcomb's Problem.

Back to Part 1 - stating the problem
Back to Part 2 - some mathematics
This is   Part 3 - towards a solution

## A solvable Newcomb-like problem - part 2 of 3

0 03 December 2012 04:49PM

This is the second part of a three post sequence on a problem that is similar to Newcomb's problem but is posed in terms of probabilities and limited knowledge.

Part 1 - stating the problem
Part 2 - some mathematics
Part 3 - towards a solution

In game theory, a payoff matrix is a way of presenting the results of two players simultaneously picking options.

For example, in the Prisoner's Dilemma, Player A gets to choose between option A1 (Cooperate) and option A2 (Defect) while, at the same time Player B gets to choose between option B1 (Cooperate) and option B2 (Defect).   Since years spent in prison are a negative outcome, we'll write them as negative numbers:

So, if you look at the bottom right hand corner, at the intersection of Player A defecting (A2) and Player B defecting (B2) we see that both players end up spending 4 years in prison.   Whereas, looking at the bottom left we see that if A defects and B cooperates, then Player A ends up spending 0 years in prison and Player B ends up spending 5 years in prison.

Another familiar example we can present in this form is the game Rock-Paper-Scissors.

We could write it as a zero sum game, with a win being worth 1, a tie being worth 0 and a loss being worth -1:

But it doesn't change the mathematics if we give both players 2 points each round just for playing, so that a win becomes worth 3 points, a tie becomes worth 2 points and a loss becomes worth 1 point.  (Think of it as two players in a game show being rewarded by the host, rather than the players making a direct bet with each other.)

If you are Player A, and you are playing against a Player B who always chooses option B1 (Rock), then your strategy is clear.  You choose option A2 (Paper) each time.  Over 10 rounds, you'd expect to end up with \$30 compared to B's \$10.

Let's imagine a slightly more sophisticated Player B, who always picks Rock in the first round, and then for all other rounds picks whatever would beat Player A's choice the previous round.   This strategy would do well against someone who always picked the same option each round, but it is deterministic and, if we guess it correctly in advance, we can design a strategy that beats it every time.  (In this case, picking Paper-Rock-Scissors then repeating back to Paper).   In fact whatever strategy B comes up with, if that strategy is deterministic and we guess it in advance, then we end up with \$30 and B ends up with \$10.

What if B has a deterministic strategy that B picked in advance and doesn't change, but we don't know at the start of the first round what it is?   In theory B might have picked any of the 3-to-the-power-of-10 deterministic strategies that are indistinguishable from each other over a 10 round duel but, in practice, humans tend to favour some strategies over others so, if you know humans and the game of Rock-Paper-Scissors better than Player B does, you have a better than even chance of guessing his pattern and coming out ahead in the later rounds of the duel.

But there's a danger to that.  What if you have overestimated your comparative knowledge level and Player B uses your overconfidence to lure you into thinking you've cracked B's pattern, while really B is laying a trap, increasing the predictability of Player A's moves so Player B can then take advantage of that to work out which moves will trump them?  This works better in a game like poker, where the stakes are not the same each round, but it is still possible in Rock-Paper-Scissors, and you can imagine variants of the game where the host varies payoff matrix by increasing the lose-tie-win rewards from 1,2,3 in the first round, to 2,4,6 in the second round, 3,6,9 in the third round, and so on.

This is why the safest strategy is to not to have a deterministic strategy but, instead, use a source of random bits to each round pick option 1 with a probability of 33%, option 2 with a probability of 33% or option 3 with a probability of 33% (modulo rounding).  You might not get to take advantage of any predictability that becomes apparent in your opponents strategy, but neither can you be fooled into becoming predictable yourself.

On a side note, this still applies even when there is only one round, because unaided humans are not as good at coming up with random bits as they think they are.  Someone who has observed many first time players will notice that first time players more often than not choose as their Rock as their 'random' first move, rather than Paper or Scissors.  If such a person were confident that they were playing a first time player, they might therefore pick Paper as their first move more frequently than not.  Things soon get very Sicilian (in the sense of the duel between Westley and Vizzini in the film The Princess Bride) after that, because a yet more sophisticated player who guessed their opponent would try this, could then pick Scissors.  And so ad infinitum, with ever more implausible levels of discernment being required to react on the next level up.

We can imagine a tournament set up between 100 players taken randomly from the expertise distribution of game players, each player submitting a python program that always plays the same first move, and for each of the remaining 9 rounds produces a move determined solely by the the moves so far in that duel.  The tournament organiser would then run every player's program once against the programs of each of the other 99 players, so on average each player would collect 99x10x2 = \$1,980

We could make things more complex by allowing the programs to use, as an input, how much money their opponent has won so far during the tournament; or iterate over running the tournament several times, to give each player an 'expertise' rating which the program in the following tournament could then use.  We could allow the tournament host to subtract from each player a sum of money depending upon the size of program that player submitted (and how much memory or cpu it used).   We could give each player a limited ration of random bits, so when facing a player with a higher expertise rating they might splurge and make their move on all 10 rounds completely random, and when facing a player with a lower expertise they might conserve their supply by trying to 'out think' them.

There are various directions we could take this, but the one I want to look at here is what happens when you make the payoff matrix asymmetric.  What happens if you make the game unfair, so not only does one player have more at stake than the other player, but the options are not even either, for example:

You still have the circular Rock-Paper-Scissors dynamic where:
If B chose B3, then A wants most to have chosen A1
If A chose A1, then B wants most to have chosen B2
If B chose B2, then A wants most to have chosen A3
If A chose A3, then B wants most to have chosen B1
If B chose B1, then A wants most to have chosen A2
If A chose A2, then B wants most to have chosen B3

so everything wins against at least one other option, and loses against at least one other option.   However Player B is clearly now in a better position, because B wins ties, and B's wins (a 9, an 8 and a 7) tend to be larger than A's wins (a 9, a 6 and a 6).

What should Player A do?  Is the optimal safe strategy still to pick each option with an equal weighting?

Well, it turns out the answer is: no, an equal weighting isn't the optimal response.   Neither is just picking the same 'best' option each time.  Instead what do you is pick your 'best' option a bit more frequently than an equal weighting would suggest, but not so much that the opponent can steal away that gain by reliably choosing the specific option that trumps yours.   Rather than duplicate material already well presented on the web, I will point you at two lecture courses on game theory that explain how to calculate the exact probability to assign to each option:

You do this by using the indifference theorem to arrive at a set of linear equations, which you can then solve to arrive at a mixed equilibrium where neither player increases their expected utility by altering the probability weightings they assign to their options.

## The TL;DR; points to take away

If you are competing in what is effectively a simultaneous option choice game, with a being who you suspect may have an equal or higher expertise to you at the game, you can nullify their advantage by picking a strategy that, each round chooses randomly (using a weighting) between the available options.

Depending upon the details of the payoff matrix, there may be one option that it makes sense for you to pick most of the time but, unless that option is strictly better than all your other choices no matter what option your opponent picks, there is still utility to gain from occasionally picking the other options in order to keep your opponent on their toes.

Back to Part 1 - stating the problem
This is  Part 2 - some mathematics
Next to Part 3 - towards a solution

## A solvable Newcomb-like problem - part 1 of 3

1 03 December 2012 09:26AM

This is the first part of a three post sequence on a problem that is similar to Newcomb's problem but is posed in terms of probabilities and limited knowledge.

Part 1 - stating the problem
Part 2 - some mathematics
Part 3 - towards a solution

Omega is an AI, living in a society of AIs, who wishes to enhance his reputation in that society for being successfully able to predict human actions.  Given some exchange rate between money and reputation, you could think of that as a bet between him and another AI, let's call it Alpha.  And since there is also a human involved, for the sake of clarity, to avoid using "you" all the time, I'm going to sometimes refer to the human using the name "Fred".

Omega tells Fred:

I'd like you to pick between two options, and I'm going to try to predict which option you're going to pick.
Option "one box" is to open only box A, and take any money inside it
Option "two box" is to open both box A and box B, and take any money inside them

but, before you pick your option, declare it, then open the box or boxes, there are three things you need to know.

Firstly, you need to know the terms of my bet with Alpha.

If Fred picks option "one box" then:
If box A contains \$1,000,000 and box B contains \$1,000 then Alpha pays Omega \$1,000,000,000
If box A contains \$0              and box B contains \$1,000 then Omega pays Alpha \$10,000,000,000
If anything else, then both Alpha and Omega pay Fred \$1,000,000,000,000

If Fred picks option "two box" then:
If box A contains \$1,000,000 and box B contains \$1,000 then Omega pays Alpha \$10,000,000,000
If box A contains \$0              and box B contains \$1,000 then Alpha pays Omega \$1,000,000,000
If anything else, then both Alpha and Omega pay Fred \$1,000,000,000,000

Secondly, you should know that I've already placed all the money in the boxes that I'm going to, and I can't change the contents of the boxes between now and when you do the opening, because Alpha is monitoring everything.  I've already made my prediction, using a model I've constructed of your likely reactions based upon your past actions.

You can use any method you like to choose between the two options, short of contacting another AI, but be warned that if my model predicted that you'll use a method which introduces too large a random element (such as tossing a coin) then, while I may lose my bet with Alpha, I'll certainly have made sure you won't win the \$1,000,000.  Similarly, if my model predicted that you'd make an outside bet with another human (let's call him George) to alter the value of winning \$1,001,000 from me I'd have also taken that into account.  (I say "human" by the way, because my bet with Alpha is about my ability to predict humans so if you contact another AI, such as trying to lay a side bet with Alpha to skim some of his winnings, that invalidates not only my game with you, but also my bet with Alpha and there are no winning to skim.)

And, third and finally, you need to know my track record in previous similar situations.

I've played this game 3,924 times over the past 100 years (ie since the game started), with humans picked at random from the full variety of the population.   The outcomes were:
3000 times players picked option "one box" and walked away with \$1,000,000
900  times players picked option "two box" and walked away with \$1,000
24 times players flipped a coin and or were otherwise too random.  Of those players:
12 players picked option "one box" and walked away with \$0
12 players picked option "two box" and walked away with \$1,000

Never has anyone ever ended up walking away with \$1,001,000 by picking option "two box".

Omega stops talking.   You are standing in a room containing two boxes, labelled "A" and "B", which are both currently closed.  Everything Omega said matches what you expected him to say, as the conditions of the game are always the same and are well known - you've talked with other human players (who confirmed it is legit) and listened to their advice.   You've not contacted any AIs, though you have read the published statement from Alpha that also confirms the terms of the bet and details of the monitoring.  You've not made any bets with other humans, even though your dad did offer to bet you a bottle of whiskey that you'd be one of them too smart alecky fools who walked away with only \$1,000.  You responded by pre-committing to keep any winnings you make between you and your banker, and to never let him know.

The only relevant physical object you've brought along is a radioactive decay based random number generator, that Omega would have been unable to predict the result of in advance, just in case you decide to use it as a factor in your choice.  It isn't a coin, giving only a 50% chance of "one box" and a 50% chance of "two box".   You can set arbitrary odds (tell it to generate a random integer between 0 and any positive integer you give it, up to 10 to the power of 100).   Omega said in his spiel the phrase "too large a random element" but didn't specify where that boundary was.

What do you do?   Or, given that such a situation doesn't exist yet, and we're talking about a Fred in a possible future, what advice would you give to Fred on how to choose, were he to ever end up in such a situation?

Pick "one box"?   Pick "two box"?   Or pick randomly between those two choices and, if so, at what odds?

And why?

Part 1 - stating the problem
next   Part 2 - some mathematics
Part 3 - towards a solution

## SIA, conditional probability and Jaan Tallinn's simulation tree

10 12 November 2012 05:24PM

If you're going to use anthropic probability, use the self indication assumption (SIA) - it's by far the most sensible way of doing things.

Now, I am of the strong belief that probabilities in anthropic problems (such as the Sleeping Beauty problem) are not meaningful - only your decisions matter. And you can have different probability theories but still always reach the decisions if you have different theories as to who bears the responsibility of the actions of your copies, or how much you value them - see anthropic decision theory (ADT).

But that's a minority position - most people still use anthropic probabilities, so it's worth taking a more through look at what SIA does and doesn't tell you about population sizes and conditional probability.

This post will aim to clarify some issues with SIA, especially concerning Jaan Tallinn's simulation-tree model which he presented in exquisite story format at the recent singularity summit. I'll be assuming basic familiarity with SIA, and will run away screaming from any questions concerning infinity. SIA fears infinity (in a shameless self plug, I'll mention that anthropic decision theory runs into far less problems with infinities; for instance a bounded utility function is a sufficient - but not necessary - condition to ensure that ADT give you sensible answers even with infinitely many copies).

But onwards and upwards with SIA! To not-quite-infinity and below!

## SIA does not (directly) predict large populations

One error people often make with SIA is to assume that it predicts a large population. It doesn't - at least not directly. What SIA predicts is that there will be a large number of agents that are subjectively indistinguishable from you. You can call these subjectively indistinguishable agents the "minimal reference class" - it is a great advantage of SIA that it will continue to make sense for any reference class you choose (as long as it contains the minimal reference class).

The SIA's impact on the total population is indirect: if the size of the total population is correlated with that of the minimal reference class, SIA will predict a large population. A correlation is not implausible: for instance, if there are a lot of humans around, then the probability that one of them is you is much larger. If there are a lot of intelligent life forms around, then the chance that humans exist is higher, and so on.

In most cases, we don't run into problems with assuming that SIA predicts large populations. But we have to bear in mind that the effect is indirect, and the effect can and does break down in many cases. For instance imagine that you knew you had evolved on some planet, but for some odd reason, didn't know whether your planet had a ring system or not. You have managed to figure out that the evolution of life on planets with ring systems is independent of the evolution of life on planets without. Since you don't know which situation you're in, SIA instructs you to increase the probability of life on ringed and on non-ringed planets (so far, so good - SIA is predicting generally larger populations).

And then one day you look up at the sky and see:

continue reading »

## SIA fears (expected) infinity

6 12 November 2012 05:23PM

It's well known that the Self-Indication Assumption (SIA) has problems with infinite populations (one of the reasons I strongly recommend not using the probability as the fundamental object of interest, but instead the decision, as in anthropic decision theory).

SIA also has problems with arbitrarily large finite populations, at least in some cases. What cases are these? Imagine that we had these (non-anthropic) probabilities for various populations:

p0, p1, p2, p3, p4...

Now let us apply the anthropic correction from SIA; before renormalising, we have these weights for different population levels:

0, p1, 2p2, 3p3, 4p4...

To renormalise, we need to divide by the sum 0 + p1 + 2p2 + 3p3 + 4p4... This is actually the expected population! (note: we are using the population as a proxy for the size of the reference class of agents who are subjectively indistinguishable from us; see this post for more details)

So using SIA is possible if and only if the (non-anthropic) expected population is finite (and non-zero).

Note that it is possible for the anthropic expected population to be infinite! For instance if pj is C/j3, for some constant C, then the non-anthropic expected population is finite (being the infinite sum of C/j2). However once we have done the SIA correction, we can see that the SIA-corrected expected population is infinite (being the infinite sum of some constant times 1/j).

## Question about application of Bayes

0 31 October 2012 02:35AM

I have successfully confused myself about probability again.

I am debugging an intermittent crash; it doesn't happen every time I run the program. After much confusion I believe I have traced the problem to a specific line (activating my debug logger, as it happens; irony...) I have tested my program with and without this line commented out. I find that, when the line is active, I get two crashes on seven runs. Without the line, I get no crashes on ten runs. Intuitively this seems like evidence in favour of the hypothesis that the line is causing the crash. But I'm confused on how to set up the equations. Do I need a probability distribution over crash frequencies? That was the solution the last time I was confused over Bayes, but I don't understand what it means to say "The probability of having the line, given crash frequency f", which it seems I need to know to calculate a new probability distribution.

I'm going to go with my intuition and code on the assumption that the debug logger should be activated much later in the program to avoid a race condition, but I'd like to understand this math.

## A follow-up probability question: Data samples with different priors

3 25 October 2012 08:07PM

(Rewritten entirely after seeing pragmatist's answer.)

In this post, helpful people including DanielLC gave me the multiply-odds-ratios method for combining probability estimates given by independent experts with a constant prior, with many comments about what to do when they aren't independent.  (DanielLC's method turns out to be identical to summing up the bits of information for and against the hypothesis, which is what I'd expected to be correct.)

I ran into problems applying this, because sometimes the prior isn't constant across samples.  Right now I'm combining different sources of information to choose the correct transcription start site for a gene.  These bacterial genes typically have from 1 to 20 possible start sites.  The prior is 1 / (number of possible sites).

Suppose I want to figure out the correct likelihood multiplier for the information that a start site overlaps the stop of the previous gene, which I will call property Q.  Assume this multiplier, lm, is constant, regardless of the prior.  This is reasonable, since we always factor out the prior.  Some function of the prior gives me the posterior probability that a site s is the correct start (Q(s) is true), given that O(s).  That's P(Q(s) | prior=1/numStarts, O(s)).

Suppose I look just at those cases where numStarts = 4, I find that P(Q(s) | numStarts=4, O(s)) = .9.

9:1 / 1:3 = 27:1

Or I can look at the cases where numStarts=2, and find that in these cases, P(Q(s) | numStarts=2, O(s)) = .95:

19:1 / 1:1 = 19:1

I want to take one pass through the data and come up with a single likelihood multiplier, rather than binning all the data into different groups by numStarts.  I think I can just compute it as

(sum of numerator : sum of denominator) over all cases s_i where O(s_i) is true, where

numerator = (numStarts_i-1) * Q(s_i)

denominator = (1-Q(s_i))

Is this correct?

## A probability question

6 19 October 2012 10:34PM

Suppose you have a property Q which certain objects may or may not have.  You've seen many of these objects; you know the prior probability P(Q) that an object has this property.

You have 2 independent measurements of object O, which each assign a probability that Q(O) (O has property Q).  Call these two independent probabilities A and B.

What is P(Q(O) | A, B, P(Q))?

To put it another way, expert A has opinion O(A) = A, which asserts P(Q(O)) = A = .7, and expert B says P(Q(O)) = B = .8, and the prior P(Q) = .4, so what is P(Q(O))?  The correlation between the opinions of the experts is unknown, but probably small.  (They aren't human experts.)  I face this problem all the time at work.

You can see that the problem isn't solvable without the prior P(Q), because if the prior P(Q) = .9, then two experts assigning P(Q(O)) < .9 should result in a probability lower than the lowest opinion of those experts.  But if P(Q) = .1, then the same estimates by the two experts should result in a probability higher than either of their estimates.  But is it solvable or at least well-defined even with the prior?

The experts both know the prior, so if you just had expert A saying P(Q(O)) = .7, the answer must be .7 .  Expert B's opinion B must revise the probability upwards if B > P(Q), and downwards if B < P(Q).

When expert A says O(A) = A, she probably means, "If I consider all the n objects I've seen that looked like this one, nA of them had property Q."

One approach is to add up the bits of information each expert gives, with positive bits for indications that Q(O) and negative bits that not(Q(O)).

## No Anthropic Evidence

8 23 September 2012 10:33AM

Closely related to: How Many LHC Failures Is Too Many?

Consider the following thought experiment. At the start, an "original" coin is tossed, but not shown. If it was "tails", a gun is loaded, otherwise it's not. After that, you are offered a big number of rounds of decision, where in each one you can either quit the game, or toss a coin of your own. If your coin falls "tails", the gun gets triggered, and depending on how the original coin fell (whether the gun was loaded), you either get shot or not (if the gun doesn't fire, i.e. if the original coin was "heads", you are free to go). If your coin is "heads", you are all right for the round. If you quit the game, you will get shot at the exit with probability 75% independently of what was happening during the game (and of the original coin). The question is, should you keep playing or quit if you observe, say, 1000 "heads" in a row?

Intuitively, it seems as if 1000 "heads" is "anthropic evidence" for the original coin being "tails", that the long sequence of "heads" can only be explained by the fact that "tails" would have killed you. If you know that the original coin was "tails", then to keep playing is to face the certainty of eventually tossing "tails" and getting shot, which is worse than quitting, with only 75% chance of death. Thus, it seems preferable to quit.

On the other hand, each "heads" you observe doesn't distinguish the hypothetical where the original coin was "heads" from one where it was "tails". The first round can be modeled by a 4-element finite probability space consisting of options {HH, HT, TH, TT}, where HH and HT correspond to the original coin being "heads" and HH and TH to the coin-for-the-round being "heads". Observing "heads" is the event {HH, TH} that has the same 50% posterior probabilities for "heads" and "tails" of the original coin. Thus, each round that ends in "heads" doesn't change the knowledge about the original coin, even if there were 1000 rounds of this type. And since you only get shot if the original coin was "tails", you only get to 50% probability of dying as the game continues, which is better than the 75% from quitting the game.

(See also the comments by simon2 and Benja Fallenstein on the LHC post, and this thought experiment by Benja Fallenstein.)

The result of this exercise could be generalized by saying that counterfactual possibility of dying doesn't in itself influence the conclusions that can be drawn from observations that happened within the hypotheticals where one didn't die. Only if the possibility of dying influences the probability of observations that did take place, would it be possible to detect that possibility. For example, if in the above exercise, a loaded gun would cause the coin to become biased in a known way, only then would it be possible to detect the state of the gun (1000 "heads" would imply either that the gun is likely loaded, or that it's likely not).

## Chief Probability Officer

11 09 September 2012 11:45PM

Stanford Professor Sam Savage (also of Probability Management) proposes that large firms appoint a "Chief Probability Officer." Here is a description from Douglas Hubbard's How to Measure Anything, ch. 6:

Sam Savage... has some ideas about how to institutionalize the entire process of creating Monte Carlo simulations [for estimating risk].

...His idea is to appoint a chief probability officer (CPO) for the firm. The CPO would be in charge of managing a common library of probability distributions for use by anyone running Monte Carlo simulations. Savage invokes concepts like the Stochastic Information Packet (SIP), a pregenerated set of 100,000 random numbers for a particular value. Sometimes different SIPs would be related. For example, the company’s revenue might be related to national economic growth. A set of SIPs that are generated so they have these correlations are called “SLURPS” (Stochastic Library Units with Relationships Preserved). The CPO would manage SIPs and SLURPs so that users of probability distributions don’t have to reinvent the wheel every time they need to simulate inflation or healthcare costs.

Hubbard adds some of his own ideas to the proposal:

• Certification of analysts. Right now, there is not a lot of quality control for decision analysis experts. Only actuaries, in their particular specialty of decision analysis, have extensive certification requirements. As for actuaries, certification in decision analysis should eventually be an independent not-for-profit program run by a professional association. Some other professional certifications now partly cover these topics but fall far short in substance in this particular area. For this reason, I began certifying individuals in Applied Information Economics because there was an immediate need for people to be able to prove their skills to potential employers.
• Certification for calibrated estimators. As we discussed earlier, an uncalibrated estimator has a strong tendency to be overconfident. Any calculation of risk based on his or her estimates will likely be significantly understated. However, a survey I once conducted showed that calibration is almost unheard of among those who build Monte Carlo models professionally, even though a majority used at least some subjective estimates. (About a third surveyed used mostly subjective estimates.) Calibration training will be one of the simplest improvements to risk analysis in an organization.
• Well-documented procedures and templates for how models are built from the input of various calibrated estimators. It takes some time to smooth out the wrinkles in the process. Most organizations don’t need to start from scratch for every new investment they are analyzing; they can base their work on that of others or at least reuse their own prior models. I’ve executed nearly the same analysis procedure following similar project plans for a wide variety of decision analysis problems from IT security, military logistics, and entertainment industry investments. But when I applied the same method in the same organization on different problems, I often found that certain parts of the model would be similar to parts of earlier models. An insurance company would have several investments that include estimating the impact on “customer retention” and “claims payout ratio.” Manufacturing-related investments would have calculations related to “marginal labor costs per unit” or “average order fulfillment time.” These issues don’t have to be modeled anew for each new investment problem. They are reusable modules in spreadsheets.
• Adoption of a single automated tool set. [In this book I show] a few of the many tool sets available. You can get as sophisticated as you like, but starting out doesn’t require any more than some good spreadsheet-based tools. I recommend starting simple and adopting more extensive tool sets as the situations demand.

## Confused about Solomonoff induction

-3 13 July 2012 11:36AM

Why wouldn't the probability of two algorithms of different lengths appearing approach the same value as longer strings of bits are searched?

## Draft of Edwin Jaynes' "Probability Theory: The Logic of Science" online, with lost chapter 30

8 23 June 2012 05:48AM

http://thiqaruni.org/mathpdf9/(86).pdf

The book didn't include Chapter 30 - "MAXIMUM ENTROPY: MATRIX FORMULATION"

Opening in adobe seems to work out better for me.

## Thoughts and problems with Eliezer's measure of optimization power

16 08 June 2012 09:44AM

Back in the day, Eliezer proposed a method for measuring the optimization power (OP) of a system S. The idea is to get a measure of small a target the system can hit:

You can quantify this, at least in theory, supposing you have (A) the agent or optimization process's preference ordering, and (B) a measure of the space of outcomes - which, for discrete outcomes in a finite space of possibilities, could just consist of counting them - then you can quantify how small a target is being hit, within how large a greater region.

Then we count the total number of states with equal or greater rank in the preference ordering to the outcome achieved, or integrate over the measure of states with equal or greater rank.  Dividing this by the total size of the space gives you the relative smallness of the target - did you hit an outcome that was one in a million?  One in a trillion?

Actually, most optimization processes produce "surprises" that are exponentially more improbable than this - you'd need to try far more than a trillion random reorderings of the letters in a book, to produce a play of quality equalling or exceeding Shakespeare.  So we take the log base two of the reciprocal of the improbability, and that gives us optimization power in bits.

For example, assume there were eight equally likely possible states {X0, X1, ... , X7}, and S gives them utilities {0, 1, ... , 7}. Then if S can make X6 happen, there are two states better or equal to its achievement (X6 and X7), hence it has hit a target filling 1/4 of the total space. Hence its OP is log2 4 = 2. If the best S could manage is X4, then it has only hit half the total space, and has an OP of only log2 2 = 1. Conversely, if S reached the perfect X7, 1/8 of the total space, then it would have an OP of log2 8 = 3.

continue reading »

## Logical Uncertainty as Probability

2 29 April 2012 10:26PM

This post is a long answer to this comment by cousin_it:

Logical uncertainty is weird because it doesn't exactly obey the rules of probability. You can't have a consistent probability assignment that says axioms are 100% true but the millionth digit of pi has a 50% chance of being odd.

I'd like to attempt to formally define logical uncertainty in terms of probability. Don't know if what results is in any way novel or useful, but.

Let X be a finite set of true statements of some formal system F extending propositional calculus, like Peano Arithmetic. X is supposed to represent a set of logical/mathematical beliefs of some finite reasoning agent.

Given any X, we can define its "Obvious Logical Closure" OLC(X), an infinite set of statements producible from X by applying the rules and axioms of propositional calculus. An important property of OLC(X) is that it is decidable: for any statement S it is possible to find out whether S is true (S∈OLC(X)), false ("~S"∈OLC(X)), or uncertain (neither).

We can now define the "conditional" probability P(*|X) as a function from {the statements of F} to [0,1] satisfying the axioms:

Axiom 1: Known true statements have probability 1:

P(S|X)=1  iff  S∈OLC(X)

Axiom 2: The probability of a disjunction of mutually exclusive statements is equal to the sum of their probabilities:

"~(A∧B)"∈OLC(X)  implies  P("A∨B"|X) = P(A|X) + P(B|X)

From these axioms we can get all the expected behavior of the probabilities:

P("~S"|X) = 1 - P(S|X)

P(S|X)=0  iff  "~S"∈OLC(X)

0 < P(S|X) < 1  iff  S∉OLC(X) and "~S"∉OLC(X)

"A=>B"∈OLC(X)  implies  P(A|X)≤P(B|X)

"A<=>B"∈OLC(X)  implies  P(A|X)=P(B|X)

etc.

This is still insufficient to calculate an actual probability value for any uncertain statement. Additional principles are required. For example, the Consistency Desideratum of Jaynes: "equivalent states of knowledge must be represented by the same probability values".

Definition: two statements A and B are indistinguishable relative to X iff there exists an isomorphism between OLC(X∪{A}) and OLC(X∪{B}), which is identity on X, and which maps A to B.
[Isomorphism here is a 1-1 function f preserving all logical operations:  f(A∨B)=f(A)∨f(B), f(~~A)=~~f(A), etc.]

Axiom 3: If A and B are indistinguishable relative to X, then  P(A|X) = P(B|X).

Proposition: Let X be the set of statements representing my current mathematical knowledge, translated into F.  Then the statements "millionth digit of PI is odd" and "millionth digit of PI is even" are indistinguishable relative to X.

Corollary:  P(millionth digit of PI is odd | my current mathematical knowledge) = 1/2.

## Learning the basics of probability & beliefs

3 31 March 2012 09:18AM

Let's say that I believe that the sky is green.

1) How can I know whether this belief is true?

2) How can I assign a probability to it to test its degree of truthfulness?

3) How can I update this belief?

Thank you.

## Causation, Probability and Objectivity

7 18 March 2012 06:54AM

Most people here seem to endorse the following two claims:

1. Probability is "in the mind," i.e., probability claims are true only in relation to some prior distribution and set of information to be conditionalized on;
2. Causality is to be cashed out in terms of probability distributions á la Judea Pearl or something.

However, these two claims feel in tension to me, since they appear to have the consequence that causality is also "in the mind" - whether something caused something else depends on various probability distributions, which in turn depends on how much we know about the situation. Worse, it has the consequence that ideal Bayesian reasoners can never be wrong about causal relations, since they always have perfect knowledge of their own probabilities.

Since I don't understand Pearl's model of causality very well, I may be missing something fundamental, so this is more of a question than an argument.

## Gambler's Reward: Optimal Betting Size

5 17 January 2012 08:32PM

I've been trying my hand at card counting lately, and I've been doing some thinking about how a perfect gambler would act at the table. I'm not sure how to derive the optimal bet size.

Overall, the expected value of blackjack is small and negative. However, there is high variance in the expected value. By varying his bet size and sitting out rounds, the player can wager more money when expected value is higher and less money when expected value is lower. Overall, this can result in an edge.

However, I'm not sure what the optimal bet size is. Going all-in with a 60 percent chance of winning is EV+, but the 40 percent chance of loss would not only destroy your bankroll, it would also prevent you from participating in future EV+ situations. Ideally, one would want to not only increase EV, but also decrease variance.

Objective: Given a distribution of expected values, develop a function that transforms the current expected value into the percentage of the bankroll that should be placed at risk.

I'm not sure how to begin. Even if I had worked out the distribution of expected values. Are other inputs required (i.e. utility of marginal dollar won, desired risk of ruin)? Should the approach perhaps be to maximize expected value after one playing session? Why not a month of playing sessions, or a billion? Is there any chance the optimal betting size would produce behavior similar to the behavior predicted by prospect theory?

I eagerly await an informative discussion. If you have something against gambling, just pretend we're talking about how much of your wealth you plan on investing in an oil well with positive expected value.

## The lessons of a world without Hitler

-4 16 January 2012 04:16PM

What would the world look like without Hitler? Fiction is generally unequivocal about this: the removal of Hitler makes no difference, the world will still lurch towards a world war through some other path. WWII and the Holocaust are such major, defining events of the twentieth century, that we twist counterfactual events to ensure they still happen.

Against this, some have made the argument that Hitler was essentially sole responsible for WWII and especially for the Holocaust - no Hitler, no war, no extermination camps. The no Holocaust argument is quite solid: the extermination system was expensive, militarily counter-productive, and could only have happened given a leader lacking checks and balance and with an idée fixe that overrode everything else (general European antisemitism allowed the Holocaust, but didn't cause it). The no WWII argument points out that Hitler was both irrational and lucky: he often took great risks, on flimsy evidence, and got away with them. Certainly his decisions in the later, post-Barbarossa period of his reign belie political, military or organisational genius. And it was the height of stupidity to have gone to war, for a half of Poland, with simultaneously the world's greatest empire and what appeared to be the overwhelmingly strong French army. Yes Gamelin, the French commander in chief, did behave like a concussed duckling, and the German army outfought the French - but no-one could have predicted this, and no-one sensible would have counted on it, and hence they wouldn't have risked the war. Hitler wan't sensible, and lucked out.

continue reading »

## Can you recognize a random generator?

2 28 December 2011 01:59PM

I can't seem to get my head around a simple issue of judging probability. Perhaps someone here can point to an obvious flaw in my thinking.

Let's say we have a binary generator, a machine that outputs a required sequence of ones and zeros according to some internally encapsulated rule (deterministic or probabilistic). All binary generators look alike and you can only infer (a probability of) a rule by looking at its output.

You have two binary generators: A and B. One of these is a true random generator (fair coin tosser). The other one is a biased random generator: stateless (each digit is independently calculated from those given before), with probability of outputting zero p(0) somewhere between zero and one, but NOT 0.5 - let's say it's uniformly distributed in the range [0; .5) U (.5; 1]. At this point, chances that A is a true random generator are 50%.

Now you read the output of first ten digits generated by these machines. Machine A outputs 0000000000. Machine B outputs 0010111101. Knowing this, is the probability of machine A being a true random generator now less than 50%?

My intuition says yes.

But the probability that a true random generator will output 0000000000 should be the same as the probability that it will output 0010111101, because all sequences of equal length are equally likely. The biased random generator is also just as likely to output 0000000000 as it is 0010111101.

So there seems to be no reason to think that a machine outputting a sequence of zeros of any size is any more likely to be a biased stateless random generator than it is to be a true random generator.

I know that you can never know that the generator is truly random. But surely you can statistically discern between random and non-random generators?

## Statisticsish Question

3 28 November 2011 04:03PM

This is a question really, not a post, I just can't find the answer formally. Does laplace's rule of succession work when you are taking from a finite population without replacement? If I know that some papers in a hat have "yes" on them, and I know that the rest don't, and that there is a finite amount of papers, and every time I take a paper out I burn it, but I have no clue how many papers are in the hat, should I still use laplace's rule to figure out how much to expect the next paper to have a "yes" on it? or is there some adjustment you make, since every time I see a yes paper the odds of yes papers:~yes papers in the hat goes down.

## Log-odds (or logits)

18 28 November 2011 01:11AM

(I wrote this post for my own blog, and given the warm reception, I figured it would also be suitable for the LW audience. It contains some nicely formatted equations/tables in LaTeX, hence I've left it as a dropbox download.)

Logarithmic probabilities have appeared previously on LW here, here, and sporadically in the comments. The first is a link to a Eliezer post which covers essentially the same material. I believe this is a better introduction/description/guide to logarithmic probabilities than anything else that's appeared on LW thus far.

Introduction:

Our conventional way of expressing probabilities has always frustrated me. For example, it is very easy to say nonsensical statements like, “110% chance of working”. Or, it is not obvious that the difference between 50% and 50.01% is trivial compared to the difference between 99.98% and 99.99%. It also fails to accommodate the math correctly when we want to say things like, “five times more likely”, because 50% * 5 overflows 100%.
Jacob and I have (re)discovered a mapping from probabilities to log- odds which addresses all of these issues. To boot, it accommodates Bayes’ theorem beautifully. For something so simple and fundamental, it certainly took a great deal of google searching/wikipedia surfing to discover that they are actually called “log-odds”, and that they were “discovered” in 1944, instead of the 1600s. Also, nobody seems to use log-odds, even though they are conceptually powerful. Thus, this primer serves to explain why we need log-odds, what they are, how to use them, and when to use them.

Article is here (Updated 11/30 to use base 10)

## Bayes Slays Goodman's Grue

0 17 November 2011 10:45AM

This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:

1. I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
2. I wanted the first LW article about the grue problem to attack it from a distinctly Lesswrongian aproach without the benefit of hindsight knowledge of the solutions of non-LW philosophy.
3. And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.

I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems.

Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly Lesswrongian ways.

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

That is the inference that the grue problem threatens, courtesy of Nelson Goodman.  The grue problem starts by defining "grue":

"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."

So you see that before time T, from the list of premises:

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green."
(we will call these the green premises)

it follows that:

"The first emerald ever observed was grue.
The second emerald ever observed was grue.
The third emerald ever observed was grue.
… etc.
The nth emerald ever observed was grue."
(we will call these the grue premises)

The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.

This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm  trying to see if it works in principle.'" - Dan Dennett

We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.

If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.

Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:

"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.
A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."

The skeptic would continue:

"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"

Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for  ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.

But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:

"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.
If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."

What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.

What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.

To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because  In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.

This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.

In conclusion:

Every random variable has as a part of it, stored in its definition/code, a frequency distribution over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never fundamental in the source code. Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.

The reason that:

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are blind to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine green-hood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of quanta states; if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.

Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.

## If life is unlikely, SIA and SSA expectations are similar

3 15 November 2011 04:45PM

Consider a scenario in which there are three rooms. In each room there is an independent 1/1000 chance of an agent being created. There is thus a 1/109 probability of there being an agent in every room, a (3*999)/109 probability of there being two agents, and a (3*9992)/109 probability of there being one.

Given that you are one of these agents, the SIA and SSA probabilities of there being n agents are:

Number of agents SIA SSA
0 0 0
1 (1*3*9992)/(3*1+2*3*999+1*3*9992) (3*9992)/(1+3*999+3*9992)
2 (2*3*999)/(3*1+2*3*999+1*3*9992) (3*999)/(1+3*999+3*9992)
3 (3*1)/(3*1+2*3*999+1*3*9992) (1)/(1+3*999+3*9992)

The expected numbers of agents is (1(3*9992) + 2(2*3*999) + 3(3*1))/(3*1+2*3*999+1*3*9992) = 1.002 for SIA, and (1(3*9992) + 2(3*999) + 3(1))/(1+3*999+3*9992) ≈ 1.001 for SSA. The high unlikelihood of life means that, given that we are alive, both SIA and SSA probabilities get dominated by worlds with very few agents.

This of course only applies to agents who existence is independent (for instance, separate galactic civilizations). If you're alive, chance are that your parents were also alive at some point too.

## Which fields of learning have clarified your thinking? How and why?

12 11 November 2011 01:04AM

Did computer programming make you a clearer, more precise thinker? How about mathematics? If so, what kind? Set theory? Probability theory?

Microeconomics? Poker? English? Civil Engineering? Underwater Basket Weaving? (For adding... depth.)

Anything I missed?

Context: I have a palette of courses to dab onto my university schedule, and I don't know which ones to chose. This much is for certain: I want to come out of university as a problem solving beast. If there are fields of inquiry whose methods easily transfer to other fields, it is those fields that I want to learn in, at least initially.

Rip apart, Less Wrong!

## Help with a (potentially Bayesian) statistics / set theory problem?

2 10 November 2011 10:30PM

Update: as it turns out, this is a voting system problem, which is a difficult but well-studied topic. Potential solutions include Ranked Pairs (complicated) and BestThing (simpler). Thanks to everyone for helping me think this through out loud, and for reminding me to kill flies with flyswatters instead of bazookas.

I'm working on a problem that I believe involves Bayes, I'm new to Bayes and a bit rusty on statistics, and I'm having a hard time figuring out where to start. (EDIT: it looks like set theory may also be involved.) Your help would be greatly appreciated.

Here's the problem: assume a set of 7 different objects. Two of these objects are presented at random to a participant, who selects whichever one of the two objects they prefer. (There is no "indifferent" option.) The order of these combinations is not important, and repeated combinations are not allowed.

Basic combination theory says there are 21 different possible combinations: (7!) / ( (2!) * (7-2)! ) = 21.

Now, assume the researcher wants to know which single option has the highest probability of being the "most preferred" to a new participant based on the responses of all previous participants. To complicate matters, each participant can leave at any time, without completing the entire set of 21 responses. Their responses should still factor into the final result, even if they only respond to a single combination.

At the beginning of the study, there are no priors. (CORRECTION via dlthomas: "There are necessarily priors... we start with no information about rankings... and so assume a 1:1 chance of either object being preferred.) If a participant selects B from {A,B}, the probability of B being the "most preferred" object should go up, and A should go down, if I'm understanding correctly.

NOTE: Direct ranking of objects 1-7 (instead of pairwise comparison) isn't ideal because it takes longer, which may encourage the participant to rationalize. The "pick-one-of-two" approach is designed to be fast, which is better for gut reactions when comparing simple objects like words, photos, etc.

The ideal output looks like this: "Based on ___ total responses, participants prefer Object A. Object A is preferred __% more than Object B (the second most preferred), and ___% more than Object C (the third most preferred)."

Questions:

1. Is Bayes actually the most straightforward way of calculating the "most preferred"? (If not, what is? I don't want to be Maslow's "man with a hammer" here.)

2. If so, can you please walk me through the beginning of how this calculation is done, assuming 10 participants?

Thanks in advance!

## Foundations of Inference

8 31 October 2011 07:48PM

I've recently been getting into all of this wonderful Information Theory stuff and have come across a paper (thanks to John Salvatier) that was written by Kevin H. Knuth:

Foundations of Inference

The paper sets up some intuitive minimal axioms for quantifying power sets and then (seems to) use them to derive Bayesian probability theory, information gain, and Shannon Entropy. The paper also claims to use less assumptions than both Cox and Kolmogorov when choosing axioms. This seems like a significant foundation/unification. I'd like to hear whether others agree and what parts of the paper you think are the significant contributions.

If a 14 page paper is too long for you, I recommend skipping to the conclusion (starting at the bottom of page 12) where there is a nice picture representation of the axioms and a quick summary of what they imply.

## Religion, happiness, and Bayes

3 04 October 2011 10:21AM

Religion apparently makes people happier. Is that evidence for the truth of religion, or against it?

(Of course, it matters which religion we're talking about, but let's just stick with theism generally.)

My initial inclination was to interpret this as evidence against theism, in the sense that it weakens the evidence for theism. Here's why:

1. As all Bayesians know, a piece of information F is evidence for an hypothesis H to the degree that F depends on H. If F can happen just as easily without H as with it, then F is not evidence for H. The more likely we are to find F in a world without H, the weaker F is as evidence for H.
2. Here, F is "Theism makes people happier." H is "Theism is true."
3. The fact of widespread theism is evidence for H. The strength of this evidence depends on how likely such belief would be if H were false.
4. As people are more likely to do something if it makes them happy, people are more likely to be theists given F.
5. Thus F opens up a way for people to be theists even if H is false.
6. It therefore weakens the evidence of widespread theism for the truth of H.
7. Therefore, F should decrease one's confidence in H, i.e., it is evidence against H.

We could also put this in mathematical terms, where F represents an increase in the prior probability of our encountering the evidence. Since that prior is a denominator in Bayes' equation, a bigger one means a smaller posterior probability--in other words, weaker evidence.

OK, so that was my first thought.

But then I had second thoughts: Perhaps the evidence points the other way? If we reframe the finding as "Atheism causes unhappiness," or posit that contrarians (such as atheists) are dispositionally unhappy, does that change the sign of the evidence?

Obviously, I am confused. What's going on here?

## P(X = exact value) = 0: Is it really counterintuitive?

8 29 July 2011 12:45PM

I'm probably not going to say anything new here. Someone must have pondered over this already. However, hopefully it will invite discussion and clear things up.

Let X be a random variable with a continuous distribution over the interval [0, 10]. Then, by the definition of probability over continuous domains, P(X = 1) = 0. The same is true for P(X = 10), P(X = sqrt(2)), P(X = π), and in general, the probability that X is equal to any exact number is always zero, as an integral over a single point.

This is sometimes described as counterintuitive: surely, at any measurement, X must be equal to something, and thus its probability cannot be zero since its clearly happened. It can be, of course, argued that mathematical probability is abstract function that does not exactly map to our intuitive understanding of probability, but in this case, I would argue that it does.

What if X is the x-coordinate of a physical object? If classical physics are in question - for example, we pointed a needle at a random point on a 10 cm ruler - then it cannot be a point object, and must have a nonzero size. Thus, we can measure the probability of the 1 cm point lying within the space the end of the needle occupies, a probability that is clearly defined and nonzero.

But even if we're talking about a point object, while it may well occupy a definite and exact coordinate in classical physics, we'll never know what exactly it is. For one, our measuring tools are not that precise. But even if they had infinite precision, statements like "X equals exactly 2.(0)" or "X equals exactly π" contain infinite information, since they specify all the decimal digits of the coordinate into infinity. We would have an infinite number of measurements to confirm it. So while X may objectively equal exactly 2 or π - again, under classical physics - measurers would never know it. At any given point, to measurers, X would lie in an interval.

Then of course there is quantum physics, where it is literally impossible for any physical object, including point objects, to have a definite coordinate with arbitrary precision. In this case, the purely mathematical notion that any exact value is an impossible event turns out (by coincidence?) to match how the universe actually works.

## Looking for proof of conditional probability

-1 28 July 2011 02:24AM

From what I understand, the Kolmogorov axioms make no mention of conditional probability. That is simply defined. If I really want to show how probability actually works, I'm not going to argue "by definition". Does anyone know a modified form that uses simpler axioms than P(A|B) = P(A∩B)/P(B)?

## Against improper priors

1 26 July 2011 11:50PM

An improper prior is essentially a prior probability distribution that's infinitesimal over an infinite range, in order to add to one. For example, the uniform prior over all real numbers is an improper prior, as there would be an infinitesimal probability of getting a result in any finite range. It's common to use improper priors for when you have no prior information.

The mark of a good prior is that it gives a high probability to the correct answer. If I bet 1,000,000 to one that a coin will land on heads, and it lands on tails, it could be a coincidence, but I probably had a bad prior. A good prior is one that results in me not being very surprised.

With a proper prior, probability is conserved, and more probability mass in one place means less in another. If I'm less surprised when a coin lands on tails, I'm more surprised when it lands on heads. This isn't true with an improper prior. If I wanted to predict the value of a random real number, and used a normal distribution with a mean of zero and a standard deviation of one, I'd be pretty darn surprised if it doesn't end up being pretty close to zero, but I'd be infinitely surprised if I used a uniform distribution. No matter what the number is, it will be more surprising with the improper prior. Essentially, a proper prior is better in every way. (You could find exceptions for this, such as averaging a proper and improper prior to get an improper prior that still has finite probabilities and they just add up to 1/2, or by using a proper prior that has zero in some places, but you can always make a proper prior that's better in every way to a given improper prior).

Dutch books also seems to be a popular way of showing what works and what doesn't, so here's a simple Dutch argument against improper priors: I have two real numbers: x and y. Suppose they have a uniform distribution. I offer you a bet at 1:2 odds that x has a higher magnitude. They're equally likely to be higher, so you take it. I then show you the value of x. I offer you a new bet at 100:1 odds that y has a higher magnitude. You know y almost definitely has a higher magnitude than that, so you take it again. No matter what happens, I win.

You could try to get out of it by using a different prior, but I can just perform a transformation on it to get what I want. For example, if you choose a logarithmic prior for the magnitude, I can just take the magnitude of the log of the magnitude, and have a uniform distribution.

There are certainly uses for an improper prior. You can use it if the evidence is so great compared to the difference between it and the correct value that it isn't worth worrying about. You can also use it if you're not sure what another person's prior is, and you want to give a result that is at least as high as they'd get no matter how much there prior is spread out. That said, an improper prior is never actually correct, even in things that you have literally no evidence for.

## [Link] The Bayesian argument against induction.

4 18 July 2011 09:52PM

In 1983 Karl Popper and David Miller published an argument to the effect that probability theory could be used to disprove induction. Popper had long been an opponent of induction. Since probability theory in general, and Bayes in particular is often seen as rescuing induction from the standard objections, the argument is significant.

It is being discussed over at the Critical Rationalism site.

## Question about Large Utilities and Low Probabilities

4 24 June 2011 06:33PM

Advanced apologies if this has been discussed before.

Question: Philosophy and Mathematics are fields in which we employ abstract reasoning to arrive at conclusions. Can the relative success of philosophy versus mathematics provide empirical evidence for how robust our arguments must be before we can even hope to have a non-negligible chance of arriving at correct conclusions? Considering how bad philosophy has been at arriving at correct conclusions, must they not be essentially as robust as mathematical proof, or correct virtually with probability 1? If so, should this not cast severe doubt on arguments showing how, in expected utility calculations, outcomes with vast sums of utility can easily swamp a low probability of their coming to pass? Won't our estimates of such probabilities be severely inflated?

## Considering all scenarios when using Bayes' theorem.

9 20 June 2011 06:11PM

Disclaimer: this post is directed at people who, like me, are not Bayesian/probability gurus.

Recently I found an opportunity to use the Bayes' theorem in real life to help myself update in the following situation (presented in gender-neutral way):

Let's say you are wondering if a person is interested in you romantically. And they bought you a drink.
A = they are interested in you.
B = they bought you a drink.
P(A) = 0.3 (Just an assumption.)
P(B) = 0.05 (Approximately 1 out of 20 people, who might be at all interested in you, will buy you a drink for some unknown reason.)
P(B|A) = 0.2 (Approximately 1 out of 5 people, who are interested in you, will buy you a drink for some unknown reason. Though it's more likely they will buy you a drink because they are interested in you.)

These numbers seem valid to me, and I can't see anything that's obviously wrong. But when I actually use Bayes' theorem:
P(A|B) = P(B|A) * P(A) / P(B) = 1.2
Uh-oh! Where did I go wrong? See if you can spot the error before continuing.

Turns out:
P(B|A) = P(A∩B) / P(A) ≤ P(B) / P(A) = 0.1667
BUT
P(B|A) = 0.2 > 0.1667

I've made a mistake in estimating my probabilities, even though it felt intuitive. Yet, I don't immediately see where I went wrong when I look at the original estimates! What's the best way to prevent this kind of mistake?
I feel pretty confident in my estimates of P(A) and P(B|A). However, estimating P(B) is rather difficult because I need to consider many scenarios.

I can compute P(B) more precisely by considering all the scenarios that would lead to B happening (see wiki article):

P(B) = ∑i P(B|Hi) * P(Hi)

Let's do a quick breakdown of everyone who would want to buy you a drink (out of the pool of people who might be at all interested in you):
P(misc. reasons) = 0.05; P(B|misc) = 0.01
P(they are just friendly and buy drinks for everyone they meet) = 0.05; P(B|friendly) = 0.8
P(they want to be friends) = 0.3; P(B|friends) = 0.1
P(they are interested in you) = 0.6; P(B|interested) = P(B|A) = 0.2
So, P(B) = 0.1905
And, P(A|B) = 0.315 (very different from 1.2!)

Once I started thinking about all possible scenarios, I found one I haven't considered explicitly -- some people buy drinks for everyone they meet -- which adds a good amount of probability (0.04) to B happening. (Those types of people are rare, but they WILL buy you a drink.) There are also other interesting assumptions that are made explicit:

• Out of all the people under consideration in this problem, there are twice as many people who would be romantically interested in you vs. people who would want to be your friend.
• People who are interested in you will buy you a drink twice as often as people who want to be your friend.

The moral of the story is to consider all possible scenarios (models/hypothesis) which can lead to the event you have observed. It's possible you are missing some scenarios, which under consideration will significantly alter your probability estimates.

Do you know any other ways to make the use of Bayes' theorem more accurate? (Please post in comments, links to previous posts of this sort are welcome.)

## Lightswitches

6 25 May 2011 04:43AM

There is probably some obvious solution to this puzzle, but it eludes me.  I'm not sure how to plug it into the equation for Bayes' Theorem.  And the situation described happened last August, so I'm probably not going to figure it out on my own.

There are two lightswitches next to each other, and they control two lights (which have no other switches connected to them).  I have used the switches a few times before, but don't occurrently recall which switch goes to which light, or whether the up or down position is the one that signifies off-ness.  One light is on, one light is off, and the switches are in different positions.  I want both lights off.  So I guess a switch, and I'm right.  What should be my credence be that my previous experience with this set of lightswitches helped me guess correctly, given that I felt like I was guessing at random (and would have had a 50% shot at being right were that the case)?  How much would this be different if I'd guessed wrong the first time?

View more: Next