Unknown Probabilities

transhumanist_atom_understander

An unknown probability sounds like a type error. There are unknowns, such as the result of a coin flip. And there are probabilities that these unknowns take certain values, such as the probability that the flip comes up heads.

As a formula, The unknown, the result of the flip, is inside the probability operator. The probability, 1/2, is outside. They're not the same kind of thing.

But suppose we have, for every possible value $θ$ of the unknown $Θ$ , $P (H | B, Θ = θ) = θ$ where $H$ and $B$ are propositions. Then it makes sense to say that $Θ$ is an unknown probability of a hypothesis $H$ , against background knowledge $B$ .

Some natural examples:

Suppose a coin is flipped, and you don't know whether it's a fair coin, or a trick coin that is heads on both sides. Let $Θ$ be 1/2 if it's a fair coin, and 1 if it's a trick coin.
When sampling randomly from a population, let $Θ$ be the fraction of individuals in the population that have a certain property.
Suppose a die is rolled, but due to imprecise manufacturing, its shape deviates from a perfect cube. Consider the phase space of the die: six dimensions for position and momentum of the center of mass, and an additional six dimensions for three angles and three angular velocities. Label each point in this phase space according to the face of the die that comes up when rolled with that point as an initial condition. Let $Θ$ be the fraction of the phase space labeled 6. As long as the prior distribution for the initial condition of the roll is sufficiently wide, this should be almost exactly the unknown probability that the die comes up 6. (This is based on Jaynes's discussion of coins in section 10.6 of "Probability Theory: The Logic of Science". The idea of an unknown shape is inspired by an actual analysis of dice data as inference of manufacturing defects.)

I could have simplified that last one considerably by just saying that $Θ$ is the probability you would assign to 6, if you knew the shape of the die. In fact, all of these unknown probabilities are really probabilities you would assign if you had some additional piece of information. But before we get there, I'll go through the first example, the coin example, in more formalism than is really necessary. I'd rather be tedious than mysterious.

(By the way, this will already be familiar if you've taken a probability class based on measure theory, where conditional expectations are defined as a certain random variable. Also, the unknown probability is the " $p$ " from Jaynes's " $A_{p}$ ", and the formula above is really the same as 18.1 in his "Probability Theory: The Logic of Science".)

Flipping a Possibly Trick Coin

The formalism is as follows. Each unknown is a function. It maps a possible world to the value of the unknown in that world.

(If you're familiar with the more usual terminology: what I'm calling an "unknown" is a random variable, and what I'm calling a "possible world" is an outcome.)

For the coin example we'll want four possible worlds. One example of an unknown that I've already mentioned is "Result":

$Result (ω_{1}) = Tails$ $Result (ω_{2}) = Heads$ $Result (ω_{3}) = Heads$ $Result (ω_{4}) = Heads$

It will be more clear what these possible worlds mean when I list all of the unknowns. I'll put this in a tabular form, so that "Result" will be one column of the table.

$\begin{matrix} Coin & Side & Result & Θ ω_{1} & Fair & A & Tails & 1 / 2 ω_{2} & Fair & B & Heads & 1 / 2 ω_{3} & Trick & A & Heads & 1 ω_{4} & Trick & B & Heads & 1 \end{matrix}$

Four possible worlds: two coins, times two sides. I added this "Side" unknown to distinguish between the two sides of the trick coin, but I'm not actually going to use it. For the trick coin, either "Side" has the "Result" of Heads.

$Θ$ simply depends on Coin. They're both functions of the possible world, though, so we can define it as

$Θ (ω) = {\begin{matrix} 1 / 2 & if Coin (ω) = Fair 1 & if Coin (ω) = Trick \end{matrix}$

You can see from the table that Coin and $Θ$ are redundant, intersubstitutible. Their values pick out the same sets of possible worlds. We can use that substitution as follows:

$P (Result = Heads | Θ = 1 / 2) = P (Result = Heads | Coin = Fair) = 1 / 2$

And likewise,

$P (Result = Heads | Θ = 1) = 1$

Taking those two together, we have, for every possible value of $Θ$ , $P (Result = Heads | Θ = θ) = θ$ So it makes sense to call $Θ$ an unknown probability of $Result = Heads$ , as long as we have no other relevant background knowledge.

Note that it shouldn't matter if we "expand" the space of possible worlds, for example by having each possible world represent a trajectory of the flip through phase space. We can consider each of these four possible worlds as the output of some lossy function from a richer space of possible worlds. The lost information doesn't affect our analysis because the unknowns of interest can be defined in terms of the more coarse-grained description.

What You Would Believe

Now we can return to the idea that an "unknown probability" is really a probability we would assign, with more information. In the coin flip case, it was the probability we would assign to heads, if we knew whether it was the trick or the fair coin being flipped. And if we didn't know anything else relevant, such as the initial conditions of the flip.

Though this notation makes it a little annoying to write this formally, we can do it as follows: $Θ (ω) = P (H | B, Y = Y (ω))$ $Y = Y (ω)$ is a valid proposition because $Y (ω)$ is simply some particular value that $Y$ can take. This formula says that the "unknown probability" at a world is a conditional probability, with that world's value of the unknown $Y$ behind the conditional bar. The unknown probability is defined in terms of the probability operator, but it's an unknown like any other: a function of a possible world.

Interestingly, these unknown probabilities are defined in terms of posterior probabilities. That is, you can think of $Θ$ above as the posterior probability that you will update to, after learning $Y$ . This posterior probability is unknown because $Y$ is unknown. This leads to a statement of conservation of expected evidence: $E [Θ | B] = P (H | B)$ You may have heard it said that the expected posterior is the prior. Naively, that seems like a type error: probabilities are not unknowns, so we can't have expectations of them. But with the concept of an unknown probability, we can take it literally.

Exercise for the reader: does this contradict The Principle of Predicted Improvement? How should the unknown posterior probability in that post be defined?

[-]transhumanist_atom_understander1mo10

The last formula in this post, the conservation of expected evidence, had a mistake which I've only just now fixed. Since I guess it's not obvious even to me, I'll put a reminder for myself here, which may not be useful to others. Really I'm just "translating" from the "law of iterated expectations" I learned in my stats theory class, which was:

This is using a notation which is pretty standard for defining conditional expectations. To define it you can first consider the expected value given a particular value of the random variable $Y$ . Think of that as a function of that particular value: $f (y) = E [X | Y = y]$ Then we define conditional expectation as a random variable, obtained from plugging in the random value of $Y$ : $E [X | Y] = f (Y)$ The problem with this notation is it gets confusing which capital letters are random variables and which are propositions, so I've bolded random variables. But it makes it very easy to state the law of iterated expectations.

The law of iterated expectations also holds when "relativized". That is, $E [E [X | Y] | B] = E [X | B]$ where $B$ is an event. If we wanted to stick to just putting random variables behind the conditional bar we could have used the indicator function of that event.

And this translates to the statement in my post. $X$ is an indicator for the event $H$ , which makes a conditional expectation of it a conditional probability of $H$ . So $E [X | Y]$ is $Θ$ . Our proposition $B$ is the background information $B$ , I used the same symbol there. And the right hand side is another expectation of an indicator and therefore also a probability.

I really didn't want to define this notation in the post itself, but it's how I'm trained to think of this stuff, so for my own confidence in the final formula I had to write it out this way.

LESSWRONG
LW

22

Unknown Probabilities

22

Flipping a Possibly Trick Coin

What You Would Believe

22