An unknown probability sounds like a type error.
There are unknowns, such as the result of a coin flip.
And there are probabilities that these unknowns take certain values, such as the probability that the flip comes up heads.
As a formula,
P(Result=Heads)=1/2
The unknown, the result of the flip, is inside the probability operator.
The probability, 1/2, is outside.
They're not the same kind of thing.
But suppose we have, for every possible value θ of the unknown Θ,
P(H|B,Θ=θ)=θ
where H and B are propositions.
Then it makes sense to say that Θ is an unknown probability of a hypothesis H, against background knowledge B.
Some natural examples:
Suppose a coin is flipped, and you don't know whether it's a fair coin, or a trick coin that is heads on both sides.
Let Θ be 1/2 if it's a fair coin, and 1 if it's a trick coin.
When sampling randomly from a population, let Θ be the fraction of individuals in the population that have a certain property.
Suppose a die is rolled, but due to imprecise manufacturing, its shape deviates from a perfect cube.
Consider the phase space of the die: six dimensions for position and momentum of the center of mass, and an additional six dimensions for three angles and three angular velocities.
Label each point in this phase space according to the face of the die that comes up when rolled with that point as an initial condition.
Let Θ be the fraction of the phase space labeled 6.
As long as the prior distribution for the initial condition of the roll is sufficiently wide, this should be almost exactly the unknown probability that the die comes up 6.
(This is based on Jaynes's discussion of coins in section 10.6 of "Probability Theory: The Logic of Science". The idea of an unknown shape is inspired by an actual analysis of dice data as inference of manufacturing defects.)
I could have simplified that last one considerably by just saying that Θ is the probability you would assign to 6, if you knew the shape of the die.
In fact, all of these unknown probabilities are really probabilities you would assign if you had some additional piece of information.
But before we get there, I'll go through the first example, the coin example, in more formalism than is really necessary.
I'd rather be tedious than mysterious.
(By the way, this will already be familiar if you've taken a probability class based on measure theory, where conditional expectations are defined as a certain random variable. Also, the unknown probability is the "p" from Jaynes's "Ap", and the formula above is really the same as 18.1 in his "Probability Theory: The Logic of Science".)
Flipping a Possibly Trick Coin
The formalism is as follows.
Each unknown is a function.
It maps a possible world to the value of the unknown in that world.
(If you're familiar with the more usual terminology: what I'm calling an "unknown" is a random variable, and what I'm calling a "possible world" is an outcome.)
For the coin example we'll want four possible worlds.
One example of an unknown that I've already mentioned is "Result":
It will be more clear what these possible worlds mean when I list all of the unknowns.
I'll put this in a tabular form, so that "Result" will be one column of the table.
Four possible worlds: two coins, times two sides.
I added this "Side" unknown to distinguish between the two sides of the trick coin, but I'm not actually going to use it.
For the trick coin, either "Side" has the "Result" of Heads.
Θ simply depends on Coin.
They're both functions of the possible world, though, so we can define it as
Θ(ω)={1/2if Coin(ω)=Fair1if Coin(ω)=Trick
You can see from the table that Coin and Θ are redundant, intersubstitutible.
Their values pick out the same sets of possible worlds.
We can use that substitution as follows:
Taking those two together, we have, for every possible value of Θ,
P(Result=Heads|Θ=θ)=θ
So it makes sense to call Θ an unknown probability of Result=Heads, as long as we have no other relevant background knowledge.
Note that it shouldn't matter if we "expand" the space of possible worlds, for example by having each possible world represent a trajectory of the flip through phase space.
We can consider each of these four possible worlds as the output of some lossy function from a richer space of possible worlds.
The lost information doesn't affect our analysis because the unknowns of interest can be defined in terms of the more coarse-grained description.
What You Would Believe
Now we can return to the idea that an "unknown probability" is really a probability we would assign, with more information.
In the coin flip case, it was the probability we would assign to heads, if we knew whether it was the trick or the fair coin being flipped.
And if we didn't know anything else relevant, such as the initial conditions of the flip.
Though this notation makes it a little annoying to write this formally, we can do it as follows:
Θ(ω)=P(H|B,Y=Y(ω))Y=Y(ω) is a valid proposition because Y(ω) is simply some particular value that Y can take.
This formula says that the "unknown probability" at a world is a conditional probability, with that world's value of the unknown Y behind the conditional bar.
The unknown probability is defined in terms of the probability operator, but it's an unknown like any other: a function of a possible world.
Interestingly, these unknown probabilities are defined in terms of posterior probabilities.
That is, you can think of Θ above as the posterior probability that you will update to, after learning Y.
This posterior probability is unknown because Y is unknown.
This leads to a statement of conservation of expected evidence:
E[Θ]=P(H|B)
You may have heard it said that the expected posterior is the prior.
Naively, that seems like a type error: probabilities are not unknowns, so we can't have expectations of them.
But with the concept of an unknown probability, we can take it literally.
Exercise for the reader: does this contradict The Principle of Predicted Improvement? How should the unknown posterior probability in that post be defined?
An unknown probability sounds like a type error. There are unknowns, such as the result of a coin flip. And there are probabilities that these unknowns take certain values, such as the probability that the flip comes up heads.
As a formula, P(Result=Heads)=1/2 The unknown, the result of the flip, is inside the probability operator. The probability, 1/2, is outside. They're not the same kind of thing.
But suppose we have, for every possible value θ of the unknown Θ, P(H|B,Θ=θ)=θ where H and B are propositions. Then it makes sense to say that Θ is an unknown probability of a hypothesis H, against background knowledge B.
Some natural examples:
Suppose a coin is flipped, and you don't know whether it's a fair coin, or a trick coin that is heads on both sides. Let Θ be 1/2 if it's a fair coin, and 1 if it's a trick coin.
When sampling randomly from a population, let Θ be the fraction of individuals in the population that have a certain property.
Suppose a die is rolled, but due to imprecise manufacturing, its shape deviates from a perfect cube. Consider the phase space of the die: six dimensions for position and momentum of the center of mass, and an additional six dimensions for three angles and three angular velocities. Label each point in this phase space according to the face of the die that comes up when rolled with that point as an initial condition. Let Θ be the fraction of the phase space labeled 6. As long as the prior distribution for the initial condition of the roll is sufficiently wide, this should be almost exactly the unknown probability that the die comes up 6. (This is based on Jaynes's discussion of coins in section 10.6 of "Probability Theory: The Logic of Science". The idea of an unknown shape is inspired by an actual analysis of dice data as inference of manufacturing defects.)
I could have simplified that last one considerably by just saying that Θ is the probability you would assign to 6, if you knew the shape of the die. In fact, all of these unknown probabilities are really probabilities you would assign if you had some additional piece of information. But before we get there, I'll go through the first example, the coin example, in more formalism than is really necessary. I'd rather be tedious than mysterious.
(By the way, this will already be familiar if you've taken a probability class based on measure theory, where conditional expectations are defined as a certain random variable. Also, the unknown probability is the "p" from Jaynes's "Ap", and the formula above is really the same as 18.1 in his "Probability Theory: The Logic of Science".)
Flipping a Possibly Trick Coin
The formalism is as follows. Each unknown is a function. It maps a possible world to the value of the unknown in that world.
(If you're familiar with the more usual terminology: what I'm calling an "unknown" is a random variable, and what I'm calling a "possible world" is an outcome.)
For the coin example we'll want four possible worlds. One example of an unknown that I've already mentioned is "Result":
Result(ω1)=Tails Result(ω2)=Heads Result(ω3)=Heads Result(ω4)=Heads
It will be more clear what these possible worlds mean when I list all of the unknowns. I'll put this in a tabular form, so that "Result" will be one column of the table.
CoinSideResultΘω1FairATails1/2ω2FairBHeads1/2ω3TrickAHeads1ω4TrickBHeads1
Four possible worlds: two coins, times two sides. I added this "Side" unknown to distinguish between the two sides of the trick coin, but I'm not actually going to use it. For the trick coin, either "Side" has the "Result" of Heads.
Θ simply depends on Coin. They're both functions of the possible world, though, so we can define it as
Θ(ω)={1/2if Coin(ω)=Fair1if Coin(ω)=Trick
You can see from the table that Coin and Θ are redundant, intersubstitutible. Their values pick out the same sets of possible worlds. We can use that substitution as follows:
P(Result=Heads|Θ=1/2)=P(Result=Heads|Coin=Fair)=1/2
And likewise,
P(Result=Heads|Θ=1)=1
Taking those two together, we have, for every possible value of Θ, P(Result=Heads|Θ=θ)=θ So it makes sense to call Θ an unknown probability of Result=Heads, as long as we have no other relevant background knowledge.
Note that it shouldn't matter if we "expand" the space of possible worlds, for example by having each possible world represent a trajectory of the flip through phase space. We can consider each of these four possible worlds as the output of some lossy function from a richer space of possible worlds. The lost information doesn't affect our analysis because the unknowns of interest can be defined in terms of the more coarse-grained description.
What You Would Believe
Now we can return to the idea that an "unknown probability" is really a probability we would assign, with more information. In the coin flip case, it was the probability we would assign to heads, if we knew whether it was the trick or the fair coin being flipped. And if we didn't know anything else relevant, such as the initial conditions of the flip.
Though this notation makes it a little annoying to write this formally, we can do it as follows: Θ(ω)=P(H|B,Y=Y(ω)) Y=Y(ω) is a valid proposition because Y(ω) is simply some particular value that Y can take. This formula says that the "unknown probability" at a world is a conditional probability, with that world's value of the unknown Y behind the conditional bar. The unknown probability is defined in terms of the probability operator, but it's an unknown like any other: a function of a possible world.
Interestingly, these unknown probabilities are defined in terms of posterior probabilities. That is, you can think of Θ above as the posterior probability that you will update to, after learning Y. This posterior probability is unknown because Y is unknown. This leads to a statement of conservation of expected evidence: E[Θ]=P(H|B) You may have heard it said that the expected posterior is the prior. Naively, that seems like a type error: probabilities are not unknowns, so we can't have expectations of them. But with the concept of an unknown probability, we can take it literally.
Exercise for the reader: does this contradict The Principle of Predicted Improvement? How should the unknown posterior probability in that post be defined?