If probability is in the map, then what is the territory? What are we mapping when we apply probability theory?
"Our uncertainty about the world, of course."
Uncertainty, yes. And sure, every map is, in a sense, a map of the world. But can we be more specific? Say, for a fair coin toss, what particular part of the world do we map with probability theory? Surely it's not the whole world at the same time, is it?
"It is. You map the whole world. Multiple possible worlds, in fact. In some of them the coin is Heads in the others it's Tails, and you are uncertain which one is yours."
Wouldn't that mean that I need to believe in some kind of multiverse to reason about probability? That doesn't sound right. Even if those "possible worlds" existed, how am I supposed to know that's the case?
"Well, you don't necessary have to believe that there are parallel worlds as real as ours in which the coin comes differently, though it's a respectable position about the nature of counterfactuals. Probability is in the mind, remember? You can simply imagine alternative worlds that are logically consistent with your observations."
Even so, this can't be the way things work. Humans are not, in fact, able to hold the whole world in their minds and validate its logical consistency, let alone multiple worlds. If that was actually required, then probability theoretic reasoning would have only been accessible for literally galaxy-brained superintelligences.
"Well, in practice we simply imagine that we've imagined the whole world, do not notice any contradictions and call it a day without thinking too hard about the matter. It's a standard practice in philosophy."
Why does this not sound reassuring at all, I wonder? Wait a second, it's even worse than that. What about uncertainty about logic? Say I try to guess whether the 1,253,725,569th digit of pi is even or odd. Only one answer is actually logically consistent with my observations, but I don't know which one. Does that mean that I can't use probability theory to reason about it?
"Oh yes, formalizing logical uncertainty is a well known unsolved problem. We kind of pretend that we can still use probability theory with it as usual, even though it doesn't make sense in the framework of possible worlds, and it seems to work out just fine."
If in practice probability theoretic reasoning works the same way with logical and physical uncertainty, but only one of them makes sense in the framework of possible worlds, doesn't it mean that we need to throw away this framework and come up with something better?
"We've tried with the framework of centred possible worlds, which, in addition to all the worlds, specify the observer's place in them. Didn't really help with logical uncertainty. But it highlighted that we have troubles with indexical uncertainty too. It's now also a respected framework among philosophers."
Well, that went about as well as one might expect from adding extra complexity to a system just for the sake of extra complexity, without fixing the already existing problems in it. Maybe we are better off starting from scratch?
"Feel free to try."
Map of Uncertainty
The first thing that comes to mind is that this framework of possible worlds is not how map-territory relations usually work. Usually a map represents some part of the territory, instead of multiple territories, some of them imaginary.
"But probability theory needs to represent our uncertainty about the territory, not just state facts about it. That's the source of this difference."
Maps can already represent our uncertainty. Consider two towns with similar overall layout. Buildings are positioned in the same places relative to the center of the town, but used differently. For example, the building that is a grocery store in the first town is a theater in the other. Suppose we have not a very detailed map of the first town. It captures the layout but doesn't specify the function of each building. Such a map represents the second town just as much as the first. As a matter of fact, such a map would represent any town with the same building layout, whether it was built in the past or will be built in the future, or never existed at all. The map pinpoints the layout.
"And what does it have to do with representing our uncertainty?"
The map represents our uncertainty between all the towns with this particular layout. The level of detail on the map corresponds to our knowledge. The less specific the map is, the less we know and vice versa. Learning a new fact about the town adds a new detail to the map and changes what kind of towns it represents. So if I know that all buildings have red roofs, I'm uncertain between all the towns with this layout and red roofs on all of the buildings and my map now include an additional detail: the roof color of the buildings.
"I see. But we still need to be able to talk about all the individual towns."
Do we, though? The whole point of math is to logically pinpoint a general principle, instead of talking about every individual example separately. We say 2+2=4, as a thing in itself, without having to hold in mind all the individual objects and processes from the real world for which it's true. We simply imply that for every objects and processes that work exactly like addition of natural numbers, two and two of such objects put through such process will result in four of such objects. We can do the exact same thing here.
"But that's because we have a formal model of arithmetic. While in this case, you are just vaguely gesturing to map-territory relations. This is not the same thing. Also Sample Space is a set in our mathematical model. It has elements. If these are not worlds, then what are they?"
Fair enough. Let's get there.
Definition of Probability Experiment
Addition of natural numbers is an abstraction for multiple objects and processes in the physical world. It represents both throwing a rocks in a bucket, and letting sheep go to the pasture - very different things. Likewise, we need some abstraction for multiple tosses of a coin which are, in fact, different physical processes. Properties of the coin, the way force is applied to it and how much, environmental factors - lots of things can vary. They are just irrelevant to our knowledge state. All that we know is that some coin tosses result in Heads and some in Tails, and no matter how many tosses we've already observed we can't guess the outcome of the next one better than chance...
"Still sounds like vague gesturing to me."
...thankfully there is already a mathematical model for what we need! A function. So let there be a function
.
For every natural number, it has one specific value from the sample space. We will call this function probability experiment. And we will say that
is i-th iteration (or trial) of probability experiment. And that
means that is the outcome of i-th iteration of probability experiment.
And we also demand that outcomes of different trials are statistically independent from each other.
"So how does it help us?"
We now have a mathematical model that correspond to actual experiments in the real world. When I toss the coin and observe that it's Heads, instead of saying that I observe a world in which the coin is Heads I can say: "The outcome of -st iteration of coin tossing probability experiment is Heads". Where corresponds to exactly this trial. I don't need to talk about "possible worlds" at all. And then I can toss the coin more times to learn the outcomes for trials , and so on. Which gives me pretty good idea about the properties of the function and therefore about the general act of tossing a coin.
Probability Experiment is in the Mind
"Isn't it just objective probabilities and Frequentism all over again?"
It does have all the advantages of Frequentism. But not its problems. The notion of probability experiment is not claimed to be about "objective probabilities". It's very explicitly a map approximating a territory to the level of our knowledge. It does account for our uncertainty in a similar manner to Bayesianism. It's just doesn't frame this as uncertainty about "possible worlds" but as an uncertainty about the outcome of the particular iteration of the experiment, instead.
"Can you give me a specific example how your approach to probability differs from Frequentism, and how it doesn't perform worse than Bayesianism?"
Sure. Consider this problem:
- There is a coin that may be biased in any way
- It's biased
- It's biased 1:2
- It's biased 1:2 in favor of Heads
For each of your knowledge states 1-4, what is the probability that each next toss of this coin results in Heads?
"A Bayesian would answer 1/2, 1/2, 1/2 and 2/3."
Naturally. While, a Frequentist who believes that probability is an inherent property of a coin has problems with first three questions. If the coin may be (or indeed is) biased then it seems that its probability to come Heads can't be 1/2, and until they toss this particular coin they can't have a coherent estimate. Only after they know how exactly the coin is biased can they give an answer.
"So how does your notion of probability experiment help here?"
As soon as you understand that probability experiment approximates your knowledge state about the coin instead of being about its objective properties, it all adds up to normality - that is - to Bayesianism. Probability experiment doesn't consist of the tosses of this particular coin. Instead it consists of all the tosses of all the types of coins our knowledge about which works the same way as for this one.
So for knowledge state 1 it's simply all kinds of coins, fair and biased in all kind of ways. Our coin toss is one of them but we have no idea which so we are indifferent between all the iterations of this experiment. All that is left is to reason about the ratio of Heads. Coins biased in the opposite ways cancel each other and so we get an equal ratio of Heads and Tails. And, therefore, the probability is 1/2.
After we've learned that the coin is biased, we know that our coin toss can't be a toss of a fair coin. So to capture our new knowledge state, we remove the tosses of a fair coin from the experiment. Now we are indifferent between all the tosses of all kinds of unfair coins. This however, doesn't affect the ratio between Heads and Tails, and so the probability is again 1/2.
For 3, our probability experiment consists only of the tosses of two coins: one biased 1:2 in favor of Tails and the other likewise biased in favor of Heads. Once again, about half of the coin tosses in such experiment are Heads, which corresponds to probability for Heads being 1/2.
And in case 4, we need to exclude the coin biased in favor of Tails. So only the tosses of a coin biased 1:2 in favor of Heads are left and therefore P(Heads) = 2/3.
"I see. That's kind of neat, actually. It provides an intuitive-for-a-Frequentist explanation of Bayesianism, and has a decent chance to switch them to the light side. But I'm already a Bayesian. What is the value of this framework for me? What are these advantages of Frequentism you were talking about?"
It provides a principled way to assign a sample space to a given problem. We can simply perform the experiment multiple times and observe what are the outcomes. Likewise, we can always infer the probabilities of events from their frequencies.
But beyond that, it allows us to get rid of these "possible worlds" which were leading everyone astray. Now instead of speculating about some weird metaphysics that we have no idea about, we explicitly approximate some process in the real world. This provides a unified way to reason about any uncertainty be it physical, logical or indexical, which, as far as I can tell, solves all the paradoxes.
Logical Uncertainty
"And how does it solve problems with logical uncertainty? Let's go back to your example with not knowing whether 1,253,725,569th digit of pi is even or odd. No matter how many times we check it, the answer is still the same. So it's a deterministic experiment with only one outcome."
By the same logic tossing a coin is also deterministic, because if we toss the same coin exactly the same way in exactly the same conditions, the outcome is always the same. But that's not how we reason about it. Just like we've generalized coin tossing probability experiment from multiple individual coin tosses, we can generalize checking whether some previously unknown digit of pi is even or odd probability experiment from multiple individual checks about different unknown digits of pi. About half of them would be even and about half of them odd, and so the initial probability estimate for 1/2 is justified. As soon as we've properly accounted for our uncertainty, a deterministic experiment with only one outcome was turned into probability experiment with multiple outcomes, and everything adds up to normality.
"But aren't you supposed to account for all the information you have? Here you clearly know that we are talking specifically about 1,253,725,569th digit of pi. Why do you simply ignore this information and start talking about some unknown digit of pi instead?"
I'm supposed to account for all the relevant information and ignore all the irrelevant. This is how mathematical models work in principle. I may know a lot of facts about apples, but when I'm reasoning about addition, I abstract away from them. Addition of apples works the same way as addition of other objects, so apple-specific facts do not matter and I can simply apply a general principle. Likewise, my ignorance about the 1,253,725,569th digit of pi works exactly the same way as my ignorance about any other digit of pi that I know nothing about, so I generalize the same way as well.
"But the fact that it's specifically the 1,253,725,569th digit of pi can be relevant. Suppose that digit is written on a card and shown to you. Now you are certain whether it's even or odd. But this certainty doesn't generalize to other digits of pi. If you were shown this card while wondering about, say, the 1,000,000,000,011th digit of pi, you wouldn't know any better."
Naturally, as soon as I know something about the 1,253,725,569th digit of pi, my knowledge state about it can't be represented by checking whether some previously unknown digit of pi is even or odd. But it's not some unique property of this particular digit, the same principle applies to all the other digits as well. So it does, indeed, generalize. For any unknown to me digit of pi that is written on the cardboard and can be shown to me, my uncertainty works the same way.
The Territory
"Okay, I'll need to think more about it."
Please do. For now, we're almost ready to answer the initial question. We can notice that coin tossing is, in fact, similar to not knowing whether some digit of pi is even or odd. There are two outcomes with an equal ratio among the iterations of probability experiment. I can use the model from coin tossing, apply it to evenness of some unknown to me digit of pi, and get a correct result. So we can generalize even further and call both of them, and any other probability experiment with the same properties as:
Likewise we can talk about - a generalized notion of any probability experiment with n equiprobable outcomes.
"A lot of probability experiments are not like that, though. Oftentimes outcomes are not equiprobable, like in the example with a coin biased 1:2 in favor of Heads."
True. So to capture them as well, we need to generalize even further and introduce the concept of
Where is weights of i-th outcome - its ratio to all the other outcomes of probability experiment.
"So the territory that probability is in the map of is..."
All the processes in the real world, such that my knowledge state about them works like a weighted sample of n elements.
Basically yes. Strictly speaking it's not just any arbitrary digit, but any digit your knowledge about values of which works the same way as about value of X.
For any digit you can execute this algorithm:
Check whether you know about it more (or less) than you know about X.
Yes: Go to the next digit
No: Add it to the probability experiment
As a result you get a bunch of digits about values of which you knew as much as you know about X. And so you can use them to estimate your credence for X
Yes. As I say in the post:
The fact how a lot of Bayesians mock Frequentists for not being able to conceptualize probability of a coin of unknown fairness, and then make the exact same mistake with not being able to conceptualize probability of a specific digit of pi, which value is unknown, has always appeared quite ironic to me.
I think we did!
That's not how people usually use these terms. The uncertainty about a state of the coin after the toss is describable within the framework of possible worlds just as uncertainty about a future coin toss, but uncertainty about a digit of pi - isn't.
Moreover, isn't it the same before the flip? It's not that coin toss is "objectively random". At the very least, the answer also exists in the future and all you need is to wait a bit for it to be revealed.
The core princinple is the same: there is in fact some value that Probability Experiment function takes in this iteration. But you don't know which. You can do some actions: look under the box, do some computation, just wait for a couple of seconds - to learn the answer. But you also can reason approximately for the state of your current uncertainty before these actions are taken.