Land War in Eurasia
Bayes theorem is a tool for updating probabilities according to evidence.
Consider a real-world example of something important I was recently wrong about. On February 23, I found out Russia had invaded Ukraine. When disruptive events happen, you must respond immediately. I quickly read what the experts were saying on Foreign Affairs and combined it with my existing knowledge of history, technology and geopolitics. I wrote "There is going to be a war. Ukraine is going to lose. The question is how much, how quickly and on what terms."
Then I remembered that war is unpredictable and war experts are often wrong. I added a qualifiers before publishing. "There is probably going to be a war. Ukraine is probably going to lose. The question is how much, how quickly and on what terms."
A week later, reports of under-equipped soldiers began trickling in and I discovered that Russia's economy was much smaller than I had assumed. I updated my analysis. As the war calcified, I finally had time to research current weapons technology and build my own model of the war from a tactics-level foundation.
Bayes' Theorem
Over the course of the war, several observations changed my probabilities of an outcome.
- Experts believe Russia will win.
- Evidence arrived that Russian troops might be under-equipped.
- I discovered the Russian economy is the size of Florida's.
- Yep. It's not just propaganda. Russian troops really are under-equipped. Video evidence confirms it.
- Russia's assault has slowed to a crawl.
Each time new evidence came in, my analysis of the situation shifted away from a Russia blitzkreig-style fait accopli.
Bayesian probability treats probability as a subjective state that is updated in response to evidence.
- Let denote the a Russian fait accompli.
- Let denote the belief of experts that Russia will perform a fait accompli.
For example, my initial prior probability of "Russia will perform a fait accompli" started out at 0.5[1].
When I discovered that experts believed Russia would win, I updated my estimate. I estimated a probability of 0.9 that experts would believe in a Russian victory (conditional upon a Russian victory) . I estimated the probability of experts to believe in a Russian victory to be 0.5.
After observing experts believed in a Russian victory, I wanted to find out the probability of a Russian victory conditional on experts believing in a Russian victory. Bayes' Theorem states that the probability of given equals the prior probability of multiplied by an update factor .
Thus, I arrived at a probability 0.9 "Ukraine is probably going to lose. The question is how much, how quickly and on what terms."
Bayesian Chains
There are unconfirmed hints suggest that Russian troops might be under-equipped. Evidence is has arrived.
- My prior probability "Russia will perform a fait accompli conditional on the observed fact that experts have predicted a fait accompli" is 0.9.
- My probability of conditional on is 0.2.
- We can use the Bayes equation to calculate new probability of Russian victory conditional on .
The precise numbers aren't important. What matters is how Bayesian probabilities are chained together. Each bit of evidence multiplies our prior probability by an update factor to get a new probability. Suppose we have a finite set of observations . We can chain them together.
But now we have a problem. For every , there exists such that . In other words, for any finite observable evidence, there exists a sufficiently strong prior such that we can ignore the evidence.
Sufficiently strong priors are unfalsifiable. All knowledge that isn't garbage is based on the concept of falsifiability. Reality is that which is falsifiable. If a statement isn't falsifiable then that statement isn't about reality.
Bayesian probability ultimately rests on an unfalsifiable conviction.
Corollary: Irresolvable Prior Disagreement
If two people disagree about but agree about all then it will be impossible for them to come into agreement because no is ever allowed to equal zero. Finite evidence cannot resolve a priori disagreement.
Stack Traces
The most important question in rationalism is "Why do you believe what you believe?" Suppose you asked a Bayesian "Why do you believe with confidence ? there are two ways a Bayesian can respond.
- The Bayesian can claim is a prior. That's not empiricism. That's just blind faith.
- The Bayesian can use Bayes Theorem: I believe with confidence because .
Option (1) is a claim of blind faith. Option (2) is three claims of blind faith. If you ask a Bayesian "Why do you believe with confidence ?" the answer is either blind faith "it's a prior" or will be "Because I believe ." If you ask "Why do you believe ?" then the logical chain either never terminates or it terminates in blind faith.
That's not even the worst problem with Bayesian philosophy.
The Belief-Value Uncertainty Principle
Suppose you asked me "What is the sum of ?" Usually, I would answer "" but if I were in the Ministry of Love's Room 101 and they were threatening me with torture then I would answer "".
A person's beliefs are not observable. Only a person's behavior is observable. Bayesians believe that people believe things and then optimize the world according to a value function. But there is no way to separate the two because:
- For every observed behavior and every possible belief there exists a value function which could produce the behavior.
- For every observed behavior and every possible value function there exists a belief which could produce the behavior.
A person's beliefs and behavior are intimately related the way a quantum wave packet's position is related to its momentum. You cannot observe both of them independently and at the same time.
Summary
Bayesian Probability has the following problems.
- The answer to "Why do you believe ?" is always reducible to priors, which are non-falsifiable. Evidence has no effect on priors.
- Rational agents with wildly differing priors are (usually) unable to come into even approximate agreement when provided with scarce evidence.
- Rational agents who disagree about unconditional priors but who agree about evidence likelihood and conditional priors should be able to come into agreement. Instead, Bayesians who disagree about unconditional priors while agreeing about evidence likelihood and conditional priors are provably unable to ever reach exact agreement if they use Bayes' Theorem. This is the opposite of how empiricism should work.
- Identifying someone else's beliefs requires you to separate a person's value function from their beliefs, which is impossible.
Frequentism
Frequentism defines probability as the limit of an event's relative frequency across a large number of independent trials. Frequentist probability is objective because it defined in terms of falsifiable real-world observations. An objective definition of probability can be used to resolve disagreements between scientists. A subjective definition cannot. That's why Frequentist probability is used as the foundation of science and Bayesian probability isn't.
The purpose of Frequentist probability is to bound uncertainties. Suppose you observe 100 white swans and 0 black swans. The probability you will observe a black swan by repeating the same experiment is probably not orders of magnitude higher than . Suppose the probability of observing a black swan is actually . The probability of observing zero black swans in a sample of 100 swans is . That's tiny.
If you can run a large number of independent trials then it is easy to show with extremely high confidence that the frequency of an event is below some . Our Frequentist confidence in theories like electromagnetism has so many nines 0.999999… that it rounds to 1.
There are three dangers to Frequentist probability:
- Sample reporting bias. Sample reporting bias is prevented via basic scientific hygiene such as pre-registering experiments.
- Long tails. Frequentist probability will get you killed if you apply it naïvly to a long-tailed anti-inductive system like financial markets. That's because Frequentist probability optimizes median (and other percentile) outcomes whereas long-tailed systems punish you according to average outcomes. Bhuntut has a cool argument about how the log utility of average outcomes is a consequence of optimizing for median (and other percentile) outcomes.
- Small data. Frequentism defines probability in terms of a large number of independent trials. How do you construct probabilities when you don't have "a large number of independent trials"?
Reductionism
Imagine you want to build a space elevator. Space elevators are expensive. You cannot build a million different versions and observe which ones work. You must get it right the first time. How can you do this when Frequentist probability requires "a large number of independent trials" to be meaningful at all?
You silo the problem into modular components.
Material reduction is the art of breaking a physical system into small subsystems with bounded interdependence. A space elevator depends strongly on the tensile strength of its core material. Laboratory material science experiments are cheap compared to building space elevators. You can build small bits of carbon nanotube table on Earth, test its tensile strength and expose it to space-like radiation. Do this over and over again until you're confident in the integrity of your materials.
Break your problem into small components. Break those components into smaller components. Break those components into even smaller components. Do this until you get to the level of atoms and photons. (We know how to go even smaller than that but applying quantum field theory to an engineering challenge usually causes more problems then it solves.) If you're confident about how a material behaves then you can be confident how an object build out of that material behaves.
Material reduction allows you to leverage your high confidence about the behavior of small things (which you have lots of data about) to confidently predict the behavior of big things (which you have little data about).
Most importantly, it's falsifiable. If I believe the result of a small-scale experiment is and you think it's then we can quickly be brought into agreement by just running the experiment.
The probabilities used in this article here are retconned for illustrative purposes only. I did not use precise numeric probabilities in my analysis at the time. ↩︎
Why not both?
I would argue that the properties Bayesians think are important are important, and the properties frequentists think are important are also important.
This favors a view in which we track both subjective and objective probability.
Frequentists say that probabilities are external to humans, and hence, cannot be definitively measured. In my experience, frequentists are under no illusion that objective probabilities can be known. Frequentist procedures can get unlucky and reject a true hypothesis.
The definition -- probabilities are limiting frequencies of repeated experiments -- does not in any way imply that results can't cluster in weird ways in the sequence of experiments. For example, the statement that the frequency of an event limits to 0 is totally consistent with the first billion experiments displaying frequency 1. You need the (unjustified and unjustifiable) IID assumption to rule this out even probabilistically.
(I'll grant that frequentists have some great ideas, but imho the IID assumption is not one of them.)