I read this comment, and after a bit of rambling I realized I was as confused as the poster. A bit more thinking later I ended up with the “definition” of probability under the next heading. It’s not anything groundbreaking, just a distillation (specifically, mine) of things discussed here over the time. It’s just what my brain thinks when I hear the word.
But I was surprised and intrigued when I actually put it in writing and read it back and thought about it. I don’t remember seeing it stated like that (but I probably read some similar things).
It probably won’t teach anyone anything, but it might trigger a similar “distillation” of “mind pictures” in others, and I’m curious to see that.
What “probability” is...
Or, more exactly, what is the answer to “what’s the probability of X?”
Well, I don’t actually know, and it probably depends on who asks. But here’s the skeleton of the answer procedure:
- Take the set of all (logically) possible universes. Assign to each universe a finite, real value m (see below).
- Eliminate from the set those those that are inconsistent with your experiences. Call the remaining set E.
- Construct T, the subset of E where X (happens, or happened, or is true).
- Assign to each universe u in set E a value p, such that p(u) is inversely proportional to m(u), and the integral of p over set E is 1.
- Calculate the integral of p over the set T. The result is called “the probability of X”, and is the answer to the question.
I’m aware that this isn’t quite a definition; in fact, it leaves more unsaid (undefined) than it explains. Nevertheless, to me it seems that the structure itself is right: people might have different interpretations for the details (and, like me, be uncertain about them), but those differences would still be mentally structured like above.
In the next section I explain a bit where each piece comes from and what it means, and in the one after I’m going to ramble a bit.
Clarifications
About (logically possible) universes: We don’t actually know what our universe is; as such, other possible universes isn’t quite a well-defined concept. For generality, the only constraint I put above is that they be logically possible, for the only reason that the description is (vaguely) mathematical and I don’t have any idea what math without logic means. (I might be missing something, though.)
Note that by “universe” I really mean an entire universe, not just “until now”. E.g., if it so happens your experiences allow for a single possible past (i.e., you know the entire history), but your universe is not deterministic, there are still many universes in E (one for each possible future); if it’s deterministic, then E contains just one universe. (And your calculations are a lot easier...)
Before you get too scared or excited by the concept of “all possible universes” remember that not all of them are actually used in the rest of the procedure. We actually need only those consistent with experience. That’s still a lot when you think about it, but my mind seems to reel in panic more often I forget this point. (Lest this note makes you too comfortable, I must also mention that the possibility that experience is (even partly) simulated explodes the size of E.)
About that real value m I was talking about: “m” comes from “measure”, but that’s a consequence of how I arrived at the schema above. Even now I’m not quite sure it belongs there, because it depends on what you think “possible universes” means. If you just set it to 1 for all universes, everything works.
But, for example, you might consider that the set U is countable, encoding them all as numbers using a well-defined rule, and use the Kolmogorov complexity of the bit-string encoding a universe for that universe’s measure. (Given step [4] above, this would mean that you think simpler universes are more probable; except it doesn’t quite mean that, because “probable” is defined only after you picked your “m”. It’s probably closer to “things that happen in simpler universes are more probable”; more in the ramblings section.)
A bit about the math: I used some math terms a bit loosely in the schema above. Depending exactly on how you mean by “possible universes”, the set of them might be finite, countably infinite, not countable, or might be a proper class rather than a set. Depending on that, “integrating” might become a different operation. If you can’t (mathematically, not physically) do such an operation on your collection of possible universes (actually, on those in E) then you have to define your own concept of probability :-P
With regards to computability, note that the series of steps above is not an algorithm, it’s just the definition. It doesn’t feel intuitive to me that there is any possible universe where you can actually follow the steps above, but math surprises me in that regard sometimes. But note that you don’t really need p(X): you just need a good-enough approximation, and you’re free to use any trick you want.
Musings
If the above didn’t interest you, the rest probably won’t, either. I’ve put in this the most interesting consequences of the schema above. It’s kind of rambling, and I apologize; as in the last section, I’ll bold keywords, so you might just skim it for paragraphs that might interest you.
I found it interesting (but not surprising) to note that Bayesian statistics correspond well to the schema above. As far as I can tell, the Bayesian prior for (any) X is the number assigned in step 5; Bayesian updating is just going back to step 2 whenever you have new experiences. The interesting part is that my description smells frequentist. I wasn’t that surprised because the main difference (in my head) between the two is the use of priors; frequentist statistics ignore prior knowledge. If you just do frequentist statistics on every possible event in every possible universe (for some value of possible), then there is no “prior knowledge” left to ignore.
The schema above describes only true/false–type problems. For non-binary problems you just split of E in step 3 into several subsets, one for each possible answer. If the problem is real-valued you need to split E into an uncountably infinite number of sets, but I’ve abused set theory terms enough today that I’m not very concerned. Anyway, in practice (in our universe) it’s usually enough to just split the domain of the value in countably many intervals, according to precision you need, and split the universes in E according to which interval they fall in. That is, you don’t actually need to know the probability that a value is, say, sqrt(2), just that it’s closer to sqrt(2) than you can measure it.
With regard to past discussions about a rationale for rationality, observe that it’s possible to apply the procedure above to evaluate what is the “rational way”, supposing you define it by “the rational guy plays to win”: instead of step (3) generate the set of decision procedures that are applicable in all E, call it D; for each d in D, split E into universes where you win and those where you lose (don't win), and call these W(d) and L(d); instead of step 4, for each decision procedure d, calculate the “winningness” of d as the integral of p over W(d) divided by the integral over L(d) (with p defined like above); instead of step 5, pick a decision d0 such that it's “winningness” is maximal (no other has a larger value).
Note that I’ve no idea if doing this actually picks the decision procedure above, nor what exactly it would mean if it doesn’t... Of course, if it does, it’s still circular, like any “reason for reasoning”. The procedure might also give different results for people with different E. I found it interesting to contemplate that it might be “possible” for someone in another universe (one much friendlier to applied calculus than ours) to calculate exactly the solution of the procedure for my E, but at the same time for the best procedure for approximating it in my universe to give a different answer. They can’t, of course, communicate this to me (since then they’re not in a different universe in the sense used above).
If your ontology implies a computable universe (thus you only need to consider those in E), you might want to use Kolmogorov complexity as a measure for the universes. I’ve no idea which encoding you should use to calculate it; there are theorems that say the difference between two encodings is bounded by a constant, but I don't see why certain encodings can't be biased to have systematic effects on your probability calculations. (Other than “it's kind of a long shot”.) You might use the above procedure for deciding on decision procedures, of course :-P
There’s also a theorem that say you can’t actually make a program to compute the KC for any arbitrary bit-string. There might be a universe–to–bit-string encoding that generates only bit-strings for which there is such a program, but that’s also kind of a long shot.
If your ontology implies quantum mechanics then I think the measure of the universes (m(u) in step 1) must involve wave functions somehow, but my understanding of QM doesn’t allow me to think it through much.
The schema above illuminated a bit something that puzzled me in that comment I was talking about at the beginning: say you are suddenly sent to the planet Progsta and a Sillpruk comes and asks you whether the game of Doldun will be won by the team Strigli; what’s your prior for the answer? What puzzled me was that the very fact that you were asked that question communicates an enormous amount of information — see this comment of mine for examples — and yet I couldn’t actually see how that should affect my priors. Of course, the information content of the question restricts hugely the universes in my E. But there were so many there that it’s still huge; more importantly, it restricts the universes along boundaries that I’ve not previously explored, and I don’t have ready heuristics to estimate that little p above:
If I throw a (correct) dice, I can split the universes in six approximately equal parts on vague symmetry justifications, and just estimate the probability of each side as 1/6. If someone on the street asks me to bet him on his dice I can split the universes in those where I win and those where I lose and estimate (using a kind of Montecarlo-integration with various scenarios I can think of) that I’ll probably lose. If I encounter an alien named Sillpruk I’ve no idea how to split the universes to estimate the result of a Doldun match. But if I were to encounter lots of aliens with strange first-questions for a while, I might develop some such simple heuristics based on simple trial and error.
PS.
I’m sorry if this was too long or just stupid. In the former case I welcome constructive criticism — don’t hesitate to tell me what you think should have been cut. I hereby subject myself to Crocker’s Rules. In the latter case... well, sorry :-)
It certainly is. That is why the schema above fascinated me when I made it explicit: although the concepts involved are not rigorously defined, the relationships between them (as expressed in the schema) feel rigorously correct. (ETA: A bit like noticing that elementary particles parallel a precise mathematical structure, but not yet knowing what particles are and why they do that.)
In a matter of speaking, the schema explicitly moves confusion about “what probability is” to “what possible worlds to consider”. This is implicitly done in many problems of probability — which reference class to pick for the outside view, or how to count possible outcomes —, but the schema makes this much clear. (To me, at least; I published this in case it has a similar effect on others.)
I’m not sure I agree. The schema doesn’t say what collection of universes to use, but I don’t see why you couldn’t just define rigorously one of your choosing and use it. (If I’m missing something here please give me a link.) Note that the one you picked can be “wrong” in some sense, and thus the probabilities you obtain not be helpful, but I don’t see a reason why you can’t do it even if it’s wrong.
Interestingly, the schema does theoretically provide a potential way of noticing you picked a bad collection of worlds: If you end up with an empty E (the subset of worlds agreeing with your experiences), then you certainly picked bad.
I’m a bit fuzzy about what it means when your experiences “consistently” happen to be improbable (but never impossible) according to your calculation. In Bayesian terms, you correctly update on every experience, but your predictions keep being wrong. The schema seems correct, so either you picked a bad collection of possible worlds, or you just happen to be in a world that’s just unpredictable (in the sense that even if you pick the best decision procedure possible, you still lose; the world just doesn’t allow you to win). In the latter case it’s unsurprising — you can’t win, so it’s “normal” that you can’t find a winning strategy — but the latter case should allow you to “escape” by finding the correct set of worlds.
Note that I intentionally didn’t mention preferences anywhere in the post. (Actually, I meant to make it explicit that they’re orthogonal to the problem, I just forgot.)
The question of preferences seems to me perfectly orthogonal to the schema. That is, if you pick a set of possible worlds, but you still can’t define preferences well, than you’re confused about preferences. If, say, you have a class of “possible world sets”, and you can rigorously define your preferences within each such set, but you can’t pick which set to use, then you’re confused about your ontology.
In other words, the schema allows the subdivision of confusion in two separate sources of confusion. It only helps in the sense that it transforms a larger problem in two smaller problems; it’s not its fault that they’re still hard.
There’s a related but more subtle point that fascinated me: even if you don’t like the schema because it’s not helpful enough, it still seems correct. No matter how else you specify your problem, if you do it correctly it will still be a special case of the schema, and you’re still going to have to face it. In a sense, how well you can solve the schema above is a bound of how rational you are.
For me it was a bit like finding out that a certain problem is NP-complete: once you know that, you can find special cases that are easier to solve but useful, and make do with them; but until your problem-solving is NP-strong, you know that you can’t solve the general case. (This doesn’t prevent CS researchers to investigate some properties of those and even harder problems.)
ETA: And the reason it was fascinating was that seeing the schema gave me a much clearer notion of how hard it is.
If it's wrong, then it's not clear what exactly are we doing. If you run out of sample space, there is no way to correct this mistake, because it's what the sample space means, options still available.
The problem in choice of sample space for Bayesian updating is the same problem as in finding the formalism for encoding a solution to the ontology problem (a preference that is no longer in danger of being defined in terms of misconceptions).
(And if there is no way to define a correct "sampl... (read more)