Ah yes this was confusing to me for a while too, glad to be able to help someone else out with it!
The key thing to realise for me, is that the probability of 21 heads in a row changes as you toss each of those 21 coins.
The sequence of 21 heads in a row does indeed have much less than 0.5 chance, to be precise , which is 0.000000476837158. But it only has such a tiny probability before any of those 21 coins have been tossed. However as soon as the first coin is tossed, the probability of those 21 coins all being heads changes. If first coin is tails, the probability of all 21 coins being heads goes down to 0, if first coin is heads the probability of all 21 coins being heads goes up to . Say you by unlikely luck keep tossing heads. Then with each additional heads in a row you toss, the probability of all 21 coins being heads goes steadily up and up, til by the time you've tossed 20 heads in a row, the probability of all 21 being heads is now.... 0.5, i.e. the same as a the probability of a single coin toss being heads! And our apparent contradition is gone :)
The more 'mathematical' way to express this would be: The unconditional probability of tossing 21 heads in a row is , i.e. 0.000000476837158 but the probability of tossing 21 heads in a row conditional on having already tossed 20 heads in a row is .
Let me know if any of that is still confusing.
I think you explain it very well!
So the thing is something like the following, right?: "Looking at it from the outside, a world where 21 heads showed in a row is incredibly unlikely: (if the coin is fair) I would happily bet against this world happening. However, I am already in an incredibly weird world where 20 heads have shown in a row, and another heads only makes it a bit more weird, so I don't know what to bet, heads or tails."
A sequence of 100 heads is only half as likely as a sequence of 99 heads. Which is why the probability of the 100th coinflip being head is exactly one half.
One way to think of this: Uncertainty (at least on this level) is in the observer, not the coin. It comes up heads or comes up tails, with 100% chance of the thing that actually happens.
Before the flip, you assign 50% to each outcome, but that’s your uncertainty, not the coin’s, and the result may as well be secretly predetermined by the universe. After you’ve seen 20 heads, that part is now probability 100% (it’s knowledge, not uncertainty, on your part), and the next flip is still 50/50 (to your knowledge, presuming you have reason to trust the coin and not update toward an unfair flipper).
Are there 2 (or more) types of probabilities and I am just mixing them up
Yes, there are conditional probabilities and unconditional probabilities.
The unconditional probability of 21 heads in a row is 0.5^21[1].
The conditional probability of 21 heads in a row given that the first 20 were all heads is 0.5.
Conditional probability is just a division: the conditional probability of some event A given that B happened is just the unconditional probability of both A and B divided by the probability of B. In symbols: P(A | B) = P(A & B) / P(B).
Bayes' Law comes from simple algebra on this.
As is common, this assumes that the coin flips are independent of one another. An alternative might be that the coin was flipped "lazily" such that it more often shows the same face as the previous flip, but over the long run still flips 50% heads. A "properly" flipped coin should not depend upon the results of any or all previous flips.
There is one thing I don't understand about probabilities:
If we toss a coin, there is a 50% chance that it shows heads or tails. If we do it 20 times and all of them showed heads, there is still a 50% chance that the next one shows heads, since the tosses are independent. However, we also know that series of X tosses showing heads are increasingly improbable when X grows. So, although there is a 50% chance that the toss shows heads again, at the same time the probability that it shows heads again are lower.
Why do we have to take into account one piece of information and not the other one when finding the probability that the next toss will show heads or tails? Are there 2 (or more) types of probabilities and I am just mixing them up (I'm thinking on things like the reported "probabilities" that polls show about one party or another getting elected in an election, for example)? Is the difference related to ergodicity (time vs ensemble averages)?