Interesting. The natural approach is to imagine that you just have a 3-sided die with 2, 4, 6 on the sides, and if you do that, then I compute A = 12 and B = 6[1]. But, as the top Reddit comment's edit points out, the difference between that problem and the one you posed is that your version heavily weights the probability towards short sequences—that weighting being 1/2^n for a sequence of length n. (Note that the numbers I got, A=12 and B=6, are so much higher than the A≈2.7 and B=3 you get.) It's an interesting selection effect.
The thing is that, if you roll a 6 and then a non-6, in an "A" sequence you're likely to just die due to rolling an odd number before you succeed in getting the double 6, and thus exclude the sequence from the surviving set; whereas in a "B" sequence there's a much higher chance you'll roll a 6 before dying, and thus include this longer "sequence of 3+ rolls" in the set.
To illustrate with an extreme version, consider:
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that you succeed in doing this. You ragequit if it takes more than two rolls.
Obviously that's one way to reduce A to 2.
Excluding odd rolls completely, so the die has a 1/3 chance of rolling 6 and a 2/3 chance of rolling an even number that's not 6, we have:
A = 1 + 1/3 * A2 + 2/3 * A
Where A2 represents "the expected number of die rolls until you get two 6's in a row, given that the last roll was a 6". Subtraction and multiplication then yields:
A = 3 + A2
And if we consider rolling a die from the A2 state, we get:
A2 = 1 + 1/3 * 0 + 2/3 * A
= 1 + 2/3 * A
Substituting:
A = 3 + 1 + 2/3 * A
=> (subtract)
1/3 * A = 4
=> (multiply)
A = 12
For B, a similar approach yields the equations:
B = 1 + 1/3 * B2 + 2/3 * B
B2 = 1 + 1/3 * 0 + 2/3 * B2
And the reader may solve for B = 6.
By the way, this is my first non-question post on lesswrong after lurking for almost a year. I've made some guesses about the mathematical maturity I can assume, but I'd appreciate feedback on if I should assume more or less in future posts (and general feedback about the writing style, either here or in private).
This is a lovely little problem, so thank you for sharing it. I thought at first it would be [a different problem](https://www.wolfram.com/mathematica/new-in-9/markov-chains-and-queues/coin-flip-sequences.html) that's similarly paradoxical.
Which of the following do you think is bigger?
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that all rolls were even.
B: The expected number of rolls of a fair die until you roll the second 6 (not necessarily in a row), given that all rolls were even.
If you are unfamiliar with conditional expectation, think of it this way: Imagine you were to perform a million sequences of die rolls, stopping each sequence when you roll two 6s in a row. Then you throw out all the sequences that contain an odd roll. The average number of rolls in the remaining sequences should be close to A. Next, perform a million sequences of die rolls, stopping each sequence when you roll the second 6. Throw out all the sequences among these that contain an odd roll. The average number of rolls in the remaining sequences should be close to B.
I asked something like this on r/math about a year ago, and even with the hint that the answer was paradoxical, the early consensus was that A must be larger. The justification was more or less the following: any time you roll until reaching two 6s in a row, you will have also hit your second 6 at or before then. So regardless what the conditions are, A must be larger than B.
But the correct answer is actually B. What on earth is going on?
A quick verification
Before we proceed, let's write some code to estimate A and B. The only goal here is to be as unambiguous as possible, so the code will be almost comically unoptimized in both run-time and length.
The 100,000s in the last two lines represent the number of sequences being rolled for the estimate; you can add zeros for accuracy or subtract zeros for faster run-time.
The estimate for A should be close to 2.727 and the estimate for B should be close to 3.000. The exact value for A is 30/11 and the exact value for B is 3, but it is helpful to first unambiguously experimentally verify that B is greater than A (and ensure that we are on the same page of what A and B mean) before diving into the possibly unintuitive math.
Primer on geometric distributions
We'll begin by calculating the expected number of rolls of a die to see the first 6. To do so, we first find the probability that it takes exactly k rolls to see the first 6. This means the first k−1 rolls were not 6 and roll k was 6.
The probability that a die rolls a 6 is 16 and the probability it does not is 56. Following the rule of product for independent probabilities, we get:
Pr(First 6 on roll k)=(56)k−116
We can now get a formula for the expected number of rolls of a die until we see the first 6. The formula for expectation gives:
E[num rolls until 6]=∞∑k=1k∗Pr(First 6 on roll k)=∞∑k=1k(56)k−116
Now we'll use the following fact: for −1<x<1:
∞∑k=1kxk−1=1(1−x)2
This can be obtained by starting with the formula for geometric series ∞∑k=0xk=11−x and taking the derivative of both sides (if you remember calculus) or squaring both sides (if you're very good at algebra). Plugging in, we have:
E[num rolls until 6]=161(1−56)2=6.
And we are done. Sort of.
Let's try that again, this time using an intuitive trick from Markov chains. We'll use "average" and "expected" interchangeably as the former is more colloquial and we are going to be a bit informal here.
Let x be the average number of rolls until we see the first 6. Let's roll the die once. With probability 16, we rolled a 6 and can stop. With probability 56, we didn't roll a 6 and are then still an average of x steps away from a 6.
So with probability 1/6, we are in a scenario where we take 1 roll to see a 6 and in the remaining probability 5/6, it will take an average of x+1 steps to see a 6. So the average number of rolls until we see the first 6 is 16(1)+56(x+1). But the average is also x! This gives us the algebra equation:
x=16+56(x+1)
that gives x=6 when solving for x.
Let's generalize now. Let's say we have some experiment that has fixed probability p of success. We repeat the experiment until it succeeds. Then if x is the expected number of trials until success, we have:
x=p(1)+(1−p)(x+1)⟹x=p+x−px+1−p⟹px=1⟹x=1/p.
Probability distributions of this form are called geometric distributions. In this case, the experiment was rolling a die and success was defined by seeing a 6, so it is a geometric distribution with success rate 16. And so the expected number of trials until success is 11/6=6.
Rolls until first 6, given all even
In this section, we will use D6 to refer to a fair 6-sided die with sides labeled 1−6 and D3 to refer to a fair 3-sided die with sides labeled 2,4,6. Consider the following two questions:
For the first question, we have a geometric distribution of success rate 13, so the expected number of trials until success is 11/3=3.
The second question was posed by Gil Kalai in a 2017 blog post. Most people incorrectly answered 3 (and keep in mind the audience for this blog is fairly math literate). The rationale was that the second question seems equivalent to the first. But let's calculate it explicitly.
Analogously to last section, we begin by calculating the probability that it takes exactly k rolls to see the first 6, given that all rolls were even. Following standard conventions, we'll write Pr(X∣Y) as shorthand for "Probability that X occurs, given that Y occurs". From the formula for conditional probability, we have:
Pr(X∣Y)=Pr(X and Y)Pr(Y)
Let's start with the numerator. If it takes us exactly k rolls to see our first 6 and all rolls in the process were even, then the first k−1 rolls were all 2 or 4 and the kth roll was a 6. The probability of this occurring is (26)k−1(16)
The denominator is the total probability that we roll a 6 before the first odd. One way we could determine this is by evaluating ∞∑i=1(26)i−1(16) (that is, summing the probability it takes exactly i rolls to get a 6 and all rolls were even, over all possible values of i). We saw how to sum those kinds of series in the last section.
But a more intuitive way is as follows - rephrase "Probability we roll a 6 before the first odd" as "Probability that between the sides {1,3,5,6}, 6 is the first to show up". From here, we can immediately see by symmetry that the probability is 14. Indeed summing the above series gives the same answer.
Altogether, we have:
Pr(First 6 on roll k∣all rolls even)=(26)k−1(16)1/4=32(13)k−1
and so:
E[rolls until first 6∣all rolls even]= ∞∑k=1k32(13)k−1=32⎛⎝11−23⎞⎠2=32
There is another, cuter, way to answer the second problem that will be important for our evaluation of B. We will first rephrase the question as "What is the expected number of rolls of a D6 until the first 6, given that 6 is the first to occur out of 1,3,5,6?". We can rephrase this again as "What is the expected number of rolls of a D6 until the first side in 1,3,5,6 shows up, given that 6 is the first to occur out of 1,3,5,6?".
Now we have some neat symmetry - the expected number of rolls of a D6 until a the first side in 1,3,5,6 shows up shouldn't depend on which of those four sides happened to be first. So the following will have the same answer as the second question: "What is the expected number of rolls of a die until the first side in 1,3,5,6 shows up?"
That's a geometric distribution with success rate 46! And so its expectation is 14/6=32.
Adding conditions after the first 6
We'll now make a modification that might seem harmless. We will roll a D6 one billion times (and keep going until we get a 6 if we somehow didn't get it in the first 1 billion rolls). What is the expected number of rolls until the first 6 given that every roll that ever showed up in this process is even.
In other words, we are still looking at the number of rolls until the first 6, and we are still requiring that all rolls before the first 6 are even. But now we are also requiring that rolls after the first 6 are even. Does this impact the expectation? Intuitively one might think "no", but we'll see that the expectation nearly doubles.
E[number of rolls∣evens up to 6 and billionth roll]=
∞∑k=1kPr(first 6 on roll k and even up to 6 and billionth roll)Pr(even up to 6 and billionth roll)
Now even under the assumption of having all evens show up in the first billion rolls, having no 6 show up in the first billion rolls is extremely unlikely. So with extreme accuracy, we can approximate the denominator as just the probability that the first billion rolls are all even - (12)
Now for k less than one billion, the probability that the first 6 is on roll k and there are only evens up to the billionth roll is:
(26)k−1(16)(12)1000000000−k=(13)(23)k−1(12)1000000000.
For k greater than one billion, the expressions are slightly different, but the contributions are so astronomically small at this point that we can pretend they are the same with basically no percent change in our answer. So we have
E[number of rolls∣evens up to 6 and billionth roll]=∞∑k=0k(13)(23)k−1=1/3(1−2/3)2=3.
which is roughly the expected number of rolls until 6 on a D3 with sides labeled 2,4,6. The conclusion is that adding conditions after the first 6 can indeed impact the expected number of rolls to the first 6.
Reviewing the paradox
Recall our definitions:
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that all rolls were even.
B: The expected number of rolls of a fair die until you roll the second 6 (not necessarily in a row), given that all rolls were even.
we will now consider a third:
C: The expected number of rolls of a fair die until you roll the second 6, given that all rolls until the first instance of two 6s in a row are even.
C and A are directly comparable since they have the exact same condition - all rolls before the first instance of two 6s in a row are even and no conditions are given on any rolls that occur after. Since, amongst these rolls, the second 6 will always occur at or before the first instance of two 6s in a row, we can safely conclude A>C
However, we've seen in the last section that C is not necessarily the same as B, even though we've only added conditions that apply at or after the second 6. So we cannot immediately conclude A>B. The two need to be calculated independently and directly compared.
Proving B>A
B ends up being a lot easier to calculate than A, so our strategy will be to prove that B=3, then use some nicer upper bounds to show that A is less than 3. The strategy to compute B comes from a cute argument by reddit user u/bobjane in the aforementioned thread:
The average number of rolls until the first instance of 1,3,5,6 is 3/2, as it is a geometric distribution. Then after rolling the first, the average until the next instance of 1,3,5,6 is again 3/2, no matter how long it took to get the first instance.
By linearity of expectation, we then have that the expected number of rolls of a die until the second instance of a side in 1,3,5,6 is 32+32=3. Conditioning on the specific combination we see being 6 and 6 does not impact the expectation, hence we have B=3.
The same user actually gave an explicit formula for A in the general setting of rolling until n 6s in a row, but there are a lot of steps and we won't need the exact answer anyway. Instead, we will content ourselves with an upper bound for A that is lower than 3.
We first compute the probability that two 6s occur before the first odd via a Markov chain argument. A "success" occurs if we roll the two 6s before an odd and a "failure" occurs if we roll the odd first.
Let p0 be the probability of success at the beginning. Any time our most recent roll was a 2 or 4 and we have neither rolled 2 6s in a row nor an odd, the probability of success is still p0 - nothing has changed. We will refer to this state as S0.
But when we roll a 6, the probability of getting two 6s before an odd is temporarily higher - call this p6. We will call this state S6. If the next roll is a 2 or 4, we are back to S0 and the new probability becomes p0 again.
We will solve for p0 and p6 via a system of equations:
p0=16p1+26p0
p1=16+26p0
In other words, the probability of success from S0 is equal to the probability of going from S0 to S1 (which occurs when you roll a 6, so 16) times the probability of success from S1, plus the probability you stay at S0 (which occurs when rolling a 2 or 4) times the probability of success from S0.
On the other hand the probability of success from S1 is 16 (the probability of rolling another 6 and being done), plus the probability you go back to S0 times the probability of success from S0.
Solving this system gives p0=122. So:
A=∞∑k=0kPr(first instance of two 6 in a row at roll k and all evens until roll k)1/22
That numerator is tricky to calculate. It is much easier to calculate the probability that an instance of exactly two 6s in a row occurs at roll k and all evens until row k; this will be higher but we just want an upper bound anyway!.
For k<2, this probability is 0 and for k=2, the probability is 162=136. For k>2, this is the probability that the first k−3 rolls are even, roll k−2 is a 2 or 4 and rolls k−1,k are 6. In other words, (12)k−1(13)(16)2=(1108)(12k−3).
So:
A<22(2(136)+∞∑k=3k(1108)(12)k−3)=77/27<2.852.
(skipping some algebra to manipulate the sum). The exact answer for A is actually 30/11, which is about 2.72.
The general case
Following the above argument, you can show that the expected number of rolls until the nth 6, given that all rolls are odd, is 3n2. The expected number of rolls until the first instance of n 6s in a row, given that all rolls are odd is a little less than n+45. So, for instance, the expected number of rolls until the 70th 6 is more than the expected number of rolls until the first instance of 100 6s in a row when both require no odds to show up before the stopping condition.