Interesting. The natural approach is to imagine that you just have a 3-sided die with 2, 4, 6 on the sides, and if you do that, then I compute A = 12 and B = 6[1]. But, as the top Reddit comment's edit points out, the difference between that problem and the one you posed is that your version heavily weights the probability towards short sequences—that weighting being 1/2^n for a sequence of length n. (Note that the numbers I got, A=12 and B=6, are so much higher than the A≈2.7 and B=3 you get.) It's an interesting selection effect.
The thing is that, if you roll a 6 and then a non-6, in an "A" sequence you're likely to just die due to rolling an odd number before you succeed in getting the double 6, and thus exclude the sequence from the surviving set; whereas in a "B" sequence there's a much higher chance you'll roll a 6 before dying, and thus include this longer "sequence of 3+ rolls" in the set.
To illustrate with an extreme version, consider:
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that you succeed in doing this. You ragequit if it takes more than two rolls.
Obviously that's one way to reduce A to 2.
Excluding odd rolls completely, so the die has a 1/3 chance of rolling 6 and a 2/3 chance of rolling an even number that's not 6, we have:
A = 1 + 1/3 * A2 + 2/3 * A
Where A2 represents "the expected number of die rolls until you get two 6's in a row, given that the last roll was a 6". Subtraction and multiplication then yields:
A = 3 + A2
And if we consider rolling a die from the A2 state, we get:
A2 = 1 + 1/3 * 0 + 2/3 * A
= 1 + 2/3 * A
Substituting:
A = 3 + 1 + 2/3 * A
=> (subtract)
1/3 * A = 4
=> (multiply)
A = 12
For B, a similar approach yields the equations:
B = 1 + 1/3 * B2 + 2/3 * B
B2 = 1 + 1/3 * 0 + 2/3 * B2
And the reader may solve for B = 6.
The thing is that, if you roll a 6 and then a non-6, in an "A" sequence you're likely to just die due to rolling an odd number before you succeed in getting the double 6, and thus exclude the sequence from the surviving set; whereas in a "B" sequence there's a much higher chance you'll roll a 6 before dying, and thus include this longer "sequence of 3+ rolls" in the set.
Yes! This kind of kills the "paradox". Its approaching an apples and oranges comparison.
Surviving sequences with n=100 rolls (for illustrative purposes)
[6, 6]
[6, 6]
[2, 6, 6]
[6, 6]
[2, 6, 6]
[6, 6]
Estimate for A: 2.333
[6, 6]
[4, 4, 6, 2, 2, 6]
[6, 6]
[6, 2, 4, 4, 6]
[6, 4, 6]
[4, 4, 6, 4, 6]
[6, 6]
[6, 6]
Estimate for B: 3.375
if you rephrase
: The probability that you roll a fair die until you roll two in a row, given that all rolls were even.
: The probability that you roll a fair die until you roll two non-consecutive (not necessarily in a row), given that all rolls were even.
This changes the code to:
A_estimate = num_sequences_without_odds/n
B_estimate = num_sequences_without_odds/n
And the result (n=100000)
Estimate for A: 0.045
Estimate for B: 0.062
I guess this is what most people where thinking when reading the problem, i.e., its a bigger chance of getting two non consecutive 6s. But by the wording (see above) of the "paradox" it gives more rolls on average for the surviving sequences, but you on the other hand have more surviving sequences hence higher probability.
By the way, this is my first non-question post on lesswrong after lurking for almost a year. I've made some guesses about the mathematical maturity I can assume, but I'd appreciate feedback on if I should assume more or less in future posts (and general feedback about the writing style, either here or in private).
Seemed well-targeted to me. I do feel like it could have been condensed a bit or had a simpler storyline or something, but it was still a good post.
I would have appreciated an intuitive explanation of the paradox something which I got from the comments.
This is a lovely little problem, so thank you for sharing it. I thought at first it would be [a different problem](https://www.wolfram.com/mathematica/new-in-9/markov-chains-and-queues/coin-flip-sequences.html) that's similarly paradoxical.
I had to write several new Python versions of the code to explore the problem before it clicked for me.
I understand the proof, but the closest I can get to a true intuition that B is bigger is:
I think the main think I want to remember is that "given" or "conditional on X" means that you use the unconditional probability distribution and throw out results not conforming to X, not that you substitute a different generating function that always generates events conforming to X.
My intuition was that B is bigger.
The justification was more or less the following: any time you roll until reaching two in a row, you will have also hit your second 6 at or before then. So regardless what the conditions are, must be larger than .
This seems obviously wrong. The conditions matter a lot. Without conditions that would be adequate to explain why it takes more rolls to get two 6s in a row than it does to get two 6s, but given the conditions that doesn't explain anything.
The way I think about it is that you are looking at a very long string of digits 1-6 and (for A) selecting the sequences of digits that end with two 6s in a row going backwards until just before you hit an odd number (which is not very far, since half of rolls are odd). If you ctrl+f "66" in your mind you might see that it's "36266" for a length of 4, but probably not. Half of your "66"s will be proceeded by an odd number, making half of the two-6s-in-a-row sequences length 2.
For people that didn't intuit that B is bigger, I wonder if you'd find it more intuitive if you imagine a D100 is used rather than a D6.
While two 100s in a row only happens once in 10,000 times, when they do happen they are almost always part of short sequences like "27,100,100" or "87,62,100,100" rather than "53,100,14,100,100".
On the other hand, when you ctrl+f for a single "100" in your mind and count backwards until you get another 100, you'll almost always encounter an odd number first before encountering another "100" and have to disregard the sequence. But occasionally the 100s will appear close together and by chance there won't be any odd numbers between them. So you might see "9,100,82,62,100" or "13,44,100,82,100" or "99,100,28,100" or "69,12,100,100".
Another way to make it more intuitive might be to imagine that you have to get several 100s in a row / several 100s rather than just two. E.g. For four 100s: Ctrl+f "100,100,100,100" in your mind. Half the time it will be proceeded by an odd number for length 4, a quarter of the time it will be length 5, etc. Now look for all of the times that four 100s appear without there being any odd numbers between them. Some of these will be "100,100,100,100", but far more will be "100,32,100,100,88,100" and similar. And half the time there will be an odd number immediately before, a quarter of the time it will be odd-then-even before, etc.
You have very strong intuition. A puzzle I was giving people before was "Round E[number of rolls until 100 6s in a row | all even] to the next integer" and the proof I had in mind for 101 was very close to your second paragraph. And when I friend of mine missed the "in a row" part and got 150, the resolution we came to (after many hours!) was similar to the rest of the argument you gave.
150 or 151? I don't have a strong intuition. I'm inclined to trust your 150, but my intuition says that maybe 151 is right because 100+99/2+almost1 rounds up to 151. Would have to think about it.
(By the way, I'm not very good at math. (Edit: Ok, fair. Poorly written. What I meant is that I have not obtained certain understandings of mathematical things that those with formal educations in math have widely come to understand, and this leads me to being lower skilled at solving certain math problems than those who have already understood certain math ideas, despite my possibly having equal or even superior natural propensity for understanding math ideas.). I know high school math plus I took differential equations and linear algebra while studying mechanical engineering. But I don't remember any of it well and don't do engineering now or use math in my work. (I do like forecasting as a hobby and think about statistics and probability in that context a lot.) I wouldn't be able to follow your math in your post without a lot of effort, so I didn't try.)
Re the almost1 and a confusion I noticed when writing my previous comment:
Re my:
E.g. For four 100s: Ctrl+f "100,100,100,100" in your mind. Half the time it will be proceeded by an odd number for length 4, a quarter of the time it will be length 5, etc.
Since 1/2+1/4+1/8...=1, the above would seem to suggest that for four 100s in a row (or two 6s in a row) the expected number of rolls conditional on all even is 5 (or 3). But I saw from your post that it was more like 2.72, not 3, so what is wrong with the suggestion?
There is an important nuance that makes it ~n+4/5 for large n (instead of n+1), but I'd have to think a bit to remember what it was and give a nice little explanation. If you can decipher this comment thread, it's somewhat explained there: https://old.reddit.com/r/mathriddles/comments/17kuong/you_roll_a_die_until_you_get_n_1s_in_a_row/k7edj6l/
I thought of the reason independently: it's that if the number before 66 is not odd, but even instead, it must be either 2 or 4, since if it was 6 then the sequence would have had a double 6 one digit earlier.
A good way to resolve the paradox to me is to modify the code to combine both the functions into one function and record the sequences of the 10,000, In one array you store the sequences where there are two consecutive 6's and in the second you store the one where they are not consecutive. That makes it a bit clearer.
For a run of 10,000 I get 412 runs where the first two 6's are consecutive (sequences_no_gap), and 192 where they are not (sequences_can_gap). So if its just case A you get 412 runs, but for case B you get 412+192 runs. Then you look at the average sequence length of sequences_no_gap and compare it to sequences_can_gap. If the average sequence length in sequences_can_gap > than sequences_no_gap, then that means the expectation will be higher, and thats what you get.
mean sequence lengths
sequences_can_gap: 3.93
sequences_no_gap: 2.49
Examples:
sequences_no_gap
[[4, 6, 6], [6, 6], [6, 6], [4, 6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [6, 6], [4, 6, 6], [4, 6, 6], ...]
sequences_can_gap
[[6, 4, 4, 6], [6, 4, 6], [4, 6, 4, 6], [2, 2, 6, 2, 6], [6, 4, 2, 6], [6, 4, 6], [6, 2, 6], [6, 2, 4, 6], [6, 2, 2, 4, 6], [6, 4, 6], [6, 4, 6], [2, 4, 6, 2, 6], [6, 4, 6], [6, 4, 6], ...]
The many examples such as [6 4 4 6] which are excluded in the first case make the expected number of rolls higher for the case where they are allowed.
(Note GPT o-1 is confused by this problem and gives slop)
n tHere is a more intuitive version of the same paradox.
Again, conditional on all dice rolls being even. But this time it's either
A) 1,000,000 consecutive 6's.
B) 999,999 consecutive 6's followed by a (possibly non-consecutive 6).
Suppose you roll a few even numbers, followed by an extremely lucky sequence of 999,999 6's.
From the point of view of version A, the only way to continue the sequence is a single extra 6. If you roll 4, you would need to roll a second sequence of a million 6'. And you are very unlikely to do that in the next 10 million steps. And very unlikely to go for 10 million steps without rolling an odd number.
Yes if this happened, it would add at least a million extra rolls. But the chance of that is exponentially tiny.
Whereas, for B, then it's quite plausible to roll 26 or 46 or 2426 instead of just 6.
Another way to think about this problem is with regular expressions. Let e=even numbers. *=0 or more.
The string "e*6e*6" matches any sequence with at least two 6's and no odd numbers.
The sequence "e*66" matches those two consecutive 6's. And the sequence "66" matches two consecutive 6's with no room for extra even numbers before the first 6. This is the shortest.
Phrased this way it looks obvious. Every time you allow a gap for even numbers to hide in, an even number might be hiding in the gap, and that makes the sequence longer.
When you remove the conditional on the other numbers being even, then the "first" becomes important to making the sequence converge at all.
I wonder if there's a connection with anthropic reasoning. Let's suppose that a bomb goes off on rolling an odd number...
In related puzzles I did hear something a while ago now, Bostrom perhaps. You have say 6 challenging events to achieve to get from no life to us. They are random and some of those steps are MUCH harder than the others, but if you look at the successful runs, you cant in hindsight see what they are. For life its say no life to life, simple single cell to complex cell and perhaps 4 other events that aren't so rare.
A run is a sequence of 100 steps where you either don't achieve the end state (all 6 challenging events achieved in order, or you do)
There is a 1 in a million chance that a run is successful.
Now if you look at the successful runs, you cant then in hindsight see what events were really hard and which weren't. The event with 1/10,000 chance at each step may have taken just 5 steps in the successful run, it couldn't take 10,000 steps because only 100 are allowed etc.
Note that a slightly different worded problem gives the intuitive result :
A_k is the event "I roll a dice k times, and it end up with 66, with no earlier 66 sequence".
B_k be the event "I roll a dice k times, and it end up with a 6, and one and only one 6 before that (but not necessarily the roll just before the end : 16236 works)".
C_k is the event "I roll a dice k times, and I only get even numbers".
In this case we do have the intuitive result (that I think most mathematicians intuitively translate this problem into) :
Σ[k * P(A_k|C_k)] > Σ[k * P(B_k|C_k)]
Now the question is : why are not the two formulations equivalent ? How would you write "expected number of runs" more formally, in a way that would not yield the above formula, and would reproduce the numbers of your Python program ?
(this is what I hate in probability theory, where slightly different worded problems, seemingly equivalent, yields completely different results for no obvious reason).
Also, the difference between the two processes is not small :
Expected rolls until two 6s in a row (given all even): 2.725588
Expected rolls until second 6 (given all even): 2.999517
vs (n = 10 millions)
k P(A_k|C_k) P(B_k|C_k)
-------------------------
1 0.000000 0.000000
2 0.111505 0.111505
3 0.074206 0.148227
4 0.074719 0.148097
5 0.066254 0.130536
6 0.060060 0.108174
7 0.053807 0.086706
8 0.049360 0.067133
9 0.046698 0.050944
10 0.040364 0.038915
11 0.038683 0.029835
12 0.030691 0.024297
13 0.034653 0.011551
14 0.036450 0.014263
15 0.024138 0.006897
16 0.007092 0.007092
17 0.012658 0.000000
18 0.043478 0.000000
19 0.000000 0.000000
20 0.000000 0.000000
Expected values:
E[k * P(A_k|C_k)] = 6.259532
E[k * P(B_k|C_k)] = 5.739979
At first I disbelieved. I thought A > B. Then I wrote code myself and checked, and got that B > A. I believed this result. Then I thought about it and realized why my reason for A > B was wrong. But I still didn't understand (and now I don't understand either) why the described random process is not equivalent to randomly choosing 2, 4, or 6 every roll. I thought some more and now I have some doubts. My first doubt is whether there exists some kind of standard way of describing random processes and conditioning on them, and whether the problem as stated by notfnofn. Perhaps the problem is just underspecified? Anyway, this is very interesting.
Here's the reasoning I intuitively want to apply:
where X = "you roll two 6s in a row by roll N", Y = "you roll at least two 6s by roll N", and Z = "the first N rolls are all even".
This is valid, right? And not particularly relevant to the stated problem, due to the "by roll N" qualifiers mucking up the statements in complicated ways?
Nah.
The problem with the explanation is this line of code:
x=random.randint(1,6)
That will generate odd numbers. The introduction to the problem states, "given that all rolls were even."
That's why you're getting a surprising result. It's the differing interpretation of what 'all' means. If you think it means what it says - all rolls - then there is no surprise. But if, as the explanation implies, you think it means that some of the rolls were odd, but not those involved into the success condition, that's when you get a different outcome. It's not a paradox. It's just how the question is interpreted.
"given that all rolls were even" here means "roll a normal 6 sided dice, but throw out all of the sequences that included odd numbers." The two are not the same, because in the case where odd numbers can be rolled, but they "kill" the sequence it makes situations involving long sequences of rolls much less likely to be included in the dataset at all.
As other comments explain, this is why the paradox emerges. By stealth, the question is actually "A: How long do I have to wait for two 6s in a row, vs B: getting two 6's, not necessarily in a row, given that I am post selecting in a way that very strongly favors short sequences of rolls".
Yes, exactly - thank you. It depends on the interpretation of the phrase "given that all rolls were even". Most ordinary people will assume it means that all the rolls were even, but as you have succinctly explained, that is not what it means in the specialist language of mathematics. It is only when you apply the latter interpretation, that some of the rolls are odd but we throw those out afterwards, that the result becomes at first surprising.
I do find LessWrong a curious place and am not a regular here. You can post something and it will get downvoted as wrong, then someone else comes along and says exactly the same thing and it's marked as correct. Heh.
Yes, its a bit weird. I was replying because I thought (perhaps getting the wrong end of the stick) that you were confused about what the question was, not (as it seems now) pointing out that the question (in your view) is open to being confused.
In probability theory the phrase "given that" is a very important, and it is (as far as I know) always used in the way used here. ["given that X happens" means "X may or may not happen, but we are thinking about the cases where it does", which is very different from meaning "X always happens"]
A more common use would be "What is the probability that a person is sick, given that they are visiting a doctor right now?". This doesn't mean "everyone in the world is visiting a doctor right now", it means that the people who are not visiting a doctor right now exist, but we are not talking about them. Similarly, the original post's imagined world involves cases where odd numbers are rolled, but we are talking about the set without odds. It is weird to think about how proposing a whole set of imaginary situations (odd and even rolls) then talking only about a subset of them (only evens) is NOT the same as initially proposing the smaller set of imaginary events in the first place (your D3 labelled 2,4,6).
But yes, I can definitely see how the phrase "given that", could be interpreted the other way.
I'm not sure about the off-topic rules here, but how about this:
Why are some of the drinks so expensive, given that all of them are mostly water?
Sometimes we use the phrase "given that" to mean, "considering that". Here, we do not mean, some of the drinks are not mostly water but we are not talking about them. We mean that literally all the drinks are mostly water.
Jumping in here: the whole point of the paragraph right after defining "A" and "B" was to ensure we were all on the same page. I also don't understand what you mean by:
Most ordinary people will assume it means that all the rolls were even
and much else of what you've written. I tell you I will roll a die until I get two 6s and let you know how many odds I rolled in the process. I then do so secretly and tell you there were 0 odds. All rolls are even. You can now make a probability distribution on the number of rolls I made, and compute its expectation.
Which of the following do you think is bigger?
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that all rolls were even.
B: The expected number of rolls of a fair die until you roll the second 6 (not necessarily in a row), given that all rolls were even.
If you are unfamiliar with conditional expectation, think of it this way: Imagine you were to perform a million sequences of die rolls, stopping each sequence when you roll two 6s in a row. Then you throw out all the sequences that contain an odd roll. The average number of rolls in the remaining sequences should be close to A. Next, perform a million sequences of die rolls, stopping each sequence when you roll the second 6. Throw out all the sequences among these that contain an odd roll. The average number of rolls in the remaining sequences should be close to B.
I asked something like this on r/math about a year ago, and even with the hint that the answer was paradoxical, the early consensus was that A must be larger. The justification was more or less the following: any time you roll until reaching two 6s in a row, you will have also hit your second 6 at or before then. So regardless what the conditions are, A must be larger than B.
But the correct answer is actually B. What on earth is going on?
A quick verification
Before we proceed, let's write some code to estimate A and B. The only goal here is to be as unambiguous as possible, so the code will be almost comically unoptimized in both run-time and length.
The 100,000s in the last two lines represent the number of sequences being rolled for the estimate; you can add zeros for accuracy or subtract zeros for faster run-time.
The estimate for A should be close to 2.727 and the estimate for B should be close to 3.000. The exact value for A is 30/11 and the exact value for B is 3, but it is helpful to first unambiguously experimentally verify that B is greater than A (and ensure that we are on the same page of what A and B mean) before diving into the possibly unintuitive math.
Primer on geometric distributions
We'll begin by calculating the expected number of rolls of a die to see the first 6. To do so, we first find the probability that it takes exactly k rolls to see the first 6. This means the first k−1 rolls were not 6 and roll k was 6.
The probability that a die rolls a 6 is 16 and the probability it does not is 56. Following the rule of product for independent probabilities, we get:
Pr(First 6 on roll k)=(56)k−116
We can now get a formula for the expected number of rolls of a die until we see the first 6. The formula for expectation gives:
E[num rolls until 6]=∞∑k=1k∗Pr(First 6 on roll k)=∞∑k=1k(56)k−116
Now we'll use the following fact: for −1<x<1:
∞∑k=1kxk−1=1(1−x)2
This can be obtained by starting with the formula for geometric series ∞∑k=0xk=11−x and taking the derivative of both sides (if you remember calculus) or squaring both sides (if you're very good at algebra). Plugging in, we have:
E[num rolls until 6]=161(1−56)2=6.
And we are done. Sort of.
Let's try that again, this time using an intuitive trick from Markov chains. We'll use "average" and "expected" interchangeably as the former is more colloquial and we are going to be a bit informal here.
Let x be the average number of rolls until we see the first 6. Let's roll the die once. With probability 16, we rolled a 6 and can stop. With probability 56, we didn't roll a 6 and are then still an average of x steps away from a 6.
So with probability 1/6, we are in a scenario where we take 1 roll to see a 6 and in the remaining probability 5/6, it will take an average of x+1 steps to see a 6. So the average number of rolls until we see the first 6 is 16(1)+56(x+1). But the average is also x! This gives us the algebra equation:
x=16+56(x+1)
that gives x=6 when solving for x.
Let's generalize now. Let's say we have some experiment that has fixed probability p of success. We repeat the experiment until it succeeds. Then if x is the expected number of trials until success, we have:
x=p(1)+(1−p)(x+1)⟹x=p+x−px+1−p⟹px=1⟹x=1/p.
Probability distributions of this form are called geometric distributions. In this case, the experiment was rolling a die and success was defined by seeing a 6, so it is a geometric distribution with success rate 16. And so the expected number of trials until success is 11/6=6.
Rolls until first 6, given all even
In this section, we will use D6 to refer to a fair 6-sided die with sides labeled 1−6 and D3 to refer to a fair 3-sided die with sides labeled 2,4,6. Consider the following two questions:
For the first question, we have a geometric distribution of success rate 13, so the expected number of trials until success is 11/3=3.
The second question was posed by Gil Kalai in a 2017 blog post. Most people incorrectly answered 3 (and keep in mind the audience for this blog is fairly math literate). The rationale was that the second question seems equivalent to the first. But let's calculate it explicitly.
Analogously to last section, we begin by calculating the probability that it takes exactly k rolls to see the first 6, given that all rolls were even. Following standard conventions, we'll write Pr(X∣Y) as shorthand for "Probability that X occurs, given that Y occurs". From the formula for conditional probability, we have:
Pr(X∣Y)=Pr(X and Y)Pr(Y)
Let's start with the numerator. If it takes us exactly k rolls to see our first 6 and all rolls in the process were even, then the first k−1 rolls were all 2 or 4 and the kth roll was a 6. The probability of this occurring is (26)k−1(16)
The denominator is the total probability that we roll a 6 before the first odd. One way we could determine this is by evaluating ∞∑i=1(26)i−1(16) (that is, summing the probability it takes exactly i rolls to get a 6 and all rolls were even, over all possible values of i). We saw how to sum those kinds of series in the last section.
But a more intuitive way is as follows - rephrase "Probability we roll a 6 before the first odd" as "Probability that between the sides {1,3,5,6}, 6 is the first to show up". From here, we can immediately see by symmetry that the probability is 14. Indeed summing the above series gives the same answer.
Altogether, we have:
Pr(First 6 on roll k∣all rolls even)=(26)k−1(16)1/4=32(13)k−1
and so:
E[rolls until first 6∣all rolls even]= ∞∑k=1k32(13)k−1=32⎛⎝11−23⎞⎠2=32
There is another, cuter, way to answer the second problem that will be important for our evaluation of B. We will first rephrase the question as "What is the expected number of rolls of a D6 until the first 6, given that 6 is the first to occur out of 1,3,5,6?". We can rephrase this again as "What is the expected number of rolls of a D6 until the first side in 1,3,5,6 shows up, given that 6 is the first to occur out of 1,3,5,6?".
Now we have some neat symmetry - the expected number of rolls of a D6 until a the first side in 1,3,5,6 shows up shouldn't depend on which of those four sides happened to be first. So the following will have the same answer as the second question: "What is the expected number of rolls of a die until the first side in 1,3,5,6 shows up?"
That's a geometric distribution with success rate 46! And so its expectation is 14/6=32.
Adding conditions after the first 6
We'll now make a modification that might seem harmless. We will roll a D6 one billion times (and keep going until we get a 6 if we somehow didn't get it in the first 1 billion rolls). What is the expected number of rolls until the first 6 given that every roll that ever showed up in this process is even.
In other words, we are still looking at the number of rolls until the first 6, and we are still requiring that all rolls before the first 6 are even. But now we are also requiring that rolls after the first 6 are even. Does this impact the expectation? Intuitively one might think "no", but we'll see that the expectation nearly doubles.
E[number of rolls∣evens up to 6 and billionth roll]=
∞∑k=1kPr(first 6 on roll k and even up to 6 and billionth roll)Pr(even up to 6 and billionth roll)
Now even under the assumption of having all evens show up in the first billion rolls, having no 6 show up in the first billion rolls is extremely unlikely. So with extreme accuracy, we can approximate the denominator as just the probability that the first billion rolls are all even - (12)
Now for k less than one billion, the probability that the first 6 is on roll k and there are only evens up to the billionth roll is:
(26)k−1(16)(12)1000000000−k=(13)(23)k−1(12)1000000000.
For k greater than one billion, the expressions are slightly different, but the contributions are so astronomically small at this point that we can pretend they are the same with basically no percent change in our answer. So we have
E[number of rolls∣evens up to 6 and billionth roll]=∞∑k=0k(13)(23)k−1=1/3(1−2/3)2=3.
which is roughly the expected number of rolls until 6 on a D3 with sides labeled 2,4,6. The conclusion is that adding conditions after the first 6 can indeed impact the expected number of rolls to the first 6.
Reviewing the paradox
Recall our definitions:
A: The expected number of rolls of a fair die until you roll two 6s in a row, given that all rolls were even.
B: The expected number of rolls of a fair die until you roll the second 6 (not necessarily in a row), given that all rolls were even.
we will now consider a third:
C: The expected number of rolls of a fair die until you roll the second 6, given that all rolls until the first instance of two 6s in a row are even.
C and A are directly comparable since they have the exact same condition - all rolls before the first instance of two 6s in a row are even and no conditions are given on any rolls that occur after. Since, amongst these rolls, the second 6 will always occur at or before the first instance of two 6s in a row, we can safely conclude A>C
However, we've seen in the last section that C is not necessarily the same as B, even though we've only added conditions that apply at or after the second 6. So we cannot immediately conclude A>B. The two need to be calculated independently and directly compared.
Proving B>A
B ends up being a lot easier to calculate than A, so our strategy will be to prove that B=3, then use some nicer upper bounds to show that A is less than 3. The strategy to compute B comes from a cute argument by reddit user u/bobjane in the aforementioned thread:
The average number of rolls until the first instance of 1,3,5,6 is 3/2, as it is a geometric distribution. Then after rolling the first, the average until the next instance of 1,3,5,6 is again 3/2, no matter how long it took to get the first instance.
By linearity of expectation, we then have that the expected number of rolls of a die until the second instance of a side in 1,3,5,6 is 32+32=3. Conditioning on the specific combination we see being 6 and 6 does not impact the expectation, hence we have B=3.
The same user actually gave an explicit formula for A in the general setting of rolling until n 6s in a row, but there are a lot of steps and we won't need the exact answer anyway. Instead, we will content ourselves with an upper bound for A that is lower than 3.
We first compute the probability that two 6s occur before the first odd via a Markov chain argument. A "success" occurs if we roll the two 6s before an odd and a "failure" occurs if we roll the odd first.
Let p0 be the probability of success at the beginning. Any time our most recent roll was a 2 or 4 and we have neither rolled 2 6s in a row nor an odd, the probability of success is still p0 - nothing has changed. We will refer to this state as S0.
But when we roll a 6, the probability of getting two 6s before an odd is temporarily higher - call this p6. We will call this state S6. If the next roll is a 2 or 4, we are back to S0 and the new probability becomes p0 again.
We will solve for p0 and p6 via a system of equations:
p0=16p1+26p0
p1=16+26p0
In other words, the probability of success from S0 is equal to the probability of going from S0 to S1 (which occurs when you roll a 6, so 16) times the probability of success from S1, plus the probability you stay at S0 (which occurs when rolling a 2 or 4) times the probability of success from S0.
On the other hand the probability of success from S1 is 16 (the probability of rolling another 6 and being done), plus the probability you go back to S0 times the probability of success from S0.
Solving this system gives p0=122. So:
A=∞∑k=0kPr(first instance of two 6 in a row at roll k and all evens until roll k)1/22
That numerator is tricky to calculate. It is much easier to calculate the probability that an instance of exactly two 6s in a row occurs at roll k and all evens until row k; this will be higher but we just want an upper bound anyway!.
For k<2, this probability is 0 and for k=2, the probability is 162=136. For k>2, this is the probability that the first k−3 rolls are even, roll k−2 is a 2 or 4 and rolls k−1,k are 6. In other words, (12)k−1(13)(16)2=(1108)(12k−3).
So:
A<22(2(136)+∞∑k=3k(1108)(12)k−3)=77/27<2.852.
(skipping some algebra to manipulate the sum). The exact answer for A is actually 30/11, which is about 2.72.
The general case
Following the above argument, you can show that the expected number of rolls until the nth 6, given that all rolls are odd, is 3n2. The expected number of rolls until the first instance of n 6s in a row, given that all rolls are odd is a little less than n+45. So, for instance, the expected number of rolls until the 70th 6 is more than the expected number of rolls until the first instance of 100 6s in a row when both require no odds to show up before the stopping condition.