by [anonymous]
1 min read

6

I came up with this problem the other day. I don't have nearly enough math to solve it. Do any of you wise folk have insight?

Imagine a sequence of randomly generated 0s and 1s. I start from nothing and generate this sequence one term at a time. I stop when I'm at the end of a streak of 1s which is at least one half the length of the total sequence.

For example:

01

0010110101101011111111111111

010110111111 -EDIT: Obviously, this is wrong, because it would terminate at 01. Sorry.

What is the average length of such sequences?

I know that all of these sequences will eventually terminate. I just don't know if the length diverges, and I'm not sure how to deal with the problem.

New Comment
13 comments, sorted by Click to highlight new comments since:
[-]gjm190

The number of sequences of length 2n ending with n 1s is 2^n.

The number of sequences of length 2n+1 ending with (n+1) 1s is also 2^n.

Hence, if you generate 0s and 1s at random for ever, the expected number of stopping-points you encounter is 1/2 + 1/2 + 1/4 + 1/4 + 1/8 + 1/8 + ... = 2.

If you begin with 000 then the expected number of stopping points after that is 3/4, hence the probability of stopping at all is at most 3/4. Hence the overall probability of ever stopping is at most 1 - 1/8.1/4 = 0.96875. (It must in fact be less than that.)

So it is not true that "all of these sequences will eventually terminate", nor even that a random such sequence terminates with probability 1.

Upvoted for the nice argument.

Just as a piece of constructive criticism, I will note that the argument is probably a bit hard to read if you don't already know a decent amount of combinatorics / probability. In particular you might want to cite Markov's inequality explicitly.

[-]gjm30

Yeah, it's a bit terse. I was writing while at work, and since regrettably "doing combinatorics on Less Wrong" isn't part of my job description I didn't spend much time polishing.

Also, there's a mistake, though it doesn't invalidate the argument.

So here's a more leisurely and less-wrong explanation.

Imagine that you just generate 0s and 1s for ever. You don't stop when you find a long enough run of 1s, you merely note that fact and continue generating bits. How many times, on average, will that happen?

The probability that it happens after generating exactly n bits is the fraction of n-bit strings that end with at least n/2 consecutive 1s. When n=2m is even, that means you get to choose the first m bits arbitrarily and then the remaining m are all 1s. When n=2m+1 is odd, it means you get to choose the first m bits arbitrarily and then the remaining m+1 are all 1s. So the probability that you get to call "stop!" after a given number of bits goes: 1/2, 1/2, 1/4, 1/4, 1/8, 1/8, etc.

Now, the average number of times you get to call "stop!" is just the sum of those probabilities. ("Expectations are additive": the average number of times you get to say "stop!" equals the average number of times you get to say it after one bit, plus the average number of times you get to say it after two bits, etc., and those averages are just the probabilities.) So the average number of "stop!"s is 1/2+1/2+1/4+1/4+etc = 2.

Now, handwavily, the fact that this is finite should be enough to make you suspect that there's a nonzero probability of never seeing a stopping-point: after all, there's a nonzero probability of reaching (say) 100 bits without stopping, and after that the probability should look like 1/2^50+1/2^50+1/2^51+... = 1/2^48 which is very small.

That handwavy argument is actually wrong, because it's possible that conditional on not yet having reached a stopping point the probability of stopping might get bigger. So, e.g., imagine that the process went like this: if you've generated a 0s and b 1s, your next bit is 1 with probability 1-1/a if b=0 and 0 otherwise; you stop when you first generate a 1. Then the expected (average) total number of stopping points is just 1 (because once you've generated one 1 you never generate another) but the probability of ever stopping is also 1 -- with each bit that goes past without a 1 generated, the probability that you stop on the very next turn gets closer to 1.

In our situation, though, it's the other way about: in so far as previous stoppability makes a difference, having been unable to stop for a while makes you less likely to be entitled to stop.

To be more precise: Suppose your first three bits happen to be 0. (An event with probability 1/8.) Then if at any later point you're entitled to stop, you'd have been entitled to stop whatever the first three bits were. In particular, Pr(can stop at time t | first three bits are 0) <= Pr(can stop at time t). Also, Pr(can stop at time t | first three bits are 0) is 0 for t=1,2,3.

Therefore, conditional on the first three bits being 0, the expected (average) number of "stop!"s is at most what you get on replacing the first three terms of the series with 0s: 0+0+0+1/4+1/8+1/8+1/16+1/16+... = 3/4.

[Note: in my original much more terse explanation, I said that those two things are equal. That was the mistake I mentioned earlier.]

This implies that, conditional on starting with three 0s, there's at least a 1/4 probability of never being entitled to stop. Why? Because expected number of stops <= Pr(any stops), because expected number of stops <= sum {all n} of Pr(exactly n stops) . n <= sum {n>=1} of Pr(exactly n stops) . 1 = sum {n>=1} of Pr(exactly n stops) = Pr(at least one stop).

So. If (as happens with probability 1/8) the first three bits are 0, then with probability at least 1/4 you never get to stop. Therefore Pr(never get to stop) >= 1/8 . 1/4 = 1/32, and therefore Pr(ever get to stop) <= 31/32. In particular, that last probability isn't 1.

I know that all of these sequences will eventually terminate.

Are you sure? I wrote a little program to compute the number of ways of stopping after n bits, and so find the probabilities, and I got this result. Columns are string length, number of ways of stopping at that length, probability of stopping at that length, and probability of stopping at at most that length. Expected length conditional on terminating appears to top out at about 1.4743 (sum of product of first and third column).

(ETA: Table edited to fix an error and add the second column.)

       1               1   0.500000   0.500000
       2               1   0.250000   0.750000
       4               1   0.062500   0.812500
       6               1   0.015625   0.828125
       8               2   0.007812   0.835938
      10               3   0.002930   0.838867
      12               6   0.001465   0.840332
      14              11   0.000671   0.841003
      16              22   0.000336   0.841339
      18              42   0.000160   0.841499
      20              84   0.000080   0.841579
      22             165   0.000039   0.841619
      24             330   0.000020   0.841638
      26             654   0.000010   0.841648
      28            1308   0.000005   0.841653
      30            2605   0.000002   0.841655
      32            5210   0.000001   0.841657
      34           10398   0.000001   0.841657
      36           20796   0.000000   0.841658
      38           41550   0.000000   0.841658
      40           83100   0.000000   0.841658
      42          166116   0.000000   0.841658
      44          332232   0.000000   0.841658
      46          664299   0.000000   0.841658
      48         1328598   0.000000   0.841658
      50         2656866   0.000000   0.841658
      52         5313732   0.000000   0.841658
      54        10626810   0.000000   0.841658
      56        21253620   0.000000   0.841658
      58        42505932   0.000000   0.841658
      60        85011864   0.000000   0.841658
      62       170021123   0.000000   0.841658
      64       340042246   0.000000   0.841658
      66       680079282   0.000000   0.841658
      68      1360158564   0.000000   0.841658
      70    2.720307e+09   0.000000   0.841658
      72    5.440613e+09   0.000000   0.841658
      74    1.088121e+10   0.000000   0.841658
      76    2.176241e+10   0.000000   0.841658
      78    4.352478e+10   0.000000   0.841658
      80    8.704957e+10   0.000000   0.841658
      82    1.740990e+11   0.000000   0.841658
      84    3.481981e+11   0.000000   0.841658
      86    6.963960e+11   0.000000   0.841658
      88    1.392792e+12   0.000000   0.841658
      90    2.785584e+12   0.000000   0.841658
      92    5.571168e+12   0.000000   0.841658
      94    1.114233e+13   0.000000   0.841658
      96    2.228467e+13   0.000000   0.841658
      98    4.456934e+13   0.000000   0.841658
     100    8.913867e+13   0.000000   0.841658

Eyeballing a plot of the fourth column doesn't suggest it's going to reach 1, but it might be some extraordinarily slow-growing thing.

The recurrence for computing the number of ways of stopping is a sort of generalised Fibonacci: each value is the sum of approximately the last half of the preceding values.

What prompts asking the question here?

I have repeated the calculation and confirm the results.

What is interesting, the sequence in the second column satisfies a recurrence relation

  • T(4n+2) = 2T(4n) - T(2n)
  • T(4n) = 2T(4n-2)
  • (and trivially T(2n+1) = T(2n), not included in the table).
[-]Larks110

Surely

010110111111

is an invalid string?

Good insight and it may affect gjm's calculations. It took me enough time to see what you were saying that I'll spell it out for others...

This string is impossible because it would have stopped at "01". In general every terminating string of length N will necessarily end with N/2 1's, and its variable prefix necessarily cannot contain a substring that would have terminated. For example, you can't count strings that match pattens like: x1xx..., xx11xx..., xxx111xx..., and xxxx1111xx.. and the coverage of later exclusion patterns overlaps with earlier exclusion patterns which makes the counting a bit tricky (that is, 0101110... is doubly excluded by x1 and xxx111).

[-]gjm20

FWIW, my calculations take this into account. (Or at least they're meant to; I can't guarantee that I haven't made any mistakes.)

[-][anonymous]00

Sorry, you're obviously right. I have edited the post to reflect this.

[-]gjm100

A sequence cannot end after an odd number of bits, because if you have 2n+1 bits ending with (n+1) 1s then you first had 2n bits ending with n 1s, and would have ended there.

Let A(n) = the number of sequences that terminate after exactly n steps, B(n) = the number of sequences of length n that haven't terminated yet, and B(n,k) = the number of sequences of length n that haven't terminated yet ending with exactly k 1s. Then A(2n) = B(2n-1,n-1) and A(2n+1) = 0.

For any k<n/2, B(n,k) = B(n-k-1). (The sequences counted by B(n,k) all end with a 0 and then k 1s, and didn't terminate before that.) So, in particular, A(2n) = B(2n-1-(n-1)-1) = B(n-1).

B(n) = sum {0<=k<n/2} of B(n,k) = sum {0<=k<n/2} of B(n-k-1).

It isn't immediately obvious to me how to get a closed form out of this, so I computed the first few terms and looked the numbers up in the Online Encyclopedia of Integer Sequences. It appears to be http://oeis.org/A002083 whose elements are conjectured to be asymptotically constant*2^n. (It's interesting that the descriptions of this sequence in OEIS don't mention anything trivially equivalent to what GSE has asked about here.)

The unconditional average length -- which we already know is infinite -- is the infinite sum {n>=1} of 2n A(2n) / 2^(2n) = the sum {n>=1} of 2n B(n-1) / 2^n. Writing b(x) = sum { n>=0 } of B(n) x^n, this equals 2b'(1/2). The information about the generating function on the OEIS page implies that b(1/2) is infinite but b(x) is finite for x<1/2, which implies that 2b'(1/2) is infinite. Check.

The average length of sequences that terminate is harder: it's lim { N->inf } of sum { 1<=n<=N } of 2n A(2n) / (2^2N-B(2N)). Since I'm supposed to be doing the work I'm paid to do right now, I'll leave it there for someone who may or may not be me to come back to :-).

(It's interesting that the descriptions of this sequence in OEIS don't mention anything trivially equivalent to what GSE has asked about here.)

Perhaps not trivially so, but this one from that page is exactly the one I worked out when I looked at the problem, and what I programmed to compute the table:

a(1)=1, else a(n) is sum of [ n/2 ] previous terms.

There's an edge case at the very start that means that the OP's sequence has an extra 1 at the beginning.

[-]gjm10

I was thinking of the material in the "Comment" section, which lists various examples of where the sequence arises, often (as here) of the form "a(n) = total number of objects of such-and-such a kind satisfying such-and-such conditions". It's true that one of the recurrence relations listed there also arises very straightforwardly from GSE's definition of the sequence, but that's not the same thing.

What about lengths other than n/2?

If the expected number of stopping times is finite, the probability of stopping cannot equal 1. (Because the expected number of stopping times when the first K bits are 0 must go to 0, and therefore pass below 1.)

So looking for a sequence of 1s of length no more than log (base 2) of n plus a constant should be necessary to get it to always stop.


Now I'm going to try and calculate the distribution. The only thing that matters at each step is the number of 1s at the end. This has a 50% chance of being increased by 1 and a 50% chance of falling to 0.

If you analyze this in a way that it's not worth it to explain on this poorly-suited medium, you get my excel calculation, which gives an expected value of .93975 conditional on terminating.

This is weird because I got the same terminating probabilities as Richard, but different expected value.