Brun's theorem is a relatively famous result in analytic number theory that says the sum of the reciprocals of the twin primes converges to a finite value. In other words, we have
∑p,p+2prime1p+1p+2=B
for some finite constant B. This is in contrast to the same sum taken over all primes, which is divergent:
∑pprime1p=∞
In this post, I'll use Brun's theorem as an illustration of sieve theoretic arguments in analytic number theory. I'll try to explain relevant results as I go along to minimize the background necessary to understand the arguments, but some background in real analysis and number theory is needed to understand the post. If you don't have such a background, most of the post will probably be gibberish.
I'm writing this post mostly because I think there's some lack of good explanations of sieve theory in general and the Brun sieve in particular. Hopefully this post will be helpful to a handful of people who are interested in or trying to understand such matters.
Note that in the post I'll not always mention that a sum or a product runs over the prime numbers explicitly. If the sum or product is indexed by the letter p, you should assume that it runs over the primes and not e.g. over the natural numbers. Sometimes p will run only over odd primes, because there is a degenerate case with the prime 2 when we work with twin primes coming from the fact that 0 and 2 are in the same residue class mod 2. This will often be obvious from the surrounding context.
Background
First, let's discuss some background results that will be useful to know throughout the post.
The prime number theorem says that the number of primes less than N, denoted π(N), is well approximated by N/logN. Concretely, it says that
limN→∞π(N)N/logN=1
Roughly, the prime number theorem says that the density of the prime numbers "around N" is roughly 1/logN when N is large. This is a rather difficult theorem to prove and we won't actually need it to prove Brun's theorem. However, the result will be useful to know in heuristic arguments for why we might expect the theorem to be true, and to motivate our method of proof.
We will also need the Mertens theorems. These theorems are "weaker" versions of the prime number theorem. Specifically, we will need to know the fact that
A(x)=∑p≤x,pprime1p=loglogx+O(1)
This is a stronger, quantitative version of the result that the sum of the reciprocals of the prime numbers is divergent. This theorem is not as difficult to prove. Here is a possible proof strategy: we know that each prime p≤n occurs in the prime factorization of n! "roughly" n/p times. So we have a rough approximation
log(n!)≈∑p≤nnlogpp
where the sum over p runs over prime numbers - I'll stick to this convention for the rest of the post for the sake of brevity. On the other hand, Stirling's theorem gives the approximation log(n!)=nlogn−n+O(logn). Combining these immediately gives
B(n)=∑p≤nlogpp=logn+O(1)
once we take account of the error term in this approximation.
To pass from this to the sum we care about, we employ partial summation. This is the discrete analog of integration by parts. The idea is to write the sum we care about as
where 1prime is the indicator function of the primes, and then use summation by parts, which allows us to shift the "differencing operator" from acting on B to acting on 1/logn. Approximating the first difference of 1/logn by its derivative gives us the main term
≈∑2≤n≤xA(n)nlog2(n)
which we can now approximate using the result A(n)=logn+O(1) from above as
≈∑2≤n≤x1nlog(n)≈loglogx
where the last approximation follows from replacing the sum by an integral and noting that d(loglogx)/dx=(xlogx)−1. If we carefully keep track of all the error terms in the approximations, we again recover the precise result that A(x)=loglogx+O(1).
That's all the background knowledge we'll need for the post. Moving on:
Heuristics
Let's think about why we might expect Brun's theorem to be true.
We know from the prime number theorem that the density of the primes around n is ∼1/logn, so if we cheat and assume that the events of n being prime and n+2 being prime are "independent", the density of the twin primes should be roughly 1/log2(n). So we should expect
π2(N)∼Nlog2(N)
where π2(N) counts the twin primes less than or equal to N. If so, we can simply turn the crank of partial summation as we did before: letting P2 be the set of twin primes and 1P2 its indicator function, we compute
This final sum is convergent. One way to see it is by an integral test: if we consider the integral
∫∞2dxxlog2(x)
then the substitution x=eu,dx=eudu turns the integral into
∫∞log2duu2
which is obviously finite. So the initial sum is convergent as well.
This heuristic suggests that Brun's theorem should be true if there is "nothing wrong": if there's no unexpected correlation between the events of n being prime and n+2 being prime. The task of the proof is therefore to show that this correlation can't get bad enough to make this sum divergent.
One key insight is that we actually have some substantial amount of room to work with in this argument: we don't actually need to get anywhere close to π2(N)∼N/log2N. If we could show, for instance, that
π2(N)≤Nlog1.5(N)
for all sufficiently large N, which is a much weaker bound than what the heuristic suggests; we would still prove that the sum we care about is convergent. This fact is what makes Brun's theorem a relatively "easy" result: we don't actually need to show that there's strictly no correlation between n being prime and n+2 being prime, just that the correlation doesn't get too bad.
The sieve of Eratosthenes
Now that we understand what we have to do, it's time to think about concrete proof strategies. Our goal is to prove a "nontrivial" upper bound on the twin prime counting function π2(N). We know that the key objective is to bound the dependence of the events of n being prime and n+2 being prime. We need to make something work with these ingredients.
Given that our goal is to prove an upper bound, we'll still have succeeded if we find a set S that contains the twin primes and for which we can prove a similar result, as if we have an upper bound on |S∩{1,2,…,x}| we'll obviously have an upper bound on π2(x) as well. This suggests one immediate way to attempt a proof: the sieve of Eratosthenes.
The sieve of Eratosthenes characterizes the primes by the following: a prime is a number that's either not divisible by, or gives 1 when divided by, 2,3,5,7,11,13,…. In other words, we obtain a sequence of sets Sk which consists of the integers not divisible by the first k primes (except the primes themselves), and as k grows this gives us an increasingly refined superset of the prime numbers. It's called a sieve because we're gradually "sieving out" composite numbers from the integers.
To apply this to our problem, we need to modify the sieve a little bit, as we care about twin primes. Instead of sieving out integers that are 0(modp) for each prime p, we'll sieve out integers that are either0(modp)or2(modp). This is because the bigger prime in a tuple of twin primes can't belong to either of these congruence classes when p is chosen to be smaller than the numbers in the tuple.
Now, we know that for a prime p, the number of natural numbers less than n in any residue class mod p is equal to n/p+O(1). This approximation is quite good when p is much less than n. So we can roughly think that applying the sieve for a prime p removes roughly a fraction 2/p of the remaining integers for odd primes, and 1/2 for the prime 2. For this to work nicely it's also more convenient to sieve out by the entire congruence class, so we don't make an exception for the case when we're sieving out a twin prime as we would with the traditional sieve of Eratosthenes.
Given the above condition, we can't sieve by all primes up to x as obviously this will leave us with nothing left, but if we sieve by the primes up to √x, we'll only have removed at most √x twin primes, which is not going to be important as the upper bound we want to prove is much bigger than √x. So we can ideally imagine to get a result like
π2(x)x≤∏p≤√x(1−2p)≈exp⎛⎝−∑p≤√x2p⎞⎠=O(1log2x)
where the approximation uses the fact that exp(−x)≈1−x when x is small and the last equality uses the Mertens' theorems. Note that p here runs over the odd primes. If we could prove this, we would be very happy.
However, in fact we can't prove this, and the reason is that we're again assuming that the events we're dealing with are independent. If we could prove the independence in some formal sense, the argument would be fine, but in fact this turns out to be very difficult to prove when we're taking all primes up to √x in the product expansion.
The reason is that the fundamental result about patching together different congruence classes modulo different primes, the Chinese remainder theorem, only allows us to control the number of solutions ≤N to a system of congruences over different prime moduli when the product of those primes is substantially less than N: concretely, it means the exact asymptotic we can give on the above expression is only
π2(x)x≤∏p≤√x(1−2p)+O(∏p≤√xpx)
It turns out that another implication of the prime number theorem is that
∑p≤x,pprimelog(p)=x+o(x)
so we expect the product of all primes up to x to behave like the exponential of x. When we apply this to the primes less than √x, we get that the product is ≈e√x, which is much larger than x asymptotically. So the error term becomes much larger than the main term and the approximation fails.
The best we can hope out of this is to take p up to (logx)/2 or so. This is sometimes called the Legendre sieve, and gives us the bound
which turns out to be too weak to do what we want. Indeed, this is even weaker than the prime number theorem bound π2(N)≪π(N)≈N/logN and we know the reciprocals of the prime numbers have a divergent sum, so the Legendre sieve gets us nothing in this situation: it's far too weak.
We obviously need to be more clever. However, this shouldn't be too hard: intuitively, it should be possible for us to go past the logx threshold, because it's not that we don't understand anything about primes that are bigger than logx. The only information we don't have is about the joint correlations that involve a very large number of primes, and we understand the low order correlations very well even at thresholds much bigger than logx, because those only involve the product of a small number of primes. For instance, up to x1/10 we understand all correlations that only involve 10 or less primes very well from the same Chinese remainder theorem bound.
This suggests the following proof strategy: since higher order correlations are causing all the problems, we should try to find some way to decompose the quantity we want to calculate into "lower order" and "higher order" terms, and then see if we can bound the "higher order" part well enough that we can get some nontrivial result.
The Brun sieve
The Brun sieve is essentially an implementation of the vague idea discussed above.
Let's say that event Ep for a prime p is defined as above: it's the event that a natural number in [1,x] is either 0 or 2 modulo p. Then, we can see our above computation as noting that, ignoring the primes less than √x (which pose no problems, as discussed above), we have
π2(x)≪∣∣
∣∣⋂p≤√x(Ep)c∣∣
∣∣
where the exponent c denotes set complements. If we assume the events are actually independent, we can do a computation that looks like
π2(x)x≪P⎛⎝⋂p≤√x(Ep)c⎞⎠=∏p≤√x(1−P(Ep))
and since we know P(Ep)=2/p+O(x−1/2) we would be able to do the same proof we did above. However, as we know, the independence assumption is actually wrong, so this doesn't really work.
However, there is an unconditional way to simplify expressions of this kind that doesn't assume independence: to use inclusion-exclusion. This looks like
Overall it's a big mess, but the hope is that this is exactly what we need to use our better understanding of the low order correlations between the congruence classes of different primes: inclusion-exclusion gives us a series expansion for the quantity we want that's increasing in the complexity of the correlations we want to understand!
So our hope might be the following: we truncate this sum at some threshold t, as in, we truncate at the term that involves the correlations of t primes. We pick t to be even, as inclusion-exclusion has the property that the partial sums are always upper bounds for even t and lower bounds for odd t. We then try to understand this truncated sum better.
The square root is not ideal for this, as we already don't understand three prime correlations when we work with the square root. So let's also take the threshold at which we cut off the primes as a free parameter m. In the above, we were using m=√x, but we want to generalize this now.
One immediate observation is that if we cut the sum off at m, we only have a good understanding of correlations that involve at most logx/logm primes. To be safe, let's truncate the sum at (or choose t to be) the even number closest to clogx/logm where c≪1 is a small number that we'll fix later. In this case, the Chinese remainder approximations will be very good, and we don't have to worry about the failure of independence anymore. (I'll equivocate between t the even integer and t the real number clogx/logm sometimes - rest assured that this causes no problems for the argument and everything goes into the error terms.) So we'll get the sum
after truncation. It's important to keep track of the error term of this properly, so we can note that the error we make whenever we approximate the probability of an intersection with an independence assumption is O(xc−1): this is because each correlation involves at most t primes and each prime is ≤m, so the product is bounded from above by mt which is equal to xc given our choice of t. Since when t<m the number of such approximations we have to make is bounded from above by t⋅C(m,t)<mt=xc, we're safe as long as c is small. For instance, c≤1/3 will be enough for the error O(x2c−1) to be negligible relative to the main term here.
Now that we know the error term is negligible, so long as c is small, let's think about approximating this sum. Bounding from above with the triangle inequality gives very poor results here because the sum has a lot of cancellation. However, we can use a trick: we know that the untruncated sum, i.e. the sum for t=∞, is equal to the infinite product
∏p≤m(1−2p)
This is the expression we were working with earlier when assuming independence, so it makes sense to bring it back here. Instead of bounding the sum above directly, we'll write it as this "main term" plus an error term that comes from the "high order correlations" when the number of primes exceeds t=clogx/logm. This fits in naturally with our proof strategy: try to bound the contribution coming from higher order correlations.
The triangle inequality works much better on the error term now. Indeed, it's fairly straightforward to see that if we recall the definition
A(m)=∑p≤m1p
then the "error term", as in, all terms aside from the product term here, can be bounded in absolute value by
∞∑k=t+12kA(m)kk!
This is an elementary exericse and I leave it to the reader to see why it's true. Taking this for granted, though, we can approximate this sum up to a constant by just its first term if t≫A(m)≈loglogm, as the rest of the series can be bounded from above by a geometric series in the parameter A(m)/t which converges quickly when the above condition is true. So as long as we pick t≫loglogm asymptotically, say if t is always greater than Dloglogm for some D>2, we can approximate this sum by just its first term, which is roughly
2t(loglogm)tt!
This is not exactly right but the difference doesn't affect the proof - it just goes into the error term, as usual. The main term is also easy to approximate: we use the old idea of writing it as
∏p≤m(1−2p)=O(exp(−∑p≤m2p))=O(1log2m)
using the Mertens theorems.
Finishing the proof
Now we're almost at the end of the proof. We have a bound of the form
O(1log2m)+O(2t(loglogm)tt!)
We can use Stirling's formula to replace the factorial and end up with some bound of the form
1log2m+(2eloglogmt)t
where I've dropped the big O notation for simplicity, as this bound is not exactly right even up to a constant. However, the differences it has with a bound that works are inconsequential, so I'll illustrate the rest of this argument on this bound.
What can we pick for m? We know that we can't pick it to be a constant, which would be x1/logx, and we can't pick it to be x. As a compromise, let's try something of the form m=x1/(gloglogx) for some constant g>0. With this expression, we'll get
t=clogxlogm=gcloglogx
Now, let's work out what this says. Our bound becomes, to top order,
g2(loglogx)2log2x+(2egc)gcloglogx
Now, we need to make the choices of the constants. c is small, so we need to pick g big enough such that t≫loglogx>2loglogm for our approximations to have negligible error terms. Choosing g≫2/c is good enough for this purpose. At the same time, we need to ensure that gc≫2e, as otherwise the error term will dominate the main term here. Specifically, we have
and we want the main term to dominate, so we want the exponent here to be less than −2. If we choose c=1/3 and g=27, it's easy to check that both of these conditions hold. So everything checks out, and we're finally able to recover a bound of the form
π2(x)≤O(x(loglogx)2log2x)
Is this good enough? Yes! Earlier, we had said that even a bound of the form
π2(x)≪xlog1.5(x)
would've been good enough, and this bound is far tighter. It's not even (logx)ε from the independence bound, which is in fact quite impressive.
Conclusion
The fundamental nature of the above proof is that we understand the behavior of primes relative to modular congruence classes quite well, at least when the modulus is small compared to the primes in question. We then try to leverage our understanding here into saying something about the distribution of primes in other sets.
This is the heart of sieve theory: it's used in cases when you understand the behavior of some objects well in some substructures and you want to "piece that together" to gain a broader understanding of how the objects behave. Here, the substructures are arithmetic progressions or modular congruence classes, and the objects are the prime numbers, but the fundamental ideas of sieve theory may be applied in other domains as well.
The Brun sieve itself is a fairly basic sieve that was introduced over a hundred years ago for the exact purpose of proving tight upper bounds on π2(x), though it can also be used for other purposes as a general way to prove upper bounds on various quantities involving primes using combinatorial methods. Today, more sophisticated sieves are able to prove nontrivial lower bounds as well, which is often a more tricky problem than proving upper bounds in sieve theory.
For more on sieve theory, you might want to check out the associated tag on Terence Tao's blog. It contains material ranging from introductory to research-level in sophistication.
Very well articulated. I was researching on the Pentium FDIV bug and Brun's constant - this proof is easy on readers with little background on analytic number theory. Thanks
Brun's theorem is a relatively famous result in analytic number theory that says the sum of the reciprocals of the twin primes converges to a finite value. In other words, we have
∑p,p+2prime1p+1p+2=B
for some finite constant B. This is in contrast to the same sum taken over all primes, which is divergent:
∑pprime1p=∞
In this post, I'll use Brun's theorem as an illustration of sieve theoretic arguments in analytic number theory. I'll try to explain relevant results as I go along to minimize the background necessary to understand the arguments, but some background in real analysis and number theory is needed to understand the post. If you don't have such a background, most of the post will probably be gibberish.
I'm writing this post mostly because I think there's some lack of good explanations of sieve theory in general and the Brun sieve in particular. Hopefully this post will be helpful to a handful of people who are interested in or trying to understand such matters.
Note that in the post I'll not always mention that a sum or a product runs over the prime numbers explicitly. If the sum or product is indexed by the letter p, you should assume that it runs over the primes and not e.g. over the natural numbers. Sometimes p will run only over odd primes, because there is a degenerate case with the prime 2 when we work with twin primes coming from the fact that 0 and 2 are in the same residue class mod 2. This will often be obvious from the surrounding context.
Background
First, let's discuss some background results that will be useful to know throughout the post.
The prime number theorem says that the number of primes less than N, denoted π(N), is well approximated by N/logN. Concretely, it says that
limN→∞π(N)N/logN=1
Roughly, the prime number theorem says that the density of the prime numbers "around N" is roughly 1/logN when N is large. This is a rather difficult theorem to prove and we won't actually need it to prove Brun's theorem. However, the result will be useful to know in heuristic arguments for why we might expect the theorem to be true, and to motivate our method of proof.
We will also need the Mertens theorems. These theorems are "weaker" versions of the prime number theorem. Specifically, we will need to know the fact that
A(x)=∑p≤x,pprime1p=loglogx+O(1)
This is a stronger, quantitative version of the result that the sum of the reciprocals of the prime numbers is divergent. This theorem is not as difficult to prove. Here is a possible proof strategy: we know that each prime p≤n occurs in the prime factorization of n! "roughly" n/p times. So we have a rough approximation
log(n!)≈∑p≤nnlogpp
where the sum over p runs over prime numbers - I'll stick to this convention for the rest of the post for the sake of brevity. On the other hand, Stirling's theorem gives the approximation log(n!)=nlogn−n+O(logn). Combining these immediately gives
B(n)=∑p≤nlogpp=logn+O(1)
once we take account of the error term in this approximation.
To pass from this to the sum we care about, we employ partial summation. This is the discrete analog of integration by parts. The idea is to write the sum we care about as
A(x)=∑2≤n≤x1prime(n)lognnlogn=∑2≤n≤xB(n)−B(n−1)logn
where 1prime is the indicator function of the primes, and then use summation by parts, which allows us to shift the "differencing operator" from acting on B to acting on 1/logn. Approximating the first difference of 1/logn by its derivative gives us the main term
≈∑2≤n≤xA(n)nlog2(n)
which we can now approximate using the result A(n)=logn+O(1) from above as
≈∑2≤n≤x1nlog(n)≈loglogx
where the last approximation follows from replacing the sum by an integral and noting that d(loglogx)/dx=(xlogx)−1. If we carefully keep track of all the error terms in the approximations, we again recover the precise result that A(x)=loglogx+O(1).
That's all the background knowledge we'll need for the post. Moving on:
Heuristics
Let's think about why we might expect Brun's theorem to be true.
We know from the prime number theorem that the density of the primes around n is ∼1/logn, so if we cheat and assume that the events of n being prime and n+2 being prime are "independent", the density of the twin primes should be roughly 1/log2(n). So we should expect
π2(N)∼Nlog2(N)
where π2(N) counts the twin primes less than or equal to N. If so, we can simply turn the crank of partial summation as we did before: letting P2 be the set of twin primes and 1P2 its indicator function, we compute
∑p∈P2,p≤x1p=∑2≤n≤x1P2(n)n≈∑2≤n≤xnn2log2(n)=∑2≤n≤x1nlog2(n)
This final sum is convergent. One way to see it is by an integral test: if we consider the integral
∫∞2dxxlog2(x)
then the substitution x=eu,dx=eudu turns the integral into
∫∞log2duu2
which is obviously finite. So the initial sum is convergent as well.
This heuristic suggests that Brun's theorem should be true if there is "nothing wrong": if there's no unexpected correlation between the events of n being prime and n+2 being prime. The task of the proof is therefore to show that this correlation can't get bad enough to make this sum divergent.
One key insight is that we actually have some substantial amount of room to work with in this argument: we don't actually need to get anywhere close to π2(N)∼N/log2N. If we could show, for instance, that
π2(N)≤Nlog1.5(N)
for all sufficiently large N, which is a much weaker bound than what the heuristic suggests; we would still prove that the sum we care about is convergent. This fact is what makes Brun's theorem a relatively "easy" result: we don't actually need to show that there's strictly no correlation between n being prime and n+2 being prime, just that the correlation doesn't get too bad.
The sieve of Eratosthenes
Now that we understand what we have to do, it's time to think about concrete proof strategies. Our goal is to prove a "nontrivial" upper bound on the twin prime counting function π2(N). We know that the key objective is to bound the dependence of the events of n being prime and n+2 being prime. We need to make something work with these ingredients.
Given that our goal is to prove an upper bound, we'll still have succeeded if we find a set S that contains the twin primes and for which we can prove a similar result, as if we have an upper bound on |S∩{1,2,…,x}| we'll obviously have an upper bound on π2(x) as well. This suggests one immediate way to attempt a proof: the sieve of Eratosthenes.
The sieve of Eratosthenes characterizes the primes by the following: a prime is a number that's either not divisible by, or gives 1 when divided by, 2,3,5,7,11,13,…. In other words, we obtain a sequence of sets Sk which consists of the integers not divisible by the first k primes (except the primes themselves), and as k grows this gives us an increasingly refined superset of the prime numbers. It's called a sieve because we're gradually "sieving out" composite numbers from the integers.
To apply this to our problem, we need to modify the sieve a little bit, as we care about twin primes. Instead of sieving out integers that are 0(modp) for each prime p, we'll sieve out integers that are either 0(modp) or 2(modp). This is because the bigger prime in a tuple of twin primes can't belong to either of these congruence classes when p is chosen to be smaller than the numbers in the tuple.
Now, we know that for a prime p, the number of natural numbers less than n in any residue class mod p is equal to n/p+O(1). This approximation is quite good when p is much less than n. So we can roughly think that applying the sieve for a prime p removes roughly a fraction 2/p of the remaining integers for odd primes, and 1/2 for the prime 2. For this to work nicely it's also more convenient to sieve out by the entire congruence class, so we don't make an exception for the case when we're sieving out a twin prime as we would with the traditional sieve of Eratosthenes.
Given the above condition, we can't sieve by all primes up to x as obviously this will leave us with nothing left, but if we sieve by the primes up to √x, we'll only have removed at most √x twin primes, which is not going to be important as the upper bound we want to prove is much bigger than √x. So we can ideally imagine to get a result like
π2(x)x≤∏p≤√x(1−2p)≈exp⎛⎝−∑p≤√x2p⎞⎠=O(1log2x)
where the approximation uses the fact that exp(−x)≈1−x when x is small and the last equality uses the Mertens' theorems. Note that p here runs over the odd primes. If we could prove this, we would be very happy.
However, in fact we can't prove this, and the reason is that we're again assuming that the events we're dealing with are independent. If we could prove the independence in some formal sense, the argument would be fine, but in fact this turns out to be very difficult to prove when we're taking all primes up to √x in the product expansion.
The reason is that the fundamental result about patching together different congruence classes modulo different primes, the Chinese remainder theorem, only allows us to control the number of solutions ≤N to a system of congruences over different prime moduli when the product of those primes is substantially less than N: concretely, it means the exact asymptotic we can give on the above expression is only
π2(x)x≤∏p≤√x(1−2p)+O(∏p≤√xpx)
It turns out that another implication of the prime number theorem is that
∑p≤x,pprimelog(p)=x+o(x)
so we expect the product of all primes up to x to behave like the exponential of x. When we apply this to the primes less than √x, we get that the product is ≈e√x, which is much larger than x asymptotically. So the error term becomes much larger than the main term and the approximation fails.
The best we can hope out of this is to take p up to (logx)/2 or so. This is sometimes called the Legendre sieve, and gives us the bound
π2(x)x≤∏p≤(logx)/2(1−2p)+O(x−1/2)≈exp⎛⎝−∑p≤(logx)/22p⎞⎠=O(1(loglogx)2)
which turns out to be too weak to do what we want. Indeed, this is even weaker than the prime number theorem bound π2(N)≪π(N)≈N/logN and we know the reciprocals of the prime numbers have a divergent sum, so the Legendre sieve gets us nothing in this situation: it's far too weak.
We obviously need to be more clever. However, this shouldn't be too hard: intuitively, it should be possible for us to go past the logx threshold, because it's not that we don't understand anything about primes that are bigger than logx. The only information we don't have is about the joint correlations that involve a very large number of primes, and we understand the low order correlations very well even at thresholds much bigger than logx, because those only involve the product of a small number of primes. For instance, up to x1/10 we understand all correlations that only involve 10 or less primes very well from the same Chinese remainder theorem bound.
This suggests the following proof strategy: since higher order correlations are causing all the problems, we should try to find some way to decompose the quantity we want to calculate into "lower order" and "higher order" terms, and then see if we can bound the "higher order" part well enough that we can get some nontrivial result.
The Brun sieve
The Brun sieve is essentially an implementation of the vague idea discussed above.
Let's say that event Ep for a prime p is defined as above: it's the event that a natural number in [1,x] is either 0 or 2 modulo p. Then, we can see our above computation as noting that, ignoring the primes less than √x (which pose no problems, as discussed above), we have
π2(x)≪∣∣ ∣∣⋂p≤√x(Ep)c∣∣ ∣∣
where the exponent c denotes set complements. If we assume the events are actually independent, we can do a computation that looks like
π2(x)x≪P⎛⎝⋂p≤√x(Ep)c⎞⎠=∏p≤√x(1−P(Ep))
and since we know P(Ep)=2/p+O(x−1/2) we would be able to do the same proof we did above. However, as we know, the independence assumption is actually wrong, so this doesn't really work.
However, there is an unconditional way to simplify expressions of this kind that doesn't assume independence: to use inclusion-exclusion. This looks like
P⎛⎝⋂p≤√x(Ep)c⎞⎠=1−P⎛⎝⋃p≤√xEp⎞⎠=1−∑p≤√xP(Ep)+∑p1,p2≤√x,p1<p2P(Ep1∩Ep2)−…
Overall it's a big mess, but the hope is that this is exactly what we need to use our better understanding of the low order correlations between the congruence classes of different primes: inclusion-exclusion gives us a series expansion for the quantity we want that's increasing in the complexity of the correlations we want to understand!
So our hope might be the following: we truncate this sum at some threshold t, as in, we truncate at the term that involves the correlations of t primes. We pick t to be even, as inclusion-exclusion has the property that the partial sums are always upper bounds for even t and lower bounds for odd t. We then try to understand this truncated sum better.
The square root is not ideal for this, as we already don't understand three prime correlations when we work with the square root. So let's also take the threshold at which we cut off the primes as a free parameter m. In the above, we were using m=√x, but we want to generalize this now.
One immediate observation is that if we cut the sum off at m, we only have a good understanding of correlations that involve at most logx/logm primes. To be safe, let's truncate the sum at (or choose t to be) the even number closest to clogx/logm where c≪1 is a small number that we'll fix later. In this case, the Chinese remainder approximations will be very good, and we don't have to worry about the failure of independence anymore. (I'll equivocate between t the even integer and t the real number clogx/logm sometimes - rest assured that this causes no problems for the argument and everything goes into the error terms.) So we'll get the sum
≈1−∑p≤m2p+∑p1<p2≤m4p1p2−∑p1<p2<p3≤m8p1p2p3−…+∑p1<p2<…<pt≤m2tp1p2…pt
after truncation. It's important to keep track of the error term of this properly, so we can note that the error we make whenever we approximate the probability of an intersection with an independence assumption is O(xc−1): this is because each correlation involves at most t primes and each prime is ≤m, so the product is bounded from above by mt which is equal to xc given our choice of t. Since when t<m the number of such approximations we have to make is bounded from above by t⋅C(m,t)<mt=xc, we're safe as long as c is small. For instance, c≤1/3 will be enough for the error O(x2c−1) to be negligible relative to the main term here.
Now that we know the error term is negligible, so long as c is small, let's think about approximating this sum. Bounding from above with the triangle inequality gives very poor results here because the sum has a lot of cancellation. However, we can use a trick: we know that the untruncated sum, i.e. the sum for t=∞, is equal to the infinite product
∏p≤m(1−2p)
This is the expression we were working with earlier when assuming independence, so it makes sense to bring it back here. Instead of bounding the sum above directly, we'll write it as this "main term" plus an error term that comes from the "high order correlations" when the number of primes exceeds t=clogx/logm. This fits in naturally with our proof strategy: try to bound the contribution coming from higher order correlations.
This will give us
=∏p≤m(1−2p)+∑p1<p2<…<pt+1≤m2t+1p1p2…pt+1−∑p1<p2<…<pt+2≤m2t+2p1p2…pt+2+…
The triangle inequality works much better on the error term now. Indeed, it's fairly straightforward to see that if we recall the definition
A(m)=∑p≤m1p
then the "error term", as in, all terms aside from the product term here, can be bounded in absolute value by
∞∑k=t+12kA(m)kk!
This is an elementary exericse and I leave it to the reader to see why it's true. Taking this for granted, though, we can approximate this sum up to a constant by just its first term if t≫A(m)≈loglogm, as the rest of the series can be bounded from above by a geometric series in the parameter A(m)/t which converges quickly when the above condition is true. So as long as we pick t≫loglogm asymptotically, say if t is always greater than Dloglogm for some D>2, we can approximate this sum by just its first term, which is roughly
2t(loglogm)tt!
This is not exactly right but the difference doesn't affect the proof - it just goes into the error term, as usual. The main term is also easy to approximate: we use the old idea of writing it as
∏p≤m(1−2p)=O(exp(−∑p≤m2p))=O(1log2m)
using the Mertens theorems.
Finishing the proof
Now we're almost at the end of the proof. We have a bound of the form
O(1log2m)+O(2t(loglogm)tt!)
We can use Stirling's formula to replace the factorial and end up with some bound of the form
1log2m+(2eloglogmt)t
where I've dropped the big O notation for simplicity, as this bound is not exactly right even up to a constant. However, the differences it has with a bound that works are inconsequential, so I'll illustrate the rest of this argument on this bound.
What can we pick for m? We know that we can't pick it to be a constant, which would be x1/logx, and we can't pick it to be x. As a compromise, let's try something of the form m=x1/(gloglogx) for some constant g>0. With this expression, we'll get
t=clogxlogm=gcloglogx
Now, let's work out what this says. Our bound becomes, to top order,
g2(loglogx)2log2x+(2egc)gcloglogx
Now, we need to make the choices of the constants. c is small, so we need to pick g big enough such that t≫loglogx>2loglogm for our approximations to have negligible error terms. Choosing g≫2/c is good enough for this purpose. At the same time, we need to ensure that gc≫2e, as otherwise the error term will dominate the main term here. Specifically, we have
(2egc)gcloglogx=exp(log(2e/gc)⋅gc⋅loglogx)=(logx)gclog(2e/gc)
and we want the main term to dominate, so we want the exponent here to be less than −2. If we choose c=1/3 and g=27, it's easy to check that both of these conditions hold. So everything checks out, and we're finally able to recover a bound of the form
π2(x)≤O(x(loglogx)2log2x)
Is this good enough? Yes! Earlier, we had said that even a bound of the form
π2(x)≪xlog1.5(x)
would've been good enough, and this bound is far tighter. It's not even (logx)ε from the independence bound, which is in fact quite impressive.
Conclusion
The fundamental nature of the above proof is that we understand the behavior of primes relative to modular congruence classes quite well, at least when the modulus is small compared to the primes in question. We then try to leverage our understanding here into saying something about the distribution of primes in other sets.
This is the heart of sieve theory: it's used in cases when you understand the behavior of some objects well in some substructures and you want to "piece that together" to gain a broader understanding of how the objects behave. Here, the substructures are arithmetic progressions or modular congruence classes, and the objects are the prime numbers, but the fundamental ideas of sieve theory may be applied in other domains as well.
The Brun sieve itself is a fairly basic sieve that was introduced over a hundred years ago for the exact purpose of proving tight upper bounds on π2(x), though it can also be used for other purposes as a general way to prove upper bounds on various quantities involving primes using combinatorial methods. Today, more sophisticated sieves are able to prove nontrivial lower bounds as well, which is often a more tricky problem than proving upper bounds in sieve theory.
For more on sieve theory, you might want to check out the associated tag on Terence Tao's blog. It contains material ranging from introductory to research-level in sophistication.