Solid post - it is good to have the full reasoning for Laplace's rule of succession in a single post instead of buried in a statistics post. I also liked the discussion on how to use it in practice - I'd love to see a full example using actual numbers if you feel like writing one!
On this topic I also recently enjoyed UnexpectedValues post. He provides a cool proof / intuition for the rule of succession.
Yeah, Neyman's proof of Laplace's version of the rule of succession is nice. The reason I think this kind of approach can't give the full strength of the conjugate prior approach is that I think there's a kind of "irreducible complexity" to computing for non-integer values of . The only easy proof I know goes through the connection to the gamma function. If you stick only to integer values there are easier ways of doing the computation, and the linearity of expectation argument given by Neyman is one way to do it.
One concrete example of the rule being used in practice I can think of right now is this comment by SimonM on Metaculus.
Laplace's rule of succession is a commonly used heuristic in forecasting the occurence of events which have little to no historical precedents. The basic problem setup is this: suppose you flip a coin with an unknown bias N=H+T times and get H heads and T tails. We want to infer something about the bias p of the coin from this information, where we define p to be the probability of the coin coming up heads.
One obvious idea is to estimate p by ^p=H/(H+T). This works well if we have some decent number of heads and tails, but it fails badly in the edge case where the tosses are very imbalanced. In the extreme case, if we get all heads and no tails, we would be estimating ^p=1 which is obviously a bad estimate: it implies that we're sure the coin has a probability 1 of coming up heads, which seems like an extreme inference to make from a sequence of N heads. After all, if we merely had p<1/(2N), we would not expect to see any heads to begin with. An inference that goes beyond this bound seems intuitively unjustified.
The simple version of Laplace's rule is a heuristic which says we should pretend we got one heads and one tails in excess of the results we have actually obtained and compute the naive estimator under that assumption. In other words, we would estimate ^p=(H+1)/(H+T+2). This avoids the pitfall mentioned in the previous paragraph even though it is a biased estimator: for N tosses we have
E[^p]=pN+1N+2=p(N+2)+(1−2p)N+2=p+1−2pN+2
which is not equal to the true value p unless p=1/2. This is unavoidable, because if we want to avoid overconfidence in the case of N heads, we have to make the estimator biased in the case p=1; and by continuity it will also at least be biased in some neighborhood of 1. It is, however, asymptotically unbiased: this means that as N→∞, we have E[^p]→p.
It's easy to see why this is an improvement over the naive estimator for avoiding overconfidence, but it looks more like an ad-hoc fix than an approach with theoretical justification. To give the justification behind Laplace's rule, I first have to talk about conjugate priors.
Conjugate priors
In Bayesian inference, we often have the following problem: there is a family of models we're considering that are parametrized by a vector θ∈Rn of parameters and we have a prior P(θ) over these parameters. Given the prior, we observe some event X, and we want to do Bayesian updating to get a posterior by
P(θ|X)∝P(θ)⋅P(X|θ)
Here ∝ is the proportionality symbol, since I'm omitting the overall normalizing factor P(X) in Bayes' rule for simplicity.
In general this becomes a mess: the posterior P(θ|X) can become intractable even after a simple update like this, since in principle there's no reason for the prior P(θ) and the likelihood P(X|θ) to have anything to do with each other. To simplify Bayesian inference in cases like this, a common trick is to pick a family of priors given the form of the likelihood function such that the posterior will also lie in the same family as the prior. Given the likelihood function, such families of priors are called conjugate priors.
Suppose we try to do this for our coin setup. We take θ=p, and the likelihood function of observing a specific sequence of H heads and T tails will then be pH(1−p)T. The prior and posterior will be related by
P(p|X)∝P(p)⋅pH(1−p)T
It should be fairly obvious that we can consider a family of distributions Beta(α,β)∝pα−1(1−p)β−1 in this case. Such density functions can be normalized to give genuine probability densities as long as α,β>0. If the prior P(p) is of this form, all the Bayesian update will do is change its exponents: we'll go from a prior of the form Beta(α,β) to a posterior of the form Beta(α+H,β+T). Thus, as long as our initial prior is of this specific form, we can reduce a Bayesian update to a simple problem of updating two parameters α,β in a straightforward and transparent manner.
The factor B(α,β) we need to normalize the function pα−1(1−p)β−1 to get an actual probability density function is called the beta function and the distribution itself is called the beta distribution. We say that the beta distribution is the conjugate prior to the Bernoulli distribution, which is the technical name for the distribution of coin flips.
Properties of the beta distribution
There are two properties of the beta distribution that are important for the inference problem we'll face: its mode and its mean. I'll compute both of them here.
The mode is easy. I'll only do the case α,β>1 here since this case is the hardest. We can ignore the normalization factor and simply maximize f(p)=pα−1(1−p)β−1 by choosing p. Differentiating and using the product rule gives
f′(p)=(α−1p−β−11−p)pα−1(1−p)β−1
which is equal to zero when the expression in the parantheses is equal to zero. Multiplying through by p(1−p) gives (1−p)(α−1)−p(β−1)=0 or
p=α−1α+β−2
This local extremum is a maximum when α,β>1, so it's indeed the mode of Beta(α,β).
The mean is easy to compute if we know some functional equations satisfied by the beta function, in particular its relationship to the Gamma function, but I won't assume knowledge of those here and I'll just say that the mean of Beta(α,β) turns out to be α/(α+β).
Importantly, the mean and mode of the beta distribution are not equal! This is actually the correct way to understand the necessity for something like Laplace's rule from a Bayesian perspective, as I'll explain shortly.
Inference
Now that we have the conjugate prior machinery prepared, we can imagine that we start flipping our coin with some prior Beta(α,β) over its bias p. Once we get H heads and T tails, we update to the posterior Beta(α+H,β+T). This works fine if we actually have a prior, so part of the moral of Laplace's rule is that you can account for your prior when doing Bayesian inference for this problem by simply assuming you got α more heads and β more tails than you actually did get. If we don't actually have a prior, though, how do we choose one?
This is the problem of uninformative priors, in other words, priors we would choose in a setting in which we know nothing about the coin. In particular, we can't know that heads is more likely than tails or vice versa, so that suggests our prior should be symmetric: we should have α=β. That reduces us down to one degree of freedom.
The naive version of Laplace's rule corresponds to taking α=β=1, which as you can check simply corresponds to a uniform distribution over [0,1]. Once we do this and observe H heads and T tails our posterior becomes Beta(H+1,T+1) which has mean (H+1)/(H+T+2) and mode H/(H+T). If we're good Bayesians our subjective probability that the next coin toss will come up heads has to be the mean (H+1)/(H+T+2) which fits Laplace's rule and not the mode H/(H+T). Taking the mode in this context would be a form of maximum a posteriori estimation or maximum likelihood estimation: we have that
P(p|X)∝P(p)⋅pH(1−p)T
so taking the mode of the posterior is equivalent to maximizing the right hand side of this expression, and since the prior P(p) is constant for α=β=1 this just means we're maximizing the likelihood. In this particular case, the mode of the prior is also the maximum likelihood estimate, though in general this would not be true. However, taking the maximum likelihood estimate is not the correct thing to do in this situation, because we really want the mean of the posterior and not its mode. This is the reason why the naive estimator fails in this problem.
We can, however, ask whether α=β=1 is the most sensible choice of prior. Two other popular choices are α=β=1/2, called the Jeffreys prior; and α=β=0, called the Haldane prior. The Haldane prior is technically not a probability density because its integral over the unit interval blows up and therefore can't be normalized, but we can still use it as a so-called improper prior in Bayesian inference. However, it has the same overconfidence problem discussed at the start of this post, so I'll instead try to give the motivation for why we might prefer the Jeffreys prior instead.
One of the main reasons to prefer the Jeffreys prior is that it is invariant under reparametrization. There is a kind of arbitrariness to assuming that the most natural distribution is uniform: if we parametrize our coin not by the chance p of coming up heads but instead by some other variable such as θ=cos(πp), taking a uniform distribution over θ would not be the same thing as taking a uniform distribution over p. Therefore we can only really say a uniform distribution is a natural choice if we know that parametrizing the distribution of coin flips by p is in some sense the best choice we can make.
On the other hand, the Jeffreys prior is defined explicitly to be invariant under reparametrization: it is proportional to the square root of a quantity (the determinant of the Fisher information matrix) which changes by the square of the Jacobian when we switch parametrizations, so the prior itself transforms by a Jacobian factor, which makes it give a consistent probability distribution regardless of the choice of variable used to parametrize it. You can think of the Jeffreys prior as "respecting the chain rule".
If we compute the Jeffreys prior for our setup, we find that it is equivalent to taking α=β=1/2. In terms of inference, this would mean we would get the estimator
^p=H+0.5H+T+1
instead of the Laplace rule estimator of
^p=H+1H+T+2
Using it in practice
The rule of succession is best used in practice not by adhering to a fixed procedure for choosing a prior, but by using the principle that the beta distribution is conjugate prior to the Bernoulli distribution in order to represent your prior beliefs about something in a tractable manner and to do Bayesian updating easily and in a way that is interpretable.
If we have strong reasons to be confident in our prior, such as if we're sampling from a reliable pseudorandom number generator, we would pick α,β large and α/(α+β) equal to the probability we're feeding the PRNG, where how large these parameters are would represent our confidence level. On the other hand, if we're forecasting something such as the chance that Italy will have a different prime minister in six months, we can get α,β from the past frequency of prime minister changes in Italy, scale them down by some amount to add some uncertainty to the prior to account for the fact that political conditions might have changed, and then update our forecast as new evidence comes in. We can put evidence on the same footing by asking "what is the number of months that the incumbent prime minister would have to remain in office so that the strength of that evidence is roughly comparable to the strength of the evidence I have?"
Like anything else in forecasting this method can be more of an art than a science, but I think the rule of succession gives a useful framework for thinking about these kinds of issues in a Bayesian way.