Are you aware of the existing work on ignorance priors, for instance the maximum entropy prior (if I remember properly this is Jeffrey’s prior and gives rise to the KT estimator), also the improper prior which effectively places almost all of the weight on 0 and 1? Interestingly, the universal distribution does not include continuous parameters but does end up dominating any computable rule for assigning probabilities, including these families of conjugate priors.
If I understand correctly, the maximum entropy prior will be the uniform prior, which gives rise to Laplace's law of succession, at least if we're using the standard definition of entropy below:
But this definition is somewhat arbitrary because the the "" term assumes that there's something special about parameterising the distribution with it's probability, as opposed to different parameterisations (e.g. its odds, its logodds, etc). Jeffrey's prior is supposed to be invariant to different parameterisations, which is why people like it.
But my complaint is more Solomonoff-ish. The prior should put more weight on simple distributions, i.e. probability distributions that describe short probabilistic programs. Such a prior would better match our intuitions about what probabilities arise in real-life stochastic processes. The best prior is the Solomonoff prior, but that's intractable. I think my prior is the most tractable prior that resolved the most egregious anti-Solomonoff problems with Laplace/Jeffrey's priors.
I find this intellectually stimulating, but it does not look useful in practice, because with repeated i.i.d. data the information in the data is much higher than the prior if the prior is diffuse/universal/ignorance.
You raise a good point. But I think the choice of prior is important quite often:
An interesting thing is that Laplace’s rule gives almost the same result as Gott’s equation from Doomsday argument, which have much simpler derivation.
Imagine a sequence of binary outcomes generated independently and identically by some stochastic process. After observing N outcomes, with n successes, Laplace's Rule of Succession suggests that our confidence in another success should be (n+1)/(N+2). This corresponds to a uniform prior over [0,1] for the underlying probability. But should we really be uniform about probabilities?
I think a uniform prior is wrong for three reasons:
I propose this mixture distribution:
w1 * logistic-normal(0, sigma^2) + w2 * 0.5(dirac(0) + dirac(1)) + w3 * thomae_{100}(α) + w4 * uniform(0,1)
where:
Ideally, our prior should be a mixture of every possible probabilistic program, weighted by 2^(-K) where K is its Kolmogorov complexity. This would properly capture our preference for simple mechanisms. However, such a distribution is impossible to represent, compute, or apply. Instead, I propose my prior as a tractable distribution that resolves what I think are the most egregious problems with Laplace's law of succession.
Now that I've found the appropriate approximation for the universal prior over binary outcomes, the path to solving induction is clear. First, we'll extend this to pairs of binary outcomes, then triples, and so on. I expect to have sequence of length 10 nailed by Tuesday, and full Solomonoff Induction by Q1 2025.
I've built an interactive demo to explore this distribution. The default parameters (w1=0.3, w2=0.1, w3=0.3, w4=0.3, sigma=5, alpha=2) reflect my intuition about the relative frequency of these different types of programs in practice. This gives a more realistic prior for many real-world scenarios where we're trying to infer the behavior of unknown processes that might be deterministic, fair, or genuinely random in various ways. What do you think? Is there a simple model which serves as a better prior?