In short: There is no objective way of summarizing a Bayesian update over an event with three outcomes as an update over two outcomes .
Suppose there is an event with possible outcomes .
We have prior beliefs about the outcomes .
An expert reports a likelihood factor of .
Our posterior beliefs about are then .
But suppose we only care about whether happens.
Our prior beliefs about are .
Our posterior beliefs are .
This implies that the likelihood factor of the expert regarding is .
This likelihood factor depends on the ratio of prior beliefs .
Concretely, the lower factor in the update is the weighted mean of the evidence and according to the weights and .
This has a relatively straightforward interpretation. The update is supposed to be the ratio of the likelihoods under each hypothesis. The upper factor in the update is . The lower factor is .
I found this very surprising - the summary of the expert report depends on my prior beliefs!
I claim that this phenomena is unintuitive, and being unaware of this can lead to errors.
Why this is weird
Bayes' rule describes how to update our prior beliefs using data.
In my mind, one very nice property of Bayes rule was that it cleanly separates the process into a subjective part (eliciting your priors) and an ~objective part (computing the update).
For example, we may disagree on our prior beliefs on whether eg COVID19 originated in a lab. But we cannot disagree on the direction and magnitude of the update caused by learning that it originated in one of the few cities in the world with a gain-of-function lab working on coronaviruses.
Because of this, researchers are encouraged to report their update factors together with their all considered beliefs. This way, users can use their research for their own conclusions by multiplying their prior with the update. And metastudies can just take the product of the likelihoods of all studies to estimate the combined effect of the evidence.
In the above example, we lose this nice property - the update factor depends on the prior beliefs of the user. Researchers would not be able to objectively summarize their likelihood about whether COVID19 originated in a lab accidentally vs zoonotically vs being designed as a bioweapon as a single number for people who only care about whether it originated in a lab versus any other possibility.
Examples in the wild
I ran into this problem twice recently:
- When analyzing Mennen’s ABC example of a case where averaging the logarithmic odds of experts seems to result in nonsense.
- In my own research on interpreting Bayesian Networks as I was trying to come up with a way of decomposing a Bayesian update into a combination of several updates.
In both cases being unaware of the phenomena led me to a conceptual mistake.
Mennen’s ABC example
Mennen’s example involves three experts debating an event with three possible outcomes, .
Expert #1 assigns relative odds of .
Expert #2 assigns relative odds of .
Expert #3 assigns relative odds of .
The logodds-averaging pooled opinion of the experts is i.e. equal odds, which correspond to a probability of equal to .
But suppose we only care about .
Expert #1’s implicit odds are .
Expert #2’s implicit odds are .
Expert #3’s implicit odds are .
The pooled odds in this case are , which correspond to a probability of equal to .
We get different results depending on whether we take the implicit odds after or before pooling expert opinion. What is going on?
Mennen claims that this is a strike against logarithmic pooling. The issue according to him is in the step where we take the opinion of the three experts and aggregate it using average logodds.
I think that this is related to the phenomena I described at the beginning of the article. The problem is with the step where we take the relative odds and summarize them as .
It’s no wonder that logodd pooling gives inconsistent results when we aggregate outcomes. Bayesian updating is not well defined in that case!
Interpreting Bayesian Networks
I will not enter into too much detail because my theory of interpretability of Bayesian Networks is very complex. But it suffices to say that I was getting inconsistent results because of this issue.
In essence, I came up with a way of decomposing a Bayesian update into a series of independent steps, corresponding to different subgraphs of a Bayesian Network.
For example, I would decompose the update over a node with three outcomes as the product of the baseline odds of the event and a number of updates.
In my system, I only cared about whether happened. So I naively summarized each update before aggregating them.
This was giving me very poor results - my resulting updates would be very off compared to traditional inference algorithms like message passing.
It is no wonder this was giving me bad results - it is the wrong way of going about it! Our analysis at the beginning implies that the update should be the average of and , instead of the sum.
After realizing the paradox, I changed my system to not summarizing the odds of until after aggregating all the updates.
Performance improved.
Consequences
I am quite confused about what to think about this.
It clearly has consequences, as illustrated by the examples in the previous section. But I am not sure what to recommend doing in response.
My most immediate takeaway is to be very careful when aggregating outcomes - there is an important chance we will be introducing an error along the way.
Beyond that, the aggregation paradox seems to imply that we need to work at the correct level of aggregation. We cannot naively deduce implied binary odds from the distribution of a multiple outcome event.
But what is the right level of aggregation?
When aggregating, the lower factor of the update is a weighted mean of the evidence likelihoods and . This suggests that the problem disappears when we impose for any disaggregation of the joint event into subevents and .
But this condition is too strong. For example, we could base our disaggregation on the observed evidence. For example, if the evidence can either be or we could disaggregate ~A into the cases where and the cases where . In that case, the condition cannot ever be satisfied, by definition.
We can say that this disaggregation is not a sensible one, and ought to be excluded for the purposes of the condition. But in that case we have passed the bucket down to defining what is a sensible disaggregation.
Another approach is to assume that the prior relative likelihood of any aggregated outcomes is uniform, ie . In that case, we have that .
But then we can no longer chain updates - after applying any likelihood where the resulting posterior will no longer meet this condition.
Pragmatically, it seems like the best we can do if we want to rescue objetivity is to resign ourselfs to summarize the updates assuming a uniform prior. That is, by averaging the evidence associated to each aggregated outcome.
This is not enough to correctly approximate Bayesian updating, as we can see in the example below:
But I can't see how to do better in the absence of more information.
One key takeaway here is that beliefs and updates are summarized in different ways.
In summary
I have explained one counterintuitive consequence of Bayesian updating on variables with more than two outcomes. This paradox implies that we should be careful when grouping together outcomes of a variable. And I have shown two situations where this unintuitive consequence is relevant.
This is a post meant to explore and start a discussion more than provide definite answers. Some things I’d be keen on discussing include:
- Is this a documented phenomena? Where can I find more discussion?
- What does this imply for formulating forecasting questions? Will this result in problems when asking binary questions about events that are multifaceted?
- What is “the right level” of outcome aggregation for a given problem?
- Are there other examples where similar issues come up?
I’d be really interested in your thoughts - please leave a comment if you have any!
Acknowledgements
Thanks to rossry, Nuño Sempere, Eric Neyman, Ehud Reiter and ForgedInvariant for discussing this topic with me and helping me clarify some ideas.
Thanks to Alex Mennen for coming up with the example I referenced in the post.
(Possibly a bit of a tangent) It occurred to me while reading this that perhaps average log odds could make sense in the context in which there is a uniform prior, and the probabilities provided by experts differ because the experts disagree on how to interpret evidence that brings them away from the uniform prior. This has some intuitive appeal:
1) Perhaps, when picking questions to ask forecasters, people have a tendency to pick questions for which they believe the probability that the answer is yes is approximately 50%, because that offers the most opportunity to update in response to the beliefs of the forecasters. If average log odds is an appropriate pooling method to use if you have a uniform prior, then this would explain its good empirical performance. I think I mentioned in our discussion on your EA forum post that if there is a tendency for more knowledgeable forecasters to give more extreme probabilities, then this would explain good performance by average log odds, which weights extreme predictions heavily. A tendency for the questions asked to have priors of near 50% according to the typical unknowledgeable person would explain why more knowledgeable forecasters would assign more extreme probabilities on average: it takes more expertise to justifiably bring their probabilities further from 50%.
2) It excuses the incoherent behavior of average log odds on my ABC example as well. If A, B, and C are mutually exclusive, then they can't all have 50% prior probability, so a pooling method that implicitly assumes that they do will not give coherent results.
Ultimately, though, I don't think this is actually true. Consider the example of forecasting a continuous variable x by soliciting probability density functions p1(x) and p2(x) from two experts, and pooling them to get the pdf proportional to √p1(x)p2(x) (renormalized so it integrates to 1). You could also consider forecasting the variable y=f(x) for some differentiable, strictly increasing function f. Then your experts give you pdfs q1(y) and q2(y) satisfying pi(x)=f′(x)qi(f(x)), and you pool them to get the pdf proportional to √q1(y)q2(y). I claim that, if what we're doing implicitly depends on a uniform prior in a sneaky way, that the first thing should be the appropriate thing to do if x has a uniform prior, and the second thing should be appropriate if y has a uniform prior. If f is nonlinear, then a uniform prior on x induces a non-uniform prior on y, and vice-versa, so we should get incompatible results from each way of doing this, as we were implicitly using different priors each time. But let's try it: √p1(x)p2(x)=√f′(x)q1(f(x))f′(x)q2(f(x))=f′(x)√q1(f(x))q2(f(x)). Thus, given that both experts provided pdfs satisfying the formula pi(x)=f′(x)qi(f(x)) making their probability distributions on x and y compatible with y=f(x), our pooled pdfs also satisfies that formula, and is also compatible with y=f(x). That is, if we pooled using beliefs about x, and then find the implied beliefs about y, we get the same thing as if we directly pooled using beliefs about y. Different implicit priors don't appear to be ruining anything.
I conclude that the incoherent results in my ABC example cannot be blamed on switching between the uniform prior on {A,B,C} and the uniform prior on {A,¬A}, and, instead, should be blamed entirely on the experts having different beliefs conditional on ¬A, which is taken account in the calculation using A,B,C, but not in the calculation using A,¬A.
Because it produces situations where more extreme probability estimates correlate with more expertise (assuming all forecasters are well-calibrated).
They wouldn't. But if both would have started with priors around 50% before they acquired any of their expertise, and it's their expertise that updates them away from 50%,... (read more)