Suppose you have a property Q which certain objects may or may not have. You've seen many of these objects; you know the prior probability P(Q) that an object has this property.
You have 2 independent measurements of object O, which each assign a probability that Q(O) (O has property Q). Call these two independent probabilities A and B.
What is P(Q(O) | A, B, P(Q))?
To put it another way, expert A has opinion O(A) = A, which asserts P(Q(O)) = A = .7, and expert B says P(Q(O)) = B = .8, and the prior P(Q) = .4, so what is P(Q(O))? The correlation between the opinions of the experts is unknown, but probably small. (They aren't human experts.) I face this problem all the time at work.
You can see that the problem isn't solvable without the prior P(Q), because if the prior P(Q) = .9, then two experts assigning P(Q(O)) < .9 should result in a probability lower than the lowest opinion of those experts. But if P(Q) = .1, then the same estimates by the two experts should result in a probability higher than either of their estimates. But is it solvable or at least well-defined even with the prior?
The experts both know the prior, so if you just had expert A saying P(Q(O)) = .7, the answer must be .7 . Expert B's opinion B must revise the probability upwards if B > P(Q), and downwards if B < P(Q).
When expert A says O(A) = A, she probably means, "If I consider all the n objects I've seen that looked like this one, nA of them had property Q."
One approach is to add up the bits of information each expert gives, with positive bits for indications that Q(O) and negative bits that not(Q(O)).
This is what I meant by extreme- further than warranted.
The subtler point was that the penalty for being extreme, in a decision-making context, depends on your threshold. Suppose you just want to know whether or not your posterior should be higher than your prior. Then, the experts saying "A>P(Q)" and "B>P(Q)" means that you vote "higher," regardless of your aggregation technique, and if the experts disagree, you go with the one that feels more strongly (if you have no data on which one is more credible).
Again, if the threshold is higher, but not significantly higher, it may be that both aggregation techniques give the same results. One of the benefits of graphing them is that it will make the regions where the techniques disagree obvious- if A says .9 and B says .4 (with a prior of .3), then what do the real-world experts think this means? Choosing between the methods should be done by focusing on the differences caused by that choice (though first-principles arguments about correlation can be useful too).