Comment Permalink

buybuydandavis12y00

In mixture of expert problems, the experts are not independent, that's the whole problem. They are all trying to correlate to some underlying reality, and thereby are correlated with each other.

But you also say "dozens of different things". Are they trying to estimate the same things, different things, or different things that should all correlate to the same thing?

See my longer comment above for more details, but it's sounding like you don't want to evaluate over the whole data set, you just want to make some assumption about the statistics of your classifiers, and combine them via maximum entropy and those statistics.

See in context

12 A probability question

by PhilGoetz

19th Oct 2012

1 min read

12

Suppose you have a property Q which certain objects may or may not have. You've seen many of these objects; you know the prior probability P(Q) that an object has this property.

You have 2 independent measurements of object O, which each assign a probability that Q(O) (O has property Q). Call these two independent probabilities A and B.

What is P(Q(O) | A, B, P(Q))?

To put it another way, expert A has opinion O(A) = A, which asserts P(Q(O)) = A = .7, and expert B says P(Q(O)) = B = .8, and the prior P(Q) = .4, so what is P(Q(O))? The correlation between the opinions of the experts is unknown, but probably small. (They aren't human experts.) I face this problem all the time at work.

You can see that the problem isn't solvable without the prior P(Q), because if the prior P(Q) = .9, then two experts assigning P(Q(O)) < .9 should result in a probability lower than the lowest opinion of those experts. But if P(Q) = .1, then the same estimates by the two experts should result in a probability higher than either of their estimates. But is it solvable or at least well-defined even with the prior?

The experts both know the prior, so if you just had expert A saying P(Q(O)) = .7, the answer must be .7 . Expert B's opinion B must revise the probability upwards if B > P(Q), and downwards if B < P(Q).

When expert A says O(A) = A, she probably means, "If I consider all the n objects I've seen that looked like this one, nA of them had property Q."

One approach is to add up the bits of information each expert gives, with positive bits for indications that Q(O) and negative bits that not(Q(O)).

Personal Blog

12

Mentioned in

7A follow-up probability question: Data samples with different priors

New Comment

28 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:26 PM

[-]DanielLC12y210

It can be done, but it's a lot easier if you use odds ratios, as shown in Share likelihood ratios, not posterior beliefs. That being said, experts will tend to know a lot of the same information, so using this will involve major double counting. Also, they tend to know each other's opinions. This means that you just have to accept the opinion they're guaranteed to share via Aumann's agreement theorem, or more likely, you have to accept that they're not acting rationally, and take their beliefs with a grain of salt.

In your example:

A = 7:3, B = 8:2, P(Q) = 4:6

First, calculate the odds ratio expert B has:

(8:2)/(4:6) = 48:8

= 6:1

Then just multiply that by what expert A had to update on his opinion:

(7:3)(6:1) = 42:3

= 14:1

Thus, there's a 14/15 = 93.3% chance of Q

[-]Douglas_Knight12y40

It is worth noting that this can be summarized by Phil's own suggestion:

One approach is to add up the bits of information each expert gives, with positive bits for indications that Q(O) and negative bits that not(Q(O)).

That is, you can interpret the log of the odds ratio as the evidence/information that A gives you beyond Q. Adding the evidence from A and B gives your aggregate evidence, which you add to the log odds of the Q prior to get your log odds posterior.

[-]PhilGoetz12y20

Wait. This doesn't work. If the prior is 1:2, and you have n experts also giving estimates of 1:2, you should end up with the answer 1:2. None of the experts are providing information; they're just throwing their hands up and spitting back the prior. Yet this approach multiplies all their odds ratios together, so that the answer changes with n.

ADDED: Oh, wait, sorry. You're saying take the odds ratio they output, factor out the prior for each expert, multiply, and then factor the prior back in. Expanded,

(8/2)/(4/6) (7/3)/(4/6) (4/6) = 8064/576, 1 / (1+1/(8064/576)) = .933

Great!

[-]DanielLC12y20

You don't have to factor out the priors both times, since putting it back in will cancel one of them out. This is equivalent to considering how one of them will update based on the information the other has.

(8/2)/(4/6) * (7/3) = 336/24, 336/(336+24) = 0.933

[-]endoself12y00

This is correct in the special case where all the information that your experts are basing their conclusions off of is independent. Slightly modifying your example, if the prior is 1:2 and you have n experts giving odds of 1:1, they each have a likelihood ratio of 2:1, so you get 1:2 * (2:1)^n = 2^(n-1):1. However, if they've all updated based on looking at the results from the same experiment, you're double-counting the evidence; intuitively, you actually want to assign an odds ratio of 1:1.

The right thing to do here is to calculate for each subset of the experts what information they share, but you probably don't have that information and so you'd have to estimate it, which I'd have to think a lot about in order to do well. Hopefully, the assumption of independence is approximately true in your data and you can just go with the naive method.

[-]PhilGoetz12y10

THANKS!

[-]buybuydandavis12y70

I think the first order of business is to straighten out the notation, and what is known.

A - measurement from algorithm A on object O
B - measurement from algorithm B on object O
P(Q|I) - The probability you assign to Q based on some unspecified information I.

Use these to assign P(Q | A,B,O,I).

You have 2 independent measurements of object O,

I think that's a very bad word to use here. A,B are not independent, they're different. The trick is coming up with their joint distribution, so that you can evaluate P(Q | A,B,O,I).

The correlation between the opinions of the experts is unknown, but probably small.

If the correlation is small, your detectors suck. I doubt that's really what's happening. The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.

We need to identify some assumptions about the accuracy of A and B, and their joint distribution. A and B aren't just numbers, they're probability estimates. They were constructed so that they would be correlated with Q. How do we express P(QAB|O)? What information do we start with in this regard?

For a normal problem, you have some data {O_i} where you can evaluate P(A), your detector, versus Q and get the expectation of Q given A. Same for B.

The maximum entropy solution would proceed assuming that these statistics were the only information you had - or that you no longer had the data, but only had some subset of expectations evaluated in this fashion. I think Jaynes found the maximum entropy solution for two measurements which correlate to the same signal. I don't think he did it in a mixture of experts context, although the solution should be about the same.

If instead you have all the data, the problem is equally straightforward. Evaluate the expectation of Q given A,B across your data set, and apply on new data. Done. Yes, there's a regularization issue, but it's a 2-d -> 1-d supervised classification problem. If you're training A and B as well, do that in combination with this 2-d->1d problem as a stacked generalization problem, to avoid over fitting.

The issue is exactly what data are you working from. Can you evaluate A and B across all data, or do you just have statistics (or assumptions expressed as statistics) on A and B across the data?

[-]pragmatist12y50

If the correlation is small, your detectors suck. I doubt that's really what's happening. The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.

The way I interpreted the claim of independence is that the verdicts of the experts are not correlated once you conditionalize on Q. If that is the case, then DanielLC's procedure gives the correct answer.

To see this more explicitly, suppose that expert A's verdict is based on evidence Ea and expert B's verdict is based on evidence Eb. The independence assumption is that P(Ea & Eb|Q) = P(Ea|Q) * P(Eb|Q).

Since we know the posteriors P(Q|Ea) and P(Q|Eb), and we know the prior of Q, we can calculate the likelihood ratios for Ea and Eb. The independence assumption allows us to multiply these likelihood ratios together to obtain a likelihood ratio for the combined evidence Ea & Eb. We then multiply this likelihood ratio with the prior odds to obtain the correct posterior odds.

[-]buybuydandavis12y00

To see this more explicitly, suppose that expert A's verdict is based on evidence Ea and expert B's verdict is based on evidence Eb. The independence assumption is that P(Ea & Eb|Q) = P(Ea|Q) * P(Eb|Q).

You can write that, and it's likely possible in some cases, but step back and think, Does this really make sense to say in the general case?

I just don't think so. The whole problem with mixture of experts, or combining multiple data sources, is that the marginals are not in general independent.

[-]pragmatist12y40

Sure, it's not generically true, but PhilGoetz is thinking about a specific application in which he claims that it is justified to regard the expert estimates as independent (conditional on Q, of course). I don't know enough about the relevant domain to assess his claim, but I'm willing to take him at his word.

I was just responding to your claim that the detectors must suck if the correlation is small. That would be true if the unconditional correlation were small, but its not true if the correlation is small conditional on Q.

[-]wnoise12y40

The usual situation is that both detectors actually have some correlation to Q, and thereby have some correlation to each other.

This need not be the case. Consider a random variable Z that is the sum of two random independent variables X and Y. Expert A knows X, and is thus correlated with Z. Expert B knows Y and is thus correlated with Z. Expert A and B can still be uncorrelated. In fact, you can make X and Y slightly anticorrelated, and still have them both be positively correlated with Z.

[-]buybuydandavis12y00

Just consider the limiting case - both are perfect predictors of Q, with value 1 for Q, and value 0 for not Q. And therefore, perfectly correlated.

Consider small deviations from those perfect predictors. The correlation would still be large. Sometimes more, sometimes less, depending on the details of both predictors. Sometimes they will be more correlated with each other than with Q, sometimes more correlated with Q than each other. The degree of correlation with of A and B with Q will impose limits on the degree of correlation between A and B.

And of course, correlation isn't really the issue here anyway, much more like mutual information, with the same sort of triangle inequality limits to the mutual information.

If someone is feeling energetic and really wants to work this our, I'd recommend looking into triangle inequalities for mutual information measures, and the previously mentioned work by Jaynes on the maximum entropy estimate of a variable from it's known correlation with two other variables, and how that constrains the maximum entropy estimate of the correlation between the other two.

[-]Steve_Rayhawk12y40

If you have a lot of experts and a lot of objects, I might try a generative model where each object had unseen values from an n-dimensional feature space, and where experts decided what features to notice using weightings from a dual n-dimensional space, with the weight covectors generated as clustered in some way to represent the experts' structured non-independence. The experts' probability estimates would be something like a logistic function of the product of each object's features with the expert's weights (plus noise), and your output summary probability would be the posterior mean of an estimate based on a special "best" expert weighting, derived using the assumption that the experts' estimates are well-calibrated.

I'm not sure what an appropriate generative model of clustered expert feature weightings would be.

Actually, I guess the output of this procedure would just end up being a log-linear model of the truth given the experts' confidences. (Some of the coefficients might be negative, to cancel confounding factors.) So maybe a lot easier way to fix this is to sample from the space of such log-linear models directly, using sampled hypothetical imputed truths, while enforcing some constraint that the experts' opinions be reasonably well-calibrated.

You have 2 independent measurements

I ignored this because I wasn't sure what you could have meant by "independent". If you meant that the experts' outputs are fully independent, conditional on the truth, then the problem is straightforward. But this seems unlikely in practice. You probably just meant the informal English connotation "not completely dependent".

[-]Vaniver12y10

If you don't get any information after the fact on whether O was Q or not, there's not one right way to do it. JRMayne's recommendation of averaging the expert judgments works, as does DanielLC's recommendation of assuming that the experts are entirely uncorrelated. The trouble with assuming they're uncorrelated is that it can give you pretty extreme probability estimates- but if you're just making decisions based on some middling threshold ("call it a Q if P(Q)>.5") then you don't have to worry about extreme probability estimates! If you make decisions based on an extreme threshold ("call it a Q if P(Q)>.99"), then you have to worry. One of the things that might be helpful is plotting what these formula will result in A,B space, and seeing if that graph looks like what you / experts in this domain would expect.

If you do get information after the fact, you'll want to use what's called a Bayesian Judge. Basically, it learns P(Q(O)|A,B,P(Q)) through Bayesian updates; you're building an expert that says "if I consider all of the n times A said a and B said b, nP times it turned out to be Q, so P(Q)=P."

The other neat thing about Bayesian judges is that they fix calibration problems with experts- it will quickly learn that when they say .9, they actually mean .7.

The trouble with the Bayesian judge is that it will starve if you can't feed it data on whether or not O was Q. I won't type up the necessary math unless this fits your situation, but if it does I'd be happy to.

[-]DanielLC12y20

The trouble with assuming they're uncorrelated is that it can give you pretty extreme probability estimates

No. The trouble with assuming they're uncorrelated is that they probably aren't. If they were, the extreme probability estimates would be warranted.

I suppose more accurately, the problem is that if there is a significant correlation, assuming they're uncorrelated will give a, equally significant error, and they're usually significantly correlated.

[-]Vaniver12y00

No. The trouble with assuming they're uncorrelated is that they probably aren't. If they were, the extreme probability estimates would be warranted.

This is what I meant by extreme- further than warranted.

The subtler point was that the penalty for being extreme, in a decision-making context, depends on your threshold. Suppose you just want to know whether or not your posterior should be higher than your prior. Then, the experts saying "A>P(Q)" and "B>P(Q)" means that you vote "higher," regardless of your aggregation technique, and if the experts disagree, you go with the one that feels more strongly (if you have no data on which one is more credible).

Again, if the threshold is higher, but not significantly higher, it may be that both aggregation techniques give the same results. One of the benefits of graphing them is that it will make the regions where the techniques disagree obvious- if A says .9 and B says .4 (with a prior of .3), then what do the real-world experts think this means? Choosing between the methods should be done by focusing on the differences caused by that choice (though first-principles arguments about correlation can be useful too).

[-]Antisuji12y10

It seems like there are a few missing details. How do the experts arrive at their opinions? Presumably they have data about O (maybe different from each other) and are updating on that data and reporting the result. So what they're really telling you is their odds ratio given the data. Another concern that you need to take into account is how much of a track record the experts have. Maybe A is more experienced than B (or runs a more sophisticated algorithm). Maybe A's data about O is more significant than B's data about O.

[-]PhilGoetz12y00

I have no additional information. This is the general case that I need to solve. This is the information that I have, and I need to make a decision.

(The real-world problem is that I have a zillion classifiers, that give probability estimates for dozens of different things, and I have to combine their outputs for each of these dozens of things. I don't have time to look inside any of them and ask for more details. I need a function that takes as an argument one prior and N estimates, assumes the estimates are independent, and produces an output. I usually can't find their correlations due to the training data not being available or other problems, and anyway I don't have time to write the code to do that, and they're usually probably small correlations.)

[-]DanielLC12y20

Are you dealing with things where it's likely to be independent? If you're looking at studies, they probably will be. If you're looking at experts, they probably won't.

[-]Douglas_Knight12y00

There are unsupervised methods, if you have unlabeled data, which I suspect you do. I don't know about standard methods, but here are a few simple ideas off the top of my head:

First, you can check if A is consistent with the prior by seeing that average probability it predicts over your data is your prior for Q. If not, there are a lot of possible failure modes, such as your new data being different from the data used to set your prior, or A being wrong or miscalibrated. If I trusted the prior a lot and wanted to fix the problem, I would scale the evidence (the odds ratio of A from the prior) by a constant.

You can apply the same test to the joint prediction. If A and B each produce the right frequency, but their joint prediction does not, then they are correlated. It is probably worth doing this, as a check on your assumption of independence. You might try to correct for this correlation by scaling the joint evidence, the same way I suggested scaling a single test. (Note that if A=B, scaling is the correct answer.)

But if you have many tests and you correct each pair, it is no longer clear how to combine all of them. One simple answers is to drop tests in highly correlated pairs and assume everything that else is independent. To salvage some information rather than dropping tests, you might cluster tests into correlated groups, use scaling to correct within clusters and assume the clusters are independent.

[-]buybuydandavis12y00

In mixture of expert problems, the experts are not independent, that's the whole problem. They are all trying to correlate to some underlying reality, and thereby are correlated with each other.

But you also say "dozens of different things". Are they trying to estimate the same things, different things, or different things that should all correlate to the same thing?

[-]Irgy12y00

There simply is no right answer to your question in general. The problem is that most of the time you simply have no way of knowing whether the experts' opinions are independent or not. The thing to realise is that even if the experts don't talk to each other and have entirely separate evidence, the underlying reality can still create a dependence. Vaniver's comment actually says it pretty well already, but just to hammer it home let me give you a specific example.

Imagine the underlying process is this: A coin is flipped 6 times, and each time either a 1 or a 0 is written a the side of a 6 sided die. Then the die is rolled, and you're interested in whether it rolled a 1. Obviously your prior is 0.5. Now imagine there are 3 experts who all give a 2/3 chance that a 1 was rolled.

Situation 1: Each expert has made a noisy observation of the actual die roll. Maybe they took a photo, but the photos are blurry and noisy. In this case, the evidence from each of the three separate photos is independent, and the odds combine like DanielLC describes to give an 8/9 chance that a 1 was rolled. With more experts saying the same thing, the probability converges to 1 here. Of course if they all had seen copies of the same photo it would be a different story...

Situation 2: No-one has seen the roll of interest itself, but each of the experts has seen the result of many other rolls of the same die (different other rolls for each expert). In this case, it's clear that all you have is strong evidence that there are four 1s and two 0s on the die, and the probability stays at 2/3. Note that the experts haven't spoken to each other, nor have they seen the same evidence, they're correlated by an underlying property of the system.

Situation 3: This one is a little strained I'll admit, but it's important to illustrate that the odds can actually be less than your prior even though the experts are individually giving chances that are higher than it. Imagine it's common knowledge among experts that there are five 1s on the die (maybe they've all seen hundreds of other rolls, though still different rolls from each other). However, each of them also has a photo of the actual roll, and again the photos are not completely clear but in each case it sure does look a little more like a 0 than a 1. In this case, the probability from their combined knowledge is actually 8/33! Ironically I used DanielLC's method again for this calculation.

The point of this then is that the answer to what their combined information would tell you could actually be literally anything at all. You just don't know, and the fact that the experts don't talk to each other and see separate evidence is not enough to assume they're uncorrelated. Of course there has to be a correct answer to what probability to assign in the case of ignorance on the level of correlation between the experts, and I'm actually not sure exactly what it is. Whatever it is though there's a good chance of it still turning out to be consistently under- or over-confidant across multiple similar trials (assuming you're doing such a thing). If this is a machine learning situation for instance (which it sounds like) I would strongly advise you simply make some observations of exactly how the probabilities of the experts correlate. I can give you some more detailed advice on how to go about doing that correctly if you wish.

Personally I would by default go with averaging the experts as my best guess. Averaged in log-odds space though of course (=log(p/(1-p)), not averaging the 0 to 1 probabilities. DanielLC's advice is theoretically well founded but the assumption of statistically independent evidence is, as I say, usually unwarrented. I would expect his method to generally give overconfident probabilities in practice.

[-]PhilGoetz12y00

I'm assuming their opinions are independent, usually because they're trained on different features that have low correlations with each other. I was thinking of adding in log-odds space, as a way of adding up bits of information, and this turns out to be the same as using DanielLC's method. Averaging instead seems reasonable if correlations are high.

[-]Irgy12y00

Yes, but the key point I was trying to make is that using different features with low correlations does not at all ensure that adding the evidence is correct. What matters is not correlations between the features, but correlations between the experts. Correlated features will of course mean correlated experts, but the converse is not true. The features don't have to be correlated for the experts to make mistakes on the same inputs. It's often the case that they do simply because some inputs are fundamentally more difficult than others, in ways that affect all of the features.

If you've observed that there's low correlations between the experts, then you've effectively already followed my main suggestion: " I would strongly advise you simply make some observations of exactly how the probabilities of the experts correlate". If you've only observed low correlations between features then I'd say it's quite likely you're going to generate overconfident results.

PS Much as I don't like "appeal to authority", I do think it's worth pointing out that I deal with exactly this problem at work, so I'm not just talking out of my behind here. Obviously it's hard to know how well experience in my field correlates with yours without knowing what your field is, but I'd expect these issues to be general.

[-][anonymous]12y00

Clemen and Winkler (1990) discuss parts of this problem.

Abstract:

When two forecasters agree regarding the probability of an uncertain event, should a decision maker adopt that probability as his or her own? A decision maker who does so is said to act in accord with the unanimity principle. We examine a variety of Bayesian consensus models with respect to their conformance (or lack thereof) to the unanimity principle and a more general compromise principle. In an analysis of a large set of probability forecast data from meteorology, we show how well the various models, when fit to the data, reflect the empirical pattern of conformance to these principles.

See also Clemen and Winkler (1999) for the more general problem of combining probability distributions rather than probabilities.

[This comment is no longer endorsed by its author]Reply

[-]JRMayne12y00

I think I misunderstand the question, or I don't get the assumptions, or I've gone terribly wrong.

Let me see if I've got the problem right to begin with. (I might not.)

40% of baseball players hit over 10 home runs a season. (I am making this up.)

Joe is a baseball player.

Baseball projector Mayne says Joe has a 70% chance of hitting more than 10 home runs next season. Baseball projector Szymborski says Joe has an 80% chance of hitting more than10 home runs next season. Both Mayne and Szymborski are aware of the usual rate of baseball players hitting more than 10 home runs.

Is this the problem?

Because if it is, the use of the prior is wrong. If the experts know the prior, and we believe the experts, the prior's irrelevant - our odds are 75%.

There are a lot of these situations in which regression to the mean, use of averages in determinations, and other factors are needed. But in this situation, if we assume reasonable experts who are aware of the general rules, and we value those experts' opinions highly enough, we should just ignore the prior - the experts have already factored that in. When Nate Silver gives you the odds that Barack Obama wins the election, you shouldn't be factoring in P(Incumbent wins) or anything else - the cake is prebaked with that information.

Since this rejects a strong claim in the post, it's possible I'm very seriously misreading the problem. Caveat emptor.

[-]PhilGoetz12y20

You're reading it correctly, but I disagree with your conclusion. If Mayne says p=.7, and Szymborski says p=.8, and their estimates are independent - remember, my classifiers are not human experts, they are not correlated - then the final result must be greater than .8. You already thought p=.8 after hearing Szymborski. Mayne's additional opinion says Joe is more-likely than average to hit more than 10 home runs, and is based on completely different information than Szymborski's, so it should make Joe's chances increase, not decrease.

[-]ShardPhoenix12y00

remember, my classifiers are not human experts, they are not correlated

Is that necessarily true? It seems that it should depend on whether they have underlying similarities (eg a similar systematic bias) in their algorithms.

Moderation Log