Coscott comments on Alternative to Bayesian Score - Less Wrong

6 Post author: Coscott 27 July 2013 07:26PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (29)

You are viewing a single comment's thread. Show more comments above.

Comment author: DanielLC 27 July 2013 08:26:07PM 0 points [-]

The problem with the squared error score is that it just rewards asking a ton of obvious questions. I predict with 100% probability that the sky will be blue one second from now. Just keep repeating for a high score.

Comment author: Coscott 27 July 2013 08:31:31PM 4 points [-]

Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50/50, then it rewards asking obvious questions also.

If it helps, you can think of the squared error score as -(1-x)^2 instead of 1-(1-x)^2, then it fixes this problem.

Comment author: gwern 27 July 2013 09:30:44PM 1 point [-]

Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50/50, then it rewards asking obvious questions also.

Only because you are baking in an implicit loss function that all questions are equally valuable; switch to some other loss function which weights the value of more interesting or harder questions more, and this problem disappears as 'the sky is blue' ceases to be worth anything compared to a real prediction like 'Obama will be re-elected'.

Comment author: Coscott 27 July 2013 09:54:37PM 2 points [-]

I don't understand why what you are suggesting has anything to do with what I said.

Yes, of course you can model different statements to different values, and I mentioned this. However, what I was saying here is that if you allow the option of just not answering one of the questions (whatever that means) then there has to be some utility associated with not answering. The comment that I was responding to was saying that Bayesian was better than Brier because Brier gave positive utilities instead of negative utilities, so could be cheated by asking lots of easy questions.

Your response seems to be about scaling the utilities for each question based on the importance of that question. This is very valid, and I mentioned that when I said "(possibly weighted) average score." That is a very valid point, but I don't see how it has anything to do with the problems associated with being able to choose what questions are asked.

Comment author: gwern 27 July 2013 10:53:55PM *  1 point [-]

That is a very valid point, but I don't see how it has anything to do with the problems associated with being able to choose what questions are asked.

I don't understand your problem here. If questions' values are scaled appropriately, or some fancier approach is used, then it doesn't matter if respondents pick and choose because they will either be wasting their time or missing out on large potential gains. A loss function style approach seems to adequately resolve this problem.

Comment author: Coscott 27 July 2013 11:17:09PM 0 points [-]

I think this is probably bad communication on my part.

The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.

For this, it really matters not only how the values are scaled, but also how they are translated. It maters what the 0 utility point for each question is, because that determines whether or not you want to choose to answer that question. I think that the 0 utility point should be put at the utility of the 50/50 probability assignment for each question. In this case, not answering a question is equivalent to answering it with 50/50 probability, so I think it would be simpler to just say, you have to answer every question, and your answer by default is 50/50, in which case the 0 points don't matter anymore. This is just semantics.

But just saying that you scale each question by its importance doesn't fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don't think about it at all.

There are benefits to weighing the questions because that allows us to take infinite sums, but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight). This may be more accurate for what we want in epistemic rationality, but it doesn't actually solve the problems associated with allowing people to pick and choose questions.

Comment author: gwern 04 August 2013 10:43:55PM 0 points [-]

The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.

Hm, no, I wasn't really thinking that way. I don't want some finite number, I want everyone to reach different numbers so more accurate predictors score higher.

The weights on particular functions do not have to be even algorithmicly set - for example, a prediction market is immune to the 'sky is blue' problem because if one were to start a contract for 'the sky is blue tomorrow', no one would trade on it unless one were willing to lose money being a market-maker as the other trader bid it up to the meteorologically-accurate 80% or whatever. One can pick and choose as much as one pleases, but unless one's contracts were valuable to other people for any reason, it would be impossible to make money by stuffing the market with bogus contracts. The utility just becomes how much money you made.

I think that the 0 utility point should be put at the utility of the 50/50 probability assignment for each question.

I think this doesn't work because you're trying to invent a non-informative prior, and it's trivial to set up sets of predictions where the obviously better non-informative prior is not 1/2: for example, set up 3 predictions for each of 3 mutually-exhaustive outcomes, where the non-informative prior obviously looks more like 1/3 and 1/2 means someone is getting robbed. More importantly, uninformative priors are disputed and it's not clear what they are in more complex situations. (Frequentist Larry Wasserman goes so far as to call them "lost causes" and "perpetual motion machines".)

But just saying that you scale each question by its importance doesn't fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don't think about it at all.

Perhaps a raw log odds is not the best idea, but do you really think there is no way to interpret them into some score which disincentivizes strategic predicting? This sounds just arrogant to me, and I would only believe it if you summarized all the existing research into rewarding experts and showed that log odds simply could not be used in any circumstance where any predictor could predict a subset of the specified predictions.

but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight).

There aren't finitely many questions because one can ask questions involving each of the infinite set of integers... Knowing that questions are asking identical questions sounds like an impossible demand to meet (for example, if any system claimed this, it could solve the Halting Problem by simply asking it to predict the output of 2 Turing machines).

Comment author: Decius 28 July 2013 04:19:51AM 0 points [-]

If you normalize Bayesian score to assign 1 to 100% and 0 to 50% (and -1 to 0%), you encounter a math error.

Comment author: Coscott 28 July 2013 07:04:31AM 2 points [-]

I didn't do that. I only set 1 to 100% and 0 to 50%. 0% is still negative infinity.

Comment author: Decius 29 July 2013 03:35:59AM 2 points [-]

That's the math error.

Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?

Comment author: mcoram 11 February 2014 11:53:54PM *  0 points [-]

There's no math error.

Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?

I think there's some confusion. Coscott said these three facts:

Let f(x) be the output if the question is true, and let g(x) be the output if the question is false.

f(x)=g(1-x)

f(x)=log(x)

In consequence, g(x)=log(1-x). So if x=0.99 and the question is false, the output is g(x)=log(1-x)=log(0.01). Or if x=0.01 and the question is true, the output is f(x)=log(x)=log(0.01). So the symmetry that you desire is true.

Comment author: Decius 12 February 2014 04:55:30AM 0 points [-]

But that doesn't output 1 for estimates of 100%, 0 for estimates of 50%, and -inf (or even -1) to estimates of 0%, or even something that can be normalized to either of those triples.

Comment author: mcoram 12 February 2014 05:03:48AM 0 points [-]

Here's the "normalized" version: f(x)=1+log2(x), g(x)=1+log2(1-x) (i.e. scale f and g by 1/log(2) and add 1).

Now f(1)=1, f(.5)=0, f(0)=-Inf ; g(1)=-Inf, g(.5)=0, g(0)=1.

Ok?

Comment author: Decius 13 February 2014 12:56:20AM 0 points [-]

Huh. I thought that wasn't a Bayesian score (not maximized by estimating correctly), but doing the math the maximum is at the right point for 1/4, 1/100, 3/4, and 99/100, and 1/2.