I am starting to wonder whether or not Bayes Score is what I want to maximize for my epistemic rationality. I started thinking about by trying to design a board game to teach calibration of probabilities, so I will use that as my example:

I wanted a scoring mechanism which motivates honest reporting of probabilities and rewards players who are better calibrated. For simplicity, lets assume that we only have to deal with true/false questions for now. A player is given a question which they believe is true with probability p. They then name a real number x between 0 and 1. Then, they receive a score which is a function of x and whether or not the problem is true. We want the expected score to be maximized exactly when x=p. Let f(x) be the output if the question is true, and let g(x) be the output if the question is false. Then, my expected utility is (p)f(x)+(1-p)g(x). If we assume f and g are smooth, then in order to have a maximum at x=p, we want (p)f'(p)+(1-p)g'(p)=0, which still leaves us with a large class of functions. It would also be nice to have symmetry by having f(x)=g(1-x). If we further require this, we get (p)f'(p)+(1-p)(-1)f'(1-p)=0, or equivalently (p)f'(p)=(1-p)f'(1-p). One way to achieve this is to set (x)f'(x) to be a constant So then, f'(x)=c/x, so f(x)=log x. This scoring mechanism is referred to as "Bayesian Score".

However, another natural way to to achieve this is by setting f'(p)/(1-p) equal to a constant. If we set this constant equal to 2, we get f'(x)=2-2x, which gives us f(x)=2x-x2=1-(1-x)2. I will call this the "Squared Error Score."

There are many other functions which satisfy the desired conditions, but these two are the simplest, so I will focus on these two.

Eliezer argues for Bayesian Score in A Technical Explanation of Technical Explanation, which I recommend reading. The reason he prefers Bayesian Score is that he wants the sum of the scores associated with determining P(A) and P(B|A) to equal the score for determining P(A&B). In other words he wants it to not matter whether you break a problem up into one experiment or two experiments. This is a legitimate virtue of this scoring mechanism, but I think that many people think it is a lot more valuable than it is. This doesn't eliminate the problem of we don't know what questions to ask. It gives us the same answer regardless of how we break up an experiment into smaller experiments, but our score is still dependent on what questions are asked, and this cannot be fixed by just saying, "Ask all questions." There are infinitely many of them. The sum does not converge. Because the score is still a function of what questions are asked, the fact that it gives the same answer for some related sets of questions is not a huge benefit.

One nice thing about the Squared Error Score is that it always gives a score between 0 and 1, which means we can actually use it in real life. For example, we could ask someone to construct a spinner that comes up either true or false, and then spin it twice. They win if either of the two spins comes up with the true answer. In this case, the best strategy is to assign probability p to true. There is no way to do anything similar for the Bayesian Score, in fact it is questionable whether or not arbitrary low utilities even make sense.

The Bayesian Score is slightly easier to generalize to multiple choice questions. The Squared Error Score can also be generalized, but it unfortunately has to make your score a function not only of the probability you assigned to the correct solution. For example, If A is the correct answer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The function you want for multiple values is if you assign probabilities x1, through xn, and the first option is correct you get output 2x1-x12-x22-...-xn2. I do not think this is as bad as it seems. It kind of makes sense that when the answer is A, you get penalized slightly for saying that you are much more confident in B than in C, since making such a claim is a waste of information. To view this as a spinner, you construct a spinner, spin it twice, and you win if either spin gets the correct answer, or if the first spin comes lexicographically strictly before the second spin.

For the purpose of my calibration game, I will almost certainly use Squared Error Scoring, because log is not feasible. But it got me thinking about why I am not thinking in terms of Squared Error Score in real life.

You might ask what is the experimental difference between the two, since they are both maximized by honest probabilities. Well If I have two questions and I want to maximize my (possibly weighted) average score, and I have a limited amount of time to research and improve my answers for them, then it matters how much the scoring mechanism penalizes various errors. Bayesian Scoring penalizes so much for being sure of one false thing that none of the other scores really matter, while Squared Error is much more forgiving. If we normalize to say that 50/50 gives 0 points while true certainty gives 1 points, then Squared Error gives -3 points for false certainty while Bayesian gives negative infinity.

I view maximizing Bayesian Score as the Golden Rule of epistemic rationality, so even a small chance that something else might be better is worth investigating. Even if you are fully committed to Bayesian Score, I would love to hear any pros or cons you can think of in either direction.

(Edited for formatting)

New Comment
30 comments, sorted by Click to highlight new comments since:

Quadratic scoring rules are often referred to as the Brier score (it seems odd to refer to one score by a name and the other by its functional form, rather than comparing names or functions).

You can read a comparison of the three proper scoring rules by Eric Bickel here. He argues for logarithmic scoring rules because of two practical concerns that I suspect are different from Eliezer's concern.

So, it looks like the two main concerns int his paper are:

  1. Brier Score is non-local, meaning that sometimes it benefits giving a slightly lower probability to a true statement. This is because it penalizes slightly for not distributing your probability mass equally among all false hypothesis. This seems like it is probably a bad thing, but I am not completely sure. It is still a waste of information to prefer B to C when the correct answer is A. Additionally, if we only think about this in the context of true-false questions, this is completely a non-concern.

  2. Bayesian Score is more stable to slightly non-linear utility functions. This is argued as a pro for Bayesian Score, but I think it should be the other way. Bayesian Score is more stable to non-linear utility functions, but with Brier Score, you can use randomness to remove any problems from non-linear utility functions completely. Because Brier Score gives you scores between 0 and 1, you don't have to reward different utilities. You can just say you get some fixed utility with probability equal to your score. This is impossible in Bayesian Score.

The paper also talks about a third "Spherical" scoring mechanism, which sets your score equal to the probability you assigned to the correct answer divided by the square root of the sum of the squares of all the probabilities.

Now that I know the name of this scoring rule, I will look for more information, but I think if anything that paper makes me like the Brier score better.(at least for true-false questions)

It's probably worth pointing out that the paper is by J. Eric Bickel and not by the much better known statistician Peter Bickel.

Edited.

[-][anonymous]80

Maybe instead of grumbling, the website could be changed to make that the default workflow automatically, with an advanced option for raw HTML?

One nice thing about the Squared Error Score is that it always gives a score between 0 and 1... There is no way to do anything similar for the Bayesian Score

Scoring results between 0 and 1 actually seems like the wrong thing to do, because you are not adequately punishing people for being overconfident. If someone says they are 99.99% confident that event A will not happen, and then A does happen, you should assign that person a very strong penalty.

I find it much easier to think about the compression rate -log2(x) than the Bayesian Score. Thinking in terms of compression makes it easy to remember that the goals is to minimize -log2(x), or compress a data set to the shortest possible size (log base 2 gives you a bit length). Bit lengths are always nice positive numbers, and we have the nice interpretation that an overconfident guesser is required to use a very large codelength to encode an outcome that was predicted to have very low probability.

It is not obvious to me that people should be penalized so strongly for being wrong. In fact, I think that is a big part of the question I am asking. Would you rather be right 999 times and 100% sure but wrong the last time, or would you rather have no information on anything?

Would you rather be right 999 times and 100% sure but wrong the last time, or would you rather have no information on anything?

Is that a rhetorical question? Obviously it depends on the application domain: if we were talking about buying and selling stocks, I would certainly want to have no information about anything than experience a scenario where I was 100% sure and then wrong. In that scenario I would presumably have bet all my money and maybe lots of my investors' money, and then lost it all.

It does depend on the domain. I think that the reason that you want to be very risk-averse in stocks is because you have adversaries trying to take your money, so you get all the negatives of being wrong without all the positives of the 999 times you knew the stock would rise and were correct.

In other cases, such as deciding which route to take while traveling to save time, I'd rather be wrong every once in a while so that I could be right more often.

Both of these ideas are about instrumental rationality, so the question is if you are trying to come up with a model of epistemic rationality which does not depend on utility functions, what type of scoring should you use?

This discussion suggests, that the puzzles presented to the guesser should be associated with a "stake" - a numeric value which says how much you (the asker) care about this particular question to be answered correctly (i.e. how risk averse you are at this particular occassion). Can this be somehow be incorporated into the reward function itself or needs to be a separate input (Is "I want to know if this stock will go up or down, and I care 10 times as much about this question than about will it rain today", the same thing as "Please estimate p for the following two questions where the reward function for the first one is f(x)=10(x-x^2) and the second is f(x)=x-x^2"? Does it somehow require some additional output channel from the guesser ("I am 90% confident that the p is 80%?" or maybe even "Here's my distribution over the values of p \in (0,1)") or does it somehow collapse into one dimension anyway (does "I am 90% confident that the p is 80% and 10% that it's 70%" collaps to "I think p is 79%"? Does a distribution over p collapse to it's expected value?).

It's meaningless to talk about optimizing epistemic rationality without talking about your utility function. There are a lot of questions you could get better at answering. Which ones you want to answer depends on what kind of decisions you want to make, which depends on what you value.

But probabilities are a useful latent variable in the reasoning process, and it can be worthwhile instrumentally to try to have accurate beliefs, as this may help out in a wide variety of situations that we cannot predict in advance. So there is still the question of which beliefs it is most important to make more accurate.

Also, I believe the OP is trying to write code for a variant of the calibration game, so it is somewhat intrinsically necessary for him to score probabilities directly.

This is for a game? How do you win? Does maximizing the expectation of intermediate scores maximize the probability that you win? Even when you know your prior scores during this game? Even if you know your opponents' scores? If it's not that type of game, then however these scores are aggregated, does maximizing the expectations of your scores on each question maximize the expectation of your utility from the aggregate?

So my main question is not for the game, it is a philosophical question on how I should define my epistemic rationality. However, there is also a game I am designing. I don't know what the overall structure will be in my game, but it actually doesn't matter what your score is or what the win condition is. As long as there are isolated questions and it is always better to win that round than to lose that round, and each round is done with the spinners I described, then the optimal strategy will always be honest reporting of probabilities.

In fact, you could take any trivia game which asks only multiple choice questions, and which you always want to get the answer right, and replace it with my spinner mechanism, and it will work.

The problem with the squared error score is that it just rewards asking a ton of obvious questions. I predict with 100% probability that the sky will be blue one second from now. Just keep repeating for a high score.

Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50/50, then it rewards asking obvious questions also.

If it helps, you can think of the squared error score as -(1-x)^2 instead of 1-(1-x)^2, then it fixes this problem.

Both methods fail miserably if you get to choose what questions are asked. Bayesian score rewards never asking any questions ever. Or, if you normalize it to assign 1 to true certainty and 0 to 50/50, then it rewards asking obvious questions also.

Only because you are baking in an implicit loss function that all questions are equally valuable; switch to some other loss function which weights the value of more interesting or harder questions more, and this problem disappears as 'the sky is blue' ceases to be worth anything compared to a real prediction like 'Obama will be re-elected'.

I don't understand why what you are suggesting has anything to do with what I said.

Yes, of course you can model different statements to different values, and I mentioned this. However, what I was saying here is that if you allow the option of just not answering one of the questions (whatever that means) then there has to be some utility associated with not answering. The comment that I was responding to was saying that Bayesian was better than Brier because Brier gave positive utilities instead of negative utilities, so could be cheated by asking lots of easy questions.

Your response seems to be about scaling the utilities for each question based on the importance of that question. This is very valid, and I mentioned that when I said "(possibly weighted) average score." That is a very valid point, but I don't see how it has anything to do with the problems associated with being able to choose what questions are asked.

That is a very valid point, but I don't see how it has anything to do with the problems associated with being able to choose what questions are asked.

I don't understand your problem here. If questions' values are scaled appropriately, or some fancier approach is used, then it doesn't matter if respondents pick and choose because they will either be wasting their time or missing out on large potential gains. A loss function style approach seems to adequately resolve this problem.

I think this is probably bad communication on my part.

The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.

For this, it really matters not only how the values are scaled, but also how they are translated. It maters what the 0 utility point for each question is, because that determines whether or not you want to choose to answer that question. I think that the 0 utility point should be put at the utility of the 50/50 probability assignment for each question. In this case, not answering a question is equivalent to answering it with 50/50 probability, so I think it would be simpler to just say, you have to answer every question, and your answer by default is 50/50, in which case the 0 points don't matter anymore. This is just semantics.

But just saying that you scale each question by its importance doesn't fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don't think about it at all.

There are benefits to weighing the questions because that allows us to take infinite sums, but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight). This may be more accurate for what we want in epistemic rationality, but it doesn't actually solve the problems associated with allowing people to pick and choose questions.

The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.

Hm, no, I wasn't really thinking that way. I don't want some finite number, I want everyone to reach different numbers so more accurate predictors score higher.

The weights on particular functions do not have to be even algorithmicly set - for example, a prediction market is immune to the 'sky is blue' problem because if one were to start a contract for 'the sky is blue tomorrow', no one would trade on it unless one were willing to lose money being a market-maker as the other trader bid it up to the meteorologically-accurate 80% or whatever. One can pick and choose as much as one pleases, but unless one's contracts were valuable to other people for any reason, it would be impossible to make money by stuffing the market with bogus contracts. The utility just becomes how much money you made.

I think that the 0 utility point should be put at the utility of the 50/50 probability assignment for each question.

I think this doesn't work because you're trying to invent a non-informative prior, and it's trivial to set up sets of predictions where the obviously better non-informative prior is not 1/2: for example, set up 3 predictions for each of 3 mutually-exhaustive outcomes, where the non-informative prior obviously looks more like 1/3 and 1/2 means someone is getting robbed. More importantly, uninformative priors are disputed and it's not clear what they are in more complex situations. (Frequentist Larry Wasserman goes so far as to call them "lost causes" and "perpetual motion machines".)

But just saying that you scale each question by its importance doesn't fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don't think about it at all.

Perhaps a raw log odds is not the best idea, but do you really think there is no way to interpret them into some score which disincentivizes strategic predicting? This sounds just arrogant to me, and I would only believe it if you summarized all the existing research into rewarding experts and showed that log odds simply could not be used in any circumstance where any predictor could predict a subset of the specified predictions.

but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight).

There aren't finitely many questions because one can ask questions involving each of the infinite set of integers... Knowing that questions are asking identical questions sounds like an impossible demand to meet (for example, if any system claimed this, it could solve the Halting Problem by simply asking it to predict the output of 2 Turing machines).

[-]Decius-10

If you normalize Bayesian score to assign 1 to 100% and 0 to 50% (and -1 to 0%), you encounter a math error.

I didn't do that. I only set 1 to 100% and 0 to 50%. 0% is still negative infinity.

That's the math error.

Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?

There's no math error.

Why is it consistent that assigning a probability of 99% to one half of a binary proposition that turns out false is much better than assigning a probability of 1% to the opposite half that turns out true?

I think there's some confusion. Coscott said these three facts:

Let f(x) be the output if the question is true, and let g(x) be the output if the question is false.

f(x)=g(1-x)

f(x)=log(x)

In consequence, g(x)=log(1-x). So if x=0.99 and the question is false, the output is g(x)=log(1-x)=log(0.01). Or if x=0.01 and the question is true, the output is f(x)=log(x)=log(0.01). So the symmetry that you desire is true.

But that doesn't output 1 for estimates of 100%, 0 for estimates of 50%, and -inf (or even -1) to estimates of 0%, or even something that can be normalized to either of those triples.

Here's the "normalized" version: f(x)=1+log2(x), g(x)=1+log2(1-x) (i.e. scale f and g by 1/log(2) and add 1).

Now f(1)=1, f(.5)=0, f(0)=-Inf ; g(1)=-Inf, g(.5)=0, g(0)=1.

Ok?

Huh. I thought that wasn't a Bayesian score (not maximized by estimating correctly), but doing the math the maximum is at the right point for 1/4, 1/100, 3/4, and 99/100, and 1/2.