Alternative to Bayesian Score

Scott Garrabrant

I am starting to wonder whether or not Bayes Score is what I want to maximize for my epistemic rationality. I started thinking about by trying to design a board game to teach calibration of probabilities, so I will use that as my example:

I wanted a scoring mechanism which motivates honest reporting of probabilities and rewards players who are better calibrated. For simplicity, lets assume that we only have to deal with true/false questions for now. A player is given a question which they believe is true with probability p. They then name a real number x between 0 and 1. Then, they receive a score which is a function of x and whether or not the problem is true. We want the expected score to be maximized exactly when x=p. Let f(x) be the output if the question is true, and let g(x) be the output if the question is false. Then, my expected utility is (p)f(x)+(1-p)g(x). If we assume f and g are smooth, then in order to have a maximum at x=p, we want (p)f'(p)+(1-p)g'(p)=0, which still leaves us with a large class of functions. It would also be nice to have symmetry by having f(x)=g(1-x). If we further require this, we get (p)f'(p)+(1-p)(-1)f'(1-p)=0, or equivalently (p)f'(p)=(1-p)f'(1-p). One way to achieve this is to set (x)f'(x) to be a constant So then, f'(x)=c/x, so f(x)=log x. This scoring mechanism is referred to as "Bayesian Score".

However, another natural way to to achieve this is by setting f'(p)/(1-p) equal to a constant. If we set this constant equal to 2, we get f'(x)=2-2x, which gives us f(x)=2x-x²=1-(1-x)². I will call this the "Squared Error Score."

There are many other functions which satisfy the desired conditions, but these two are the simplest, so I will focus on these two.

Eliezer argues for Bayesian Score in A Technical Explanation of Technical Explanation, which I recommend reading. The reason he prefers Bayesian Score is that he wants the sum of the scores associated with determining P(A) and P(B|A) to equal the score for determining P(A&B). In other words he wants it to not matter whether you break a problem up into one experiment or two experiments. This is a legitimate virtue of this scoring mechanism, but I think that many people think it is a lot more valuable than it is. This doesn't eliminate the problem of we don't know what questions to ask. It gives us the same answer regardless of how we break up an experiment into smaller experiments, but our score is still dependent on what questions are asked, and this cannot be fixed by just saying, "Ask all questions." There are infinitely many of them. The sum does not converge. Because the score is still a function of what questions are asked, the fact that it gives the same answer for some related sets of questions is not a huge benefit.

One nice thing about the Squared Error Score is that it always gives a score between 0 and 1, which means we can actually use it in real life. For example, we could ask someone to construct a spinner that comes up either true or false, and then spin it twice. They win if either of the two spins comes up with the true answer. In this case, the best strategy is to assign probability p to true. There is no way to do anything similar for the Bayesian Score, in fact it is questionable whether or not arbitrary low utilities even make sense.

The Bayesian Score is slightly easier to generalize to multiple choice questions. The Squared Error Score can also be generalized, but it unfortunately has to make your score a function not only of the probability you assigned to the correct solution. For example, If A is the correct answer, you get more points for 80%A 10%B 10%C than from 80%A 20%B 0%C. The function you want for multiple values is if you assign probabilities x₁, through x_n, and the first option is correct you get output 2x₁-x1²-x₂²-...-x_n². I do not think this is as bad as it seems. It kind of makes sense that when the answer is A, you get penalized slightly for saying that you are much more confident in B than in C, since making such a claim is a waste of information. To view this as a spinner, you construct a spinner, spin it twice, and you win if either spin gets the correct answer, or if the first spin comes lexicographically strictly before the second spin.

For the purpose of my calibration game, I will almost certainly use Squared Error Scoring, because log is not feasible. But it got me thinking about why I am not thinking in terms of Squared Error Score in real life.

You might ask what is the experimental difference between the two, since they are both maximized by honest probabilities. Well If I have two questions and I want to maximize my (possibly weighted) average score, and I have a limited amount of time to research and improve my answers for them, then it matters how much the scoring mechanism penalizes various errors. Bayesian Scoring penalizes so much for being sure of one false thing that none of the other scores really matter, while Squared Error is much more forgiving. If we normalize to say that 50/50 gives 0 points while true certainty gives 1 points, then Squared Error gives -3 points for false certainty while Bayesian gives negative infinity.

I view maximizing Bayesian Score as the Golden Rule of epistemic rationality, so even a small chance that something else might be better is worth investigating. Even if you are fully committed to Bayesian Score, I would love to hear any pros or cons you can think of in either direction.

(Edited for formatting)

There are many other functions which satisfy the desired conditions, but these two are the simplest, so I will focus on these two.

(Edited for formatting)

The model I am imagining from you is that there is some countable collection of statements you want to assign true/false to. You assign some weight function to the statements so that to total weight of all statements is some finite number, and your score is the sum of the weights of all statements which you choose to answer.

Hm, no, I wasn't really thinking that way. I don't want some finite number, I want everyone to reach different numbers so more accurate predictors score higher.

The weights on particular functions do not have to be even algorithmicly set - for example, a prediction market is immune to the 'sky is blue' problem because if one were to start a contract for 'the sky is blue tomorrow', no one would trade on it unless one were willing to lose money being a market-maker as the other trader bid it up to the meteorologically-accurate 80% or whatever. One can pick and choose as much as one pleases, but unless one's contracts were valuable to other people for any reason, it would be impossible to make money by stuffing the market with bogus contracts. The utility just becomes how much money you made.

I think that the 0 utility point should be put at the utility of the 50/50 probability assignment for each question.

I think this doesn't work because you're trying to invent a non-informative prior, and it's trivial to set up sets of predictions where the obviously better non-informative prior is not 1/2: for example, set up 3 predictions for each of 3 mutually-exhaustive outcomes, where the non-informative prior obviously looks more like 1/3 and 1/2 means someone is getting robbed. More importantly, uninformative priors are disputed and it's not clear what they are in more complex situations. (Frequentist Larry Wasserman goes so far as to call them "lost causes" and "perpetual motion machines".)

But just saying that you scale each question by its importance doesn't fix the fact that if you model this as you can choose to answer questions if you want and your utility is the sum of your utilities for the individual questions encourages not answering any questions under the Bayesian rule as written, since it can only give you negative utility. You have to fix that by either fixing 0 points for your utilities in some reasonable way or just requiring that you are assigned utility for every question, and there is a default answer if you don't think about it at all.

Perhaps a raw log odds is not the best idea, but do you really think there is no way to interpret them into some score which disincentivizes strategic predicting? This sounds just arrogant to me, and I would only believe it if you summarized all the existing research into rewarding experts and showed that log odds simply could not be used in any circumstance where any predictor could predict a subset of the specified predictions.

but if we assume for now that there are only finitely many questions, and all questions have rational weights, then weighing the questions is similar to just asking the same questions multiple times (proportional to its weight).

There aren't finitely many questions because one can ask questions involving each of the infinite set of integers... Knowing that questions are asking identical questions sounds like an impossible demand to meet (for example, if any system claimed this, it could solve the Halting Problem by simply asking it to predict the output of 2 Turing machines).

14

Alternative to Bayesian Score

14

14

14

Alternative to Bayesian Score

14

14