Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Credence calibration game FAQ

13 Post author: Academian 26 November 2012 12:52AM

Hey rationality friends, I just made this FAQ for the credence calibration game.  So if you have people you'd like to introduce to it --- for example, to get them used to thinking of belief strengths as probabilities --- now is a good time :)

Also, shameless promotion: please tweet/g+/like it; I want the world to be thinking in probabilities ASAP!

*Also*, please email me (critch@math.berkeley.edu) if you're good at making apps quickly and are interested in improving the game or making a variant of it; I'm swamped in job applications right now, but could easily have a Skype or phone conversation about our cache of ideas for improvements / variations (e.g. collecting user data on a server, more question types, a variant awarding gambles rather than deterministic scores, a variant with clickable emotion buttons for the user...).

Cheers!

Comments (55)

Comment author: asparisi 26 November 2012 05:47:15AM *  11 points [-]

Suggestions (for general audience outside of LW/Rationalist circles)

I like the name "Confidence Game"- reminds people of a con game while informing you as to the point of the game.

Try to see if you can focus on a positive-point scale, if you can. Try to make it so that winning nets you a lot of points but "losing" only a couple. (Same effect on scores, either way) This won't seem as odd if you set it up as one long scale rather than two shorter ones: so 99-90-80-60-50-60-80-90-99.

Setting it to a timer will make it ADDICTIVE. Set it up in quick rounds. Make it like a quiz show. No question limit, or a bonus if you hit the limit for being "Quick on your feet." Make it hard but not impossible to do.

Set up a leaderboard where you can post to FB, show friends, and possibly compare your score to virtual "opponents" (which are really just scoring metrics) Possibly make those metrics con-man themed, keeping with the game's name.

Graphics will help a lot. Consider running with the con-game theme.

Label people: maybe something like "Underconfident" "Unsure" "Confident" "AMAZING" "Confident" "Overconfident" "Cocksure" (Test labels to see what works well!) rather than using graphs. Graphs and percentages? Turn-off. Drop the % sign and just show two numbers with a label. Make this separate from points but related. (High points=greater chance of falling toward the center, but in theory not necessarily the same.) Yes, I know the point is to get people to think in percentages, but if you want to do that you have to get them there without actually showing them math, which many find off-putting.

Set up a coin system that earns you benefits for putting into the game: extended round, "confidence streak" bonuses, hints, or skips might be good rewards here. Test and see what works. Allow people to pay for coins, but also reward coins for play or another mini-game related to play or both. (Investment=more play)

Comment author: asparisi 26 November 2012 05:50:26AM 1 point [-]

Another thought: once you have a large bank of questions, consider "theme questions" as something people can buy with coins. Yes, that becomes a matter of showing off rather than the main point, but people LIKE to show off.

Comment author: lukeprog 26 November 2012 01:57:02AM *  9 points [-]

Confusingly, this game has at least three titles:

Comment author: Academian 26 November 2012 09:38:18AM *  1 point [-]

Yes, Alexei (aka Bent Spoon Games) and I talked about the name recently; to promote its use in university courses teaching Bayesian statistics, we're sticking with Credence Game. Confidence means something slightly different in statistics, and the game is meant to teach not just calibration, but also the act of measuring belief strength itself. The name update on BSG, and in the app itself as downloaded from there, will happen soon enough.

Comment author: gjm 07 December 2012 11:48:00PM 3 points [-]

A couple of times I've accidentally clicked twice in rapid succession on the "Continue" button, and the second click has been interpreted as an answer to the next question. (I think it happened to be a correct answer both times.)

I suggest disabling the buttons -- or at least the "B" ones -- for about 100ms after displaying a new question.

Comment author: DanielVarga 30 November 2012 09:38:58AM *  3 points [-]

Nice game, BUT. The Mac version caused me some pain. I chose the default fullscreen settings, and I couldn't quit the application for minutes. No quit button that I could find. No drop-down menu bar. Cmd-tab stopped working. Mission Control (F3) stopped working. WTF? Finally I figured out that alt-cmd-esc still works, and then I could quit the game.

Comment author: Kindly 27 November 2012 01:46:38PM 2 points [-]

This claim seems incorrect:

[mean squared error] wouldn't penalize a credence of 99.999% on wrong answers enough to strongly dissuade unjustified absolute certainties.

Since the scoring rule is proper, the penalty for incorrectly being 99.999% confident is exactly high enough that you wouldn't want to say you are if you're merely 99.998% confident.

Unless you mean something different?

Comment author: ChristianKl 03 December 2012 10:13:30PM *  2 points [-]

I don't care about the US Postmaster General or about who won which sports championship. I don't care about boxing champions. I don't care about the various US sport teams.

I don't even want to learn that knowledge.

Nobel Prizes are more interesting. Who got the Nobel prize in Physiology or Medicine 2012? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

Who got the Nobel prize in Physiology or Medicine for the discovery that mature cells can be reprogrammed to become pluripotent? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

EDIT: (This comment should be top level, sorry...)

Comment author: Alexei 06 December 2012 04:01:33AM 1 point [-]

Yeah, I feel you. I'm currently adding a feature where you specify what questions you want to be asked.

Comment author: wanderingsoul 27 November 2012 10:06:47AM 2 points [-]

Nice game, good to see someone making it easy to just practice being well calibrated.

My calibration started off wonky, (e.g. was wrong each of the first six times I claimed 70% certainty) but quickly improved. Unfortunately, it improved suspiciously well, I suspect I may have been assigning probabilities with my primary goal not being scoring points, but instead with trying to get that bar graph displayed every 5 or 10 questions to even out. It's a well designed game, but unfortunately at least for me the score wasn't the main motivator, which is a problem because the score is the quantity that increases by being helpfully well-calibrated. Anyone else have a similar experience?

Comment author: gjm 07 December 2012 11:37:26PM 1 point [-]

My experience is distinctly similar.

I observed another curiosity. For much of my time playing the game I've got a larger fraction of 50%s right than of 60%s. I think what's going on is that the 50% cases are ones where I definitely have no idea of the answer and have to fall back on heuristics (have I heard of this person? does the name sound old or recent? etc.) -- and the heuristics work better than I can bring myself to admit they do :-).

Comment author: evand 26 November 2012 04:51:29AM 2 points [-]

The difference between 60% credence and 80% credence seems much smaller to me than the difference between 90% and 99%. Is there a reason there's no option between 90% and 99%? In your testing, have you found any well-calibrated users who answer 99% a non-trivial fraction of the time?

Comment author: asparisi 26 November 2012 01:37:28PM 3 points [-]

Well, in 11 out of 145 answers (7.5%) I so far have answered 99%, and I have yet to be wrong in any of my answers.

If I continue at this rate, in approximately 1,174 more answers, I'll be able to tell you if I am well callibrated (less, if I fail at more than one answer in the intervening time)

Comment author: asparisi 26 November 2012 01:38:31PM 0 points [-]

To clarify, wrong in any of my answers at the 99% level. I have been wrong at other levels (including, surprisingly, hovering within around 1% of 90% at the 90% level.

Comment author: Alexei 06 December 2012 04:03:09AM 0 points [-]

Yes, but it also depends on your knowledge base. Unfortunately right now we don't have a lot of questions, but I hope people will add more. Then, we hope, you'll have more questions for which you can answer 99%.

Comment author: ChristianKl 03 December 2012 10:47:56PM *  0 points [-]

I would say that I know with >99% certainity that Woodrow Wilson was president earlier than George H. W. Bush.

The same is true for the fact that the atom silver gets abbreviated with Ag.

Comment author: pleeppleep 26 November 2012 02:15:33AM 3 points [-]

I feel that "rationality friends" should be a standard way of addressing Lesswrongers.

Comment author: Nisan 26 November 2012 05:55:21AM 6 points [-]

Yes. Also, "Hear me, rat-people."

Comment author: Viliam_Bur 26 November 2012 02:59:49PM 1 point [-]
Comment author: David_Gerard 26 November 2012 06:31:19AM 4 points [-]

Super Rationality Adventure Pals the Saturday morning cartoon! On 1080p from a BitTorrent near you.

Comment author: Luke_A_Somers 26 November 2012 11:26:50AM 4 points [-]

We know we're winning when they begin making rationality friends music videos.

Comment author: David_Gerard 27 November 2012 09:05:47PM 4 points [-]

The 5yo has been watching Pokemon anime by the ton. I'm picturing something with similar production values featuring the CFAR trials. A wise old fellow who seems to know just everything. People catch GREAT BIG Cognitive Biases out in the wild, then put these Cognitive Biases into training to evolve them smaller and smaller. RationalWiki guest-star as Team Rocket. It'll be huge.

Comment author: Alicorn 28 November 2012 07:28:04PM 1 point [-]

This would be cute, except you'd have to catch the biases in a dreamscape/mindscape of some kind, not in the "wild". (Why would you want to retrieve them, even for shrinking, unless they were already inside your head?)

Comment author: gwern 28 November 2012 09:10:30PM 7 points [-]

'Arguments are soldiers' and 'reason is for persuasion'; hence, you obviously use biasemons on your opponents to make them make bad decisions and believe wrong things.

(It's not any more malicious than enslaving wild animals and making them fight until they are knocked unconscious by the trauma...)

Comment author: David_Gerard 28 November 2012 11:36:26PM 1 point [-]

Because biases are so much easier to identify in other people, of course.

Comment author: Kindly 27 November 2012 03:42:18PM 1 point [-]

Greetings, my rational comrades!

Comment author: pleeppleep 27 November 2012 03:54:22PM 1 point [-]

No.

Comment author: Jabberslythe 26 November 2012 10:15:06PM 1 point [-]

I have been looking for a good calibration test for a while! Thank you.

Here's a little critique: The questions are becoming repetitive after playing it for not too long. I have noticed that a few of the questions are wrong, as in it says that I am wrong and docks points but I am right according to the data it sites (I think some of these had to do with the dark knight).

Comment author: Jabberslythe 27 November 2012 02:51:00PM 2 points [-]

Also, many of the questions are US centric and many of the questions are about sports or popular music or other things that I would not expect the average person interested in calibration to knowledgeable about or interested in. If you are trying to make it take off these might be justified however, these could be good decisions.

Comment author: Alexei 06 December 2012 04:04:38AM 1 point [-]

Yes, I agree. We are working on adding more questions to the game.

I fixed the bug with the film question.

Comment author: ChristianKl 03 December 2012 11:38:50PM 0 points [-]

When it comes to the boiling point of platinum I would however doubt that the average person has any interest.

Comment author: linkhyrule5 04 August 2013 03:20:13AM 0 points [-]

One thing I might suggest is leaving the question on screen after the question has been answered, and loading the answer on the same screen. I keep indicating an answer I didn't mean to select, and I'm not sure if it's a bug or my twitchy fingers.

Comment author: ChristianKl 08 December 2012 12:22:39PM *  0 points [-]

If you ask snopes maybe you can use there database of claims. You could start by throwing out all claims that snopes doesn't classify as true or false.

Example claim from http://www.snopes.com/business/money/money.asp: According to snopes the rumour that U.S. law specifies that a creditor does not have to accept more than 100 pennies towards the payment of a debt or obligation.

A) True - B) False

When the user answered the software presents him with "False" and gives him a link to the snopes website where the user can read more information.

Snopes would profit from having more people exposed to their database and those people being directed with link to the snopes website.

I don't know whether the Snopes folks would be open to having their database used in that way, but asking won't hurt.

Comment author: gjm 08 December 2012 12:00:00PM *  0 points [-]

Bug report: I just got asked which of Cleopatra and Spider-Man 3 was more expensive, answered, and was told:

Incorrect. The right answer is A ($Bmillion) vs B ($Cmillion)

where C > B. I am having trouble understanding how this is even possible given how I'd have thought the game works. Some weird screwup related to inflation correction? Comparing the prices as strings rather than numbers? (B was shorter than C but with a larger first digit.)

[EDITED to add: hey, it just did it to me again. This time it was Cleopatra versus Troy. Same failure mode otherwise.]

Comment author: ChristianKl 06 December 2012 08:44:23PM *  0 points [-]

Instead of having two different 50/50 buttons it might make sense to have only one button that spans both fields.

Believing A is true with 50% and believing B is true with 50% is the same belief when either A or B is true. The program shouldn't encourage the student to mentally treat the two categories as different. It should train the student to feel like he doesn't know the answer when he's 50/50.

Comment author: gjm 07 December 2012 11:40:26PM 0 points [-]

Except that I (and maybe others with similarly warped brains) find that my success rate in things I click "50%" for is distinctly above 50%. For much of my time playing the game I've actually got a larger fraction of "50%"s than of "60%"s right. If there were only one 50% button I'd have had no way of discovering that, which would be a shame because it clearly tells me something useful (specifically, I think, that my heuristics for guessing answers to some kinds of questions whose answers I have no explicit idea about are more effective than I thought).

Comment author: ChristianKl 03 December 2012 11:48:43PM 0 points [-]

There should be auditory feedback. Making a correct guess should encourage the program to play a pleasant sound.

Comment author: Alexei 06 December 2012 04:05:07AM 0 points [-]

Good point! I'll add that to my feature todo list. :)

Comment author: ChristianKl 03 December 2012 10:55:02PM *  0 points [-]

When I use the program I would enjoy it if I learn some content while I'm using the program. You could make a list of 2000 claims. 1000 true claims. 1000 false claims.

Type I error is the incorrect rejection of a true null hypothesis. A) Yes B) No

Type I error is is the failure to reject a false null hypothesis. A) Yes B) No

Pack all the knowledge that CFAR wants to teach in rationality minicamps into the program.

What's the name of the bias to see events that have already occurred as being more predictable than they were before they took place? A) Hindsight bias B) Confirmation bias

To improve learning you use the SRS in the background. Take the SuperMemo algorithm. You don't have to show the user in the open how many due cards there are. Whenever the user has completed all due cards, you add new cards. Whenever there are due cards, you add no new cards.

This approach has two big advantages: 1) The students learns two things at the same time. 2) The student is using calibrating his knowledge on questions that are more meaningful. That will increase the chance that he can use the calibration in his "real life".

Comment author: Alexei 06 December 2012 04:09:13AM 0 points [-]

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Adding SRS sounds like a fun feature. I'll think about adding it.

Comment author: ChristianKl 06 December 2012 08:50:06PM *  0 points [-]

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

Regardles how you slice it, 50% of the time A will be the right answer and 50% B. Reality doesn't work like that. But I agree that when it possible to ask about a subjet in a way where A and B aren't Yes/No it might be a good idea to avoid using Yes/No

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Great :)

Adding SRS sounds like a fun feature. I'll think about adding it.

In case you don't want to much repetition on the first days of using the software, you could add an additional rule: A maximum of 50% of the cards are recently answered, the other 50% of the cards are either new or older than 21 days.

I would also be possible to make some questions like those over CFAR knowledge SRS based while not doing SRS on other questions.

Comment author: gjm 07 December 2012 11:41:26PM 0 points [-]

"It's a dessert topping and a floor wax!"

I think trying to do two completely different things together would likely result in neither of them being done as well as they can be by two separate tools.

Comment author: ChristianKl 03 December 2012 09:40:06PM 0 points [-]

Having to use the mouse is not optimal.

It would be good to be able to use the keypad. 0-9 gives you a one key for every button. <space> could give you a new question and move past the graph.

Comment author: Alexei 06 December 2012 04:09:53AM 0 points [-]

Good call! I'll add that feature in.

Comment author: khafra 29 November 2012 11:04:07AM 0 points [-]

This app is great; I've already shared it on Facebook and in person. One thing 2 of 3 people I shared with wanted was multiple user profiles. Not just for multiple people; but so a single user can track his calibration in the morning vs. the evening; under different emotional states and amounts of stress, etc.

Comment author: Alexei 06 December 2012 04:10:53AM 1 point [-]

Interesting. I haven't heard anyone else request this feature, but I'll add it to my TODO list and might get to it eventually. Thanks!

Comment author: khafra 06 December 2012 11:57:39AM 0 points [-]

Is the codebase on github? I might try to write a patch, although I'd have to learn Objective C in the meantime.

Comment author: Alexei 07 December 2012 12:11:09AM 1 point [-]

No, but it sounds like a good idea to put it up there. The project is written using Unity3D engine, and the code base is in C#. :)

Comment author: JoshuaFox 26 November 2012 11:05:05AM 0 points [-]

What's a good result, both in terms of the number and the graph? What are other people's results? Not that I want to be too competitive, but I have no idea if I am doing very well or very badly.

Comment author: asparisi 26 November 2012 09:16:45PM 0 points [-]

High score seems to be good in terms of "My confident beliefs tend to be right."

Having your bars on the graph line up with the diagonal line would be an "ideal" graph (neither over- nor under- confident)

Comment author: JoshuaFox 27 November 2012 08:34:58AM *  0 points [-]

What is a high score? I realize that there is no absolute scale, but I have no idea if 10 is good or 1000 is bad.

Comment author: ChristianKl 03 December 2012 11:02:30PM *  1 point [-]

Out of 363 guesses my average score is 8. I'm from Germany so my knowledge of some of the US specific stuff isn't good.

Comment author: gjm 07 December 2012 11:46:12PM *  0 points [-]

One anecdata point: after 200 answers I have an average score of 12.6; I have success rates of 54% for "50%" answers, 59% for "60%" answers, 75% for "70%" answers, 88% for "80%" and "90%" answers (average 83%), and 100% for "99%" answers. (I've been very consistently underconfident.) I'm from the UK and for many of the sporting questions I couldn't even tell you what sport they're about. This feels to me like pretty good performance but I have little real basis for that opinion.

[EDITED to add: my counts are 33 @ 50%, 107 @ 60%, 40 @ 70%, 6 @ 80%, 3 @ 90%, 11 @ 99%. So lots of very unconfident answers.]

[EDITED again to add: Perhaps add an optional mode in which some score information gets shared, anonymously or otherwise, so that there can be a leaderboard and a display of many users' calibration graphs and so forth, for those who like to compete or to benchmark themselves.]

[EDITED again to add, though probably no one cares but me: after 400 questions all my statistics are basically the same as above, so apparently I'm (1) consistent and (2) a slow learner.]

Comment author: JoshuaFox 08 December 2012 07:27:11PM *  0 points [-]

Thanks. FWIW, My average score was 13.

And yes, a sharing mode would be useful. As-is, we have no basis for comparison.

Comment author: asparisi 27 November 2012 12:49:18PM 0 points [-]

Well, you can get up to 99 points for being 99 percent confident and getting the right answer, or minus several hundred (I have yet to fail at a 99 so I don't know how many) for failing at that same interval.

Wrong answers are, for the same confidence interval, more effective at bringing down your score than right answers are at bringing it up, so in some sense as long as you are staying positive you're doing good.

But if you want to compare further, you'd have to take into account how many questions you've answered, as your lifetime total will be different depending on the questions you answer. (990 after 10 questions would be exceptional: best possible score. 990 after 1,000 questions means you are getting a little less than a point per question, overall)