Credence calibration game FAQ

Academian

Credence calibration game FAQ — LessWrong

24 Credence calibration game FAQ

26th Nov 2012

1 min read

24

Hey rationality friends, I just made this FAQ for the credence calibration game. So if you have people you'd like to introduce to it --- for example, to get them used to thinking of belief strengths as probabilities --- now is a good time :)

Also, shameless promotion: please tweet/g+/like it; I want the world to be thinking in probabilities ASAP!

*Also*, please email me (critch@math.berkeley.edu) if you're good at making apps quickly and are interested in improving the game or making a variant of it; I'm swamped in job applications right now, but could easily have a Skype or phone conversation about our cache of ideas for improvements / variations (e.g. collecting user data on a server, more question types, a variant awarding gambles rather than deterministic scores, a variant with clickable emotion buttons for the user...).

Cheers!

Games (posts describing)

Personal Blog

24

New Comment

57 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:23 AM

[-]lukeprog14y160

Confusingly, this game has at least three titles:

The Credence Game (your website)
Calibration Game (Bent Spoon Games' website + the iOS app store)
Confidence Game (inside the app itself: "Welcome to the Confidence Game!")

[-]Academian14y20

Yes, Alexei (aka Bent Spoon Games) and I talked about the name recently; to promote its use in university courses teaching Bayesian statistics, we're sticking with Credence Game. Confidence means something slightly different in statistics, and the game is meant to teach not just calibration, but also the act of measuring belief strength itself. The name update on BSG, and in the app itself as downloaded from there, will happen soon enough.

[-]asparisi14y140

Suggestions (for general audience outside of LW/Rationalist circles)

I like the name "Confidence Game"- reminds people of a con game while informing you as to the point of the game.

Try to see if you can focus on a positive-point scale, if you can. Try to make it so that winning nets you a lot of points but "losing" only a couple. (Same effect on scores, either way) This won't seem as odd if you set it up as one long scale rather than two shorter ones: so 99-90-80-60-50-60-80-90-99.

Setting it to a timer will make it ADDICTIVE. Set it up in quick rounds. Make it like a quiz show. No question limit, or a bonus if you hit the limit for being "Quick on your feet." Make it hard but not impossible to do.

Set up a leaderboard where you can post to FB, show friends, and possibly compare your score to virtual "opponents" (which are really just scoring metrics) Possibly make those metrics con-man themed, keeping with the game's name.

Graphics will help a lot. Consider running with the con-game theme.

Label people: maybe something like "Underconfident" "Unsure" "Confident" "AMAZING" "Confident" "Overconfident" "Cocksure" (Test labels to see what works well!) rather than using graphs. Graphs and percentages? Turn-off. Drop the % sign and just show two numbers with a label. Make this separate from points but related. (High points=greater chance of falling toward the center, but in theory not necessarily the same.) Yes, I know the point is to get people to think in percentages, but if you want to do that you have to get them there without actually showing them math, which many find off-putting.

Set up a coin system that earns you benefits for putting into the game: extended round, "confidence streak" bonuses, hints, or skips might be good rewards here. Test and see what works. Allow people to pay for coins, but also reward coins for play or another mini-game related to play or both. (Investment=more play)

[-]asparisi14y20

Another thought: once you have a large bank of questions, consider "theme questions" as something people can buy with coins. Yes, that becomes a matter of showing off rather than the main point, but people LIKE to show off.

[-]pleeppleep14y60

I feel that "rationality friends" should be a standard way of addressing Lesswrongers.

[-]Nisan14y100

Yes. Also, "Hear me, rat-people."

[-]Viliam_Bur14y10

Ratfolk.

[-][anonymous]14y00

You win.

[This comment is no longer endorsed by its author]Reply

[-]David_Gerard14y70

Super Rationality Adventure Pals the Saturday morning cartoon! On 1080p from a BitTorrent near you.

[-]Luke_A_Somers14y70

We know we're winning when they begin making rationality friends music videos.

[-]David_Gerard14y90

The 5yo has been watching Pokemon anime by the ton. I'm picturing something with similar production values featuring the CFAR trials. A wise old fellow who seems to know just everything. People catch GREAT BIG Cognitive Biases out in the wild, then put these Cognitive Biases into training to evolve them smaller and smaller. RationalWiki guest-star as Team Rocket. It'll be huge.

[-]Alicorn14y10

This would be cute, except you'd have to catch the biases in a dreamscape/mindscape of some kind, not in the "wild". (Why would you want to retrieve them, even for shrinking, unless they were already inside your head?)

[-]gwern14y110

'Arguments are soldiers' and 'reason is for persuasion'; hence, you obviously use biasemons on your opponents to make them make bad decisions and believe wrong things.

(It's not any more malicious than enslaving wild animals and making them fight until they are knocked unconscious by the trauma...)

[-]David_Gerard14y20

Because biases are so much easier to identify in other people, of course.

[-]Kindly14y20

Greetings, my rational comrades!

[-]pleeppleep14y20

No.

[-]DanielVarga14y50

Nice game, BUT. The Mac version caused me some pain. I chose the default fullscreen settings, and I couldn't quit the application for minutes. No quit button that I could find. No drop-down menu bar. Cmd-tab stopped working. Mission Control (F3) stopped working. WTF? Finally I figured out that alt-cmd-esc still works, and then I could quit the game.

[-]wanderingsoul14y50

Nice game, good to see someone making it easy to just practice being well calibrated.

My calibration started off wonky, (e.g. was wrong each of the first six times I claimed 70% certainty) but quickly improved. Unfortunately, it improved suspiciously well, I suspect I may have been assigning probabilities with my primary goal not being scoring points, but instead with trying to get that bar graph displayed every 5 or 10 questions to even out. It's a well designed game, but unfortunately at least for me the score wasn't the main motivator, which is a problem because the score is the quantity that increases by being helpfully well-calibrated. Anyone else have a similar experience?

[-]gjm14y10

My experience is distinctly similar.

I observed another curiosity. For much of my time playing the game I've got a larger fraction of 50%s right than of 60%s. I think what's going on is that the 50% cases are ones where I definitely have no idea of the answer and have to fall back on heuristics (have I heard of this person? does the name sound old or recent? etc.) -- and the heuristics work better than I can bring myself to admit they do :-).

[-]gjm14y40

A couple of times I've accidentally clicked twice in rapid succession on the "Continue" button, and the second click has been interpreted as an answer to the next question. (I think it happened to be a correct answer both times.)

I suggest disabling the buttons -- or at least the "B" ones -- for about 100ms after displaying a new question.

[-]Kindly14y40

This claim seems incorrect:

[mean squared error] wouldn't penalize a credence of 99.999% on wrong answers enough to strongly dissuade unjustified absolute certainties.

Since the scoring rule is proper, the penalty for incorrectly being 99.999% confident is exactly high enough that you wouldn't want to say you are if you're merely 99.998% confident.

Unless you mean something different?

[-]ChristianKl14y30

I don't care about the US Postmaster General or about who won which sports championship. I don't care about boxing champions. I don't care about the various US sport teams.

I don't even want to learn that knowledge.

Nobel Prizes are more interesting. Who got the Nobel prize in Physiology or Medicine 2012? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

Who got the Nobel prize in Physiology or Medicine for the discovery that mature cells can be reprogrammed to become pluripotent? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman

EDIT: (This comment should be top level, sorry...)

[-]Alexei14y20

Yeah, I feel you. I'm currently adding a feature where you specify what questions you want to be asked.

[-]Illia7y10

Hey Alexei, I know it's been a while since this game was developed, but I've discovered it recently and I want to make a better one. More question types, community-contributed and voted questions, categories, profiles; cross-platform web app, themes, better UI etc.

It's of course going to be open-source.

Would you like to help with math model for questions and evaluating, or maybe with development?

galatyuk.ilya AT gmail

[-]evand14y30

The difference between 60% credence and 80% credence seems much smaller to me than the difference between 90% and 99%. Is there a reason there's no option between 90% and 99%? In your testing, have you found any well-calibrated users who answer 99% a non-trivial fraction of the time?

[-]asparisi14y50

Well, in 11 out of 145 answers (7.5%) I so far have answered 99%, and I have yet to be wrong in any of my answers.

If I continue at this rate, in approximately 1,174 more answers, I'll be able to tell you if I am well callibrated (less, if I fail at more than one answer in the intervening time)

[-]asparisi14y00

To clarify, wrong in any of my answers at the 99% level. I have been wrong at other levels (including, surprisingly, hovering within around 1% of 90% at the 90% level.

[-]Alexei14y00

Yes, but it also depends on your knowledge base. Unfortunately right now we don't have a lot of questions, but I hope people will add more. Then, we hope, you'll have more questions for which you can answer 99%.

[-]ChristianKl14y00

I would say that I know with >99% certainity that Woodrow Wilson was president earlier than George H. W. Bush.

The same is true for the fact that the atom silver gets abbreviated with Ag.

[-]Jabberslythe14y10

I have been looking for a good calibration test for a while! Thank you.

Here's a little critique: The questions are becoming repetitive after playing it for not too long. I have noticed that a few of the questions are wrong, as in it says that I am wrong and docks points but I am right according to the data it sites (I think some of these had to do with the dark knight).

[-]Jabberslythe14y20

Also, many of the questions are US centric and many of the questions are about sports or popular music or other things that I would not expect the average person interested in calibration to knowledgeable about or interested in. If you are trying to make it take off these might be justified however, these could be good decisions.

[-]Alexei14y10

Yes, I agree. We are working on adding more questions to the game.

I fixed the bug with the film question.

[-]ChristianKl14y00

When it comes to the boiling point of platinum I would however doubt that the average person has any interest.

[-]linkhyrule513y00

One thing I might suggest is leaving the question on screen after the question has been answered, and loading the answer on the same screen. I keep indicating an answer I didn't mean to select, and I'm not sure if it's a bug or my twitchy fingers.

[-]ChristianKl14y00

If you ask snopes maybe you can use there database of claims. You could start by throwing out all claims that snopes doesn't classify as true or false.

Example claim from http://www.snopes.com/business/money/money.asp: According to snopes the rumour that U.S. law specifies that a creditor does not have to accept more than 100 pennies towards the payment of a debt or obligation.

A) True - B) False

When the user answered the software presents him with "False" and gives him a link to the snopes website where the user can read more information.

Snopes would profit from having more people exposed to their database and those people being directed with link to the snopes website.

I don't know whether the Snopes folks would be open to having their database used in that way, but asking won't hurt.

[-]gjm14y00

Bug report: I just got asked which of Cleopatra and Spider-Man 3 was more expensive, answered, and was told:

Incorrect. The right answer is A ($Bmillion) vs B ($Cmillion)

where C > B. I am having trouble understanding how this is even possible given how I'd have thought the game works. Some weird screwup related to inflation correction? Comparing the prices as strings rather than numbers? (B was shorter than C but with a larger first digit.)

[EDITED to add: hey, it just did it to me again. This time it was Cleopatra versus Troy. Same failure mode otherwise.]

[-]ChristianKl14y00

Instead of having two different 50/50 buttons it might make sense to have only one button that spans both fields.

Believing A is true with 50% and believing B is true with 50% is the same belief when either A or B is true. The program shouldn't encourage the student to mentally treat the two categories as different. It should train the student to feel like he doesn't know the answer when he's 50/50.

[-]gjm14y00

Except that I (and maybe others with similarly warped brains) find that my success rate in things I click "50%" for is distinctly above 50%. For much of my time playing the game I've actually got a larger fraction of "50%"s than of "60%"s right. If there were only one 50% button I'd have had no way of discovering that, which would be a shame because it clearly tells me something useful (specifically, I think, that my heuristics for guessing answers to some kinds of questions whose answers I have no explicit idea about are more effective than I thought).

[-]ChristianKl14y00

There should be auditory feedback. Making a correct guess should encourage the program to play a pleasant sound.

[-]Alexei14y00

Good point! I'll add that to my feature todo list. :)

[-]ChristianKl14y00

When I use the program I would enjoy it if I learn some content while I'm using the program. You could make a list of 2000 claims. 1000 true claims. 1000 false claims.

Type I error is the incorrect rejection of a true null hypothesis. A) Yes B) No

Type I error is is the failure to reject a false null hypothesis. A) Yes B) No

Pack all the knowledge that CFAR wants to teach in rationality minicamps into the program.

What's the name of the bias to see events that have already occurred as being more predictable than they were before they took place? A) Hindsight bias B) Confirmation bias

To improve learning you use the SRS in the background. Take the SuperMemo algorithm. You don't have to show the user in the open how many due cards there are. Whenever the user has completed all due cards, you add new cards. Whenever there are due cards, you add no new cards.

This approach has two big advantages: 1) The students learns two things at the same time. 2) The student is using calibrating his knowledge on questions that are more meaningful. That will increase the chance that he can use the calibration in his "real life".

[-]gjm14y00

"It's a dessert topping and a floor wax!"

I think trying to do two completely different things together would likely result in neither of them being done as well as they can be by two separate tools.

[-]Alexei14y00

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Adding SRS sounds like a fun feature. I'll think about adding it.

[-]ChristianKl14y00

One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.

Regardles how you slice it, 50% of the time A will be the right answer and 50% B. Reality doesn't work like that. But I agree that when it possible to ask about a subjet in a way where A and B aren't Yes/No it might be a good idea to avoid using Yes/No

I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.

Great :)

Adding SRS sounds like a fun feature. I'll think about adding it.

In case you don't want to much repetition on the first days of using the software, you could add an additional rule: A maximum of 50% of the cards are recently answered, the other 50% of the cards are either new or older than 21 days.

I would also be possible to make some questions like those over CFAR knowledge SRS based while not doing SRS on other questions.

[-]ChristianKl14y00

Having to use the mouse is not optimal.

It would be good to be able to use the keypad. 0-9 gives you a one key for every button.

could give you a new question and move past the graph.

[-]Alexei14y00

Good call! I'll add that feature in.

[-]khafra14y00

This app is great; I've already shared it on Facebook and in person. One thing 2 of 3 people I shared with wanted was multiple user profiles. Not just for multiple people; but so a single user can track his calibration in the morning vs. the evening; under different emotional states and amounts of stress, etc.

[-]Alexei14y20

Interesting. I haven't heard anyone else request this feature, but I'll add it to my TODO list and might get to it eventually. Thanks!

[-]khafra14y00

Is the codebase on github? I might try to write a patch, although I'd have to learn Objective C in the meantime.

[-]Alexei14y20

No, but it sounds like a good idea to put it up there. The project is written using Unity3D engine, and the code base is in C#. :)

[-]JoshuaFox14y00

What's a good result, both in terms of the number and the graph? What are other people's results? Not that I want to be too competitive, but I have no idea if I am doing very well or very badly.

[-]asparisi14y00

High score seems to be good in terms of "My confident beliefs tend to be right."

Having your bars on the graph line up with the diagonal line would be an "ideal" graph (neither over- nor under- confident)

[-]JoshuaFox14y00

What is a high score? I realize that there is no absolute scale, but I have no idea if 10 is good or 1000 is bad.

[-]ChristianKl14y10

Out of 363 guesses my average score is 8. I'm from Germany so my knowledge of some of the US specific stuff isn't good.

[-]gjm14y00

One anecdata point: after 200 answers I have an average score of 12.6; I have success rates of 54% for "50%" answers, 59% for "60%" answers, 75% for "70%" answers, 88% for "80%" and "90%" answers (average 83%), and 100% for "99%" answers. (I've been very consistently underconfident.) I'm from the UK and for many of the sporting questions I couldn't even tell you what sport they're about. This feels to me like pretty good performance but I have little real basis for that opinion.

[EDITED to add: my counts are 33 @ 50%, 107 @ 60%, 40 @ 70%, 6 @ 80%, 3 @ 90%, 11 @ 99%. So lots of very unconfident answers.]

[EDITED again to add: Perhaps add an optional mode in which some score information gets shared, anonymously or otherwise, so that there can be a leaderboard and a display of many users' calibration graphs and so forth, for those who like to compete or to benchmark themselves.]

[EDITED again to add, though probably no one cares but me: after 400 questions all my statistics are basically the same as above, so apparently I'm (1) consistent and (2) a slow learner.]

[-]JoshuaFox14y00

Thanks. FWIW, My average score was 13.

And yes, a sharing mode would be useful. As-is, we have no basis for comparison.

[-]asparisi14y00

Well, you can get up to 99 points for being 99 percent confident and getting the right answer, or minus several hundred (I have yet to fail at a 99 so I don't know how many) for failing at that same interval.

Wrong answers are, for the same confidence interval, more effective at bringing down your score than right answers are at bringing it up, so in some sense as long as you are staying positive you're doing good.

But if you want to compare further, you'd have to take into account how many questions you've answered, as your lifetime total will be different depending on the questions you answer. (990 after 10 questions would be exceptional: best possible score. 990 after 1,000 questions means you are getting a little less than a point per question, overall)

Moderation Log