Hey rationality friends, I just made this FAQ for the credence calibration game. So if you have people you'd like to introduce to it --- for example, to get them used to thinking of belief strengths as probabilities --- now is a good time :)
Hey rationality friends, I just made this FAQ for the credence calibration game. So if you have people you'd like to introduce to it --- for example, to get them used to thinking of belief strengths as probabilities --- now is a good time :)
Confusingly, this game has at least three titles:
Yes, Alexei (aka Bent Spoon Games) and I talked about the name recently; to promote its use in university courses teaching Bayesian statistics, we're sticking with Credence Game. Confidence means something slightly different in statistics, and the game is meant to teach not just calibration, but also the act of measuring belief strength itself. The name update on BSG, and in the app itself as downloaded from there, will happen soon enough.
Suggestions (for general audience outside of LW/Rationalist circles)
I like the name "Confidence Game"- reminds people of a con game while informing you as to the point of the game.
Try to see if you can focus on a positive-point scale, if you can. Try to make it so that winning nets you a lot of points but "losing" only a couple. (Same effect on scores, either way) This won't seem as odd if you set it up as one long scale rather than two shorter ones: so 99-90-80-60-50-60-80-90-99.
Setting it to a timer will make it ADDICTIVE. Set it up in quick rounds. Make it like a quiz show. No question limit, or a bonus if you hit the limit for being "Quick on your feet." Make it hard but not impossible to do.
Set up a leaderboard where you can post to FB, show friends, and possibly compare your score to virtual "opponents" (which are really just scoring metrics) Possibly make those metrics con-man themed, keeping with the game's name.
Graphics will help a lot. Consider running with the con-game theme.
Label people: maybe something like "Underconfident" "Unsure" "Confident" "AMAZING" "Confident" "Overconfident" "Cocksure" (Test labels to see what works well!) rather than using graphs. Graphs and percentages? Turn-off. Drop the % sign and just show two numbers with a label. Make this separate from points but related. (High points=greater chance of falling toward the center, but in theory not necessarily the same.) Yes, I know the point is to get people to think in percentages, but if you want to do that you have to get them there without actually showing them math, which many find off-putting.
Set up a coin system that earns you benefits for putting into the game: extended round, "confidence streak" bonuses, hints, or skips might be good rewards here. Test and see what works. Allow people to pay for coins, but also reward coins for play or another mini-game related to play or both. (Investment=more play)
Another thought: once you have a large bank of questions, consider "theme questions" as something people can buy with coins. Yes, that becomes a matter of showing off rather than the main point, but people LIKE to show off.
Super Rationality Adventure Pals the Saturday morning cartoon! On 1080p from a BitTorrent near you.
The 5yo has been watching Pokemon anime by the ton. I'm picturing something with similar production values featuring the CFAR trials. A wise old fellow who seems to know just everything. People catch GREAT BIG Cognitive Biases out in the wild, then put these Cognitive Biases into training to evolve them smaller and smaller. RationalWiki guest-star as Team Rocket. It'll be huge.
This would be cute, except you'd have to catch the biases in a dreamscape/mindscape of some kind, not in the "wild". (Why would you want to retrieve them, even for shrinking, unless they were already inside your head?)
'Arguments are soldiers' and 'reason is for persuasion'; hence, you obviously use biasemons on your opponents to make them make bad decisions and believe wrong things.
(It's not any more malicious than enslaving wild animals and making them fight until they are knocked unconscious by the trauma...)
Nice game, BUT. The Mac version caused me some pain. I chose the default fullscreen settings, and I couldn't quit the application for minutes. No quit button that I could find. No drop-down menu bar. Cmd-tab stopped working. Mission Control (F3) stopped working. WTF? Finally I figured out that alt-cmd-esc still works, and then I could quit the game.
Nice game, good to see someone making it easy to just practice being well calibrated.
My calibration started off wonky, (e.g. was wrong each of the first six times I claimed 70% certainty) but quickly improved. Unfortunately, it improved suspiciously well, I suspect I may have been assigning probabilities with my primary goal not being scoring points, but instead with trying to get that bar graph displayed every 5 or 10 questions to even out. It's a well designed game, but unfortunately at least for me the score wasn't the main motivator, which is a problem because the score is the quantity that increases by being helpfully well-calibrated. Anyone else have a similar experience?
My experience is distinctly similar.
I observed another curiosity. For much of my time playing the game I've got a larger fraction of 50%s right than of 60%s. I think what's going on is that the 50% cases are ones where I definitely have no idea of the answer and have to fall back on heuristics (have I heard of this person? does the name sound old or recent? etc.) -- and the heuristics work better than I can bring myself to admit they do :-).
A couple of times I've accidentally clicked twice in rapid succession on the "Continue" button, and the second click has been interpreted as an answer to the next question. (I think it happened to be a correct answer both times.)
I suggest disabling the buttons -- or at least the "B" ones -- for about 100ms after displaying a new question.
This claim seems incorrect:
[mean squared error] wouldn't penalize a credence of 99.999% on wrong answers enough to strongly dissuade unjustified absolute certainties.
Since the scoring rule is proper, the penalty for incorrectly being 99.999% confident is exactly high enough that you wouldn't want to say you are if you're merely 99.998% confident.
Unless you mean something different?
I don't care about the US Postmaster General or about who won which sports championship. I don't care about boxing champions. I don't care about the various US sport teams.
I don't even want to learn that knowledge.
Nobel Prizes are more interesting. Who got the Nobel prize in Physiology or Medicine 2012? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman
Who got the Nobel prize in Physiology or Medicine for the discovery that mature cells can be reprogrammed to become pluripotent? A) John B. Gurdon and Shinya Yamanaka B) Bruce A. Beutler, Jules A. Hoffmann and Ralph M. Steinman
EDIT: (This comment should be top level, sorry...)
Yeah, I feel you. I'm currently adding a feature where you specify what questions you want to be asked.
Hey Alexei, I know it's been a while since this game was developed, but I've discovered it recently and I want to make a better one. More question types, community-contributed and voted questions, categories, profiles; cross-platform web app, themes, better UI etc.
It's of course going to be open-source.
Would you like to help with math model for questions and evaluating, or maybe with development?
galatyuk.ilya AT gmail
The difference between 60% credence and 80% credence seems much smaller to me than the difference between 90% and 99%. Is there a reason there's no option between 90% and 99%? In your testing, have you found any well-calibrated users who answer 99% a non-trivial fraction of the time?
Well, in 11 out of 145 answers (7.5%) I so far have answered 99%, and I have yet to be wrong in any of my answers.
If I continue at this rate, in approximately 1,174 more answers, I'll be able to tell you if I am well callibrated (less, if I fail at more than one answer in the intervening time)
To clarify, wrong in any of my answers at the 99% level. I have been wrong at other levels (including, surprisingly, hovering within around 1% of 90% at the 90% level.
Yes, but it also depends on your knowledge base. Unfortunately right now we don't have a lot of questions, but I hope people will add more. Then, we hope, you'll have more questions for which you can answer 99%.
I would say that I know with >99% certainity that Woodrow Wilson was president earlier than George H. W. Bush.
The same is true for the fact that the atom silver gets abbreviated with Ag.
I have been looking for a good calibration test for a while! Thank you.
Here's a little critique: The questions are becoming repetitive after playing it for not too long. I have noticed that a few of the questions are wrong, as in it says that I am wrong and docks points but I am right according to the data it sites (I think some of these had to do with the dark knight).
Also, many of the questions are US centric and many of the questions are about sports or popular music or other things that I would not expect the average person interested in calibration to knowledgeable about or interested in. If you are trying to make it take off these might be justified however, these could be good decisions.
One thing I might suggest is leaving the question on screen after the question has been answered, and loading the answer on the same screen. I keep indicating an answer I didn't mean to select, and I'm not sure if it's a bug or my twitchy fingers.
If you ask snopes maybe you can use there database of claims. You could start by throwing out all claims that snopes doesn't classify as true or false.
Example claim from http://www.snopes.com/business/money/money.asp: According to snopes the rumour that U.S. law specifies that a creditor does not have to accept more than 100 pennies towards the payment of a debt or obligation.
A) True - B) False
When the user answered the software presents him with "False" and gives him a link to the snopes website where the user can read more information.
Snopes would profit from having more people exposed to their database and those people being directed with link to the snopes website.
I don't know whether the Snopes folks would be open to having their database used in that way, but asking won't hurt.
Bug report: I just got asked which of Cleopatra and Spider-Man 3 was more expensive, answered, and was told:
Incorrect. The right answer is A ($Bmillion) vs B ($Cmillion)
where C > B. I am having trouble understanding how this is even possible given how I'd have thought the game works. Some weird screwup related to inflation correction? Comparing the prices as strings rather than numbers? (B was shorter than C but with a larger first digit.)
[EDITED to add: hey, it just did it to me again. This time it was Cleopatra versus Troy. Same failure mode otherwise.]
Instead of having two different 50/50 buttons it might make sense to have only one button that spans both fields.
Believing A is true with 50% and believing B is true with 50% is the same belief when either A or B is true. The program shouldn't encourage the student to mentally treat the two categories as different. It should train the student to feel like he doesn't know the answer when he's 50/50.
Except that I (and maybe others with similarly warped brains) find that my success rate in things I click "50%" for is distinctly above 50%. For much of my time playing the game I've actually got a larger fraction of "50%"s than of "60%"s right. If there were only one 50% button I'd have had no way of discovering that, which would be a shame because it clearly tells me something useful (specifically, I think, that my heuristics for guessing answers to some kinds of questions whose answers I have no explicit idea about are more effective than I thought).
When I use the program I would enjoy it if I learn some content while I'm using the program. You could make a list of 2000 claims. 1000 true claims. 1000 false claims.
Type I error is the incorrect rejection of a true null hypothesis. A) Yes B) No
Type I error is is the failure to reject a false null hypothesis. A) Yes B) No
Pack all the knowledge that CFAR wants to teach in rationality minicamps into the program.
What's the name of the bias to see events that have already occurred as being more predictable than they were before they took place? A) Hindsight bias B) Confirmation bias
To improve learning you use the SRS in the background. Take the SuperMemo algorithm. You don't have to show the user in the open how many due cards there are. Whenever the user has completed all due cards, you add new cards. Whenever there are due cards, you add no new cards.
This approach has two big advantages: 1) The students learns two things at the same time. 2) The student is using calibrating his knowledge on questions that are more meaningful. That will increase the chance that he can use the calibration in his "real life".
"It's a dessert topping and a floor wax!"
I think trying to do two completely different things together would likely result in neither of them being done as well as they can be by two separate tools.
One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.
I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.
Adding SRS sounds like a fun feature. I'll think about adding it.
One reason we don't want to do 50/50 true/false statements is because reality doesn't work like that.
Regardles how you slice it, 50% of the time A will be the right answer and 50% B. Reality doesn't work like that. But I agree that when it possible to ask about a subjet in a way where A and B aren't Yes/No it might be a good idea to avoid using Yes/No
I love your idea of putting CFAR knowledge into the game!! That's great. I'll get someone to do it.
Great :)
Adding SRS sounds like a fun feature. I'll think about adding it.
In case you don't want to much repetition on the first days of using the software, you could add an additional rule: A maximum of 50% of the cards are recently answered, the other 50% of the cards are either new or older than 21 days.
I would also be possible to make some questions like those over CFAR knowledge SRS based while not doing SRS on other questions.
This app is great; I've already shared it on Facebook and in person. One thing 2 of 3 people I shared with wanted was multiple user profiles. Not just for multiple people; but so a single user can track his calibration in the morning vs. the evening; under different emotional states and amounts of stress, etc.
Interesting. I haven't heard anyone else request this feature, but I'll add it to my TODO list and might get to it eventually. Thanks!
What's a good result, both in terms of the number and the graph? What are other people's results? Not that I want to be too competitive, but I have no idea if I am doing very well or very badly.
High score seems to be good in terms of "My confident beliefs tend to be right."
Having your bars on the graph line up with the diagonal line would be an "ideal" graph (neither over- nor under- confident)
What is a high score? I realize that there is no absolute scale, but I have no idea if 10 is good or 1000 is bad.
Out of 363 guesses my average score is 8. I'm from Germany so my knowledge of some of the US specific stuff isn't good.
One anecdata point: after 200 answers I have an average score of 12.6; I have success rates of 54% for "50%" answers, 59% for "60%" answers, 75% for "70%" answers, 88% for "80%" and "90%" answers (average 83%), and 100% for "99%" answers. (I've been very consistently underconfident.) I'm from the UK and for many of the sporting questions I couldn't even tell you what sport they're about. This feels to me like pretty good performance but I have little real basis for that opinion.
[EDITED to add: my counts are 33 @ 50%, 107 @ 60%, 40 @ 70%, 6 @ 80%, 3 @ 90%, 11 @ 99%. So lots of very unconfident answers.]
[EDITED again to add: Perhaps add an optional mode in which some score information gets shared, anonymously or otherwise, so that there can be a leaderboard and a display of many users' calibration graphs and so forth, for those who like to compete or to benchmark themselves.]
[EDITED again to add, though probably no one cares but me: after 400 questions all my statistics are basically the same as above, so apparently I'm (1) consistent and (2) a slow learner.]
Thanks. FWIW, My average score was 13.
And yes, a sharing mode would be useful. As-is, we have no basis for comparison.
Well, you can get up to 99 points for being 99 percent confident and getting the right answer, or minus several hundred (I have yet to fail at a 99 so I don't know how many) for failing at that same interval.
Wrong answers are, for the same confidence interval, more effective at bringing down your score than right answers are at bringing it up, so in some sense as long as you are staying positive you're doing good.
But if you want to compare further, you'd have to take into account how many questions you've answered, as your lifetime total will be different depending on the questions you answer. (990 after 10 questions would be exceptional: best possible score. 990 after 1,000 questions means you are getting a little less than a point per question, overall)
When I use the program I would enjoy it if I learn some content while I'm using the program. You could make a list of 2000 claims. 1000 true claims. 1000 false claims.
Type I error is the incorrect rejection of a true null hypothesis. A) Yes B) No
Type I error is is the failure to reject a false null hypothesis. A) Yes B) No
Pack all the knowledge that CFAR wants to teach in rationality minicamps into the program.
What's the name of the bias to see events that have already occurred as being more predictable than they were before they took place? A) Hindsight bias B) Confirmation bias
To improve learning you use the SRS in the background. Take the SuperMemo algorithm. You don't have to show the user in the open how many due cards there are. Whenever the user has completed all due cards, you add new cards. Whenever there are due cards, you add no new cards.
This approach has two big advantages: 1) The students learns two things at the same time. 2) The student is using calibrating his knowledge on questions that are more meaningful. That will increase the chance that he can use the calibration in his "real life".
"It's a dessert topping and a floor wax!"
I think trying to do two completely different things together would likely result in neither of them being done as well as they can be by two separate tools.