Inspired by Lê Nguyên Hoang's post on Bayesian Examination, I have been developing (as a hobby) a new app called Bayes-Up (available at: bayes-up.web.app). The app is now in a state where it is working well enough to be shared with others. In this post I list a few things you can do with it, because I expect that it will spark some interest within the community.
- Test and improve your calibration: Bayes-Up uses a collection of good quality trivia questions from the open trivia database. The main point of the app is that you can find a list of multiple choice quizzes, answer questions by assigning probabilities to each of the possible choices, receive a score based off a quadratic proper scoring rule and later find statistics about the quality of your calibration. A good place to start is the quiz from the book Factfulness by Hans Rosling that I included in the app.
- Create quizzes and upload them. There exists already a small number of calibration training apps. Bayes-Up differs mainly because it allows to upload and share your own quizzes. This can solve one of the problems of calibration apps which is to create good quality content (quizzes / questions). If you are a teacher and want your students to develop more metacognitive skills and intellectual honesty, or if you are organizing workshops on probability calibration, Bayes-Up can make it easier for you. To add a quiz, simply write it in a spreadsheet, export it as a CSV file and upload it in Bayes-Up.
- Recommend UI improvements, new features, report bugs, or contribute to the implementation. Only very little feedback has been collected so far and certainly a lot could be improved with little effort. The code of the app is open source and hosted on github.
- Analyse the data from Bayes-Up users. So far about 30'000 questions have been answered by about 1'300 users since the end of December 2019. The collected data is available at this link and will likely grow in the following months. Simple questions that analysing this data could answer are: Do users become better calibrated over time? Is calibration topic-specific or transferrable? How can the answers of users with unknown calibration and unknown knowledge be aggregated to predict the right answers to every question? Let me know if you want to do something with it or need a better documentation.
Thanks, I think I get it now.
If I observe 4 heads out of 4 and my prior was uniform across [0,1] then my posterior maximum likelihood is at 1 and this should definitely be within my error bars. Calculating the mean and adding symmetric error bars doesn’t work for asymmetric distributions.
To do this method more accurately you would have to calculate the full posterior distribution across [0,1] and use that to create error bars. Personally I would do this numerically but there may well be an analytical solution someone else will know about.
Alternatively, a frequentist approach: create error bars on the target percentage, rather than on the percentage achieved.
For each percentage grouping see how many questions had been answered using that percentage. Then use a binomial distribution to calculate the likelihood of each number of correct responses assuming that I am perfectly calibrated. This is essentially calculating a p-value with the null hypothesis being “I am perfectly calibrated”.
For example say I’ve answered 80% 4 times. If I’m perfectly calibrated I have a 0.8^4=41% chance of getting them all correct. Correspondingly I have:
0.8^3 x 0.2 x 4 = 41% to get 3 correct
0.8^2 x 0.2^2 x 6 = 15.4% to get 2 correct
0.8 x 0.2^3 x 4 = 2.5% to get 1 correct
0.2^4 = 0.2% to get 0 correct
If I am using a 90% CI (5% - 95%) then getting 0 correct is not inside my interval and nor is getting 1 correct (since 0.2% + 2.5% < 5%) but any of the other results are. So the top of my target error bar would reach to 100% and the bottom of would be between 25% and 50%
It is possible to combine all of the answers to create a single p-value across all percentages but this gets more complicated.
(Of course there would be 0 width error bars at 0% and 100% responses as any failures on these percentages are irrecoverable but this is right and proper)