Inspired by Lê Nguyên Hoang's post on Bayesian Examination, I have been developing (as a hobby) a new app called Bayes-Up (available at: bayes-up.web.app). The app is now in a state where it is working well enough to be shared with others. In this post I list a few things you can do with it, because I expect that it will spark some interest within the community.
- Test and improve your calibration: Bayes-Up uses a collection of good quality trivia questions from the open trivia database. The main point of the app is that you can find a list of multiple choice quizzes, answer questions by assigning probabilities to each of the possible choices, receive a score based off a quadratic proper scoring rule and later find statistics about the quality of your calibration. A good place to start is the quiz from the book Factfulness by Hans Rosling that I included in the app.
- Create quizzes and upload them. There exists already a small number of calibration training apps. Bayes-Up differs mainly because it allows to upload and share your own quizzes. This can solve one of the problems of calibration apps which is to create good quality content (quizzes / questions). If you are a teacher and want your students to develop more metacognitive skills and intellectual honesty, or if you are organizing workshops on probability calibration, Bayes-Up can make it easier for you. To add a quiz, simply write it in a spreadsheet, export it as a CSV file and upload it in Bayes-Up.
- Recommend UI improvements, new features, report bugs, or contribute to the implementation. Only very little feedback has been collected so far and certainly a lot could be improved with little effort. The code of the app is open source and hosted on github.
- Analyse the data from Bayes-Up users. So far about 30'000 questions have been answered by about 1'300 users since the end of December 2019. The collected data is available at this link and will likely grow in the following months. Simple questions that analysing this data could answer are: Do users become better calibrated over time? Is calibration topic-specific or transferrable? How can the answers of users with unknown calibration and unknown knowledge be aggregated to predict the right answers to every question? Let me know if you want to do something with it or need a better documentation.
You are right about the proportion of dots within the error bars. This sounds like something I would want to change.
100% is not within the error bar, because they are not exactly error bars, but bayesian estimations of where your true probability lies using a uniform prior between 0% and 100%. If I pick a coin which has a probability p of Head picked uniformly between 0% and 100%, then after observing 4 Heads out of 4 throws, you should still believe in average that the probability of Head is 80% ( = n_heads / (n_throws + 1) ) in average and a 75% confidence interval would not contain the probability 100%.
So you need to show more proofs that your 100% answers are indeed right 100% of the time. I agree this is confusing, and I want to change it for the better, but I am unsure how.
For all answers with probability p, I count the number of times it has been the right answer and a wrong answer. If anyone as a recommendation on how to compute the top and bottom percentage of the error bars from these, I would really appreciate it.