Way back in April 2018, I announced a Prediction Contest, in which the person who made the best predictions on a bunch of questions on PredictionBook ahead of a 1st July deadline would win a prize after they all resolved in January 2019, which is now.
It was a bit of an experiment; I had no idea how many people were up for practicing predictions to try to improve their calibration, and decided to throw a little money and time at giving it a try. And in the spirit of reporting negative experimental results: The answer was 3, all of which I greatly appreciate for their participation. I don't regret running the experiment, but I'm going to pass on running a Prediction Contest 2019. I don't think this necessarily rules out trying to practically test and compete in rationality-related areas in other ways later, though.
The Results
Our entrants were bendini, bw, and Ialaithion, and their ranked log scores were:
bw: -9.358221122
Ialaithion: -9.594999615
bendini: -10.0044594
This was sufficiently close that changing a single question's resolution could tip the results, so they were all pretty good. That said, bw came out ahead, and even managed to beat averaging everyone's predictions- if you simply took the average prediction (including non-entrants) as of entry deadline and made that your prediction, you'd have got -9.576568147.
The full calculations for each of the log scores, as well as my own log score and the results of feeding the predictions as of prediction time to a simple model rather than simply averaging them, are in a spreadsheet here.
I'll be in touch with bw to sort out their prize this evening, and thanks to everyone who participated and who helped with finding questions to use for it.
It's not a novel algorithm type, just a learning project I did in the process of learning ML frameworks, a fairly simple LSTM + one dense layer, trained on the predictions + resolution of about 60% of the resolved predictions from PredictionBook as of September last year (which doesn't include any of the ones in the contest). The remaining resolved predictions were used for cross-validation or set aside as a test set. An even simpler RNN is only very slightly less good, though.
The details of how the algorithm works are thus somewhat opaque but from observing the way it reacts to input, it seems to lean on the average, weight later in sequence predictions more heavily (so order matters) and get more confident with number of predictions, while treating the propositions with only one probability assignment as probably being heavily overconfident. It seems to have more or less learnt that insight Tetlock pointed out on its own. Disagreement might also matter to it, not sure.
It's on GitHub at https://github.com/jbeshir/moonbird-predictor-keras; this doesn't include the data, which I downloaded using https://github.com/jbeshir/predictionbook-extractor. It's not particularly tidy though, and still includes a lot of unused functionality for input features- the words of the proposition, the time between a probability assignment and the due time, etc- which I didn't end up using because the dataset was too small for it to learn any signal in them.
I'm currently working on making the online frontend to the model automatically retrain the model at intervals using freshly resolved predictions, mostly for practice building a simple "online" ML system before I move on to trying to build things with more practical application.
The main reason I ran figures for it against the contest was that some of its individual confidences seemed strange to me, and while the cross-validation stuff was saying it was good, I was suspicious I was getting something wrong in the process.