SOTA
Do they have an evaluation result in Brier score, by back testing on resolved questions, similar to what is done in the literature?
(They have a pic with "expected Brier score", which seems to be based on some kind of simulation?)
One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like "all English text" to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like "Slurs targeting sexual orientation" which would risk mislabeling, say, racial slurs.
Section 4.3 of the OpenAI SEA paper also discusses this point.
Mechanistic interpretability-based evals could try to find inputs that lead to concerning combinations of features
An early work that does this on the vision model is https://distill.pub/2019/activation-atlas/.
Specifically, in the section on Focusing on a Single Classification, they observe spurious correlations in the activation space, via feature visualization, and use this observation to construct new failure cases of the model.
this line of work is the strongest argument for mech interp [...] having concrete capabilities externalities
I have found this claim a bit handwavy, as I could imagine state space models being invented and improved to the current stage without the prior work of mech interp. More fundamentally, just "being inspired by" is not a quantitative claim after all, and mech interp is not the central idea here anyway.
On the other hand, though, much of the (shallow) interp can help with capabilities more directly, especially on inference speed. Recent examples I can think of are Attention Sinks, Activation Sparsity, Deja Vu, and several parallel and follow-up works. (Sudden Drops also has some evidence on improving training dynamics using insights from developmental interp, though I think it's somewhat weak.)
Great questions, and thanks for the helpful comments!
underconfidence issues
We have not tried explicit extremizing. But in the study where we average our system's prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.
However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up
We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should be a bit smaller, for the reason you said.
- Relatedly, have you tried other retrieval schedules and if so did they affect the results
No, we have not tried. One alternative is to sample k random or uniformly spaced intervals within [open, resolve]. Unfortunately, this is not super kosher as this leaks the resolution date, which, as we argued in the paper, correlates with the label.
figure 4c
This is on validation set. Notice that the figure caption begins with "Figure 4: System strengths. Evaluating on the validation set, we note"
log score
We will update the paper soon to include the log score.
standard error in time series
See here for some alternatives in time series modeling.
I don't know what is the perfect choice in judgemental forecasting, though. I am not sure if it has been studied at all (probably kind of an open question). Generally, the keyword here you can Google is "autocorrelation standard error" and "standard error in time series".
We will be updating the paper with log scores.
I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.
is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration?
Definitely both. But more coming from the fact that the models don't like to say extreme values (like, <5%), even when the evidence suggests so. This doesn't necessarily hurt calibration, though, since calibration only cares about the error within each bin of the predicted probabilities.
This will result in a lot of very low probability forecasts, since it's likely that only A or B occurs, especially closer to the resolution date.
Yes, so we didn't do all of the multiple choice questions, only those that are already splitted into binary questions by the platforms. For example, if you query Metaculus API, some multiple choice questions are broken down into binary subquestions (each with their own community predictions etc). Our dataset is not dominated by such multiple-choice-turned-binary questions.
Does your system obey the Law of total probability?
No, and we didn't try very hard to fully improve this. Similarly, if you ask the model the same binary question, but in the reverse way, the answers in general do not sum to 1. I think future systems should try to overcome this issue by enforcing the constraints in some way.
I'm left wondering what caused the human and LM forecasts to differ in accuracy.
By accuracy, we mean 0-1 error: so you round the probabilistic forecast to 0 or 1 whichever is the nearest, and measure the 0-1 loss. This means that as long as you are directionally correct, you will have good accuracy. (This is not a standard metric, but we choose to report it mostly to compare with prior works.) So the kind of hedging behavior doesn't hurt accuracy, in general.
The McCarthy example [...] This is a failure in reasoning, not calibration, IMO.
This is a good point! We'll add a bit more on how to interpret these qualitative examples. To be fair, these are hand-picked and I would caution against drawing strong conclusions from them.
In equation 1, is k 0-indexed or 1-indexed?
1-indexed.
Is someone finetuning (or already finetuned) a system to be generally numerate and good at fermi estimates?
We didn't try to fine-tune on general Fermi estimate tasks, but I imagine the results will be positive. For our specific problem of forecasting with external reasonings, fine-tuning helps a lot! We have an ablation study in Section 7 of the paper showing that if you just use the non-fine-tuned chat model, holding everything else fixed, it's significantly worse.
We also didn't explore using base models that are not instruction-tuned or RLHF'ed. That could be interesting to look at.
Thanks! This seems the best way to eval the bot anyway!