All of Fred Zhang's Comments + Replies

Thanks! This seems the best way to eval the bot anyway!

SOTA

Do they have an evaluation result in Brier score, by back testing on resolved questions, similar to what is done in the literature?

(They have a pic with "expected Brier score", which seems to be based on some kind of simulation?)

5Garrett Baker
Futuresearch bets on Manifold.

One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like "all English text" to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like "Slurs targeting sexual orientation" which would risk mislabeling, say, racial slurs. 

Section 4.3 of the OpenAI SEA paper also discusses this point.

Mechanistic interpretability-based evals could try to find inputs that lead to concerning combinations of features

An early work that does this on the vision model is https://distill.pub/2019/activation-atlas/.

Specifically, in the section on Focusing on a Single Classification, they observe spurious correlations in the activation space, via feature visualization, and use this observation to construct new failure cases of the model. 

this line of work is the strongest argument for mech interp [...] having concrete capabilities externalities

I have found this claim a bit handwavy, as I could imagine state space models being invented and improved to the current stage without the prior work of mech interp. More fundamentally, just "being inspired by" is not a quantitative claim after all, and mech interp is not the central idea here anyway.

On the other hand, though, much of the (shallow) interp can help with capabilities more directly, especially on inference speed. Recent examples I can t... (read more)

2LawrenceC
Yeah, "strongest" doesn't mean "strong" here! 

Great questions, and thanks for the helpful comments!

underconfidence issues

We have not tried explicit extremizing. But in the study where we average our system's prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases. 

However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up

We were not aware of this! We always take unweighted average ac... (read more)

We will be updating the paper with log scores. 

I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.

is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration?

Definitely both. But more coming from the fact that the models don't like to say extreme ... (read more)

Is someone finetuning (or already finetuned) a system to be generally numerate and good at fermi estimates?

We didn't try to fine-tune on general Fermi estimate tasks, but I imagine the results will be positive. For our specific problem of forecasting with external reasonings, fine-tuning helps a lot! We have an ablation study in Section 7 of the paper showing that if you just use the non-fine-tuned chat model, holding everything else fixed, it's significantly worse. 

We also didn't explore using base models that are not instruction-tuned or RLHF'ed. That could be interesting to look at.