Fred Zhang — LessWrong

LESSWRONG
LW

Replying toAI forecasting bots incoming

Fred Zhang1y

AI forecasting bots incoming

Thanks! This seems the best way to eval the bot anyway!

Replying toAI forecasting bots incoming

Fred Zhang1y

AI forecasting bots incoming

SOTA

Do they have an evaluation result in Brier score, by back testing on resolved questions, similar to what is done in the literature?

(They have a pic with "expected Brier score", which seems to be based on some kind of simulation?)

Fred Zhang1y

One can think of this as cases where auto-interp exhibits a precision-recall trade-off. At one extreme, you can generate super broad annotations like "all English text" to capture a a lot, which would overkill; and at the other end, you can generate very specific ones like "Slurs targeting sexual orientation" which would risk mislabeling, say, racial slurs.

Section 4.3 of the OpenAI SEA paper also discusses this point.

Replying toSparsify: A mechanistic interpretability research agenda

Fred Zhang2y

Sparsify: A mechanistic interpretability research agenda

Mechanistic interpretability-based evals could try to find inputs that lead to concerning combinations of features

An early work that does this on the vision model is https://distill.pub/2019/activation-atlas/.

Specifically, in the section on Focusing on a Single Classification, they observe spurious correlations in the activation space, via feature visualization, and use this observation to construct new failure cases of the model.

Fred Zhang2y

this line of work is the strongest argument for mech interp [...] having concrete capabilities externalities

I have found this claim a bit handwavy, as I could imagine state space models being invented and improved to the current stage without the prior work of mech interp. More fundamentally, just "being inspired by" is not a quantitative claim after all, and mech interp is not the central idea here anyway.

On the other hand, though, much of the (shallow) interp can help with capabilities more directly, especially on inference speed. Recent examples I can think of are Attention Sinks, Activation Sparsity, Deja Vu, and several parallel and follow-up works. (Sudden Drops also has some evidence on improving training dynamics using insights from developmental interp, though I think it's somewhat weak.)

Replying toApproaching Human-Level Forecasting with Language Models

Fred Zhang2y

Approaching Human-Level Forecasting with Language Models

Great questions, and thanks for the helpful comments!

underconfidence issues

We have not tried explicit extremizing. But in the study where we average our system's prediction with community crowd, we find improved results better than both (under Brier scores). This essentially does the extremizing in those <10% cases.

However, humans on INFER and Metaculus are scored according to the integral of their Brier score over time, i.e. their score gets weighted by how long a given forecast is up

We were not aware of this! We always take unweighted average across the retrieval dates when evaluating our system. If we put more weights on the later retrieval dates, the gap between our system and human should... (read more)

Replying toApproaching Human-Level Forecasting with Language Models

Fred Zhang2y

Approaching Human-Level Forecasting with Language Models

We will be updating the paper with log scores.

I think human forecasters collaborating with their AI counterparts (in an assistance / debate setup) is a super interesting future direction. I imagine the strongest possible system we can build today will be of this sort. This related work explored this direction with some positive results.

is the better performance of the LM forecasts due to disagreeing about direction, or mostly due to marginally better calibration?

Definitely both. But more coming from the fact that the models don't like to say extreme values (like, <5%), even when the evidence suggests so. This doesn't necessarily hurt calibration, though, since calibration only cares about the error within each... (read more)

Replying toApproaching Human-Level Forecasting with Language Models

Fred Zhang2y

Approaching Human-Level Forecasting with Language Models

Is someone finetuning (or already finetuned) a system to be generally numerate and good at fermi estimates?

We didn't try to fine-tune on general Fermi estimate tasks, but I imagine the results will be positive. For our specific problem of forecasting with external reasonings, fine-tuning helps a lot! We have an ablation study in Section 7 of the paper showing that if you just use the non-fine-tuned chat model, holding everything else fixed, it's significantly worse.

We also didn't explore using base models that are not instruction-tuned or RLHF'ed. That could be interesting to look at.

Approaching Human-Level Forecasting with Language Models

Fred Zhang

Fred Zhang, dannyhalawi, jsteinhardt

TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting.

Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt)

Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089

Abstract

Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the... (read 786 more words →)