jsteinhardt

Replying toOversight Assistants: Turning Compute into Understanding

Oversight Assistants: Turning Compute into Understanding

Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?

Replying toOversight Assistants: Turning Compute into Understanding

jsteinhardt1mo

Oversight Assistants: Turning Compute into Understanding

Thanks! Some thoughts here:

The first is how to train oversight AIs when the oversight tasks are no longer easily verifiable—for example, sophisticated reward hacks that can fool expert coders, or hard-to-verify sandbagging behavior on safety-related research. You mentioned that this would get covered in the next post, so I'm looking forward to that.

I think the thing that helps you out here is compositionality---all of these properties hopefully reduce to simpler concepts that are themselves verifiable, so hopefully e.g. a smart enough interp assistant could understand all the individual concepts as well as how they compose together and use this to understand more complex latent reasoning that isn't directly verifiable.

The second is how

... (read more)

Replying toOversight Assistants: Turning Compute into Understanding

jsteinhardt1mo

Oversight Assistants: Turning Compute into Understanding

The explainer model is actually evaluated based on how well the explanations predict ground-truth activation patterns, so it's not being evaluated by an LM-judge, but against the underlying ground-truth.

There is still room to hack the metric to some extent (in particular, we use an LM-based simulator to turn the explanations into predictions, so you could do better by providing more simulator-friendly explanations). This is probably happening, but we did a head-to-head comparison of LM-generated vs. human-generated explanations, and based on spot-checking them by hand, the higher-scoring explanations under our metric really did seem better.

There's also a number of other sanity checks in the paper if you're interested!

Oversight Assistants: Turning Compute into Understanding

jsteinhardt

1mo

Currently, we primarily oversee AI with human supervision and human-run experiments, possibly augmented by off-the-shelf AI assistants like ChatGPT or Claude. At training time, we run RLHF, where humans (and/or chat assistants) label behaviors with whether they are good or not. Afterwards, human researchers do additional testing to surface and evaluate unwanted behaviors, possibly assisted by a scaffolded chat agent.

The problem with primarily human-driven oversight is that it is not scalable: as AI systems keep getting smarter, errors become harder to detect:

The behaviors we care about become more complex, moving from simple classification tasks to open-ended reasoning tasks to long-horizon agentic tasks.
Human labels become less reliable due to reward hacking: AI systems

... (read 2513 more words →)

Replying toContradict my take on OpenPhil's past AI beliefs

jsteinhardt2mo

Contradict my take on OpenPhil's past AI beliefs

My guess would be that it's because they paid Hypermind directly rather than making the grant to me.

Replying toContradict my take on OpenPhil's past AI beliefs

jsteinhardt2mo

Contradict my take on OpenPhil's past AI beliefs

If you are interested, I did a detailed analysis of different groups of forecasters here: https://bounded-regret.ghost.io/scoring-ml-forecasts-for-2023/

I wouldn't treat competitive forecasters as a homogeneous group, but I also think basically everyone was surprised by the rate of progress on the MATH dataset. The main difference is that the better forecasters adjusted quickly after the first surprise and were mostly calibrated after.

Replying toContradict my take on OpenPhil's past AI beliefs

jsteinhardt2mo

Contradict my take on OpenPhil's past AI beliefs

My forecasts actually were funded by OP! I would guess that the main counterfactual change as a result of this was going with Hypermind over Good Judgement. It might be interesting to look at differences between those populations of forecasters---I would not model "super forecasters" as homogeneous and in retrospect the particular forecasters we got seemed not super good at AI questions, or else just weren't trying hard enough. But I also worked with some very good, AI-focused forecasters as a sanity check and they were also surprised by progress as determined by pre-registered predictions.

Replying toScalable End-to-End Interpretability

jsteinhardt2mo

Scalable End-to-End Interpretability

Thanks, appreciate it! Interested if you have any particular tasks you'd want as part of the safety case (we are actively building out a dataset of tasks for evaluating interpretability assistants and looking for ideas).

Scalable End-to-End Interpretability

jsteinhardt

2mo

This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability.

There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining a full bottom-up understanding of neural networks.

In my mind, these both get at something important but also both miss something. What they each get right:

Pragmatic interpretability identifies the need to ground in actual behaviors and data to make progress, and is closer to "going for the throat" in terms of solving specific problems like unfaithfulness.
Ambitious interpretability correctly notes that much of what goes

... (read 624 more words →)

117

Analyzing long agent transcripts (Docent)

jsteinhardt

11mo

This is a brief overview of a recent release by Transluce. You can see the full write-up on the Transluce website.

AI systems are increasingly being used as agents: scaffolded systems in which large language models are invoked across multiple turns and given access to tools, persistent state, and so on. Understanding and overseeing agents is challenging, because they produce a lot of text: a single agent transcript can have hundreds of thousands of tokens.

At Transluce, we built a system, Docent, that accelerates analysis of AI agent transcripts. Docent lets you quickly and automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand... (read more)

Replying toAnalyzing long agent transcripts (Docent)

jsteinhardt11mo*

Analyzing long agent transcripts (Docent)

Looks like an issue with the cross-posting (it works at https://bounded-regret.ghost.io/analyzing-long-agent-transcripts-docent/). Moderators, any idea how to fix?

EDIT: Fixed now, thanks to Oliver!

Replying toMETR: Measuring AI Ability to Complete Long Tasks

jsteinhardt11mo

METR: Measuring AI Ability to Complete Long Tasks

I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).

Replying toMETR: Measuring AI Ability to Complete Long Tasks

jsteinhardt11mo

METR: Measuring AI Ability to Complete Long Tasks

Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.

(This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).

-8

Introducing Transluce — A Letter from the Founders

jsteinhardt

We are launching an independent research lab that builds open, scalable technology for understanding AI systems and steering them in the public interest.

Transluce means to shine light through something to reveal its structure. Today’s complex AI systems are difficult to understand—not even experts can reliably predict their behavior once deployed. At the same time, AI is being adopted more quickly than any technology in recent memory. Given AI's extraordinary consequences for society, how we determine whether models are safe to release must be a matter of public conversation, and the tools for inspecting and assessing models should embody publicly agreed-upon best practices.

Our goal at Transluce is to create world-class tools for understanding... (read 721 more words →)

Augmenting Statistical Models with Natural Language Parameters

jsteinhardt

This is a guest post by my student Ruiqi Zhong, who has some very exciting work defining new families of statistical models that can take natural language explanations as parameters. The motivation is that existing statistical models are bad at explaining structured data. To address this problem, we agument these models with natural language parameters, which can represent interpretable abstract features and be learned automatically.

Imagine the following scenario: It is the year 3024. We are historians trying to understand what happened between 2016 and 2024, by looking at how Twitter topics changed across that time period. We are given a dataset of user-posted images sorted by time, $x_{1}$ , $x_{2}$ , ... $x_{T}$ , and our goal is... (read 2361 more words →)

Approaching Human-Level Forecasting with Language Models

Fred Zhang

Fred Zhang, dannyhalawi, jsteinhardt

TL;DR: We present a retrieval-augmented LM system that nears the human crowd performance on judgemental forecasting.

Paper: https://arxiv.org/abs/2402.18563 (Danny Halawi*, Fred Zhang*, Chen Yueh-Han*, and Jacob Steinhardt)

Twitter thread: https://twitter.com/JacobSteinhardt/status/1763243868353622089

Abstract

Forecasting future events is important for policy and decision-making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the... (read 786 more words →)

Analyzing the Historical Rate of Catastrophes

jsteinhardt

To communicate risks, we often turn to stories. Nuclear weapons conjure stories of mutually assured destruction, briefcases with red buttons, and nuclear winter. Climate change conjures stories of extreme weather, cities overtaken by rising sea levels, and crop failures. Pandemics require little imagination after COVID, but were previously the subject of movies like Contagion.

Stories are great for conveying concrete risks (I myself recently did this for AI risks), but they’re a bad way to predict the future. That’s because most stories are far too specific to be probable. More importantly, stories tend to feature short, simple chains of causation while reality is complex and multi-causal.

Instead of using stories, most competitive forecasters... (read 4578 more words →)

Forecasting AI (Overview)

jsteinhardt

This is a landing page for various posts I’ve written, and plan to write, about forecasting future developments in AI. I draw on the field of human judgmental forecasting, sometimes colloquially referred to as superforecasting. A hallmark of forecasting is that answers are probability distributions rather than single outcomes, so you should expect ranges rather than definitive answers (but ranges can still be informative!). If you are interested in learning more about this field, I teach a class on it with open-access notes, slides, and assignments.

For AI forecasting in particular, I first got into this area by forecasting progress on several benchmarks:

In Updates and Lessons from AI Forecasting, I describe a forecasting

... (read 366 more words →)

GPT-2030 and Catastrophic Drives: Four Vignettes

jsteinhardt

I previously discussed the capabilities we might expect from future AI systems, illustrated through GPT₂₀₃₀, a hypothetical successor of GPT-4 trained in 2030. GPT₂₀₃₀ had a number of advanced capabilities, including superhuman programming, hacking, and persuasion skills, the ability to think more quickly than humans and to learn quickly by sharing information across parallel copies, and potentially other superhuman skills such as protein engineering. I’ll use “GPT₂₀₃₀++” to refer to a system that has these capabilities along with human-level planning, decision-making, and world-modeling, on the premise that we can eventually reach at least human-level in these categories.

More recently, I also discussed how misalignment, misuse, and their combination make it difficult to control... (read 2863 more words →)

Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI

jsteinhardt

Given their advanced capabilities, future AI systems could pose significant risks to society. Some of this risk stems from humans using AI systems for bad ends (misuse), while some stems from the difficulty of controlling AI systems “even if we wanted to” (misalignment).

We can analogize both of these with existing risks. For misuse, we can consider the example of nuclear weapons, where the mass-production of hydrogen bombs created an existentially precarious situation. If the world’s arsenal of hydrogen bombs had been deployed in military conflict, we might well have destroyed society. AI might similarly enable nation-states to create powerful autonomous weapons, speed the research of other dangerous technologies like superviruses, or employ... (read 3383 more words →)

LESSWRONG
LW

LESSWRONG
LW

What will GPT-2030 look like?

More Is Different for AI

AI Forecasting: One Year In

Future ML Systems Will Be Qualitatively Different

jsteinhardt

jsteinhardt

Oversight Assistants: Turning Compute into Understanding

Scalable End-to-End Interpretability

Analyzing long agent transcripts (Docent)

Introducing Transluce — A Letter from the Founders

Augmenting Statistical Models with Natural Language Parameters

Approaching Human-Level Forecasting with Language Models

Analyzing the Historical Rate of Catastrophes

More Is Different for AI

jsteinhardt

What will GPT-2030 look like?

More Is Different for AI

AI Forecasting: One Year In

Future ML Systems Will Be Qualitatively Different

jsteinhardt

jsteinhardt

Oversight Assistants: Turning Compute into Understanding

Scalable End-to-End Interpretability

Analyzing long agent transcripts (Docent)

Introducing Transluce — A Letter from the Founders

Augmenting Statistical Models with Natural Language Parameters

Approaching Human-Level Forecasting with Language Models

Analyzing the Historical Rate of Catastrophes

More Is Different for AI

Abstract