LawrenceC

I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project. 

I'm also a part-time fund manager for the LTFF.

Obligatory research billboard website: https://chanlawrence.me/

Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions

Comments

Sorted by

Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.

[..]

This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?

Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit "we've tried as hard at elicitation as would be feasible post deployment", but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases. 

The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that's not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non--refusal fine-tuned models. But it's quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways. 

Re: the METR evaluations on o1. 

We'll be releasing more details of our evaluations of the o1 model we evaluated, in the same style of our blog posts for o1-preview and Claude 3.5 Sonnet (Old). This includes both more details on the general autonomy capability evaluations as well as AI R&D results on RE-Bench. 


Whereas the METR evaluation, presumably using final o1, was rather scary.
[..]
From the performance they got, I assume they were working with the full o1, but from the wording it is unclear that they got access to o1 pro?

Our evaluations were not on the released o1 (nor o1-pro); instead, we were provided with an earlier checkpoint of o1 (this is in the system card as well). You're correct that we were working with a variant of o1 and not o1-mini, though. 

If 70% of all observed failures are essentially spurious, then removing even some of those would be a big leap – and if you don’t even know how the tool-use formats work and that’s causing the failures, then that’s super easy to fix.

While I agree with the overall point (o1 is a very capable model whose capabilities are hard to upper bound), our criteria for "spurious" is rather broad, and includes many issues that we don't expect to be super easy to fix with only small scaffolding changes. In experiments with previous models, I'd say 50% of issues we classify as spurious are fixable with small amounts of effort. 

Which is all to say, this may look disappointing, but it is still a rather big jump after even minor tuning

Worth noting that this was similar to our experiences w/ o1-preview, where we saw substantial improvements on agentic performance with only a few days of human effort. 

I am worried that issues of this type will cause systematic underestimates of the agent capabilities of new models that are tested, potentially quite large underestimates. 

Broadly agree with this point -- while we haven't seen groundbreaking advancements due to better scaffolding, there have been substantial improvements to o1-preview's coding abilities post-release via agent scaffolds such as AIDE. I (personally) expect to see comparable increases for o1 and o1-pro in the coming months. 

This is really good, thanks so much for writing it! 

I've never heard of Whisper or Eleven labs until today, and I'm excited to try them out.

Yeah, this has been my experience using Grammarly pro as well. 

LawrenceCΩ562

I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.

LawrenceCΩ350

I mean, we don't know all the details, but Qwen2 was explicitly trained on synthetic data from Qwen1.5 + "high-quality multi-task instruction data". I wouldn't be surprised if the same were true of Qwen 1.5. 

From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

Similarly, Gemma 2 had its pretraining corpus filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...] 
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10) 

LawrenceCΩ781

After thinking about it more, I think the LLaMA 1 refusals strongly suggest that this is an artefact of training data.So I've unendorsed the comment above. 

It's still worth noting that modern models generally have filtered pre-training datasets (if not wholely synthetic or explicitly instruction following datasets), and it's plausible to me that this (on top of ChatGPT contamination) is a large part of why we see much better instruction following/more eloquent refusals in modern base models. 

LawrenceCΩ220

It's worth noting that there's reasons to expect the "base models" of both Gemma2 and Qwen 1.5 to demonstrate refusals -- neither is trained on unfilted webtext. 

We don't know what 1.5 was trained on, but we do know that Qwen2's pretraining data both contains synthetic data generated by Qwen1.5, and was filtered using Qwen1.5 models. Notably, its pretraining data explicitly includes "high-quality multi-task instruction data"! From the Qwen2 report:

Quality Enhancement The filtering algorithm has been refined with additional heuristic and modelbased methods, including the use of the Qwen models to filter out low-quality data. Moreover, these
models are utilized to synthesize high-quality pre-training data. (Page 5)
[...]
Similar to previous Qwen models, high-quality multi-task instruction data is integrated into the
Qwen2 pre-training process to enhance in-context learning and instruction-following abilities.

I think this had a huge effect on Qwen2: Qwen2 is able to reliably follow both the Qwen1.5 chat template (as you note) as well as the "User: {Prompt}\n\nAssistant: " template. This is also reflected in their high standardized benchmark scores -- the "base" models do comparably to the instruction finetuned ones! In other words, Qwen2 "base" models are pretty far from traditional base models a la GPT-2 or Pythia as a result of explicit choices made when generating their pretraining data and this explains its propensity for refusals. I wouldn't be surprised if the same were true of the 1.5 models.


I think the Gemma 2 base models were not trained on synthetic data from larger models but its pretraining dataset was also filtered to remove "unwanted or unsafe utterances". From the Gemma 2 tech report:

We use the same data filtering techniques as Gemma 1. Specifically, we filter the pretraining dataset to reduce the risk of unwanted or unsafe utterances, filter out certain personal information or other sensitive data, decontaminate evaluation sets from our pre-training data mixture, and reduce the risk of recitation by minimizing the proliferation of sensitive outputs. (Page 3)
[...] 
We undertook considerable safety filtering of our pre-training data to reduce the likelihood of our
pre-trained and fine-tuned checkpoints producing harmful content. (page 10) 

My guess is this filtering explains why the model refuses, moreso than (and in addition to?) ChatGPT contamination. Once you remove all the "unsafe completions" 


I don't know what's going on with LLaMA 1, though.  

[This comment is no longer endorsed by its author]Reply1
LawrenceCΩ440

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

Load More