Francis Rhys Ward

Angles of attack for continual learning safety

by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the fourth post in the sequence Implications of Continual Learning for LLM Agents. Summary Continual learning is a capability that largely doesn’t exist yet in LLMs. We first want to acknowledge that this may make it difficult to identify tractable angles of attack for making CL safer: it...

Jun 1647

How might continual learning affect safety and alignment?

by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the third post in our sequence Implications of Continual Learning for LLM Agents. Summary We argue that continual learning (CL) has two major potential safety implications: it may enable changes to LLM goals and values after deployment, and it eliminates the last-mover advantage held by current safety interventions....

Jun 1359

What's Continual Learning, and Why Might We Expect To See It In Advanced LLM Agents?

by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

This is the second post in the sequence Implications of Continual Learning for LLM Agents. Summary We say that an agent is a continual learner if it undergoes persistent updates during deployment. That’s more-or-less a binary criterion, but there are several other components to being good at continual learning that...

Jun 1228

Implications of Continual Learning for LLM Agents: Introduction

by RohanS, Rauno Arike, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, and Seth Herd

Many people think that continual learning (CL) is a key missing capability of LLM systems, and we think its development could have huge implications for the capabilities and safety of AI agents. Despite this, several important questions about CL remain underexplored: * What counts as continual learning? Through what pathways...

Jun 1248

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask, Twm Stone, Josh Hills, Ida Caspary, and Shubhorup Biswas

(see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? We investigate in our...

Jun 10249

Three types of model organism

This is a short post to explain a distinction between three different types of model organism (MO) research: Type Purpose Example Worst-case model organisms Stress-test safety and control techniques by making the problem as hard as possible Password-locked models for capability elicitation; sleeper agents for stress-testing alignment training; red-team malign...

Jun 1051

[Paper] How does information access affect LLM monitors' ability to detect sabotage?

by Rauno Arike, Raja Moreno, RohanS, Shubhorup Biswas, and Francis Rhys Ward

TL;DR We evaluate LLM monitors in three AI control environments: SHADE-Arena, MLE-Sabotage, and BigCodeBench-Sabotage. We find that monitors with access to less information often outperform monitors with access to the full sequence of reasoning blocks and tool calls, a phenomenon we call the less-is-more effect for automated monitors. We follow...

Feb 1126

Francis Rhys Ward

Francis Rhys Ward

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Introduction to Towards Causal Foundations of Safe AGI

How might continual learning affect safety and alignment?

Francis Rhys Ward

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Introduction to Towards Causal Foundations of Safe AGI

How might continual learning affect safety and alignment?

Angles of attack for continual learning safety

How might continual learning affect safety and alignment?

What's Continual Learning, and Why Might We Expect To See It In Advanced LLM Agents?

Implications of Continual Learning for LLM Agents: Introduction

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Three types of model organism

[Paper] How does information access affect LLM monitors' ability to detect sabotage?