trentbrick

Message

Building and evaluating alignment auditing agents

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude...

Jul 24, 202547

On Chesterton's Fence

TLDR; Chesterton’s Fences are important, and very hard to identify/evaluate. With finite time, bountiful stupidity and inflated egos, it is too easy to not look deeply enough at existing ways of doing things and understand why they are the way they are before attempting to “fix” them. Reading Secrets of...

Sep 10, 202021

The Greatest Host

Hi all, I recently wrote a blog post summarizing some articles I found really fascinating by Stephen Hedrick that leverage evolution and viral ecology to argue: 1. While our immune system is indeed sophisticated, it doesn’t keep us any more protected from parasites than the more simple immune system of...

May 10, 202033

LESSWRONG
LW

LESSWRONG
LW

trentbrick

trentbrick

trentbrick

trentbrick

Building and evaluating alignment auditing agents

On Chesterton's Fence

The Greatest Host

Building and evaluating alignment auditing agents

On Chesterton's Fence

The Greatest Host