jsteinhardt

Cognitive Security as an AI Safety Cause Area

As AI systems become more capable, the cognitive security of humans will be increasingly at risk. By cognitive security, I mean the ability of humans to maintain control over their beliefs and actions. Cognitive security could be compromised in several ways: AI could become very good at persuading people of...

May 25158

The Case for Evaluating Model Behaviors

Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on. From a safety perspective, capability evaluations have a place: by understanding how close we are to different capabilities, and the rate of progress...

May 2040

Building Technology to Drive AI Governance

Technically skilled people who care about AI going well often ask me: how should I spend my time if I think AI governance is important? By governance, I mean the constraints, incentives, and oversight that govern how AI is developed. One option is to focus on technical work that solves...

Feb 1859

Oversight Assistants: Turning Compute into Understanding

Currently, we primarily oversee AI with human supervision and human-run experiments, possibly augmented by off-the-shelf AI assistants like ChatGPT or Claude. At training time, we run RLHF, where humans (and/or chat assistants) label behaviors with whether they are good or not. Afterwards, human researchers do additional testing to surface and...

Jan 685

Scalable End-to-End Interpretability

This is partly a linkpost for Predictive Concept Decoders, and partly a response to Neel Nanda's Pragmatic Vision for AI Interpretability and Leo Gao's Ambitious Vision for Interpretability. There is currently somewhat of a debate in the interpretability community between pragmatic interpretability---grounding problems in empirically measurable safety tasks---and ambitious interpretability----obtaining...

Dec 18, 2025121

Analyzing long agent transcripts (Docent)

This is a brief overview of a recent release by Transluce. You can see the full write-up on the Transluce website. AI systems are increasingly being used as agents: scaffolded systems in which large language models are invoked across multiple turns and given access to tools, persistent state, and so...

Mar 24, 202541

Introducing Transluce — A Letter from the Founders

We are launching an independent research lab that builds open, scalable technology for understanding AI systems and steering them in the public interest. Transluce means to shine light through something to reveal its structure. Today’s complex AI systems are difficult to understand—not even experts can reliably predict their behavior once...

Oct 23, 202474

jsteinhardt

jsteinhardt

What will GPT-2030 look like?

Cognitive Security as an AI Safety Cause Area

More Is Different for AI

AI Forecasting: One Year In

jsteinhardt

What will GPT-2030 look like?

Cognitive Security as an AI Safety Cause Area

More Is Different for AI

AI Forecasting: One Year In

Cognitive Security as an AI Safety Cause Area

The Case for Evaluating Model Behaviors

Building Technology to Drive AI Governance

Oversight Assistants: Turning Compute into Understanding

Scalable End-to-End Interpretability

Analyzing long agent transcripts (Docent)

Introducing Transluce — A Letter from the Founders