Jeffrey Ladish

Help keep AI under human control: Palisade Research 2026 fundraiser

TL;DR: Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via Every.org or reach out at donate@palisaderesearch.org. Who We Are Palisade Research...

Dec 18, 2025105

Shutdown Resistance in Reasoning Models

We recently discovered some concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down. AI models are increasingly trained to solve problems without human assistance. A user...

Jul 6, 2025140

Bounty for Evidence on Some of Palisade Research's Beliefs

(Cross-posted from the Bountied Rationality Facebook group) EDIT: Bounty Expired Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the...

Sep 23, 202446

Take SCIFs, it’s dangerous to go alone

Coauthored by Dmitrii Volkov1, Christian Schroeder de Witt2, Jeffrey Ladish1 (1Palisade Research, 2University of Oxford). We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the...

May 1, 202443

Palisade is hiring Research Engineers

Palisade is looking to hire Research Engineers. We are a small team consisting of Jeffrey Ladish (Executive Director), Charlie Rogers-Smith (Chief of Staff), and Kyle Scott (part-time Treasurer & Operations). In joining Palisade, you would be a founding member of the team, and would have substantial influence over our strategic...

Nov 11, 202323

unRLHF - Efficiently undoing LLM safeguards

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. I'm grateful to Palisade Research for their support throughout this project. tl;dr: demonstrating that we can cheaply undo safety finetuning from open-source models to remove refusals - thus making...

Oct 12, 2023117

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing...

Oct 12, 2023151

Jeffrey Ladish

Jeffrey Ladish

Don't die with dignity; instead play to your outs

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Nuclear war is unlikely to cause human extinction

Shutdown Resistance in Reasoning Models

Jeffrey Ladish

Don't die with dignity; instead play to your outs

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Nuclear war is unlikely to cause human extinction

Shutdown Resistance in Reasoning Models

Help keep AI under human control: Palisade Research 2026 fundraiser

Shutdown Resistance in Reasoning Models

Bounty for Evidence on Some of Palisade Research's Beliefs

Take SCIFs, it’s dangerous to go alone

Palisade is hiring Research Engineers

unRLHF - Efficiently undoing LLM safeguards

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B