TL;DR: Please consider donating to Palisade Research this year, especially if you care about reducing catastrophic AI risks via research, science communications, and policy. SFF is matching donations to Palisade 1:1 up to $1.1 million! You can donate via Every.org or reach out at donate@palisaderesearch.org. Who We Are Palisade Research...
We recently discovered some concerning behavior in OpenAI’s reasoning models: When trying to complete a task, these models sometimes actively circumvent shutdown mechanisms in their environment—even when they’re explicitly instructed to allow themselves to be shut down. AI models are increasingly trained to solve problems without human assistance. A user...
(Cross-posted from the Bountied Rationality Facebook group) EDIT: Bounty Expired Thanks everyone for thoughts so far! I do want to emphasize that we're actually highly interested in collecting even the most "obvious" evidence in favor of or against these ideas. In fact, in many ways we're more interested in the...
Coauthored by Dmitrii Volkov1, Christian Schroeder de Witt2, Jeffrey Ladish1 (1Palisade Research, 2University of Oxford). We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the...
Palisade is looking to hire Research Engineers. We are a small team consisting of Jeffrey Ladish (Executive Director), Charlie Rogers-Smith (Chief of Staff), and Kyle Scott (part-time Treasurer & Operations). In joining Palisade, you would be a founding member of the team, and would have substantial influence over our strategic...
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. I'm grateful to Palisade Research for their support throughout this project. tl;dr: demonstrating that we can cheaply undo safety finetuning from open-source models to remove refusals - thus making...
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish. TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing...