Summary This is a summary of a paper published by the alignment team at UK AISI. Read the full paper here. AI research agents may help solve ASI alignment, for example via the following plan: * Build agents that can do empirical alignment work (e.g.~writing code, running experiments, designing evaluations...
Preamble The preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section. On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by...
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...
Summary: This post highlights the need for results in AI safety, such as debate or scalable oversight, to 'relativise', i.e. for the result to hold even when all parties are given access to a black box 'oracle' (the oracle might be a powerful problem solver, a random function, or a...
Linkpost to arXiv: https://arxiv.org/abs/2506.13609. Summary: We present a scalable oversight protocol where honesty is incentivized at equilibrium. Prior debate protocols allowed a dishonest AI to force an honest AI opponent to solve a computationally intractable problem in order to win. In contrast, prover-estimator debate incentivizes honest equilibrium behavior, even when...
Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time....