from aisafety.world

 

The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor. 

The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).

“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world. 

This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.

The method section is important but we put it down the bottom anyway.

Here’s a spreadsheet version also.

 

Editorial

  • One commenter said we shouldn’t do this review, because the sheer length of it fools people into thinking that there’s been lots of progress. Obviously we disagree that it’s not worth doing, but be warned: the following is an exercise in quantity; activity is not the same as progress; you have to consider whether it actually helps.
     
  • A smell of ozone. In the last month there has been a flurry of hopeful or despairing pieces claiming that the next base models are not a big advance, or that we hit a data wall. These often ground out in gossip, but it’s true that the next-gen base models are held up by something, maybe just inference cost.
    • The pretraining runs are bottlenecked on electrical power too. Amazon is at present not getting its nuclear datacentre.
       
  • But overall it’s been a big year for capabilities despite no pretraining scaling.
    • I had forgotten long contexts were so recent; million-token windows only arrived in February. Multimodality was February. “Reasoning” was September. “Agency” was October.
    • I don’t trust benchmarks much, but on GPQA (hard science) they leapt all the way from chance to PhD-level just with post-training.
    • FrontierMath launched 6 weeks ago; o3 moved it 2% → 25%. This is about a year ahead of schedule (though 25% of the benchmark is “find this number!International Math Olympiad/elite undergrad level.) Unlike the IMO there’s an unusual emphasis on numerical answers and computation too.
    • LLaMA-3.1 only used 0.1% synthetic pretraining data, Hunyuan-Large supposedly used 20%.
    • The revenge of factored cognition? The full o1 (descriptive name GPT-4-RL-CoT) model is uneven, but seems better at some hard things. The timing is suggestive: this new scaling dimension is being exploited now to make up for the lack of pretraining compute scaling, and so keep the excitement/investment level high. See also Claude apparently doing a little of this.
      • Moulton: “If the paths found compress well into the base model then even the test-time compute paradigm may be short lived.”
    • You can do a lot with a modern 8B model, apparently more than you could with 2020’s GPT-3-175B. This scaling of capability density will cause other problems.
    • There’s still some room for scepticism on OOD capabilities. Here’s a messy case which we don’t fully endorse.
       
  • Whatever safety properties you think LLMs have are not set in stone and might be lost. For instance, previously, training them didn’t involve much RL. And currently they have at least partially faithful CoT.
     
    • The revenge of RL safety. After LLMs ate the field, the old safety theory (which thought in terms of RL) was said to be less relevant.[1] But the training process for o1R1 involves more RL than RLHF does, which is pretty worrying. o3 involves more RL still.
       
  • Some parts of AI safety are now mainstream.
    • So it’s hard to find all of the people who don’t post here. For instance, here’s a random ACL paper which just cites the frontier labs.
    • The AI governance pivot continues despite the SB1047 setback. MIRI is a policy org now. 80k made governance researcher its top recommended career, after 4 years of technical safety being that.
    • The AISIs seem to be doing well. The UK one survived a political transition and the US one might survive theirs. See also the proposed AI Safety Review Office.
    • Mainstreaming safety ideas has polarised things of course; an organised opposition has stood up at last. Seems like it’s not yet party-political though.
       
  • Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.
     
  • Alignment evals with public test sets will probably be pretrained on, and as such will probably quickly stop meaning anything. Maybe you hope that it generalises from post-training anyway?
     
  • Safety cases are a mix of scalable oversight and governance; if it proves hard to make a convincing safety case for a given deployment, then – unlike evals – the safety case gives an if-then decision procedure to get people to stop; or if instead real safety cases are easy to make, we can make safety cases for scalable oversight, and then win.
     
  • Grietzer and Jha deprecate the word “alignment”, since it means too many things at once:
    • “P1: Avoiding takeover from emergent optimization in AI agents
    • P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
    • P3: Ensuring AIs are good at solving problems as specified (by user or designer)
    • P4: Ensuring AI systems enhance, and don’t erode, human agency
    • P5: Ensuring that advanced AI agents learn a human utility function
    • P6: Ensuring that AI systems lead to desirable systemic and long term outcomes”
       
  • Manifestos / mini-books: A Narrow Path (ControlAI), The Compendium (Conjecture), Situational Awareness (Aschenbrenner), Introduction to AI Safety, Ethics, and Society (CAIS). 
     
  • I note in passing the for-profit ~alignment companies in this list: Conjecture, Goodfire, Leap, AE Studio. (Not counting the labs.)
     
  • We don’t comment on quality. Here’s one researcher’s opinion on the best work of the year (for his purposes).
     
  • From December 2023: you should read Turner and Barnett alleging community failures.
     
  • The term “prosaic alignment” violates one good rule of thumb: that one should in general name things in ways that the people covered would endorse.[2] We’re calling it “iterative alignment”. We liked Kosoy’s description of the big implicit strategies used in the field, including the “incrementalist” strategy.
     
  • Quite a lot of the juiciest work is in the "miscellaneous" category, suggesting that our taxonomy isn't right, or that tree data structures aren't.

 

Agendas with public outputs

1. Understand existing models

Evals 

(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)

Various capability and safety evaluations

 

Various red-teams

 

Eliciting model anomalies 

 

Interpretability 

(Figuring out what a trained model is actually computing.[3])

 

Good-enough mech interp

 

Sparse Autoencoders

 

Simplexcomputational mechanics for interp

  • One-sentence summaryComputational mechanics for interpretability; what structures must a system track in order to predict the future?
  • Theory of change: apply the theory to SOTA AI, improve structure measures and unsupervised methods for discovering structure, ultimately operationalize safety-relevant phenomena.
  • See also: Belief State Geometry
  • Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: pessimistic
  • Broad approach: cognitive, maths/philosophy
  • Some names: Paul Riechers, Adam Shai
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Transformers represent belief state geometry in their residual streamOpen Problems in Comp Mech
  • Critiques: not found
  • Funded by: Survival and Flourishing Fund
  • Publicly-announced funding 2023-4: $74,000

 

Pr(Ai)2RCausal Abstractions

 

Concept-based interp 

 

Leap

  • One-sentence summary: research startup selling an interpretability API (model-agnostic feature viz of vision models). Aiming for data-independent (“want to extract information directly from the model with little dependence on training or test data”) and global (“mech interp isn’t going to be enough, we need holistic methods that capture gestalt”) interpretability methods.
  • Theory of change: make safety tools people want to use, stress-test methods in real life, develop a strong alternative to bottom-up circuit analysis.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify
  • Target case: pessimistic
  • Broad approach: cognitive
  • Some names: Jessica Rumbelow, Robbie McCorkell
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Why did ChatGPT say that? (PIZZA)
  • Critiques: not found
  • Funded by: Speedinvest, Ride Home, Open Philanthropy, EA Funds
  • Publicly-announced funding 2023-4: $425,000

 

EleutherAI interp

 

Understand learning

(Figuring out how the model figured it out.)

 

TimaeusDevelopmental interpretability

 

Saxe lab 

 

 

See also

 

2. Control the thing

(figure out how to predictably detect and quash misbehaviour)

 

Iterative alignment 

 

Control evaluations

  • One-sentence summary: don’t evaluate the model, evaluate the humans / organisation / deployment method instead: assess the risk of a particular protocol for deploying your model in a particular scenario, by silently swapping in an intentionally misaligned model. Play out the scenario in sim until 1) a catastrophe occurs, 2) your protocol confidently spots the red team model, or 3) the scenario ends without catastrophe. Also do adversarial evaluation (monitoring untrusted model outputs with both trusted and untrusted models). Predecessor.
  • Theory of change: prevent high-stakes failures by automating risk assessment of eval and deployment plans.
  • See also: safety cases.
  • Which orthodox alignment problems could it help with?: 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
  • Target case: worst-case
  • Broad approach: engineering / behavioural
  • Some names: RedwoodBuck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
  • Estimated # FTEs: 9
  • Some outputs in 2024AI ControlSubversion Strategy Evalsequencetoy modelsnotes
  • Critiques: of org in generalJozdien
  • Funded by: Open Philanthropy, Survival and Flourishing Fund
  • Publicly-announced funding 2023-4: $6,398,000

 

Guaranteed safe AI

  • One-sentence summary: This is a multi-team agenda with some big differences. Something like: formally model the behavior of cyber-physical systems, define precise constraints on what actions can occur, and require AIs to provide safety proofs for their recommended actions (correctness and uniqueness). Get AI to (assist in) building a detailed world simulation which humans understand, elicit preferences over future states from humans, verify[4] that the AI adheres to coarse preferences[5]; plan using this world model and preferences.
  • Theory of change: make a formal verification system that can act as an intermediary between a human user and a potentially dangerous system and only let provably safe actions through. Notable for not requiring that we solve ELK. Does require that we solve ontology though.
  • See also: Bengio’s AI Scientist, Safeguarded AIOpen Agency ArchitectureSLESAtlas Computing, program synthesis, Tenenbaum.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
  • Target case: (nearly) worst-case
  • Broad approach: cognitive
  • Some names: Yoshua Bengio, Max Tegmark, Steve Omohundro, David "davidad" Dalrymple, Joar Skalse, Stuart Russell, Ohad Kammar, Alessandro Abate, Fabio Zanassi
  • Estimated # FTEs: 10-50
  • Some outputs in 2024Bayesian oracleTowards Guaranteed Safe AIARIA Safeguarded AI Programme Thesis
  • Critiques: ZviGleave[6]Dickson
  • Funded by: UK government, OpenPhil, Survival and Flourishing Fund, Mila / CIFAR
  • Publicly-announced funding 2023-4: >$10m

 

Assistance gamesreward learning 

 

Social-instinct AGI

  • One-sentence summary: Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. Newest iteration of a sustained and novel agenda.
  • Theory of change: Fairly direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify.
  • Target case: worst-case
  • Broad approach: cognitive
  • Some names: Steve Byrnes
  • Estimated # FTEs: 1
  • Some outputs in 2024: My AGI safety research—2024 review, ’25 plans, Neuroscience of human social instincts: a sketch, Intuitive Self Models
  • Critiques: Not found.
  • Funded by: Astera Institute.

 

Prevent deception and scheming 

(through methods besides mechanistic interpretability)
 

Mechanistic anomaly detection

 

Cadenza

  • One-sentence summary: now focusing on developing robust white-box dishonesty-detection methods for LLM's and model evals. Previously working on concept-based interpretability.
  • Theory of change: Build and benchmark strong white-box methods to assess trustworthiness and increase transparency of models, and encourage open releases / evals from labs by demonstrating the benefits and necessity of such methods.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
  • Target case: pessimistic / worst-case
  • Broad approach: cognitive
  • Some names: Kieron Kretschmar, Walter Laurito, Sharan Maiya, Grégoire Dhimoïla
  • #FTEs: 3
  • Some outputs in 2024Cluster-Norm for Unsupervised Probing of Knowledge
  • Funded by: self-funded / volunteers
  • Publicly-announced funding 2023-4: none.

 

Faithful CoT through separation and paraphrasing 

  • One-sentence summary: shoggoth/face + paraphraser. Avoid giving the model incentives to hide its deceptive cognition or steganography. You could do this with an o1-like design, where the base model is not optimised for agency or alignment.
  • Theory of change: keep the CoT unoptimised and informative so that it can be used for control. Make it so we can see (most) misalignment in the hidden CoT.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
  • Target case: pessimistic
  • Broad approach: engineering
  • Some names: Daniel Kokotajlo, AI Futures Project
  • Estimated # FTEs: 1
  • Some outputs in 2024Why don’t we just…
  • Critiques: Demski
  • Funded by: SFF
  • Publicly-announced funding 2023-4: $505,000 for the AI Futures Project

 

Indirect deception monitoring 

 

See also “retarget the search”.

 

Surgical model edits

(interventions on model internals)
 

Activation engineering 

See also unlearning.

 

Goal robustness 

(Figuring out how to keep the model doing what it has been doing so far.)

 

Mild optimisation

  • One-sentence summary: avoid Goodharting by getting AI to satisfice rather than maximise.
  • Theory of change: if we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
  • Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
  • Target case: pessimistic
  • Broad approach: cognitive
  • Some names: Jobst Heitzig, Simon Fischer, Jessica Taylor
  • Estimated # FTEs: ?
  • Some outputs in 2024How to safely use an optimizerAspiration-based designs sequenceNon-maximizing policies that fulfill multi-criterion aspirations in expectation
  • Critiques: Dearnaley
  • Funded by: ?
  • Publicly-announced funding 2023-4: N/A

 

3. Safety by design 

(Figuring out how to avoid using deep learning)

 

ConjectureCognitive Software 

  • One-sentence summary: make tools to write, execute and deploy cognitive programs; compose these into large, powerful systems that do what we want; make a training procedure that lets us understand what the model does and does not know at each step; finally, partially emulate human reasoning.
  • Theory of change: train a bounded tool AI to promote AI benefits without needing unbounded AIs. If the AI uses similar heuristics to us, it should default to not being extreme.
  • Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 5. Instrumental convergence
  • Target case: pessimistic
  • Broad approach: engineering, cognitive
  • Some names: Connor Leahy, Gabriel Alfour, Adam Shimi
  • Estimated # FTEs: 1-10
  • Some outputs in 2024: “We have already done a fair amount of work on Vertical Scaling (Phase 4) and Cognitive Emulation (Phase 5), and lots of work of Phase 1 and Phase 2 happens in parallel.”. The Tactics programming language/framework. Still working on cognitive emulationsA Roadmap for Cognitive Software and A Humanist Future of AI
  • See also AI chains.
  • Critiques: ScherSaminorg
  • Funded by: Plural Platform, Metaplanet, “Others Others”, Firestreak Ventures, EA Funds in 2022
  • Publicly-announced funding 2023-4: N/A.

 

See also parts of Guaranteed Safe AI involving world models and program synthesis.

 

4. Make AI solve it

(Figuring out how models might help with figuring it out.)

 

Scalable oversight

(Figuring out how to get AI to help humans supervise models.)

 

OpenAI Superalignment  Automated Alignment Research

  • One-sentence summary: be ready to align a human-level automated alignment researcher.
  • Theory of change: get AI to help us with scalable oversight, critiques, recursive reward modelling, and so solve inner alignment.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify or 8. Superintelligence can hack software supervisors
  • Target case: optimistic
  • Broad approach: behavioural
  • Some names: Jan LeikeElriggsJacques Thibodeau
  • Estimated # FTEs: 10-50
  • Some outputs in 2024Prover-verifier games
  • Critiques: ZviChristianoMIRISteinerLadishWentworthGao
  • Funded by: lab funders
  • Publicly-announced funding 2023-4: N/A

 

Weak-to-strong generalization

 

Supervising AIs improving AIs

 

Cyborgism

  • One-sentence summary: Train human-plus-LLM alignment researchers: with humans in the loop and without outsourcing to autonomous agents. More than that, an active attitude towards risk assessment of AI-based AI alignment.
  • Theory of change: Cognitive prosthetics to amplify human capability and preserve values. More alignment research per year and dollar.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: pessimistic
  • Broad approach: engineering, behavioural
  • Some names: Janus, Nicholas Kees Dupuis,
  • Estimated # FTEs: ?
  • Some outputs in 2024Pantheon Interface
  • Critiques: self
  • Funded by: ?
  • Publicly-announced funding 2023-4: N/A.

 

Transluce

  • One-sentence summary: Make open AI tools to explain AIs, including agents. E.g. feature descriptions for neuron activation patterns; an interface for steering these features; behavior elicitation agent that searches for user-specified behaviors from frontier models
  • Theory of change: Introducing Transluce; improve interp and evals in public and get invited to improve lab processes.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
  • Target case: pessimistic
  • Broad approach: cognitive
  • Some names: Jacob Steinhardt, Sarah Schwettmann
  • Estimated # FTEs: 6
  • Some outputs in 2024Eliciting Language Model Behaviors with Investigator AgentsMonitor: An AI-Driven Observability InterfaceScaling Automatic Neuron Description
  • Critiques: not found.
  • Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba.
  • Publicly-announced funding 2023-4: N/A

 

Task decomp

Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thoughttree of thought.
 

Adversarial 

Deepmind Scalable Alignment

  • One-sentence summary: “make highly capable agents do what humans want, even when it is difficult for humans to know what that is”.
  • Theory of change: [“Give humans help in supervising strong agents”] + [“Align explanations with the true reasoning process of the agent”] + [“Red team models to exhibit failure modes that don’t occur in normal use”] are necessary but probably not sufficient for safe AGI.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
  • Target case: worst-case
  • Broad approach: engineering, cognitive
  • Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras
  • Estimated # FTEs: ?
  • Some outputs in 2024: Progress updateDoubly Efficient DebateInference-only Experiments
  • Critiques: The limits of AI safety via debate
  • Funded by: Google
  • Publicly-announced funding 2023-4: N/A

 

Anthropic: Bowman/Perez

  • One-sentence summary: scalable oversight of truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? / scalable benchmarking how to measure (proxies for) speculative capabilities like situational awareness.
  • Theory of change: current methods like RLHF will falter as frontier AI tackles harder and harder questions → we need to build tools that help human overseers continue steering AI → let’s develop theory on what approaches might scale → let’s build the tools.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
  • Target case: pessimistic
  • Broad approach: behavioural
  • Some names: Sam Bowman, Ethan Perez, He He, Mengye Ren
  • Estimated # FTEs: ?
  • Some outputs in 2024Debating with more persuasive LLMs Leads to More Truthful AnswersSleeper Agents
  • Critiques: obfuscationlocal inadequacy?, it doesn’t work right now (2022)
  • Funded by: mostly Anthropic’s investors
  • Publicly-announced funding 2023-4: N/A.

 

Latent adversarial training

 

See also FAR (below). See also obfuscated activations.

 

5. Theory 

(Figuring out what we need to figure out, and then doing that.)
 

The Learning-Theoretic Agenda 

  • One-sentence summary: try to formalise a more realistic agent, understand what it means for it to be aligned with us, translate between its ontology and ours, and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.
  • Theory of change: fix formal epistemology to work out how to avoid deep training problems.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: worst-case
  • Broad approach: cognitive
  • Some names: Vanessa Kosoy, Diffractor
  • Estimated # FTEs: 3
  • Some outputs in 2024Linear infra-Bayesian BanditsTime complexity for deterministic string machinesInfra-Bayesian HagglingQuantum Mechanics in Infra-Bayesian PhysicalismIntro lectures
  • Critiques: Matolcsi
  • Funded by: EA Funds, Survival and Flourishing Fund, ARIA[8]
  • Publicly-announced funding 2023-4: $123,000

 

Question-answer counterfactual intervals (QACI)

  • One-sentence summary: Get the thing to work out its own objective function (a la HCH).
  • Theory of change: make a fully formalized goal such that a computationally unbounded oracle with it would take desirable actions; and design a computationally bounded AI which is good enough to take satisfactory actions.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
  • Target case: worst-case
  • Broad approach: cognitive
  • Some names: Tamsin Leake, Julia Persson
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Epistemic states as a potential benign prior
  • Critiques: none found.
  • Funded by: Survival and Flourishing Fund
  • Publicly-announced funding 2023-4: $55,000

 

Understanding agency 

(Figuring out ‘what even is an agent’ and how it might be linked to causality.)

 

Causal Incentives 

  • One-sentence intro: use causal models to understand agents. Originally this was to design environments where they lack the incentive to defect, hence the name.
  • Theory of change: as above.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
  • Target case: pessimistic
  • Broad approach: behavioural/maths/philosophy
  • Some names: Tom Everitt, Matt McDermott, Francis Rhys Ward, Jonathan Richens, Ryan Carey
  • Estimated # FTEs: 1-10
  • Some outputs in 2024: Robust agents learn causal world modelsMeasuring Goal-Directedness
  • Critiques: not found.
  • Funded by: EA funds, Manifund, Deepmind
  • Publicly-announced funding 2023-4: Some tacitly from DM

 

Hierarchical agency

 

(Descendents of) shard theory

  • One-sentence summary: model the internal components of agents, use humans as a model organism of AGI (humans seem made up of shards and so might AI). Now more of an empirical ML agenda.
  • Theory of change: If policies are controlled by an ensemble of influences ("shards"), consider which training approaches increase the chance that human-friendly shards substantially influence that ensemble.
  • See also Activation Engineering, Reward basesgradient routing.
  • Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural
  • Target case: optimistic
  • Broad approach: cognitive
  • Some names: Alex Turner, Quintin Pope, Alex Cloud, Jacob Goldman-Wetzler, Evzen Wybitul, Joseph Miller
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Intrinsic Power-Seeking: AI Might Seek Power for Power’s SakeShard Theory - is it true for humans?Gradient routing
  • Critiques: ChanSoaresMillerLangKwaHerdRishikatailcalled
  • Funded by: Open Philanthropy (via funding of MATS), EA funds, Manifund
  • Publicly-announced funding 2023-4: >$581,458

 

Altair agent foundations

  • One-sentence summary: Formalize key ideas (“structure”, “agency”, etc) mathematically
  • Theory of change: generalize theorems → formalize agent foundations concepts like the agent structure problem → hopefully assist other projects through increased understanding
  • Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: worst-case
  • Broad approach: maths/philosophy
  • Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K
  • Estimated # FTEs: 1-10
  • Some outputs in 2024mostly exposition but it’s early days
  • Critiques: not found
  • Funded by: LTFF
  • Publicly-announced funding 2023-4: $60,000

 

boundaries / membranes

  • One-sentence summary: Formalise one piece of morality: the causal separation between agents and their environment. See also Open Agency Architecture.
  • Theory of change: Formalise (part of) morality/safety, solve outer alignment.
  • Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: pessimistic
  • Broad approach: maths/philosophy
  • Some names: Chris Lakin, Andrew Critch, davidad, Evan Miyazono, Manuel Baltieri
  • Estimated # FTEs: 0.5
  • Some outputs in 2024Chris posts
  • Critiques: not found
  • Funded by: ?
  • Publicly-announced funding 2023-4: N/A

 

Understanding optimisation

  • One-sentence summary: what is “optimisation power” (formally), how do we build tools that track it, and how relevant is any of this anyway. See also developmental interpretability.
  • Theory of change: existing theories are either rigorous OR good at capturing what we mean; let’s find one that is both → use the concept to build a better understanding of how and when an AI might get more optimisation power. Would be nice if we could detect or rule out speculative stuff like gradient hacking too.
  • Which orthodox alignment problems could it help with?: 5. Instrumental convergence
  • Target case: pessimistic
  • Broad approach: maths/philosophy
  • Some names: Alex Flint, Guillaume Corlouer, Nicolas Macé
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Degeneracies are sticky for SGD
  • Critiques: not found.
  • Funded by: CLR, EA funds
  • Publicly-announced funding 2023-4: N/A.

 

Corrigibility

(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight are ~atheoretical approaches to this.)


Behavior alignment theory 

  • One-sentence summary: predict properties of AGI (e.g. powerseeking) with formal models. Corrigibility as the opposite of powerseeking.
  • Theory of change: figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold, test them when feasible.
  • See also thisEJTDupuisHoltman.
  • Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 5. Instrumental convergence
  • Target case: worst-case
  • Broad approach: maths/philosophy
  • Some names: Michael K. Cohen, Max Harms/Raelifin, John Wentworth, David Lorell, Elliott Thornley
  • Estimated # FTEs: 1-10
  • Some outputs in 2024CAST: Corrigibility As Singular TargetA Shutdown Problem ProposalThe Shutdown Problem: Incomplete Preferences as a Solution
  • Critiques: none found.
  • Funded by: ?
  • Publicly-announced funding 2023-4:  ?

 

Ontology Identification 

(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)

 

Natural abstractions 

  • One-sentence summary: check the hypothesis that our universe “abstracts well” and that many cognitive systems learn to use similar abstractions. Check if features correspond to small causal diagrams corresponding to linguistic constructions.
  • Theory of change: find all possible abstractions of a given computation → translate them into human-readable language → identify useful ones like deception → intervene when a model is using it. Also develop theory for interp more broadly; more mathematical analysis. Also maybe enables “retargeting the search” (direct training away from things we don’t want).
  • See also: causal abstractions, representational alignment, convergent abstractions
  • Which orthodox alignment problems could it help with?: 5. Instrumental convergence, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
  • Target case: worst-case
  • Broad approach: cognitive
  • Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Natural Latents: The ConceptsNatural Latents Are Not Robust To Tiny MixturesTowards a Less Bullshit Model of Semantics
  • Critiques: Chan et alSotoHarwoodSoares
  • Funded by: EA Funds
  • Publicly-announced funding 2023-4: N/A?

 

 

ARC TheoryFormalizing heuristic arguments  

  • One-sentence summary: mech interp plus formal verification. Formalize mechanistic explanations of neural network behavior, so to predict when novel input may lead to anomalous behavior.
  • Theory of change: find a scalable method to predict when any model will act up. Very good coverage of the group’s general approach here.
  • See also: ELK, mechanistic anomaly detection.
  • Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 8. Superintelligence can hack software supervisors
  • Target case: worst-case
  • Broad approach: cognitive, maths/philosophy
  • Some names: Jacob HiltonMark Xu, Eric Neyman, Dávid Matolcsi, Victor Lecomte, George Robinson
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Estimating tail riskTowards a Law of Iterated Expectations for Heuristic EstimatorsProbabilities of rare outputsBird’s eye overviewFormal verification
  • Critiques: VaintrobClarificationalternative formulation
  • Funded by: FLI, SFF
  • Publicly-announced funding 2023-4: $1.7m

 

Understand cooperation

(Figuring out how inter-AI and AI/human game theory should or would work.)

 

Pluralistic alignmentcollective intelligence

  • One-sentence summary: align AI to broader values / use AI to understand and improve coordination among humans.
  • Theory of change: focus on getting more people and values represented.
  • See also: AI Objectives Institute, Lightcone Chord, Intelligent CooperationMeaning Alignment Institute. See also AI-AI Bias.
  • Which orthodox alignment problems could it help with?: 11. Someone else will deploy unsafe superintelligence first, 13. Fair, sane pivotal processes
  • Target case: optimistic
  • Broad approach: engineering?
  • Some names: Yejin Choi, Seth Lazar, Nouha Dziri, Deger Turan, Ivan Vendrov, Jacob Lagerros
  • Estimated # FTEs: 10-50
  • Some outputs in 2024roadmap, workshop
  • Critiques: none found
  • Funded by: Foresight, Midjourney?
  • Publicly-announced funding 2023-4: N/A

 

Center on Long-Term Risk (CLR) 

  • One-sentence summary: future agents creating s-risks is the worst of all possible problems, we should avoid that.
  • Theory of change: make present and future AIs inherently cooperative via improving theories of cooperation and measuring properties related to catastrophic conflict.
  • See also: FOCAL
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 3. Pivotal processes require dangerous capabilities, 4. Goals misgeneralize out of distribution
  • Target case: worst-case
  • Broad approach: maths/philosophy
  • Some names: Jesse Clifton, Caspar Oesterheld, Anthony DiGiovanni, Maxime Riché, Mia Taylor
  • Estimated # FTEs: 10-50
  • Some outputs in 2024Measurement Research AgendaComputing Optimal Commitments to Strategies and Outcome-conditional Utility Transfers
  • Critiques: none found
  • Funded by: Polaris Ventures, Survival and Flourishing Fund, Community Foundation Ireland
  • Publicly-announced funding 2023-4: $1,327,000

 

FOCAL 

  • One-sentence summary: make sure advanced AI uses what we regard as proper game theory.
  • Theory of change: (1) keep the pre-superintelligence world sane by making AIs more cooperative; (2) remain integrated in the academic world, collaborate with academics on various topics and encourage their collaboration on x-risk; (3) hope that work on “game theory for AIs”, which emphasises cooperation and benefit to humans, has framing & founder effects on the new academic field.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe
  • Target case: pessimistic
  • Broad approach: maths/philosophy
  • Some names: Vincent Conitzer, Caspar Oesterheld, Vojta Kovarik
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Foundations of Cooperative AIA dataset of questions on decision-theoretic reasoning in Newcomb-like problemsWhy should we ever automate moral decision making?Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback
  • Critiques: Self-submitted: “our theory of change is not clearly relevant to superintelligent AI”.
  • Funded by: Cooperative AI Foundation, Polaris Ventures
  • Publicly-announced funding 2023-4: N/A

 

Alternatives to utility theory in alignment

See also: Chris Leong's Wisdom Explosion

6. Miscellaneous 

(those hard to classify, or those making lots of bets rather than following one agenda)

 

AE Studio

 

Anthropic Alignment Capabilities / Alignment Science / Assurance / Trust & Safety / RSP Evaluations 

  • One-sentence summary: remain ahead of the capabilities curve/maintain ability to figure out what’s up with state of the art models, keep an updated risk profile, propagate flaws to relevant parties as they are discovered.
  • Theory of change: “hands-on experience building safe and aligned AI… We'll invest in mechanistic interpretability because solving that would be awesome, and even modest success would help us detect risks before they become disasters. We'll train near-cutting-edge models to study how interventions like RL from human feedback and model-based supervision succeed and fail, iterate on them, and study how novel capabilities emerge as models scale up. We'll also share information so policy-makers and other interested parties can understand what the state of the art is like, and provide an example to others of how responsible labs can do safety-focused research.”
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 13. Fair, sane pivotal processes
  • Target case: mixed
  • Broad approach: mixed
  • Some names: Evan Hubinger, Monte Macdiarmid
  • Estimated # FTEs: 10-50
  • Some outputs in 2024Updated Responsible Scaling Policycollaboration with U.S. AISI
  • Critiques: RSPsZach Stein-Perlman
  • Funded by: Amazon, Stackpoint, Google, Menlo Ventures, Wisdom Ventures, Ripple Impact Investments, Factorial Fund, Mubadala, Jane Street, HOF Capital, the Ford Foundation, Fidelity.
  • Publicly-announced funding 2023-4: $4 billion (safety fraction unknown)

 

Apart Research

 

Apollo

 

Cavendish Labs

 

Center for AI Safety (CAIS)

  • One-sentence summary: do what needs doing, any type of work
  • Theory of change: make the field more credible. Make really good benchmarks, integrate academia into the field, advocate for safety standards and help design legislation.
  • Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe, 13. Fair, sane pivotal processes
  • Target case: mixed
  • Broad approach: mixed
  • Some names: Dan Hendrycks, Andy Zou, Mantas Mazeika, Jacob Steinhardt, Dawn Song (some of these are not full-time at CAIS though).
  • Estimated # FTEs: 10-50
  • Some outputs in 2024WMDPCircuit BreakersSafetywashingHarmBenchTamper-Resistant Safeguards for Open-Weight LLMs
  • Critiques: various people hated SB1047
  • Funded by: Open Philanthropy, Survival and Flourishing Fund, Future of Life Institute
  • Publicly-announced funding 2023-4: $9,800,854

 

CHAI

  • See also the reward learning and provably safe systems entries.

 

Deepmind Alignment Team 

  • One-sentence summary: theory generation, threat modelling, and toy methods to help with those. “Our main threat model is basically a combination of specification gaming and goal misgeneralisation leading to misaligned power-seeking.” See announcement post for full picture.
  • Theory of change: direct the training process towards aligned AI and away from misaligned AI: build enabling tech to ease/enable alignment work → apply said tech to correct missteps in training non-superintelligent agents → keep an eye on it as capabilities scale to ensure the alignment tech continues to work.
  • See also (in this document): Process-based supervision, Red-teaming, Capability evaluations, Mechanistic interpretability, Goal misgeneralisation, Causal alignment/incentives
  • Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors
  • Target case: pessimistic
  • Broad approach: engineering
  • Some names: Rohin Shah, Anca Dragan, Allan Dafoe, Dave Orr, Sebastian Farquhar
  • Estimated # FTEs: ?
  • Some outputs in 2024AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
  • Critiques: Zvi
  • Funded by: Google
  • Publicly-announced funding 2023-4: N/A

 

Elicit (ex-Ought)

  • One-sentence summary: “a) improved reasoning of AI governance & alignment researchers, particularly on long-horizon tasks and (b) pushing supervision of process rather than outcomes, which reduces the optimisation pressure on imperfect proxy objectives leading to “safety by construction”.
  • Theory of change: “The two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.”
     

FAR 

 

Krueger Lab Mila

 

MIRI

  • Now a governance org – out of scope for us but here’s what they’re been working on.

 

NSF SLES

  • One-sentence summary: funds academics or near-academics to do ~classical safety engineering on AIs. A collaboration between the NSF and OpenPhil. Projects include
    • “Neurosymbolic Multi-Agent Systems”
    • “Conformal Safe Reinforcement Learning”
    • “Autonomous Vehicles”
  • Theory of change: apply safety engineering principles from other fields to AI safety.
  • Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
  • Target case: pessimistic case
  • Broad approach: engineering
  • Some names: Dan Hendrycks, Sharon Li
  • Estimated # FTEs: 10+
  • Some outputs in 2024: Generalized Out-of-Distribution Detection: A SurveyAlignment as Reward-Guided Search
  • See alsoArtificial Intelligence in Safety-critical Systems: A Systematic Review
  • Critiqueszoop
  • Funded by: Open Philanthropy
  • Publicly-announced funding 2023-4: $18m granted in 2024.

 

OpenAI Superalignment Safety Systems

  • See also: weak-to-strong generalization, automated alignment researcher.
  • Some outputs: MLE-Bench
  • # FTEs: “80”. But this includes lots working on bad-words prevention and copyright-violation prevention.
  • They just lost Lilian Weng, their VP of safety systems. 

OpenAI Alignment Science

  • use reasoning systems to prevent models from generating unsafe outputs. Unclear if this is a decoding-time thing (i.e. actually a control method) or a fine-tuning thing.
  • Some names: Mia Glaese, Boaz Barak, Johannes Heidecke, Melody Guan. Lost its head, John Schulman.
  • Some outputso1-preview system cardDeliberative Alignment 

OpenAI Safety and Security Committee

  • One responsibility of the new board is to act with an OODA loop 3 months long.

OpenAI AGI Readiness Mission Alignment

  • No detail. Apparently a cross-function, whole-company oversight thing.
  • We haven’t heard much about the Preparedness team since Mądry left it.
  • Some names: Joshua Achiam.

 

Palisade Research

  • One-sentence summary: Fundamental research in LLM security, plus capability demos for outreach, plus workshops.
  • Theory of change: control is much easier if we can secure the datacenter / if hacking becomes much harder. The cybersecurity community need to be alerted.
  • Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 8. Superintelligence can hack software supervisors, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
  • Target case: worst-case
  • Broad approach: engineering
  • Some names: Jeffrey Ladish, Charlie Rogers-Smith, Ben Weinstein-Raun, Dmitrii Volkov
  • Estimated # FTEs: 1-10
  • Some outputs in 2024Removing safety fine-tuning from Llama 2-Chat 13B for less than $200from Llama 3 for free in minutesLLM Agent Honeypot
  • Critiques: none found.
  • Funded by: Open Philanthropy, Survival and Flourishing Fund
  • Publicly-announced funding 2023-4: $2,831,627

 

Tegmark GroupIAIFI

 

UK AI Safety Institute

 

US AI Safety Institute

  • One-sentence summary: government initiative focused on evaluating and mitigating risks associated with advanced AI systems.
  • Theory of change: rigorous safety evaluations and developing guidelines in collaboration with academia and industry.
  • See also: International Network of AI Safety Institutes
  • Which orthodox alignment problems could it help with?: 13. Fair, sane pivotal processes
  • Target case: pessimistic
  • Broad approach: behavioural
  • Some names: Paul Christiano, Elham Tabassi, Rob Reich
  • Estimated # FTEs: 10-50
  • Some outputs in 2024: shared Pre-release testing of Sonnet and o1, Misuse RiskSynthetic Contentvision
  • Critiques: The US AI Safety Institute stands on shaky ground
  • Funded by: US government
  • Publicly-announced funding 2023-4: $10m (plus some NIST support)

 

Agendas without public outputs this year

Graveyard (known to be inactive)

 

Method

We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.

We started with last year’s list and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.

An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.

All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.

We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.

  • Which deep orthodox subproblems could it ideally solve? (via Davidad)
     
  • The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)
    • “optimistic-case”[9]: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc
    • pessimistic-case: if we’re in-between the above and the below
    • worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness
       
  • The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)
    • engineering: iterating over outputs
    • behavioural: understanding the input-output relationship
    • cognitive: understanding the algorithms
    • maths/philosophy[10]: providing concepts for the other approaches

As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:

  • 3. Pivotal processes require dangerous capabilities
  • 11. Someone else will deploy unsafe superintelligence first
  • 13. Fair, sane pivotal processes

We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.

We dropped the operational criteria this year because we made our point last year and it’s clutter.

Lastly, we asked some reviewers to comment on the draft.

 

Other reviews and taxonomies

 

Acknowledgments

Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.

 

  1. ^

     Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’

  2. ^

     Also, actually, Christiano’s original post is about the alignment of prosaic AGI, not the prosaic alignment of AGI.

  3. ^

     This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.

  4. ^

     Often less strict than formal verification but "directionally approaching it": probabilistic checking.

  5. ^

     Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”

  6. ^

    Satron adds that this is lacking in concrete criticism and that more expansion on object-level problems would be useful.

  7. ^

     Scheming can be a problem before this point, obviously. Could just be too expensive to catch AIs who aren't smart enough to fool human experts.

  8. ^

    Indirectly; Kosoy is funded for her work on Guaranteed Safe AI.

  9. ^

     Called “average-case” in Ngo’s post.

  10. ^

    Our addition.

New Comment
27 comments, sorted by Click to highlight new comments since:

I was on last year’s post (under “Social-instinct AGI”) but not this year’s. (…which is fine, you can’t cover everything!) But in case anyone’s wondering, I just posted an update of what I’m working on and why at: My AGI safety research—2024 review, ’25 plans.

I've added a section called "Social-instinct AGI" under the "Control the thing" heading similar to last year.

Kudos to the authors for this nice snapshot of the field; also, I find the Editorial useful.

Beyond particular thoughts (which I might get to later) for the entries (with me still understanding that quantity is being optimized for, over quality), one general thought I had was: How can this be made into more of "living document"?.

This helps

If we missed you or got something wrong, please comment, we will edit.

but seems less effective a workflow than it could be. I was thinking more of a GitHub README where individuals can PR in modifications to their entries or add in their missing entries to the compendium. I imagine most in this space have GitHub accounts, and with the git tracking, there could be "progress" (in quotes, since more people working doesn't necessarily translate to more useful outcomes) visualization tools.

The spreadsheet works well internally but does not seem as visible as would a public repository. Forgive me if there is already a repository and I missed it. There are likely other "solutions" I am missing, but regardless, thank you for the work you've contributed to this space.

Thanks, this is a really helpful broad survey of the field. Would be useful to see a one-screen-size summary, perhaps a table with the orthodox alignment problems as one axis?

I'll add that the collective intelligence work I'm doing is not really "technical AI safety" but is directly targeted at orthodox problems 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, and targeting all alignment difficulty worlds not just the optimistic one (in particular, I think human coordination becomes more not less important in the pessimistic world). I write more of how I think about pivotal processes in general in AI Safety Endgame Stories but it's broadly along the lines of von Neumann's

For progress there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment.

Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?

cuts off some nuance, I would call this the projection of the collective intelligence agenda onto the AI safety frame of "eliminate the risk of very bad things happening" which I think is an incomplete way of looking at how to impact the future

in particular I tend to spend more time thinking about future worlds that are more like the current one in that they are messy and confusing and have very terrible and very good things happening simultaneously and a lot of the impact of collective intelligence tech (for good or ill) will determine the parameters of that world

Just wanted to chime in with my current research direction.
 

The name “OpenAI Superalignment Team”.

Not just the name but the team, correct?

As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?

My impression from coverage in eg Wired and Future Perfect was that the team was fully dissolved, the central people behind it left (Leike, Sutskever, others), and Leike claimed OpenAI wasn't meeting its publicly announced compute commitments even before the team dissolved. I haven't personally seen new work coming out of OpenAI trying to 'build a roughly human-level automated alignment researcher⁠' (the stated goal of that team). I don't have any insight beyond the media coverage, though; if you've looked more deeply into it than that, your knowledge is greater than mine.

(Fairly minor point either way; I was just surprised to see it expressed that way)

Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.

Oh, fair enough. Yeah, definitely that agenda is still very much alive! Never mind, then, carry on :)

And thanks very much to you and collaborators for this update; I've pointed a number of people to the previous version, but with the field evolving so quickly, having a new version seems quite high-value.

I suggest removing Gleave's critique of guaranteed safe AI. It's not object-level, doesn't include any details, and is mostly just vibes.

My main hesitation is I feel skeptical of the research direction they will be working on (theoretical work to support the AI Scientist agenda). I'm both unconvinced of the tractability of the ambitious versions of it, and more tractable work like the team's previous preprint on Bayesian oracles is theoretically neat but feels like brushing the hard parts of the safety problem under the rug.

Gleave doesn't provide any reasons for why he is unconvinced of the tractability of the ambitious versions of guaranteed safe AI. He also doesn't provide any reason why he thinks that Bayesian oracle paper brushes the hard parts of the safety problem under the rug.

His critique is basically, "I saw it, and I felt vaguely bad about it." I don't think it should be included, as it dilutes the thoughtful critiques and doesn't provide much value to the reader.

I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.

Sounds good to me!

boundaries / membranes

  • One-sentence summary: Formalise one piece of morality: the causal separation between agents and their environment. See also Open Agency Architecture.
  • Theory of change: Formalise (part of) morality/safety, solve outer alignment.

Chris Lakin here - this is a very old post and What does davidad want from «boundaries»? should be the canonical link

Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.

This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn't think of. Hell, it doesn't even work on my toddler. It seems unworkable as well as unethical.

I do not think people are shifting to control instead of alignment because it's better, I think they are giving up on value alignment. And since the current models are not smarter than us yet, control works OK - for now.

IMO

in my opinion, the acronym for the international math olympiad deserves to be spelled out here

Good point imo, expanded and added a hyperlink!

We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.

The link to 'other reviews' links to a google doc of this post. I think it should be an internal link to the 'other reviews' section.

fixed! edited hyperlink.

Pr(Ai)2R is at least partially funded by Good Ventures/OpenPhil

Fabien Roger's compilation is mentioned twice in the "Other reviews and taxonomies" section. First as "Roger", then as "Fabien Roger's list"

edited, thanks for catching this!

Anomalous's critique of QACI isn't opening on my end and seems to have been deleted by the author.

Perhaps consider removing it entirely, if it has indeed been deleted.

Copying my comment from 2023 version of this article

Soares's comment which you link as a critique of guaranteed safe AI is almost certainly not a critique. It's more of an overview/distillation.

In his comment, Soares actually explicitly says: 1) "Here's a recent attempt of mine at a distillation of a fragment of this plan" and 2) "note: I'm simply attempting to regurgitate the idea here".

It would maybe fit into the "overview" section, but not in the "critique" section.