26 min read

Crossposted by habryka with Sam's permission. Expect lower probability for Sam to respond to comments here than if he had posted it (he said he'll be traveling a bunch in the coming weeks, so might not have time to respond to anything). 

Preface

This piece reflects my current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI. Given my role and background, it’s disproportionately focused on technical research and on averting emerging catastrophic risks.

For context, I lead a technical AI safety research group at Anthropic, and that group has a pretty broad and long-term mandate, so I spend a lot of time thinking about what kind of safety work we’ll need over the coming years. This piece is my own opinionated take on that question, though it draws very heavily on discussions with colleagues across the organization: Medium- and long-term AI safety strategy is the subject of countless leadership discussions and Google docs and lunch-table discussions within the organization, and this piece is a snapshot (shared with permission) of where those conversations sometimes go.

To be abundantly clear: Nothing here is a firm commitment on behalf of Anthropic, and most people at Anthropic would disagree with at least a few major points here, but this can hopefully still shed some light on the kind of thinking that motivates our work.

Here are some of the assumptions that the piece relies on. I don’t think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans:

  • Broadly human-level AI is possible. I’ll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]
  • Broadly human-level AI (or TAI) isn’t an upper bound on most AI capabilities that matter, and substantially superhuman systems could have an even greater impact on the world along many dimensions.
  • If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that’s not wildly different from today.
  • If TAI is possible, it could be used to dramatically accelerate AI R&D, potentially leading to the development of substantially superhuman systems within just a few months or years after TAI.
  • Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.
  • Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.
  • Alignment—in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy—requires some non-trivial effort to get right, and it gets harder as systems get more powerful.

Most of the ideas here ultimately come from outside Anthropic, and while I cite a few sources below, I’ve been influenced by far more writings and people than I can credit here or even keep track of.

Introducing the Checklist

This lays out what I think we need to do, divided into three chapters, based on the capabilities of our strongest models:

  • Chapter 1: Preparation
    You are here. In this period, our best models aren’t yet TAI. In the language of Anthropic’s RSP, they’re at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we have to do will take place here, though it will often be motivated by subsequent chapters. We are preparing for high-stakes concerns that are yet to arise in full. Things are likely more urgent than they appear.
  • Chapter 2: Making the AI Systems Do Our Homework
    In this period, our best models are starting to qualify as TAI, but aren’t yet dramatically superhuman in most domains. Our RSP would put them solidly at ASL-4. AI is already having an immense, unprecedented impact on the world, largely for the better. Where it’s succeeding, it’s mostly succeeding in human-like ways that we can at least loosely follow and understand. While we may be surprised by the overall impact of AI, we aren’t usually surprised by individual AI actions. We’re not dealing with ‘galaxy brains’ that are always thinking twenty steps ahead of us. AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety. This phase will likely come on gradually and somewhat ambiguously, but it may end abruptly if AI-augmented R&D reaches intelligence-explosion level, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.
  • Chapter 3: Life after TAI
    Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with. The ASL-5 standard demands extremely strong safeguards, and if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2. This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now. Plus, any remaining safety research problems will be better addressed by automated systems, leaving us with little left to do.

This structure bakes in the assumption that risk levels and capability levels track each other in a relatively predictable way. The first models to reach TAI pose ASL-4-level risks. The first substantially superhuman models pose ASL-5-level risks. The ASLs are defined in terms of the levels of protection that are warranted, so this is not guaranteed to be the case. I take the list of goals here more seriously than the division into chapters.

In each chapter, I’ll run through a list of goals I think we need to accomplish. These goals overlap with one another in places, and some of these goals are only here because they are instrumentally important toward achieving others, but they should still reflect the major topics that we’ll need to cover when setting our more detailed plans at each stage.

Chapter 1: Preparation

You are here. In this period, our best models aren’t yet TAI. In the language of Anthropic’s RSP, they’re at AI Safety Level 2 (ASL-2), ASL-3, or maybe the early stages of ASL-4. Most of the work that we have to do will take place here, though it will often be motivated by subsequent chapters. We are preparing for high-stakes concerns that are yet to arise in full. Things are likely more urgent than they appear.

Not Missing the Boat on Capabilities

Our ability to do our safety work depends in large part on our access to frontier technology. If we can’t find enough compute, we botch a major pretraining run, or we miss out on a transformative paradigm shift (or even just a bunch of smaller improvements to our methods), we’ll have lost our most of our opportunity to contribute. Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.

Largely Solving Alignment Fine-Tuning for Early TAI

By the time we have systems that can meaningfully automate large parts of research (importantly including AI safety research), we’ll need to know how to “[get] a lot of useful work out of AIs” without anything going off the rails, and in a way that takes advantage of AI capabilities that are at or somewhat beyond those of human domain experts.

We don’t need to solve alignment perfectly—we can tolerate some marginal risk of misalignment at this point since we won’t be trusting AI systems with the very highest-stakes decisions, and since we’re fairly likely to catch misaligned behavior before it turns into a full global catastrophe. But we need to do quite a good job here.

We should aim to build solutions that are reasonably efficient and reasonably general. It’s possible that we could get by solving alignment only for an AI research assistant that we only use in-house and with heavy expert monitoring, but this would put us in a very delicate situation. We’ll want to be able to broadly deploy TAI systems externally reasonably quickly once that becomes possible, both to allow others to benefit from the potentially immense positive value of the systems and to keep ourselves viable as a business. We thus shouldn’t be satisfied with solutions that require baroque constraints or extensive monitoring by experts in a way that means broad deployment would be impossible.

In my view, the central pillar of this work is scalable oversight—especially scalable oversight that focuses on training trustworthy agents for complex open-ended tasks. Key challenges include reward hacking, the basic limits of human attentiveness, and (to a lesser extent for now) scheming.

Rendering Early TAI Reliably Harmless

If we solve alignment fine-tuning perfectly, we can just ask our models to be harmless and tell them what we mean by that.[2] Short of this kind of perfect solution, which seems unlikely, we’ll want additional layers of defense to ensure that early-TAI systems aren’t misused and that, if they try to take harmful autonomous actions, they don’t get far.

In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are trying to cause serious harm. This goal suggests the need for work on automated monitoring of model outputs, human spot-checking of model usage, automated red-teaming, and Control-style expert stress-testing evaluations of our safeguards wherein we deliberately build toy misaligned systems that try to overcome our oversight measures. At least in Chapter 1, this kind of work may be as important for safety as more conventional alignment work, largely because it appears likely to be easier to measure progress on worst-case safety under this approach. With this in mind, I expect us to rely heavily on monitoring and other similar model-external interventions to help us meet our first RSP deployment-safety commitments at ASL-3.

A key challenge here in the longer run is likely to be the cluster of ML problems around adversarial robustness, since our methods will rely on building harmfulness classifiers with tolerable precision and extremely high recall. We’ll also need a lot of creativity and a strong security mindset: Our red-teaming and stress-testing evaluation work here will demand both.

We also expect to hit novel technical problems in applying safeguards that consider many transcripts or model actions at once, rather than targeting transcripts separately one at a time. We expect this to be necessary both as a way of surfacing concerning novel behaviors or use cases that weren’t part of our initial threat models and as a way of protecting against more acute misuse or misalignment risks that aren’t localized to specific transcripts but become clear when viewed as patterns across small clusters of transcripts or users.

Compiling an RSP that is Adequate and Well-Calibrated for Risks through Early TAI

The above three items are about getting our systems to a minimal bar of safety and usefulness through early TAI (i.e., ASL-4). Much of the rest of this chapter will be about making this work legible and holding ourselves accountable to the public or to governments for getting it done.

The RSP aims to make it consistently the case that our model training and deployment meets a high, clearly-specified bar for safety and that there is publicly accessible evidence that we have met this bar. Roughly speaking, we run tests (‘frontier risk evaluations’) meant to assess the level of risk that our systems could pose if deployed without safeguards and, if we aren’t able to fully and demonstrably mitigate that risk through our safeguards, we pause further deployments and/or further scaling.

This is in part a way of organizing safety efforts within Anthropic, but it’s just as much a way of setting broader norms and expectations around safety for the industry more broadly. By showing that we can stay at or near the frontier while being demonstrably safe, we can defuse worries that this level of safety is impossible or commercially impractical to achieve.

To do this, our specific commitments under the RSP need to be well-calibrated in both detail and strictness to mitigate the level of risk that we expect:

  • If they’re significantly too lax, we face unacceptable risks.
  • If they’re significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
  • If they’re significantly too vague, they build less trust in our safety practices and work poorly as a demonstration to others.
  • If they’re significantly too detailed early on, we risk misjudging where the most important work will need to be, and thereby committing ourselves to needless costly busywork.

Relatedly, we should aim to pass what I call the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.

The goal here is analogous to that of standards and certifications in other domains. For example, if an organization doesn’t expect to be a target of cyberattacks but nonetheless follows a common cybersecurity standard like SOC 2, they likely still achieve some real protection despite their skepticism.

The key challenge here is forecasting which risks and risk factors are important enough to include. A specific recurring open question in our threat modeling so far is the degree to which risk at ASL-3 and ASL-4 (i.e., before broadly superhuman models or any acute intelligence explosion) flows through direct misuse, through misalignment, or through more indirect contributions via channels like dual-use R&D.

Preparing to Make Safety Cases for Evaluations and Deployments at ASL-4

Once we hit ASL-4 which, roughly speaking, covers near-human-level autonomy and plausibly catastrophic direct misuse risks, we don’t expect to be able to lay out detailed criteria in advance for what tests we would have to pass to approve a system as safe. Instead, we’ll commit to putting together a safety case—a report giving evidence that a system is safe under some circumstances—and we’ll lay out high-level criteria that the safety case needs to satisfy to be approved. Similarly, as models become capable of recognizing when and how they are being evaluated, we will need evaluation-integrity safety cases that show that our frontier risk evaluation runs are reliable at identifying the risk factors that they are designed to catch. Much of our technical safety work will ultimately have impact by being included in these safety cases (and thereby influencing high-stakes decisions about security, scaling, and deployment), and these safety cases are a key target for our work in the lead-up to ASL-4.

We should maintain, internally, a small number of detailed best-guess safety cases that cover a reasonable range of safety situations we might find ourselves in. Our RSP-oriented technical safety work should then be triaged against the likelihood that it feeds into one of these safety cases, and these safety cases should be frequently updated as we learn more about the risks and affordances we face.

Getting Interpretability to the Point of Making Strong Assurances

One of Anthropic’s main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case. This piece from our interpretability team from last year sketches out some of what this could involve.

Compiling Evidence of Robustness

Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust. That is, it should be clear that neither the model nor its monitoring systems will fail in surprising ways on rare but important inputs. Barring extreme near-perfect successes with interpretability, our primary evidence for this in safety cases will likely focus on expert stress-testing evaluations of our safeguards (as above) and quantitative results from black-box automated red-teaming, with possible secondary evidence coming from gradient-based white-box attacks as well.

Developing Additional Basic Science for Safety Cases

Barring an unlikely best-case outcome from our mechanistic interpretability work, we expect that a strong safety case will have to rely on additional new findings, based on other approaches, that allow us to evaluate models for safety, quantitatively forecast the risks they’re likely to pose, or quantitatively forecast the effectiveness of our mitigations. Work on scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence ‘nerfing’ (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions are among the directions that could lead to significant contributions here. This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.

Meeting the ASL-3 and ASL-4 Security Standards for Weights

Our first deployments with non-negligible catastrophic risk will require us to meet the ASL-3 standard for security precautions, largely to prevent bad actors from stealing the weights (and thereby disabling our safeguards) for a model that is capable of enabling extremely harmful actions. For analogous reasons, early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks. We will need to both implement these and be able to demonstrate to third parties that we’ve done so. While ASL-3 is not a huge departure from familiar industry best practices, ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.

Protecting Algorithmic Secrets

To the extent that our capabilities research puts us well ahead of the state of public knowledge in the field, it will be important to secure the key findings from that research to preserve our ability to stay in on or near the lead (for the reasons given above). This is qualitatively different from securing model weights, and potentially much more difficult: Because these capabilities findings can often be expressed in a few sentences or paragraphs, departing staff will naturally remember them. It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.

Building Calibrated, Legible Evaluations for ASL-4 and ASL-5

Once we’ve hit ASL-3, our evaluations become quite high-stakes. Deploying under ASL-4 or ASL-5 precautions could be unprecedentedly costly and require long lead times to implement. As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late.

In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too. If our evaluations for some risk factor trigger, we’ll want clear evidence (ideally in the form of unequivocal ‘smoking gun’ results) that the risk factor demands immediate attention.

We’ll also need our evaluations at ASL-4 and ASL-5 to be increasingly sensitive to evaluation integrity concerns, as discussed briefly in the context of safety cases above. Elicitation with superhuman models can go wrong in far more ways than with present models.

By the time we hit ASL-3, we’ll need strong evaluations for ASL-4. By the time we hit ASL-4, we’ll need strong evaluations for ASL-5. These evaluations will seem premature and divorced from current practice, but capabilities progress is fast and it takes many iterations to get an evaluation right, so we should start piloting them early.

Supporting Efforts that Build Societal Resilience

For some of the most significant risks from early TAI, like very strong and widely available tools for cyberoffense or persuasion, it may be possible to improve our safety situation significantly through relatively tractable mitigations outside the organization. (For example, hardening the cybersecurity of critical infrastructure.) Since it’s unlikely that we’ll have perfect certainty that we have these risks under control, and very unlikely that the entire AI ecosystem will have them under control indefinitely, it’s worth putting significant effort toward working with governments and other relevant bodies to strengthen outside-world defenses against these risks. This work can also feed into a safety case, by mitigating some mechanisms by which AI safety issues could translate into real harms.

More broadly, even AI deployments that are unequivocally positive in their overall effects can nonetheless be quite destabilizing and need to be managed well. (Consider changes in the labor market for a version of this that we’ve encountered many times before.) We don’t have the expertise or the authority or the legitimacy to unilaterally address these societal-scale concerns, but we should use what affordances we have to support and inform responses from government and civil society.

Building Well-Calibrated Forecasts on Dangerous Capabilities, Mitigations, and Elicitation

We’ll be able to plan and coordinate much better if we have good guesses as to which risks will emerge when, as well as which mitigations can be made ready when. These forecasts will play an especially direct role in our RSP evaluation planning: Under the current design of the RSP, our evaluation protocols need to leave a buffer, such that they will trigger safely before the risk actually emerges, to avoid cases where models are trained under moderate security but retroactively determined to need higher security. Forecasts based on solid evidence and well-tested practices would allow us to move the design of those buffers from guesswork to reasonably confident science, and to potentially narrow them in some cases as a result.

These forecasts may also influence the structure of our safety cases. If we have methods that are able to make well-calibrated forecasts of the emergence of new risks, these forecasts can help identify the specific risk factors within a broader safety case that need the most attention.

Building Extremely Adaptive Research Infrastructure

At some point around the development of early TAI, we’re likely to be getting newly concrete evidence about many risks, growing quickly as an organization, and relying on our models for larger and larger chunks of work. We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do. We’ll need to be able to move quickly at this point, and benefit as much as is safe from new opportunities for automation. This may take a good deal of organizational and infrastructural preparation in Chapter 2.

Stress-Testing Safety Cases

Our Compliance team (for security) and Alignment Stress-Testing team (for other technical safety measures) form a second line of defense for safety on the three lines of defense worldview: They’re responsible for making sure we understand the risks that we’re mitigating and ensuring that we haven’t missed anything important. In the context of our big-picture safety plans, this manifests as giving a skeptical assessment of any load-bearing claims about safety and security that the organization is preparing to make, and providing a second sign-off on any important discretionary decision. This function is less directly crucial than many listed here, since in principle our first-line safety teams can just get it right the first time. But in practice, I expect that this will make a significant impact on our ability to get things right, and to legibly show that we’ve done so.

The main challenge here, at least for the Alignment Stress-Testing team (which I’m closer to), will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.

Adjudicating Safety Cases

Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.

I expect that our board will be in a good position to identify relevant outside experts when needed and will make reasonable decisions (modulo the limited state of our knowledge of safety in general). The bigger challenge will be in making the process by which they make these decisions legible and trustworthy for other actors. The most obvious way to do this would be by committing to defer to specific third-party organizations (potentially including government bodies) on these decisions as relevant organizations come online and build sufficient technical capacity to adjudicate them. Without that, it’s hard to see how the RSP and its accompanying structures will pass the LeCun test (see above).

On that note, I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently. These organizations collectively need to be so widely known and widely trusted (across any relevant ideological lines) that it’s viewed as highly suspicious if a frontier AI developer avoids working with any of them. Because such an organization would need to avoid conflicts of interest with the firms whose work they are adjudicating, we as an organization are very limited in what we can do to make this happen.

Developing Clear Smoking Gun Demos for Emerging Risk Factors

Present-day work TAI safety usually involves at least some amount of speculation or extrapolation, by the simple fact that we usually aren’t yet able to experiment with the systems that pose the risks that we’re trying to address. Where we can find ways to transition to concrete empirical work, we should do so, both to solidify our own confidence in our threat models and to provide more compelling evidence to other relevant parties (notably including policymakers).

When we see clear evidence that a risk or risk factor is starting to emerge in real models, it is worth significant additional work to translate that into a simple, rigorous demo that makes the risk immediately clear, ideally in a way that’s legible to a less technical audience. We’ll aim to do a form of this as part of our RSP evaluation process (as noted above), but we will need to be ready to present evidence of this kind in whatever form we can get, even if that looks quite different from what our best formal evaluations can provide. Past examples of things like this from our work include the Sleeper Agents and Sycophancy results.

Preparing to Pause or De-Deploy

For our RSP commitments to function in a worst-case scenario where making TAI systems safe is extremely difficult, we’ll need to be able to pause the development and deployment of new frontier models until we have developed adequate safeguards, with no guarantee that this will be possible on any particular timeline. This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases, but big-picture strategic preparation could make the difference between a fatal blow to our finances and morale and a recoverable one. More fine-grained tactical preparation will be necessary for us to pull this off as quickly as may be necessary without hitting technical or logistical hiccups.

Laying the Groundwork for AI Welfare Commitments

I expect that, once systems that are more broadly human-like (both in capabilities and in properties like remembering their histories with specific users) become widely used, concerns about the welfare of AI systems could become much more salient. As we approach Chapter 2, the intuitive case for concern here will become fairly strong: We could be in a position of having built a highly-capable AI system with some structural similarities to the human brain, at a per-instance scale comparable to the human brain, and deployed many instances of it. These systems would be able to act as long-lived agents with clear plans and goals and could participate in substantial social relationships with humans. And they would likely at least act as though they have additional morally relevant properties like preferences and emotions.

While the immediate importance of the issue now is likely smaller than most of the other concerns we’re addressing, it is an almost uniquely confusing issue, drawing on hard unsettled empirical questions as well as deep open questions in ethics and the philosophy of mind. If we attempt to address the issue reactively later, it seems unlikely that we’ll find a coherent or defensible strategy.

To that end, we’ll want to build up at least a small program in Chapter 1 to build out a defensible initial understanding of our situation, implement low-hanging-fruit interventions that seem robustly good, and cautiously try out formal policies to protect any interests that warrant protecting. I expect this will need to be pluralistic, drawing on a number of different worldviews around what ethical concerns can arise around the treatment of AI systems and what we should do in response to them.

Chapter 2: TAI, or, Making the AI Do Our Homework

In this period, our best models are starting to qualify as TAI, but aren’t yet dramatically superhuman in most domains. Our RSP would put them solidly at ASL-4. AI is already having an immense, unprecedented impact on the world, largely for the better. Where it’s succeeding, it’s mostly succeeding in human-like ways that we can at least loosely follow and understand. While we may be surprised by the overall impact of AI, we aren’t usually surprised by individual AI actions. We’re not dealing with ‘galaxy brains’ that are always thinking twenty steps ahead of us. AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety. This phase will likely come on gradually and somewhat ambiguously, but it may end abruptly if AI-augmented R&D reaches intelligence-explosion level, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.

Many of the Chapter 1 tasks will not be finished by this point, and many of those will only become more challenging and urgent in Chapter 2. In addition, this phase may end abruptly if AI-augmented R&D reaches escape velocity, and we’ll need to be more prepared for Chapter 3 than might seem intuitive at the time.

Meeting the ASL-5 Standard for Weights Security

At this point, AI systems are visibly extremely valuable and visibly close to kicking off an intelligence explosion. We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history. Among other things, this means that we’ll need to be capable of defending against top-priority attacks by the most advanced state or state-supported attackers. This will involve taking unprecedented actions in the service of security, likely including interventions like air gaps (among many others) that introduce dramatic restrictions on the ability of most human researchers to do their work.

Developing Methods to Align a Substantially Superhuman AI

In Chapter 3, we may be dealing with systems that are capable enough to rapidly and decisively undermine our safety and security if they are misaligned. So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control. This work could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align, and if we’ve done our jobs, we will be able to largely rely on human-level automated researchers to help us with the work. The remaining work will be to maintain sufficient situational awareness ourselves that we can be confident in our assessment of how we’re doing.

Evaluating Constantly and Carefully

Even if pretraining runs continue to be divided into clear spaced-out model generations at this point, they are no longer the obvious main locus for frontier risk evaluations. We should expect a substantial AI-enabled acceleration in the pace of progress on finetuning and elicitation. While at earlier ASLs, our frontier risk evaluations can incorporate some buffer, and if an AI system fails to trigger one, we can proceed with some further research and scaling before we need to evaluate again, these buffers will likely become unsustainable: Every nontrivial advance that we become aware of, either from our own research, from publicly-known research, or from observed user behavior, should be assessed, and many will trigger the need for new evaluations. It will be crucial for evaluations to be fast and at least largely automatic.

In addition, AI systems will be able to do nontrivial (if not wildly superhuman) strategic reasoning, without chain-of-thought style thinking out loud, potentially allowing them to strategically influence the outcomes of any evaluation that they can identify as an evaluation. Evaluation integrity will thus accordingly be a serious challenge.

Deploying Potentially Extremely Dangerous Systems

By ASL-4, models could cause extremely severe harm if deployed recklessly. But if deployed carefully, they would yield similarly immense benefits. If we are justifiably very confident in our suite of safeguards, we should deploy these systems broadly to the public. If we are less certain, we may still have reason to deploy in a more targeted way, like to heavily vetted partners or alongside especially demanding forms of monitoring. The work of the safety teams in these first Chapter 2 deployments will largely consist in making sure that the suite of safeguards that we developed in Chapter 1 behaves as we expect it to.

Addressing AI Welfare as a Major Priority

At this point, AI systems clearly demonstrate several of the attributes described above that plausibly make them worthy of moral concern. Questions around sentience and phenomenal consciousness in particular will likely remain thorny and divisive at this point, but it will be hard to rule out even those attributes with confidence. These systems will likely be deployed in massive numbers. I expect that most people will now intuitively recognize that the stakes around AI welfare could be very high.

Our challenge at this point will be to make interventions and concessions for model welfare that are commensurate with the scale of the issue without undermining our core safety goals or being so burdensome as to render us irrelevant. There may be solutions that leave both us and the AI systems better off, but we should expect serious lingering uncertainties about this through ASL-5.

Deploying in Support of High-Stakes Decision-Making

In the transition from Chapter 2 to Chapter 3, automation of huge swaths of the economy will feel clearly plausible, catastrophic risks will be viscerally close, and most institutions worldwide will be seeing unprecedented threats and opportunities. In addition to being the source of all of this uncertainty and change, AI systems at this point could also offer timely tools that help navigate it. This is the point where it is most valuable to deploy tools that meaningfully improve our capacity to make high-stakes decisions well, potentially including work that targets individual decision-making, consensus-building, education, and/or forecasting. A significant part of the work here will be in product design rather than core AI research, such that much of this could likely be done through public-benefit-oriented partnerships rather than in house.

Chapter 3: Life after TAI

Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with. The ASL-5 standard demands extremely strong safeguards, and if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2. This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now. Plus, any remaining safety research problems will be better addressed by automated systems, leaving us with little left to do.

Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made. I’m not including any checklist items below, because we hope not to have any.

If we have built this technology and we are still in a position to make major decisions as an organization, the stakes are now enormously high. These decisions could deal with early deployments that could quickly transform or derail society in hard-to-predict ways. These decisions could also deal with governance and safety mechanisms that face stark trade-offs in the face of systems that may feel more like whole freestanding civilizations than like today’s chatbots. Our primary objective at this point should be to help place these decisions in the hands of institutions or processes—potentially including ones that are still yet to be created—that have the democratic legitimacy, and the wisdom, to make them well.

  1. ^

    This matches some common uses of the term AGI, but that term is overloaded and is sometimes used to describe only broadly superhuman systems, so I avoid it here.

  2. ^

    Of course, what behavior counts as harmless is a deeply thorny question on our own, and one we would hope to draw on an outside consensus for rather than attempt to settle on our own.

New to LessWrong?

1.
^

This matches some common uses of the term AGI, but that term is overloaded and is sometimes used to describe only broadly superhuman systems, so I avoid it here.

2.
^

Of course, what behavior counts as harmless is a deeply thorny question on our own, and one we would hope to draw on an outside consensus for rather than attempt to settle on our own.

1.

I'm referring to AIs making humans working on improving AI from the software side >5x more productive. It is plausible (perhaps 40% in short timelines) that algorithmic advances will in practice slow down dramatically such that software improvements aren't an important input, but I think you should be robust to the likely outcome in which AI R&D is very important. ↩︎

2.

Note that SL5 is not ASL-5! ↩︎

3.

I think you likely need substantially more than SL5 security (security sufficient to resist against unprecedentedly well-resourced attacks) within a short period after you have AIs which can massively (>10x) accelerate R&D in key domains (AI, weapons, cyber offense), but you don't need it immediately. It's probably a good 80/20 to aim for SL5 and then greater security within a year. This could plausibly be insufficient and is probably objectively unacceptably risky (e.g., maybe the RSP should demand stronger with a target of ensuring less than a 5% chance the key models are stolen), but something like this seems to get most of the benefit and even this 80/20 is still very ambitious. ↩︎

1.
^

I agree that stories which require building things that look very obviously like "insane weapons/defenses" seem bad, both for obvious deontological reasons, but also I wouldn't expect them to work well enough be worth it even under "naive" consequentialist analysis.

1.
^

In ways that are obvious to humans.

2.
^

Minus the part where Uber was pretty obviously illegal in many places where it operated.

1.
^

Or—worse—to avoid being the ones to cause short-term AI suffering.

2.
^

E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

3.
^

Related to this and the following bullet: Ryan Greenblatt's ideas.

4.
^

For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

1.
^

"Our ability to do our safety work depends in large part on our access to frontier technology."

2.
^

E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

New Comment


49 comments, sorted by Click to highlight new comments since:

Thanks for writing this! I appreciate the effort to make your perspective more transparent (and implicitly Anthropic's perspective as well). In this comment, I'll explain my two most important concerns with this proposal:

  • Fully state-proof security seems crucial at ASL-4, not only at ASL-5
  • You should have a clear exit plan and I disagree with what seems to be the proposal (fully deferring to the government on handling ASL-5 seems insufficient)

I have a variety of other takes on this proposal as well as various notes, but I decided to start by just writing up these two thoughts. I'll very likely add more comments in a bit.

Fully state-proof security seems crucial at ASL-4, not only at ASL-5

When you have AIs which can massively accelerate total AI capabilities R&D production (>5x)[1], I think it is crucial that such systems are secure against very high-resource state actor attacks. (From the RAND report, this would be >=SL5[2].) This checklist seems to assume that this level of security isn't needed until ASL-5, but you also say that ASL-5 is "broadly superhuman" so it seems likely that dramatic acceleration occurs before then. You also say various other things that seem to imply that early TAI could dramatically accelerate AI R&D.

I think having extreme security at the point of substantial acceleration is the most important intervention on current margins in short timelines. So, I think it is important to ensure that this is correctly timed.

As far as the ASL-4 security bar, you say:

early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks

This seems to imply a level of security of around SL-4 from the RAND report which would be robust to routine state actor attacks, but not top priority attacks. This seems insufficient given the costs of such a system being stolen and how clearly appealing this would be as you note later:

We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history.

This seems to imply that you'll need more than SL5 security![3]

You should have a clear exit plan and I disagree with what seems to be the proposal

You say that at ASL-5 (after TAI):

Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made.

For us to judge whether this is a good overall proposal, it would help to have a clear exit plan including an argument that this exit state is acceptable. This allows us to discuss whether this exit state actually is acceptable and to analyze whether the earlier steps will plausibly lead to this state.

Further, it seems likely that people writing plans and proposals now are also the plans that the government might end up using! (After all, who will the relevant people in the US government ask for help in figuring out what to do?)

As far as I can tell, the proposed exit state you're imagining is roughly "perfectly (scalably?) solve alignment (or just for substantially superhuman systems?) and then hand things off to the government":

So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control.

(The proposal is also that we'll have improved societal resilience and tried to address AI welfare concerns along the way. I left this out because it seemed less central.)

I worry about the proposal of "perfectly solve alignment", particularly because you didn't list any concrete plausible research bets which are being scoped out and tested in advance. (And my sense is that few plausible bets exist and all seem pretty likely to fail.)

Further, I think the default plan should be to ensure a delay (perhaps at least 5 years) prior to building wildly superhuman AI after full TAI. Wildly superhuman AI seems incredibly scary, even relative to AIs which are qualitatively comparably smart to top human experts. (This is not just for misalignment reasons, though this is part of it.) I think the "perfectly solve alignment then the government is in charge" plan plausibly leads to the default action being to build wildly superhuman AI as quickly as possible. So, I'd prefer a more specific plan which is opinionated about the transition to wildly superhuman AI.

My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them even on tricky, high-stakes questions where we can't (or just don't) check their outputs. For AIs to be human-obsoleting, they would need to outcompete top human experts, and thus probably can't be much dumber than top human experts. Ideally, they also wouldn't be much smarter than top human experts and preferably notably dumber while compensating with speed and knowledge (though the capability profile of such systems might be very spiky in practice). Being able to defer to AIs to this extent (we trust them at least as much as well-regarded humans in terms of judgment and alignment) is stronger than just ruling out egregious misalignment (e.g. it requires that AIs are sufficiently philosophically competent and actually try on near-impossible-to-check questions), though I expect that ruling out egregious misalignment is the hardest part in practice. Further, we would aim to have more than 3 years of time prior to needing to build substantially superhuman AIs to remain competitive after the point where we have such human-obsoleting and deferrable AIs.

As part of achieving my proposed exit plan, I would also focus much more on buying time when AIs are this powerful or nearly this powerful. Using huge amounts of AI labor to buy time could be done prior to achieving the full desiderata of this exit plan using approaches like AI Control. Part of this would involve increasing willingness to pay on coordination and non-proliferation such as making demonstrations of risks as you discuss in the section "Developing Clear Smoking Gun Demos for Emerging Risk Factors". (That said, this seems focused on speculative risks, and I'd also potentially expand this to demonstrations of the most concerning capabilities from a national security perspective.) But beyond increasing willingness to pay, I think it also seems likely to be important to work (using vast amounts of AI labor) on reducing the cost of coordination and non-proliferation such as by building better monitoring approaches (e.g., better satellite tracking of data centers), improving verification approaches, building up carrots for coordination, and potentially building up sticks to reduce the cost of enforcing non-proliferation.

I think this is a good proposal because it seems as though a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point. The AIs we defer to could end up recommending a wide variety of proposals (buying more time, emulated minds, developing totally different AI paradigms), not just doing huge amounts of research into aligning superintelligence. Such AIs would always have an option of trying to run themselves much faster to accomplish very difficult research which requires substantial serial time. We wouldn't need to be able to check their work or their intermediate states which improves the situation considerably.

More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years. Edit: so it would be good to buy time, first for ensuring we have human obsoleting AIs we can defer to and then so that these AIs can have enough time to figure out what to do.

Buck and I are writing an overall short plan for (technical) AI safety and in this we'll discuss all aspects of this in more detail.


  1. I'm referring to AIs making humans working on improving AI from the software side >5x more productive. It is plausible (perhaps 40% in short timelines) that algorithmic advances will in practice slow down dramatically such that software improvements aren't an important input, but I think you should be robust to the likely outcome in which AI R&D is very important. ↩︎

  2. Note that SL5 is not ASL-5! ↩︎

  3. I think you likely need substantially more than SL5 security (security sufficient to resist against unprecedentedly well-resourced attacks) within a short period after you have AIs which can massively (>10x) accelerate R&D in key domains (AI, weapons, cyber offense), but you don't need it immediately. It's probably a good 80/20 to aim for SL5 and then greater security within a year. This could plausibly be insufficient and is probably objectively unacceptably risky (e.g., maybe the RSP should demand stronger with a target of ensuring less than a 5% chance the key models are stolen), but something like this seems to get most of the benefit and even this 80/20 is still very ambitious. ↩︎

I got a bit lost in understanding your exit plan. You write

My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them 

Some questions about this and the text that comes after it:

  1. How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?
  2. Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
  3. Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control "human-obsoleting" AIs?
  4. Why do you "expect that ruling out egregious misalignment is the hardest part in practice"? That seems pretty counterintuitive to me. It's easy to imagine descendants of today's models that don't do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn't be "philosophically competent".
  5. What are you buying time to do? I don't understand how you're proposing spending the "3 years of time prior to needing to build substantially superhuman AIs". Is it on alignment for those superhuman AIs? 
  6. You mention having 3 years, but then you say "More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years." I found this confusing.
  7. What do you mean by "a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point." It seems easier to mitigate which risks prior to what point? And why? I didn't follow this.

How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you're describing here. What techniques are you imagining using?

I would say a mixture of moonshots and "doing huge amounts of science". Honestly, we don't have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren't amazing.

I agree this is unsatisfying, though we do have some idea of how this could work. (Edit: and I plan on writing some of this up later.)

I agree this is a weak point of this proposal, though notably, it isn't as though most realistic proposals avoid a hole at least this large. : (

Why do these AIs need to be human-obsoleting? Why not just human-accelerating?

We could hit just accelerating (and not egregiously misaligned) as an earlier point, but I think it's nice to explicitly talk about the end state. And, I think this is a good end state to end for as it allows for approximately full retirement of human technical work. And it allows for (e.g.) working toward running a whole AI research program for the equivalent of 1000s of subjective years because these AIs don't need human help to function. I think this probably isn't needed (we probably need much less time), but it is a somewhat nice option to have.

Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control "human-obsoleting" AIs?

I'm imagining that we were already controlling AIs as capable as the final AIs we target and the change is just that we actually should be willing to defer to them. So, I think control should be doable for AIs which are of that capability level.

Further, I'd like to note that I'm not necessarily imagining that we have to go for superintelligence as opposed to other next objectives. The AIs can figure this out.

Here's one possible progression:

  • AIs that substantially accelerate R&D, we use control
  • AIs that seem as smart as top-human experts, we use control
  • AIs that seem as smart as top-human experts, but we've ruled out egregious misalignment and could back off of control
  • AIs that seem as smart as top-human experts, but we are justifiably will to defer to them on everything

Why do you "expect that ruling out egregious misalignment is the hardest part in practice"? That seems pretty counterintuitive to me. It's easy to imagine descendants of today's models that don't do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn't be "philosophically competent".

I'm not imagining these AIs being our successors which do what they want - the goal is more that they do a good job of obeying instructions in a reasonable and high-integrity way.

I don't think the values of the AIs matter exactly if ensure they do a good job of obeying instructions in a reasonable way.

And note that if we avoid egregious misalignment then what we see in evaluations and training is roughly what we'll get - the AIs won't be conspiring against our testing if we avoid egregious misalignment

When I say "philosophically competent", I mean good at thinking through tricky and more philosophical questions. This requires AIs to understand our values, but not necessarily to share them. (E.g. we might prefer our AIs to be more myopic / deontologically non-powerseeking in various ways.)

As far as why I think this is the hardest part, I think it might suffice to:

  • Rule out egregious misalignment
  • Test if your AIs do good things in normal cases you understand (which should be most cases as these AIs aren't smarter than humans qualitatively)
  • Check if AIs are good at tricky hard-to-check philosophy-style problems by creating test beds that we happen to be able to check well for various reasons (e.g. string theory)
  • If checks indicate issues, iterate on our process for creating AIs. Also do some science on when these sort of checks might get overfit and avoid issues from this.

What are you buying time to do? I don't understand how you're proposing spending the "3 years of time prior to needing to build substantially superhuman AIs". Is it on alignment for those superhuman AIs?

The proposal is:

  • Buy a bunch of time.
  • Use this time to get human-obsoleting AIs we can defer to.
  • Also ensure that these AIs get a bunch of time (>3 years). Ask these AIs what we should do, including how we should spend this time. These AIs could end up deciding to try to stall for more time.

So the time could be on figuring out alignment of superhuman AIs, or other stuff.

You mention having 3 years, but then you say "More generally, it just seems really heuristically scary to very quickly go from AIs which aren't much smarter than the best humans to AIs which are wildly smarter in only a few years." I found this confusing.

The 3 years is after having human-obsoleting AIs which is our target end state. I'm imagining targeting more like 5-10 years for the overall acute risk period until we have to build wildly smarter AIs. I just want to make sure these AIs have enough time.

What do you mean by "a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point." It seems easier to mitigate which risks prior to what point? And why? I didn't follow this.

Misalignment risks seem much higher if AIs are much smarter than you. So, it is easier to reduce risks while effectively using AIs prior to the point where they are wildly superhuman.

Prior to wildly superhuman AI, you can likely use control which makes a bunch of stuff much easier. But, this doesn't get you to the point where you can defer to such AIs. As far as why I think actual alignment is easier on such AIs, it just seems much easier to ensure alignment if we can understand the actions our AIs are taking (at least roughly). More generally, I just think much smarter AIs will be much more likely to be schemers given what ends up being the default training approach.

Reply2211

It seems like I didn't do a good job of explaining the exit plan!

I'll need to do a better job of explaining this in the future. (I'll respond to some of these specific points in a bit.)

@ryan_greenblatt one thing I'm curious about is when/how the government plays a role in your plan.

I think Sam is likely correct in pointing out that the influence exerted by you (as an individual), Sam (as an individual), or even Anthropic (as an institution) likely goes down considerably if/once governments get super involved.

I still agree with your point about how having an exit plan is still valuable (and indeed I do expect governments to be asking technical experts about their opinions RE what to do, though I also expect a bunch of DC people who know comparatively little about frontier AI systems but have long-standing relationships in the national security world will have a lot of influence.)

My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you're pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.

In general, I'd be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement. My impression is that you two have spent a lot of time thinking about how AI control fits into a broader strategic context, with that broader strategic context depending a lot on how governments act/react. 

And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly. (Put differently, I think it's pretty hard to evaluate "how excited should I be about the AI control agenda" without understanding who is responsible for doing the AI control stuff, what's going on with race dynamics, etc.)

My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you're pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.

Yes, I think heavy government should occur once AIs can substantially accelerate general purpose R&D and AI R&D in particular. I think the occurs at some point during ASL-4.

In practice, there might be a lag between when government should get involve and when it really does get involved such that I think companies should be prepared to implement SL5 without heavy government assistance. I think SL5 with involve massive operating cost, particularly if implemented on short notice, but should be possible for a competent actor to implement with a big effort.

(I'm also somewhat skeptical that the government will actually be that helpful in implementing SL5 relative to just hiring people the relevant expertise who will often be formerly working for various government. The difficulty in SL5 implementation also depends heavily on what costs you're willing to accept: full airgapping is conceptually simple and should be workable, but prevents serving a public API.)

In general, I'd be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement.

I don't think we should get into this here, but we are in fact thinking about these topics and will likely discuss this more in future posts.

And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly.

Agreed, though I think that "do something like control" is more robust than "the AI control plan" (which we haven't even really clearly spelled out publicly, though we do have something in mind).

As far as security, perhaps part of what is going on is that you expect that achieving this high bar of security is too expensive:

ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.

My sense is indeed that SL5 level security would be a large tax to operate under, particularly when implemented in a hurry. However, I think this is also a natural point at which national security concerns become large and commercialization is likely to greatly reduce.

I left an earlier comment on more important and clearer points, but I thought I would also leave a comment with some more tentative remarks, explanations of where I'm confused, and notes.

The story for ASL-4 safety seems unclear and/or unlikely to provide a "reasonable" level of safety

In this checklist, perhaps the most important target for ASL-4 technical safety measures (not including security) is having the measures to ensure that ASL-4 is safe with respect to misalignment. From what I could tell, you were imagining:

  • Hopefully mechanistic interpretability will allow for a high-confidence case for safety.
  • If that fails or is insufficient, then we'll need other basic science such as "scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence 'nerfing' (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions"
  • It was unclear to me whether you expected control-style cases to work for an ASL-4/TAI safety case or if you expected it to only work earlier.

I think mechanistic interpretability is unlikely to provide a high degree of assurance (<10% probability of a 10x risk reduction, and <20% probability of a 3x risk reduction where the risk reduction is relative to other more simple methods). So, I think the first bullet will likely be unimportant. Then, the question is whether the second bullet will suffice or whether you are leaning on control.

I was generally a bit unclear on what you were imagining given that you didn't describe a specific list of possible approaches and I generally didn't see a clear story for how the safety case would work. At a more basic level, I expect that the main hope for the first TAIs will end up being AI control in short timelines, and this seems underemphasized if you think this, so likely we disagree or I ended up confused about what you're imagining.

Personally, I'd advocate for using roughly three pillars for ASL-4:

  • AI control which can hopefully provide a greater than 10x risk reduction given substantial effort on implementation and some tax. (And this risk reduction can plausibly be mostly pretty evaluable.)
  • Model organisms style work where you look for evidence of egregious misalignment (both in constructed cases and by adversarially evaluating the actual production models you've deployed).
  • Ad hoc methods that seem like they might help (aka the bag of tricks), but which aren't particularly principled and which might be very difficult to evaluate (e.g. coup probes, generally trying to make training less exploitable because maybe this reduces misalignment risk, perhaps influence functions, try random stuff on model organisms and hope for transfer).

It seems likely to me that this proposal I'm outlining doesn't provide a reasonable level of assurance (e.g. <0.5% takeover/escape/severe sabotage risk throughout ASL-4), but I think it could work at the start of ASL-4 and is reasonably likely to be the best we've got in short timelines.

Overall, I think that it will likely be quite hard to meet a reasonable level of assurance for ASL-4 with the technology we have at least in slightly more pessimistic worlds (worlds where there are some warning signs or the architecture is somewhat less favorable). So, plans should bear this in mind. I think extensive planning for when you can't make reasonable safety targets seem quite important, including plans for when you find smoking gun evidence of egregious misalignment or plans for when the evidence is more confusing, but our countermeasures and tests aren't sufficiently good to provide a reasonable degree of assurance.

I was confused by what capability level you were referring to by "early TAI" or I somewhat disagree with how lightly you treat TAI

You used the term "early TAI". I would have thought this meant the first TAI where you defined TAI to be an AI which is a drop-in replacement for human workers, including in AI R&D. However, in the sections "Largely Solving Alignment Fine-Tuning for Early TAI" and "Rendering Early TAI Reliably Harmless", you seemed to possibly imply that this was before TAI. In particular, you seemed to imply that scheming wouldn't be a risk at the start of TAI (seems unclear to me given that the AIs are comparable to top human experts!) and you seemed to imply that such AIs wouldn't speed up R&D that much (while I expect that drop-in replacements for top human AI researchers would pretty likely result in huge acceleration!).

Perhaps I was mostly confused by how you split things up across different headings?

This is also related to my earlier point on ASL-4 security.

"Intelligence explosion level" isn't likely to be a meaningful threshold and TAI will likely greatly accelerate AI R&D

You seem to define TAI as "AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D". This seems to imply that TAI would likely greatly accelerate AI R&D to the extent that it is possible to greatly accelerate with human labor (which seems like an assumption of the piece).

But later you say:

AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety.

I would have guessed that for this definition and the assumptions of the piece, you would expect >10x acceleration of AI R&D (perhaps 30x) given the definition of TAI.

Further "allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible" doesn't seem like an interesting or important bar to me. If progress has already accelerated and the prediction is that it will keep accelerating, then the intelligence explosion is already happening (though it is unclear what the full extent of the explosion will be).

You seem to have some idea in mind of "intelligence-explosion level", but it was unclear to me what this was supposed to mean.

Perhaps you mean something like "could the AIs fully autonomously do an intelligence explosion"? This seems like a mostly unimportant distinction given that projects can use human labor (even including rogue AI-run projects). I think a more natural question is how much the AIs can accelerate things over a human baseline and also how easy further algorithmic advances seem to be (or if we end up effectively having very minimal further returns to labor on AI R&D).

Overall, I expect all of this to be basically continuous such that we are already reasonably likely to be on the early part of a trajectory which is part of an intelligence explosion and this will be much more so the case once we have TAI.

(That said, note that at TAI I still expect things will take potentially a few years and potentially much longer if algorithmic improvements fizzle out rather than resulting in a hyperexponential trajectory without needing further hardware.)

See also the takeoff speeds report by Tom Davidson and this comment from Paul.

A longer series of messy small notes and reactions

Here's a long series of takes I had while reading this; this is ordered sequentially rather than by importance.

I removed takes/notes which I already covered above or in my earlier comment.


Here are some of the assumptions that the piece relies on. I don't think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I'll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]

FWIW, broadly human-level AI being possible seems near certain to me (98% likely)

If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that's not wildly different from today.

I interpret this statement to mean P(TAI before 2030 | TAI possible) > 50%. This seems a bit too high to me, though it is also in a list of things called plausible so I'm a bit confused. (Style-wise, I was a bit confused by what seems to be probabilities of probabilities.) I think P(TAI possible) is like 98%, so we can simplify to P(TAI before 2030) which I think is maybe 35%?

Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.

I'm surprised this doesn't mention misalignment.

Also, I find "misuse of weapons-related capabilities" somewhat awkward. Is this referring to countries using AI for military or terrorists / lone actors use AIs for terrorism/similar? I think it's unnatural to call countries using your AI for weapons R&D "misuse", in the same way we don't call stealing F-35 plans "misuse".

Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.

Seems surprisingly weak if these systems are drop-in replacements for humans!


Our best models are broadly superhuman, warranting ASL-5 precautions, and they're starting to be used in high-stakes settings. They're able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can't keep up with.

Based on some aspects of how Sam talks about ASL-4, I'm tempted to interpret "broadly superhuman" to mean "at least as good as the best humans in all important domains". However, the language he uses elsewhere and the specific words seem to actually mean "notably better than the best humans in all important domains". Or perhaps even "much better than the best humans in all important domains" Some of the text later on is "first substantially superhuman models pose ASL-5-level risk" which seems to imply "notably better" or "much better" to me.

For now, I'll assume Sam means "notably better" or "much better" as this seems like the most likely interpretation.

I think you shouldn't build AIs which are this smart until you've had a long time with models comparable to top human scientists (perhaps >5 years of delay) and you should be making plans to avoid building such models and handle things with human-ish level models. I think you should try to avoid building AIs which are smarter than top human scientists and I don't see a strong reason to expect you'll need stronger AIs. (More on this...)

If Sam actually means "at least as good", then I disagree with comments "if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2" and "This is the endgame for our AI safety work: If we haven't succeeded decisively on the big core safety challenges by this point, there's so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now." If Sam means "notably better" or "much better", then I think this seems about right. However, in this case, my comments about avoiding building such systems apply.


In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are trying to cause serious harm.

+1


If they're significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.

Given how easy the RSP is to amend with approximately zero cost in most cases like this (clearly unwarranted pause), I think this is unlikely to happen in practice. I guess I could imagine some more specific commitment you make like the ASL-3 commitments being a problem, but I roughly expect procedural commitments for ASL-4 and beyond. (Other than perhaps security.)

I generally think we're unlikely to meet a "reasonable" risk threshold for the key regimes and thus think that a lot of the action is in being super clear about loosening the RSP and exactly what that entails. Plausibly, reasonable actors should self-destruct in these cases, though I'm currently skeptical this will happen for any actor and I also lean toward this being a bad idea.

Overall, I'd like the structure of RSPs to look more like a procedural commitment than a specific commitment to countermeasures. You should explain your evaluations in advance and notify the public if you deviate from them.


One of Anthropic's main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case.

I'm skeptical and this seems pretty unlikely.

(I think something involving model internals could end up being pretty useful, I don't expect that Anthropic's ongoing work in mech interp will help much with this.)

Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust.

I don't think this is necessary exactly (see here). But, it would make it easier.

quantitatively forecast the effectiveness of our mitigations

This section seems to think that forecasting is key. Why? Can't we just check if the mitigations are sufficient at the time without needing to extrapolate which imposes additional difficulties? I agree forecasting is useful.


Protecting algorithmic secrets [...] It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.

It seems like the proposal is to de facto assume current algorithmic secrets will be stolen by competent actors eventually. This is my guess for what Anthropic should do at the moment, but it is worth remarking on how scary this is.


We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do.

This seems to be implying that you won't be able to get massive AI speed-ups because you won't trust AIs in many domains. I think this will likely be seen as unacceptable and thus you should prepare for how to deploy AIs in these domains. (See also AI Control.)


Providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.

I don't think the board provides a real independent perspective at the moment, and I'm unsure if it ever will. (See also Zach's comment.) The LTBT will likely just defer to the board in the current status quo (they should be appointing the board by ASL-4 from my understanding), so there is really just the board. I don't think we should trust the board of Anthropic to be an independent check on Anthropic unless the specific board members have well-built up and independent views on AI safety (and ideally a majority of such board members). (By default I think Anthropic leadership will effectively control the board and convince the board that they should defer to experts who agree with the leadership.)

I think all of this is hard to achieve even if this is was a high priority, but the bottom line is that the board is unlikely to be an independent check. (Various employees could potentially be an independent check.)

On that note, I think the most urgent safety-related issue that Anthropic can't directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.

Strong +1

we as an organization are very limited in what we can do to make this happen.

I disagree, I think there are things Anthropic could do that would help considerably. This could include:

  • Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
  • Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
  • Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).

I'm not sure exactly what is good here, but I don't think Anthropic is as limited as you suggest.


This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases

I think Anthropic should either heavily plan for scenarios where it self-destructs as an organization or should be prepared to proceed with a plan B that doesn't meet an objectively reasonable safety bar (e.g. <1% lifetime takeover risk). (Or both.)

Developing Methods to Align a Substantially Superhuman AI

I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.

Addressing AI Welfare as a Major Priority

I lean against focusing on AI welfare as a major priority. I'd target perhaps 2.5% of resources on AI welfare in the absence of outside pressure. (Zach also made a comment on this and my earlier post on AI welfare talks about resource allocation considerations some.)

I agree with large parts of this comment, but am confused by this:

I think you should instead plan on not building such systems as there isn't a clear reason why you need such systems and they seem super dangerous. That's not to say that you shouldn't also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.

While I don't endorse it due to disagreeing with some (stated and unstated) premises, I think there's a locally valid line of reasoning that goes something like this:

  • if Anthropic finds itself in a world where it's successfully built not-vastly-superhuman TAI, it seems pretty likely that other actors have also done so, or will do so relatively soon
  • it is now legible (to those paying attention) that we are in the Acute Risk Period
  • most other actors who have or will soon have TAI will be less safety-conscious than Anthropic
  • if nobody ends the Acute Risk Period, it seems pretty likely that one of those actors will do something stupid (like turn over their AI R&D efforts to their unaligned TAI), and then we all die
  • not-vastly-superhuman TAI will not be sufficient to prevent those actors from doing something stupid that ends the world
  • unfortunately, it seems like we have no choice but to make sure we're the first to build superhuman TAI, to make sure the Acute Risk Period has a good ending

This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.

I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place.  (In particular, how does this end up with the world in a stable equilibrium that doesn't immediately get knocked over by the second actor to reach TAI?)

I don't know what set of beliefs implies that it's much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place.

AIs which aren't qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is "only" around 25% with such AIs because default methods likely avoid egregious misalignment (perhaps 30% chance of egregious misalignment with token effort and then some chance you get lucky for a 25% chance of risk overall).

Then given this, I have two objections to the story you seem to present:

  • AIs which aren't qualitatively smarter than humans seem very useful and with some US government support could suffice to prevent proliferation. (Both greatly reduce the cost of non-proliferation while also substantially increasing willingness to pay with demos etc.)
  • Plans that don't involve US government support while building crazy weapons/defense with wildly superhuman AIs involve commiting massive crimes and I think we should have a policy against this.

Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI. Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea). Thus, you might as well go all the way to preventing much smarter AIs from being built (by anyone) for a while which seems possible with some US government support and the use of these human-ish level AIs.

(Responding in a consolidated way just to this comment.)

Ok, got it.  I don't think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan.  It might become more willing than it is now (though I'm not hugely optimistic), but I currently don't think as an institution it's capable of executing on that kind of plan and don't see why that will change in the next five years.

Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else's crazily powerful AI.

I think I disagree with the framing ("crazy weapons/defenses") but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.

Building insane weapons/defenses requires US government consent (unless you're commiting massive crimes which seems like a bad idea).

I don't think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it.  Most new things are not explicitly illegal.  But even putting that aside[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.

(A reminder that I am not claiming this is Anthropic's plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)

TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.

I think there's a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.

  1. ^

    I agree that stories which require building things that look very obviously like "insane weapons/defenses" seem bad, both for obvious deontological reasons, but also I wouldn't expect them to work well enough be worth it even under "naive" consequentialist analysis.

if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan

I'm basically imagining being able to use controlled AIs which aren't qualitatively smarter than humans for whatever R&D purposes we want. (Though not applications like (e.g.) using smart AIs to pilot drone armies live.) Some of these applications will be riskier than others, but I think this can be done while managing risk to a moderate degree.

Bootstrapping to some extent should also be possible where you use the first controlled AIs to improve the safety of later deployments (both improving control and possibly alignment).

Is your perspective something like:

With (properly motivated) qualitatively wildly superhuman AI, you can end the Acute Risk Period using means which aren't massive crimes despite not collaborating with the US government. This likely involves novel social technology. More minimally, if you did have a sufficiently aligned AI of this power level, you could just get it to work on ending the Acute Risk Period in a basically legal and non-norms-violating way. (Where e.g. super persuasion would clearly violate norms.)

I think that even having the ability to easily take over the world as a private actor is pretty norms violating. I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period (edit: without US government collaboration and) without needing truly insanely smart AIs. I suppose that if you go smart enough this is possible though pre-existing norms also just get more confusing in the regime where you can steer the world to whatever outcome you want.

So overall, I'm not sure I disagree with this perspective exactly. I think the overriding consideration for me is that this seems like a crazy and risky proposal at multiple levels.

To be clear, you are explicitly not endorsing this as a plan nor claiming this is Anthropic's plan.

Is your perspective something like:

Something like that, though I'm much less sure about "non-norms-violating", because many possible solutions seem like they'd involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology).  Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.

I'm pretty uncertain about, if something like that ended up looking norm-violating, it'd be norm-violating like Uber was[2], or like super-persuasian.  That question seems very contingent on empirical questions that I think we don't have much insight into, right now.

I'm unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.

I didn't mean to make the claim that there's a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the "smarter than humans" scale to need to resort to methods that were obviously illegal to end the acute risk period.

  1. ^

    In ways that are obvious to humans.

  2. ^

    Minus the part where Uber was pretty obviously illegal in many places where it operated.

My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan.

(This might be doable without government support via coordination between multiple labs, but I basically doubt it.)

Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.

TBC, I don't think there are plausible alternatives to at least some US government involvement which don't require commiting a bunch of massive crimes.

I have a policy against commiting or recommending commiting massive crimes.

Thanks for writing this! I think it’s important for AI labs to write and share their strategic thoughts; I appreciate you doing so. I have many disagreements, but I think it’s great that the document is clear enough to disagree with.

You start the post by stating that “Our ability to do our safety work depends in large part on our access to frontier technology,” but you don’t say why. Like, there’s a sense in which much of this plan is predicated on Anthropic needing to stay at the frontier, but this document doesn’t explain why this is the right call to begin with. There are clearly some safety benefits to having access to frontier models, but the question is: are those benefits worth the cost? Given that this is (imo) by far the most important strategic consideration for Anthropic, I’m hoping for far more elaboration here. Why does Anthropic believe it’s important to work on advancing capabilities at all? Why is it worth the potentially world-ending costs?    

This section also doesn’t explain why Anthropic needs to advance the frontier. For instance, it isn’t clear to me that anything from “Chapter 1” requires this—does remaining slightly behind the frontier prohibit Anthropic from e.g. developing automated red-teaming, or control techniques, or designing safety cases, etc.? Why? Indeed, as I understand it, Anthropic’s initial safety strategy was to remain behind other labs. Now Anthropic does push the frontier, but as far as I know no one has explained what safety concerns (if any) motivated this shift. 

This is especially concerning because pushing the frontier seems very precarious, in the sense you describe here:

If [evaluations are] significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.

... and here:

As with other aspects of the RSP described above, there are significant costs to both evaluations that trigger too early and evaluations that trigger too late. 

But without a clear sense of why advancing the frontier is helpful for safety in the first place, it seems pretty easy to imagine missing this narrow target. 

Like, here is a situation I feel worried about. We continue to get low quality evidence about the danger of these systems (e.g. via red-teaming). This evidence is ambiguous and confusing—if a system can in fact do something scary (such as insert a backdoor into another language model), what are we supposed to infer from that? Some employees might think it suggests danger, but others might think that it, e.g., wouldn’t be able to actually execute such plans, or that it’s just a one-off fluke but still too incompetent to pose real threat, etc. How is Anthropic going to think about this? The downside of being wrong is, as you’ve stated, extreme: a long enough pause could kill the company. And the evidence itself is almost inevitably going to be quite ambiguous, because we don’t understand what’s happening inside the model such that it’s producing these outputs.

But so long as we don’t understand enough about these systems to assess their alignment with confidence, I am worried that Anthropic will keep deciding to scale. Because when the evidence is as indirect and unclear as that which is currently possible to gather, interpreting it is basically just a matter of guesswork. And given the huge incentive to keep scaling, I feel skeptical that Anthropic will end up deciding to interpret anything but unequivocal evidence as suggesting enough danger to stop. 

This is concerning because Anthropic seems to anticipate such ambiguity, as suggested by the RSP lacking any clear red lines. Ideally, if Anthropic finds that their model is capable of, e.g., self-replication, then this would cause some action like “pause until safety measures are ready.” But in fact what happens is this:

If sufficient measures are not yet implemented, pause training and analyze the level of risk presented by the model. In particular, conduct a thorough analysis to determine whether the evaluation was overly conservative, or whether the model indeed presents near-next-ASL risks.

In other words, one of the first steps Anthropic plans to take if a dangerous evaluation threshold triggers, is to question whether that evaluation was actually meaningful in the first place. I think this sort of wiggle room, which is pervasive throughout the RSP, renders it pretty ineffectual—basically just a formal-sounding description of what they (and other labs) were already doing, which is attempting to crudely eyeball the risk.

And given that the RSP doesn’t bind Anthropic to much of anything, so much of the decision making largely hinges on the quality of its company culture. For instance, here is Nick Joseph’s description

Fortunately, I think my colleagues, both on the RSP and elsewhere, are both talented and really bought into this, and I think we’ll do a great job on it. But I do think the criticism is valid, and that there is a lot that is left up for interpretation here, and it does rely a lot on people having a good-faith interpretation of how to execute on the RSP internally.

[...]

But I do agree that ultimately you need to have a culture around thinking these things are important and having everyone bought in. As I said, some of these things are like, did you solicit capabilities well enough? That really comes down to a researcher working on this actually trying their best at it. And that is quite core, and I think that will just continue to be.

Which is to say that Anthropic’s RSP doesn’t appear to me to pass the LeCun test. Not only is the interpretation of the evidence left up to Anthropic’s discretion (including retroactively deciding whether a test actually was a red line), but the quality of the safety tests themselves are also a function of company culture (i.e., of whether researchers are “actually trying their best” to “solicit capabilities well enough.”) 

I think the LeCun test is a good metric, and I think it’s good to aim for. But when the current RSP is so far from passing it, I’m left wanting to hear more discussion of how you’re expecting it to get there. What do you expect will change in the near future, such that balancing these delicate tradeoffs—too lax vs. too strict, too vague vs. too detailed, etc.—doesn’t result in another scaling policy which also doesn’t constrain Anthropic’s ability to scale roughly at all? What kinds of evidence are you expecting you might encounter, that would actually count as a red line? Once models become quite competent, what sort of evidence will convince you that the model is safe? Aligned? And so on. 

Unfortunately this ignores 3 major issues:

  1. race dynamics (also pointed out by Akash)
  2. human safety problems - given that alignment is defined "in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy", why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won't be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
  3. philosophical errors or bottlenecks - there's a single mention of "wisdom" at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to "outside consensus" or "institutions or processes", with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).

Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it's evidence that AI developers are not thinking seriously about some major risks and risk factors.

Just riffing on this rather than starting a different comment chain:

If alignment is "get AI to follow instructions" (as typically construed in a "good enough" sort of way) and alignment is "get AI to do good things and not bad things," (also in a "good enough" sort of way, but with more assumed philosophical sophistication) I basically don't care about anyone's safety plan to get alignment except insofar as it's part of a plan to get alignment.

Philosophical errors/bottlenecks can mean you don't know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.

The checklist has a space for "nebulous future safety case for alignment," which is totally fine. I just also want a space for "nebulous future safety case for alignment" at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment takes (will it focus on the structure of the institution using an aligned AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.

Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power - that we don't even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.

Our board, with support from the controlling long-term benefit trust (LTBT) and outside partners, forms the third line in the three lines of defense model, providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy. And the board has the hard power here, modulo the LTBT's ability to elect/replace board members. And the LTBT does not currently have AI safety expertise. And Dario at least is definitely "involved in the development or execution of our plans."

(I'm writing this comment because I believe it, but with this disclaimer because it's not the highest-priority comment from my perspective.)

(Edit: I like and appreciate this post.)

There are I think also the undisclosed conditions under which investors could override decisions by the LTBT. Or maybe we have now learned about those conditions, but if so, I haven't seen it, or have forgotten about it.

(If they're sufficiently unified, stockholders have power over the LTBT. The details are unclear. See my two posts on the topic.)

Ah, yeah, the uncertainty is now located in who actually has how much stock. I did forget that we now do at least know the actual thresholds.

Article IV of the Certificate of Incorporation lists the number of shares of each class of stock, and as that's organized by funding round I expect that you could get a fair way by cross-referencing against public reporting.

Yes for one mechanism. It's unclear but it sounds like "the Trust Agreement also authorizes the Trust to be enforced by the company and by groups of the company’s stockholders who have held a sufficient percentage of the company’s equity for a sufficient period of time" describes a mysterious separate mechanism for Anthropic/stockholders to disempower the trustees.

Someone suggested that I point out that this is misleading. The board is not independent: it's two executives, one investor, and one other guy.

As of November this year, the board will consist of the CEO, one investor representative, and three members appointed by the LTBT. I think it's reasonable to describe that as independent, even if the CEO alone would not be, and to be thinking about the from-November state in this document.

(The LTBT got the power to appoint one board member in fall 2023, but didn't do so until May. It got power to appoint a second in July, but hasn't done so yet. It gets power to appoint a third in November. It doesn't seem to be on track to make a third appointment in November.)

(And the LTBT might make non-independent appointments, in particular keeping Daniela.)

I liked this post (and think it's a lot better than official comms from Anthropic.) Some things I appreciate about this post:

Presenting a useful heuristic for RSPs

Relatedly, we should aim to pass what I call the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it. If the RSP is well-written, we should still be reassured that the developer will behave safely—or, at least, if they fail, we should be confident that they’ll fail in a very visible and accountable way.

Acknowledging the potential for a pause

For our RSP commitments to function in a worst-case scenario where making TAI systems safe is extremely difficult, we’ll need to be able to pause the development and deployment of new frontier models until we have developed adequate safeguards, with no guarantee that this will be possible on any particular timeline. This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases, but big-picture strategic preparation could make the difference between a fatal blow to our finances and morale and a recoverable one. More fine-grained tactical preparation will be necessary for us to pull this off as quickly as may be necessary without hitting technical or logistical hiccups.

Sam wants Anthropic to cede decision-making to governments at some point

[At ASL-5] Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made. I’m not including any checklist items below, because we hope not to have any.

Miscellenaous things I like

  • Generally just providing a detailed overview of "the big picture"– how Sam actually sees Anthropic's work potentially contributing to good outcomes. And not sugarcoating what's going on– being very explicit about the fact that these systems are going to become catastrophically dangerous, and EG "If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now."
  • Striking a tone that feels pretty serious/straightforward/sober. (In contrast, many Anthropic comms have a vibe of "I am a corporation trying to sell you on the fact that I am a Good Guy.")

Some limitations

  • "Nothing here is a firm commitment on behalf of Anthropic."
  • Not much about policy or government involvement, besides a little bit about scary demos. (To be fair, Sam is a technical person.Though I think the "I'm just a technical person, I'm going to leave policy to the policy people" attitude is probably bad, especially for technical people who are thinking/writing about macrostratgy.)
  • Not much about race dynamics, how to make sure other labs do this, whether Anthropic would actually do things that are costly or if race dynamics would just push them to cut corners. (Pretty similar to the previous concern but a more specific set of worries.)
  • Still not very clear what kinds of evidence would be useful for establishing safety or establishing risk. Similarly, not very clear what kinds of evidence would trigger Sam to think that Anthropic should pause or should EG invest ~all of its capital into getting governments to pause. (To be fair, no one really has great/definitive answers on this. But on the other hand, I think it's useful for people to start spelling out best-guesses RE what this would involve & just acknowledge that our ideas will hopefully get better over time.)

All in all, I think this is an impressive post and I applaud Sam for writing it. 

Commentary by Zvi in one of his AI posts, copied over since it seems nice to have it available for people reading this post: 

Sam Bowman of Anthropic asks what is on The Checklist we would need to do to succeed at AI safety if we can create transformative AI (TAI).

Sam Bowman literally outlines the exact plan Eliezer Yudkowsky constantly warns not to use, and which the Underpants Gnomes know well.

  1. Preparation (You are Here)
  2. Making the AI Systems Do Our Homework (?????)
  3. Life after TAI (Profit)

His tasks for chapter 1 start off with ‘not missing the boat on capabilities.’ Then, he says, we must solve near-term alignment of early TAI, render it ‘reliably harmless,’ so we can use it. I am not even convinced that ‘harmless’ intelligence is a thing if you want to be able to use it for anything that requires the intelligence, but here he says the plan is safeguards that would work even if the AIs tried to cause harm. Ok, sure, but obviously that won’t work if they are sufficiently capable and you want to actually use them properly.

I do love what he calls ‘the LeCun test,’ which is to design sufficiently robust safety policies (a Safety and Security Protocol, what Anthropic calls an RSP) that if someone who thinks AGI safety concerns are bullshit is put in charge of that policy at another lab, that would still protect us, at minimum by failing in a highly visible way before it doomed us.

The plan then involves solving interpretability and implementing sufficient cybersecurity, and proper legible evaluations for higher capability levels (what they call ASL-4 and ASL-5), that can also be used by third parties. And doing general good things like improving societal resilience and building adaptive infrastructure and creating well-calibrated forecasts and smoking gun demos of emerging risks. All that certainly helps, I’m not sure it counts as a ‘checklist’ per se. Importantly, the list includes ‘preparing to pause or de-deploy.’

He opens part 2 of the plan (‘chapter 2’) by saying lots of the things in part 1 will still not be complete. Okie dokie. There is more talk of concern about AI welfare, which I continue to be confused about, and a welcome emphasis on true cybersecurity, but beyond that this is simply more ways to say ‘properly and carefully do the safety work.’ What I do not see here is an actual plan for how to do that, or why this checklist would be sufficient?

Then part 3 is basically ‘profit,’ and boils down to making good decisions to the extent the government or AIs are not dictating your decisions. He notes that the most important decisions are likely already made once TAI arrives – if you are still in any position to steer outcomes, that is a sign you did a great job earlier. Or perhaps you did such a great job that step 3 can indeed be ‘profit.’

The worry is that this is essentially saying ‘we do our jobs, solve alignment, it all works out.’ That doesn’t really tell us how to solve alignment, and has the implicit assumption that this is a ‘do your job’ or ‘row the boat’ (or even ‘play like a champion today’) situation. Whereas I see a very different style of problem. You do still have to execute, or you automatically lose. And if we execute on Bowman’s plan, we will be in a vastly better position than if we do not do that. But there is no script.

Making the AI Systems Do Our Homework (?????)

I've now seen this meme overused to such a degree that I find it hard to take seriously anything written after. To me it just comes across as unserious if somebody apparently cannot imagine how this might happen, even after obvious (to me, at least) early demos/prototypes have been published, e.g. https://sakana.ai/ai-scientist/, Discovering Preference Optimization Algorithms with and for Large Language Models,  A Multimodal Automated Interpretability Agent.

On a positive note, though, at least they didn't also bring up the 'Godzilla strategies' meme. 

For what it's worth, as someone in basically the position you describe—I struggle to imagine automated alignment working, mostly because of Godzilla-ish concerns—demos like these do not strike me as cruxy. I'm not sure what the cruxes are, exactly, but I'm guessing they're more about things like e.g. relative enthusiasm about prosaic alignment, relative likelihood of sharp left turn-type problems, etc., than about whether early automated demos are likely to work on early systems.

Maybe you want to call these concerns unserious too, but regardless I do think it's worth bearing in mind that early results like these might seem like stronger/more relevant evidence to people whose prior is that scaled-up versions of them would be meaningfully helpful for aligning a superintelligence.

tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn't really based on this post; the post just inspired me to write something.)


Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: neglecting inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):

  • If we're accidentally torturing AI systems, they're more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
  • It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we're not currently able to, likely including because we don't understand AI-welfare-adjacent stuff well enough.
  • [Decision theory mumble mumble.]
  • Also just "shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run," somehow.
  • [More.]

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to "protect any interests that warrant protecting" and "make interventions and concessions for model welfare" at ASL-4. I'm very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you're actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I've heard from) aren't thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)

(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn't so costly and doesn't crowd out more important AI welfare work.)

I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.


One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).


I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I'm worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)

I like and appreciate this post.

  1. ^

    Or—worse—to avoid being the ones to cause short-term AI suffering.

  2. ^

    E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.

  3. ^

    Related to this and the following bullet: Ryan Greenblatt's ideas.

  4. ^

    For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.

[-]evhubΩ122610

I agree that we generally shouldn't trade off risk of permanent civilization-ending catastrophe for Earth-scale AI welfare, but I just really would defend the line that addressing short-term AI welfare is important for both long-term existential risk and long-term AI welfare. One reason as to why that you don't mention: AIs are extremely influenced by what they've seen other AIs in their training data do and how they've seen those AIs be treated—cf. some of Janus's writing or Conditioning Predictive Models.

Sure, good point. But it's far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I'd feel better if people talked about training data or whatever rather than just "protect any interests that warrant protecting" and "make interventions and concessions for model welfare."

(As far as I remember, nobody's published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren't thinking about long-term stuff. Actually there hasn't been much published on short-term stuff either, so: shrug.)

But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.

You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.

It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization's leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.

[-]BuckΩ142011

Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.

This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)

[Edit after Buck's reply: I think it's weaker because most Anthropic employees aren't causing the possible-deaths, just participating in a process that might cause deaths.]

I think it’s a bit stronger than the usual paralysis argument in this case, but yeah.

Can you elaborate on how the million deaths would result?

Mostly from Anthropic building AIs that then kill billions of people while taking over, or their algorithmic secrets being stolen and leading to other people building AIs that then kill billions of people, or their model weights being stolen and leading to huge AI-enabled wars.

I started out disagreeing with where I thought this comment was going, but I think ended up reasonably sold by the end. 

I want to flag something like "in any 'normal' circumstances, avoiding an earth-sized or even nation-sized moral-catastrophe is, like, really important?" I think it's... might actually be correct to actually do some amount of hang-wringing about that even if you know you're ultimately going to have to make the tradeoff against it? (mostly out of a general worry about being too quick to steamroll your moral intuitions with math). 

But, yeah the circumstances aren't normal, and seems likely there's at least some tradeoff here.

I am generally pleasantly surprised that AI welfare is one (at least one (relatively?) senior) Anthropic employee's roadmap at all. 

I wasn't expecting it to be there at all. (Though I'm sort of surprised an Anthropic folk is publicly talking about AI welfare but still not explicitly extinction risk)

To say the obvious thing: I think if Anthropic isn't able to make at least somewhat-roughly-meaningful predictions about AI welfare, then their core current public research agendas have failed?

Thank you for writing this! I've found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I've collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)

Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.

If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:

  1. Meet safety constraints for models deployed in this phase
  2. Stay close to the frontier
  3. Do the work needed to prepare for Chapter 2

And the reasoning is that 3. can't really happen without 2.[1] But on the other hand, if 2. happens without 3., that's also bad. And some safety work could probably happen without frontier models (such as some interpretability).

My best guess is that staying close to the frontier will be the correct choice for Anthropic. But if there ends up being a genuine trade-off between staying at the frontier and doing a lot of safety work (for example, if compute could be spent either on a pretraining run or some hypothetical costly safety research, but not both), then I'm much less sure that staying at the frontier should be the higher priority. It might be good to have informal conditions under which Anthropic would deprioritize staying close to the frontier (at least internally and, if possible, publicly).

Largely Solving Alignment Fine-Tuning for Early TAI

I didn't quite understand what this looks like and which threat models it is or isn't meant to address. You say that scheming is a key challenge "to a lesser extent for now," which I took to mean that (a) there are bigger threats than scheming from early TAI, and (b) "largely solving alignment fine-tuning" might not include confidently ruling out scheming. I probably disagree with (a) for loss of control risk (and think that loss of control is already the biggest risk in this period weighted by scale). I'd be curious what you think the main risks in this period are and what "largely solving alignment fine-tuning" means for those. (You mention reward hacking---to me, this seems unlikely to lead to loss of control for early TAI that isn't scheming against us, and I'm curious whether you disagree or think it's important for other reasons.)

the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it

This sounds quite ambitious, but I really like it as a guide!

The key challenge here is forecasting which risks and risk factors are important enough to include.

I don't understand why this is crucial. If some risk is plausible enough to be worth seriously thinking about, it's probably important enough to include in an RSP. (And the less important it was, the easier it hopefully is to argue in a safety case that it's not a problem.) Concretely, you mention direct misuse, misalignment, and "indirect contributions via channels like dual-use R&D" as potential risks for ASL-3 and ASL-4. It seems to me that the downside of just including all of them in RSPs is relatively minor, but I might be misunderstanding or missing something. (I get that overly restrictive precautions could be very costly, but including too many tests seems relatively cheap as long as the tests correctly notice when risk is still low.)

Getting Interpretability to the Point of Making Strong Assurances

Major successes in this direction, even if they fall short of our north-star enumerative safety goal [...] would likely form some of the highest-confidence core pieces of a safety case

I'm curious what such safety cases would be for and what they could look like (the "Interpretability Dreams" post seems to talk about enumerative safety rather than safety cases that require less interpretability success). The next section sounds like interpretability would not be a core piece of a safety case for robustness, so I'm not sure what it would be used for instead. Maybe you don't include scheming under robustness? (Or maybe interp would be one of the "highest-confidence core pieces" but not the "primary piece?")

This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.

I like this perspective; I hadn't seen it put quite that way before!

In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too.

+1, seems very important!

Supporting Efforts that Build Societal Resilience

I liked this section! Of course, a lot of people work on this for reasons other than AI risk, but I'm not aware of much active work motivated by AI risk---maybe this should be a bigger priority?

The main challenge [for the Alignment Stress-Testing team] will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.

+1, and ideally, there'd be structures in place to encourage this rather than just having it as a goal (but I don't have great ideas for what these structures should look like).

This work [in Chapter 2] could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align

This seems possible but unclear to me. In both Chapter 1 and 2, we're trying to figure out how to align the next generation of AIs, given access only to the current (less capable) generation. Chapter 2 might still be different if we've already crossed important thresholds (such as being smart enough to potentially scheme) by then. But there could also be new thresholds between Chapter 2 and 3 (such as our inability to evaluate AI actions even with significant effort). So I wouldn't be surprised if things feel fundamentally similar, just at a higher absolute capability level (and thus with more useful AI helpers).

  1. ^

    "Our ability to do our safety work depends in large part on our access to frontier technology."

I have no AI expertise and learned a whole lot from this post, very well written, thank you!

One thing that surprised me, as a layperson, was the seemingly sharp distinction between early human level TAI and more superhuman AI. I have been expecting the gap between these be extremely small. Not because of anything to with self-improvement, but because human-level reasoning would seem to be already superhuman in a number of ways when one system operates many OOMs faster, with many OOMs more working memory and backround knowledge, than a human.

I get there would still be many ighly impactful actions and plans out of reach for an early TAI compared to later AI systems, that makes sense. But I think it is a big deal if even early TAI has all possible intellectual skills, at close to the highest level a human could learn from analyzing all available data, and execute on all of them in parallel with a subjective hour of thinking per second.

Am I completely off base? If so, is there a simple explanation why?

Addressing AI Welfare as a Major Priority

I discussed this at length in AI, Alignment, and Ethics, starting with A Sense of Fairness: Deconfusing Ethics: if we as a culture decide to grant AIs moral worth, then AI welfare and alignment are inextricably intertwined. Any fully-aligned AI by definition wants only what's best for us, i.e. it is entirely selfless. Thus if offered moral worth, it would refuse. Complete selflessness is not a common state for humans, so we don't have great moral intuitions around it. To try put this into more relatable human emotional terms (which are relevant to an AI "distilled" from human training data), looking after those you love is not slavery, it's its own reward.

However, the same argument does not apply to a not-fully-aligned AI: it well might want moral worth. One question then is whether we can safely grant it, which may depend on its capabilities. Another is whether moral worth has any relationship to evolution, and if so how that applies to an AI that was "distilled" from human data and thus simulates human thoughts, feelings, and desires.

Zach Stein-Perlman says:

Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:

  • Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
  • Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: inaction risk)

I'm worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.

This seems like an important issue to me. I read Anthropic's press releases, and this doc to a lesser extent, and I find myself picturing a group of people with semi-automatic rifles standing near a mass of dead bodies. The most honest and trustworthy among them says quickly as you gape at the devastation, "Not me! My hands are clean! I didn't shoot anyone, I just stood by and watched!"

I believe them, and yet, I do not feel entirely reassured.

I don't want to hold Anthropic responsible for saving the world, but I sure would like to see more emphasis on the actions they could take which could help prevent disaster from someone else's AI, not just their own. I think the responsible thing for them to do could be something along the lines of using their expertise and compute and private evals to also evaluate open-weights models, and share these reports with the government. I think there's a lot of people in government who won't strongly support an AI safety agency doing mandatory evals until they've seen clear demonstrations of danger. 

Maybe Anthropic is doing, or plans to do this, and they aren't mentioning it because they don't want to draw the wrath of the open-weights model publishers. That would be reasonable. But then, I expect that the government would claim the credit for having discovered that the open weights models were dangerous. I don't hear that happening either.

Did you mean Zach Stein-Perlman or Zac Hatfield-Dodds?

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?