This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.

These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,[1] but specific details may be suboptimal. And some actions are not fully specified; some are of the form do good things in area X or avoid risk Y, without pinning down what to do. And this is not certain to be correct in the future; I just think it's a good plan for now. Some of these actions are costly, but I claim that they are cheap and important enough that a responsible frontier AI company should do them unilaterally — the gain in safety is worth the cost in competitiveness. I mostly don't justify these actions in this post.

(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)

Security for model weights and code

Extremely strong security will be crucial when AIs can massively accelerate AI R&D. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly[2]—before it would be very bad for AI model weights to be stolen.[3] It seems such AI could plausibly appear as early as 2027; if so, labs should plan to achieve extreme security optionality before 2027.[4] The lab should develop a roadmap to achieve extreme security optionality, publicly explain and justify its high-level security goals, and maybe share some details and publish stuff to validate claims (e.g. a pentester's summary of their pentest results) with sensitive details redacted.[5] It is also reasonable to aim for a weaker level of security, SL5, for now, but that will probably be insufficient when AIs can massively accelerate research.

Control

Anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape during internal deployment.[6] The lab should do evals for scheming-relevant capabilities, and by the time models can plausibly scheme coherently, be ready to ensure control, e.g.:

  • Use monitoring and auditing techniques in deployment
    • AI monitors read all inputs and outputs; the most suspicious outputs are flagged for human review; moderately suspicious outputs are resampled (potentially from a trusted, weaker model) or edited by a trusted model
  • Use control evals based on the deployment protocol: determine how good the protocol is assuming the model is scheming
  • Try to directly test whether the model is scheming (even though this is unreliable): test the model on inputs that might cause it to misbehave in simulated high-stakes deployment settings, and test it in simulated successful escapes
  • If the lab catches scheming, take it seriously: tell everyone, pause deployment—or go into high alert, disabling some risky stuff and lowering thresholds for editing/replacing/auditing suspicious model outputs—and use that example of misbehavior to improve safety

Misuse

Anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users, maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.

Accountability for future deployment decisions

Plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:[7]

Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.

Plan for AGI

Labs should have a clear, good plan for what to do when they build AGI—i.e., AI that can obsolete top human scientists—beyond just making the AGI safe.[8] For example, one plan is perfectly solve AI safety, then hand things off to some democratic process. This plan is bad because solving AI safety will likely take years after systems are transformative and superintelligence is attainable, and by default someone will build superintelligence without strong reason to believe they can do so safely or do something else unacceptable during that time.

The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer[9] to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects[10]), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control). Use AGI to increase the government's willingness to pay for nonproliferation and to decrease the cost of nonproliferation (build carrots to decrease cost of others accepting nonproliferation and build sticks to decrease cost to US of enforcing nonproliferation).[11]

For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.

Safety research

Do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.[12] Also boost external safety research, especially by giving safety researchers deeper access[13] to externally deployed models.

Policy engagement

Inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.

Let staff share views

Encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)

Public statements

The lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).

If-then planning (meta)

Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)

More

I list some less important or more speculative actions (very nonexhaustive) in this footnote.[14] See The Checklist: What Succeeding at AI Safety Will Involve (Bowman 2024) for a collection of goals for a lab. See A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed but rough, out-of-date, and very high-context plan. See Racing through a minefield: the AI deployment problem (Karnofsky 2022) on the dilemma labs face when deciding how to deploy powerful AI and high-level actions for a lab to improve safety. See Towards best practices in AGI safety and governance: A survey of expert opinion (Schuett et al. 2023) for an old list.


Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.


Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

    With two exceptions, where I feel unconfident: misuse prevention (misuse has positive second-order effects) and accountability for risk assessment (its causal benefit may be small if done unilaterally).

  2. ^

    Extreme security optionality is the ability to implement extreme security quickly, e.g. within 3 months. Also extreme security should be implementable at a reasonable maximum cost, e.g $10B to implement plus $10B per year and a 75% slowdown in research — the property the lab could implement extreme security but it would require basically stopping research or incurring other prohibitive costs doesn't suffice.

    I focus on extreme security for model weights and code, not necessarily for secrets. Securing weights will be more important; it seems easier; and extreme security for secrets entails that staff are much less able to talk to people outside the lab, which could hurt their ability to think clearly.

    Labs will likely need substantially better security than defend against top-priority operations by top state actors soon after AIs can massively accelerate AI R&D: model weights will be among the most valuable and strategically important assets ever, so labs should be prepared for "unprecedentedly well-resourced attacks," not to mention that AI will boost offense.

  3. ^

    Proliferation of model weights would lead to others quickly getting more powerful AI, since the AI can massively boost AI R&D. This removes the option of coordination that requires nonproliferation, enables misuse and state conflict, makes it hard for US labs to gain a substantial lead, and makes it impossible for US labs to unilaterally slow progress while retaining a lead.

  4. ^

    By the time extreme security is necessary, AIs will be able to substantially boost security. So for now, labs should focus on the things that will be hard for AI to accelerate: things that require nonparallelizable physical work. For example, labs should probably focus on physically constructing or retrofitting datacenters such that they can be extremely secure.

    If it becomes clear that the lab will achieve extreme security optionality before it's necessary, the lab could choose to revise the plan and slow its progress toward extreme security.

  5. ^

    Would an if-then security commitment—don't create a very dangerous model until its weights would be secure—suffice? Sure, but I worry that extreme security takes substantial time to implement (in part because it likely requires physical changes to datacenters) and the labs are not on track to implement it before they need it, so a lab will either proceed unsafely or pause while the less scrupulous race ahead. Regardless of whether a lab has such a commitment, I think it should start sprinting for extreme security optionality now.

  6. ^

     Maybe a large fraction of AI risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored."

    Escape means "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this:

    At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories.

  7. ^

    This is partially a substitute for responsible scaling policies. Labs should still make at least high-level RSP-y (or "if-then") commitments too.

  8. ^

    A plan could include how to use AGI, what further AI development to do, and what the exit state is if all goes well. A plan should consider not just what the lab will do but also what will be happening with the US government, other labs, and other states.

    Really the goal is a playbook, not just a plan — it should address what happens if various assumptions aren't met or things are different than expected, should include ideas that are worth exploring/trying but not depending on, etc.

  9. ^

    We don't need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.

  10. ^

    To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.

  11. ^

    This plan, and desire for this kind of plan, is largely stolen from Ryan Greenblatt. He doesn't necessarily endorse my version.

    Dario Amodei recently said something related:

    it seems very important that democracies have the upper hand on the world stage when powerful AI is created. . . . My current guess at the best way to do this is via an “entente strategy”, in which a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust military superiority (the stick) while at the same time offering to distribute the benefits of powerful AI (the carrot) to a wider and wider group of countries in exchange for supporting the coalition’s strategy to promote democracy (this would be a bit analogous to “Atoms for Peace”). The coalition would aim to gain the support of more and more of the world, isolating our worst adversaries and eventually putting them in a position where they are better off taking the same bargain as the rest of the world: give up competing with democracies in order to receive all the benefits and not fight a superior foe.

    (Among other things, this does not address the likely possibility that the democracies should delay building wildly superintelligent AI.)

  12. ^

    This is slightly complicated because (1) some safety research boosts capabilities and (2) some safety practices work better if their existence (in the case of control) or details (in the case of adversarial robustness) are secret. In some cases you can share with other labs to get most of the benefits and little of the cost. Regardless, lots of safety research is good to publish. And publishing model evals is good for some of the same reasons plus transparency and accountability.

  13. ^

    E.g. fine-tuning, helpful-only, filters/moderation off, logprobs, activations.

  14. ^
    • RSP details (but it's not currently possible to do all of this very well)
      • See generally Key Components of an RSP
      • Do evals regularly
      • And have predefined high-level capability thresholds
      • And have predefined low-level operationalizations
      • And publish evals and thresholds, or have an auditor comment
      • And have thresholds trigger specific responses
      • And all these details should be good
      • And prepare for a pause (to make it less costly, in case it's necessary)
      • And accountability/transparency/meta stuff: publish updates or be transparent to an auditor, have a decent update process, elicit external review, perhaps adopt a mechanism for staff to flag false statements
    • Elicitation details (for model evals for dangerous capabilities): see the summary in my blogpost (or my table)
    • Don't publish capabilities research, modulo Thoughts on sharing information about language model capabilities (Christiano 2023) — don't publish stuff that boosts black-box capabilities (especially stuff on training) unless it improves safety
    • Internal governance
      • Prepare for difficult decisions ahead
      • Have structure, mandate, and incentives set up such that the lab can prioritize safety in key decisions.
      • Plan what the lab and its staff would do if it needed to pause for safety
      • Have a good process for staff to escalate concerns about safety internally
      • Don't discourage certain kinds of whistleblowing — details unclear
      • Don't use nondisparagement provisions in contracts or severance agreements
    • Object-level security practices
    • Various specific safety techniques (potentially high-priority but I just didn't focus this post on them) (not necessarily currently existing) (very nonexhaustive)
New Comment
10 comments, sorted by Click to highlight new comments since:

share good access with third-party evaluators

Glad to see my personal hobbyhorse got a mention!

Overall this is great, and I have just one concern about a possible what-if case that isn't covered.

Model development secrets being disproportionately important.

I know, it's a really tricky thing to deal with if this becomes the case. But I think it's worth having a 'what if' plan in place in case this suddenly happens. Imagine a lab is working on their research and some researcher stumbles across an algorithmic innovation which makes a million-fold improvement in training efficiency. The peak capability level of the model that can now be trained for a few thousand dollars is now on par with their leading multi-billion dollar frontier model. Furthermore, this peak capability level is sufficient for the 3x speedup in AI R&D that you specified as being a threshold of high concern. Under such a circumstance, this secret becomes even more dangerous and valuable than the multi-billion-dollar-training-cost frontier model weights. Seems like having a plan in place for what if this happens would be wise.

Something like: don't tell your co-workers, report to so-and-so specific person whose job it is to handle evaluating and reporting up-the-chain about possible dangerous algorithmic developments. Probably you'd need a witness protection / isolation setup for the researchers who'd been exposed to the secret. You'd need to start planning around how soon others might stumble onto the same discovery, what other similar discoveries might be out there to be found, what government actions should be taken, etc.

I know this sounds like an implausible scenario to most people currently, but I think it's not something ruled out as physically impossible, and is worth having a what-if plan for. 

Some evidence which I claim points in the direction of the plausibility of the feasibility of a highly parallelized search for algorithmic improvements: https://arxiv.org/abs/2403.17844 

This paper indicates that even very small scale experiments on novel architectures can give a good approximation of how well those novel architectures will perform at scale. This means that a large search which does many small scale tests could be expected to find promising leads.

Some ideas relating to comms/policy:

  • Communicate your models of AI risk to policymakers
    • Help policymakers understand emergency scenarios (especially misalignment scenarios) and how to prepare for them
    • Use your lobbying/policy teams primarily to raise awareness about AGI and help policymakers prepare for potential AGI-related global security risks.
  • Develop simple/clear frameworks that describe which dangerous capabilities you are tracking (I think OpenAI's preparedness framework is a good example, particularly RE simplicity/clarity/readability.)
  • Advocate for increased transparency into frontier AI development through measures like stronger reporting requirements, whistleblower mechanisms, embedded auditors/resident inspectors, etc.
  • Publicly discuss threat models (kudos to DeepMind)
  • Engage in public discussions/debates with people like Hinton, Bengio, Hendrycks, Kokotajlo, etc.
  • Encourage employees to engage in such discussions/debates, share their threat models, etc.
  • Make capability forecasts public (predictions for when models would have XYZ capabilities)
  • Communicate under what circumstances you think major government involvement would be necessary (e.g., nationalization, "CERN for AI" setups).

Nice. I mostly agree on current margins — the more I mistrust a lab, the more I like transparency.

I observe that unilateral transparency is unnecessary on the view-from-the-inside if you know you're a reliably responsible lab. And some forms of transparency are costly. So the more responsible a lab seems, the more sympathetic we should be to it saying "we thought really hard and decided more unilateral transparency wouldn't be optimific." (For some forms of transparency.)

I think I agree with this idea in principle, but I also feel like it misses some things in practice (or something). Some considerations:

  • I think my bar for "how much I trust a lab such that I'm OK with them not making transparency commitments" is fairly high. I don't think any existing lab meets that bar.
  • I feel like a lot of forms of helpful transparency are not that costly. The main 'cost' EDIT: I think one of the most noteworthy costs is something like "maybe the government will end up regulating the sector if/when it understands how dangerous industry people expect AI systems to be and how many safety/security concerns they have". But I think things like "report dangerous stuff to the govt", "have a whistleblower mechanism", and even "make it clear that you're willing to have govt people come and ask about safety/security concerns" don't seem very costly from an immediate time/effort perspective.
  • If a Responsible Company implemented transparency stuff unilaterally, it would make it easier for the government to have proof-of-concept and implement the same requirements for other companies. In a lot of cases, showing that a concept works for company X (and that company X actually thinks it's a good thing) can reduce a lot of friction in getting things applied to companies Y and Z.

I do agree that some of this depends on the type of transparency commitment and there might be specific types of transparency commitments that don't make sense to pursue unilaterally. Off the top of my head, I can't think of any transparency requirements that I wouldn't want to see implemented unilaterally, and I can think of several that I would want to see (e.g., dangerous capability reports, capability forecasts, whistleblower mechanisms, sharing if-then plans with govt, sharing shutdown plans with govt, setting up interview program with govt, engaging publicly with threat models, having clear OpenAI-style tables that spell out which dangerous capabilities you're tracking/expecting).

@Zach Stein-Perlman @ryan_greenblatt feel free to ignore but I'd be curious for one of you to explain your disagree react. Feel free to articulate some of the ways in which you think I might be underestimating the costliness of transparency requirements. 

(My current estimate is that whistleblower mechanisms seem very easy to maintain, reporting requirements for natsec capabilities seem relatively easy insofar as most of the information is stuff you already planned to collect, and even many of the more involved transparency ideas (e.g., interview programs) seem like they could be implemented with pretty minimal time-cost.)

It depends on the type of transparency.

Requirements which just involve informing one part of the government (say US AISI) in ways which don't cost much personnel time mostly have the effect of potentially making the government much more aware at some point. I think this is probably good, but presumably labs would prefer to retain flexibility and making the government more aware can go wrong (from the lab's perspective) in ways other than safety focused regulation. (E.g., causing labs to be merged as part of a national program to advance capabilities more quickly.)

Whistleblower requirements with teeth can have information leakage concerns. (Internal only whistleblowering policies like Anthropic's don't have this issue, but also have basically no teeth for the company overall.)

Any sort of public discussion (about e.g. threat models, government involvement, risks) can have various PR and reputational cost. (Idk if you were counting this under transparency.)

To the extent you expect to disagree with third party inspectors about safety (and they aren't totally beholden to you), this might end up causing issues for you later.

I'm not claiming that "reasonable" labs shouldn't do various types of transparency unilaterially, but I don't think the main cost is in making safety focused regulation more likely.

TY. Some quick reactions below:

  • Agree that the public stuff has immediate effects that could be costly. (Hiding stuff from the public, refraining from discussing important concerns publicly, or developing a reputation for being kinda secretive/sus can also be costly; seems like an overall complex thing to model IMO.)
  • Sharing info with government could increase the chance of a leak, especially if security isn't great. I expect the most relevant info is info that wouldn't be all-that-costly if leaked (e.g., the government doesn't need OpenAI to share its secret sauce/algorithmic secrets. Dangerous capability eval results leaking or capability forecasts leaking seem less costly, except from a "maybe people will respond by demanding more govt oversight" POV.
  • I think all-in-all I still see the main cost as making safety regulation more likely, but I'm more uncertain now, and this doesn't seem like a particularly important/decision-relevant point. Will edit the OG comment to language that I endorse with more confidence.

I may edit this post in a week based on feedback. You can suggest changes that don't merit their own LW comments here. That said, I'm not sure what feedback I want.

A related/overlapping thing I should collect: basic low-hanging fruit for safety. E.g.:

  • Have a high-level capability threshold at which securing model weights is very important
  • Evaluate whether your models are capable at agentic tasks; look at the score before releasing or internally deploying them