I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?”

This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1

  • A plan connotes something like: “By default, we ~definitely fail. To succeed, we need to hit multiple non-default goals.” If you want to start a company, you need a plan: doing nothing will definitely not result in starting a company, and there are multiple identifiable things you need to do to pull it off.
  • I don’t think that’s the situation with AI risk.
    • As I argued before, I think we have a nontrivial chance of avoiding AI takeover even in a “minimal-dignity” future - say, assuming essentially no growth from here in the size or influence of the communities and research fields focused specifically on existential risk from misaligned AI, and no highly surprising research or other insights from these communities/fields either. (This statement is not meant to make anyone relax! A nontrivial chance of survival is obviously not good enough.)
    • I think there are a number of things we can do that further improve the odds. My favorite interventions are such that some success with them helps a little, and a lot of success helps a lot, and they can help even if other interventions are badly neglected. I’ll list and discuss these interventions below.
    • So instead of a “plan” I tend to think about a “playbook”: a set of plays, each of which might be useful. We can try a bunch of them and do more of what’s working. I have takes on which interventions most need more attention on the margin, but think that for most people, personal fit is a reasonable way to prioritize between the interventions I’m listing.

Below I’ll briefly recap my overall picture of what success might look like (with links to other things I’ve written on this), then discuss four key categories of interventions: alignment research, standards and monitoring, successful-but-careful AI projects, and information security. For each, I’ll lay out:

  • How a small improvement from the status quo could nontrivially improve our odds.
  • How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly.
  • Common concerns/reservations about the intervention.

Overall, I feel like there is a pretty solid playbook of helpful interventions - any and all of which can improve our odds of success - and that working on those interventions is about as much of a “plan” as we need for the time being.

The content in this post isn’t novel, but I don’t think it’s already-consensus: two of the four interventions (standards and monitoring; information security) seem to get little emphasis from existential risk reduction communities today, and one (successful-but-careful AI projects) is highly controversial and seems often (by this audience) assumed to be net negative.

Many people think most of the above interventions are doomed, irrelevant or sure to be net harmful, and/or that our baseline odds of avoiding a catastrophe are so low that we need something more like a “plan” to have any hope. I have some sense of the arguments for why this is, but in most cases not a great sense (at least, I can’t see where many folks’ level of confidence is coming from). A lot of my goal in posting this piece is to give such people a chance to see where I’m making steps that they disagree with, and to push back pointedly on my views, which could change my picture and my decisions.

As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.

My basic picture of what success could look like

I’ve written a number of nearcast-based stories of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe); two “success stories” that assume good decision-making by key actors; and an outline of how we might succeed with “minimal dignity.”

The essence of my picture has two phases:

  1. Navigate the initial alignment problem:2 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. It’s also plausible that it’s fiendishly hard.
  2. Navigating the deployment problem:3 reducing the risk that someone in the world will deploy dangerous systems, even though the basic technology exists to make powerful (human-level-ish) AIs safe. (This is often discussed through the lens of “pivotal acts,” though that’s not my preferred framing.4)
    1. You can think of this as containing two challenges: stopping misaligned human-level-ish AI, and maintaining alignment as AI goes beyond human level.
    2. The basic hope (discussed here) is that “safe actors”5 team up to the point where they outnumber and slow/stop “unsafe actors,” via measures like standards and monitoring - as well as alignment research (to make it easier for all actors to be effectively “cautious”), threat assessment research (to turn incautious actors cautious), and more.
    3. If we can get aligned human-level-ish AI, it could be used to help with all of these things, and a small lead for “cautious actors” could turn into a big and compounding advantage. More broadly, the world will probably be transformed enormously, to the point where we should consider ~all outcomes in play.

4 key categories of interventions

Here I’ll discuss the potential impact of both small and huge progress on each of 4 major categories of interventions.

For more detail on interventions, see Jobs that can help with the most important century;

What AI companies can do today to help with the most important century; and How major governments can help with the most important century.

Alignment research

How a small improvement from the status quo could nontrivially improve our odds. I think there are various ways we could “get lucky” such that aligning at least the first human-level-ish AIs is relatively easy, and such that relatively small amounts of progress make the crucial difference.

  • If we can get into a regime where AIs are being trained with highly accurate reinforcement - that is, there are few (or no) opportunities to perform well by deceiving, manipulating and/or overpowering sources of supervision - then it seems like we have at least a nontrivial hope that such AIs will end up aligned, in the sense that they generalize to some rule like “Do what the supervisor intends, in the ordinary (hard to formalize) sense that most humans would mean it” and wouldn’t seek takeover even with opportunities for it. (And at least for early human-level-ish systems, it seems like the probability might be pretty high.) Relatively modest progress on things like debate or task decomposition/amplification/recursive reward modeling could end up making for much more accurate reinforcement. (A bit more on this in a previous piece.)
  • A single really convincing demonstration of something like deceptive alignment could make a big difference to the case for standards and monitoring (next section). Interpretability research is one potential path here - it could be very valuable to have even one hard-won observation of the form, “This system initially misbehaved, behaved better as its misbehavior was ‘trained out,’ appeared to become extremely well-behaved, but then was revealed by interpretability techniques to be examining each situation for opportunities to misbehave secretly or decisively.”
  • It doesn’t seem like anyone has gotten very far with adversarial training yet, but it seems possible that a relatively modest amount of progress could put us in a position to have something like “human-level-ish AI systems that just can’t tell whether takeover opportunities are fake.”
  • The more existing work has been done on a given alignment agenda, the more hope I see for automating work on that agenda if/when there are safe-to-use, human-level-ish systems. This could be especially important for interpretability work, where it seems like one could make productive use of a huge number of “automated researchers.”

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. The big win here would be some alignment (or perhaps threat assessment) technique that is both scalable (works even for systems with far-beyond-human capabilities) and cheap (can be used by a given AI lab without having to pay a large “alignment tax”). This seems pretty unlikely to be imminent, but not impossible, and it could lead to a world where aligned AIs heavily outnumber misaligned AIs (a key hope).

Concerns and reservations. Quoting from a previous piece, three key reasons people give for expecting alignment to be very hard are:

  • AI systems could quickly become very powerful relative to their supervisors, which means we have to confront a harder version of the alignment problem without first having human-level-ish aligned systems.
    • I think it’s certainly plausible this could happen, but I haven’t seen a reason to put it at >50%.
    • To be clear, I expect an explosive “takeoff” by historical standards. I want to give Tom Davidson’s analysis more attention, but it implies that there could be mere months between human-level-ish AI and far more capable AI (but that could be enough for a lot of work by human-level-ish AI).
    • One key question: to the extent that we can create a feedback loop with AI systems doing research to improve hardware and/or software efficiency (which then increases the size and/or capability of the “automated workforce,” enabling further research ...), will this mostly be via increasing the number of AIs or by increasing per-AI capabilities? There could be a feedback loop with human-level-ish AI systems exploding in number, which seems to present fewer (though still significant) alignment challenges than a feedback loop with AI systems exploding past human capability.11
  • It’s arguably very hard to get even human-level-ish capabilities without ambitious misaligned aims. I discussed this topic at some length with Nate Soares - notes here. I disagree with this as a default (though, again, it’s plausible) for reasons given at that link.
  • Expecting “offense-defense” asymmetries (as in this post) such that we’d get catastrophe even if aligned AIs greatly outnumber misaligned ones. Again, this seems plausible, but not the right default guess for how things will go, as discussed at the end of the previous section.

Standards and monitoring

How a small improvement from the status quo could nontrivially improve our odds. Imagine that:

  • Someone develops a very hacky and imperfect - and voluntary - “dangerous capabilities” standard, such as (to oversimplify): if an AI seems7 capable of doing everything needed to autonomously replicate in the wild,8 then (to be standard-compliant) it cannot be deployed (and no significant scaleup can be done at all) without strong assurances of security (assessed via penetration testing by reputable third parties) and alignment (assessed via, say, a public explanation of why the AI lab believes its system to be aligned, including required engagement with key reasons this might be hard to assess and a public comment period, and perhaps including an external review).
  • Several top AI labs declare that they intend to abide by the standard - perhaps out of genuine good intentions, perhaps because they think regulation is inevitable and hope to legitimize approaches to it that they can gain experience with, perhaps due to internal and external pressure and a desire for good PR, perhaps for other reasons.
  • Once several top AI labs have committed, it becomes somewhat odd-seeming for an AI lab not to commit. Some do hold out, but they tend to have worse reputations and more trouble attracting talent and customers, due partly to advocacy efforts. A cascade along the lines of what we’ve seen in farm animal welfare seems plausible.
  • The standard is fairly “squishy”; there are various ways to weasel out by e.g. selecting an overly “soft” auditor or violating the spirit of the “no deployments, no significant scaleup” rules, etc. and there are no consequences if a lab abandons the standard beyond disclosure of that decision.

I think this kind of situation would bring major benefits to the status quo, if only via incentives for top AI labs to move more carefully and invest more energy in alignment. Even a squishy, gameable standard, accompanied by mostly-theoretical possibilities of future regulation and media attention, could add to the risks (bad PR, employee dissatisfaction, etc.) and general pain of scaling up and releasing models that can’t be shown to be safe.

This could make it more attractive for companies to do their best with less capable models while making serious investments in alignment work (including putting more of the “results-oriented leadership effort” into safety - e.g., “We really need to make better alignment progress, where are we on that?” as opposed to “We have a big safety team, what more do you want?”) And it could create a big financial “prize” for anyone (including outside of AI companies) who comes up with an attractive approach to alignment.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. A big potential win is something like:

  • Initially, a handful of companies self-regulate by complying with the standard..
  • This situation creates an ecosystem for standards setters, evaluation designers (e.g., designing evaluations of dangerous capabilities and alignment), auditors, etc.
  • When the government decides to regulate AI, they default to poaching people from that ecosystem and copying over its frameworks. My impression is that governments generally prefer to poach/copy what’s already working when feasible. Now that regulation is official, standards are substantially less squishy (though not perfect) - perhaps via government-authorized auditors being given a lot of discretion to declare AI systems unsafe.
  • The US government, and/or other governments, unilaterally enforces standards (and/or just blocks development of AI) internationally, with methods ranging from threats of sanctions to cyberwarfare or even more drastic measures.
  • It’s not impossible to build a dangerous AI at this point, but it’s quite difficult and risky, and this slows everyone down a lot and greatly increases investment in alignment. If the alignment investment still doesn’t result in much, it might at least be the case that limited AI becomes competitive and appealing.
  • This all could result in early deployed human-level-ish AI systems being “safe enough” and used largely to develop better standards, better ways of monitoring and enforcing them, etc.

Concerns and reservations. A common class of concerns is along the lines of, “Any plausible standards would be squishy/gameable”; I think this is significantly true, but squishy/gameable regulations can still affect behavior a lot.9

Another concern: standards could end up with a dynamic like “Slowing down relatively cautious, high-integrity and/or law-abiding players, allowing less cautious players to overtake them.” I do think this is a serious risk, but I also think we could easily end up in a world where the “less cautious” players have trouble getting top talent and customers, which does some combination of slowing them down and getting them to adopt standards of their own (perhaps weaker ones, but which still affect their speed and incentives). And I think the hope of affecting regulation is significant here.

I think there’s a pretty common misconception that standards are hopeless internationally because international cooperation (especially via treaty) is so hard. But there is precedent for the US enforcing various things on other countries via soft power, threats, cyberwarfare, etc. without treaties or permission, and in a high-stakes scenario, it could do quite a lot of this..

Successful, careful AI lab

Conflict of interest disclosure: my wife is co-founder and President of Anthropic and owns significant equity in both Anthropic and OpenAI. This may affect my views, though I don't think it is safe to assume specific things about my takes on specific AI labs due to this.10

How a small improvement from the status quo could nontrivially improve our odds. If we just imagine an AI lab that is even moderately competitive on capabilities while being substantially more concerned about alignment than its peers, such a lab could:

  • Make lots of money and thus support lots of work on alignment as well as other things (e.g., standards and monitoring).
  • Establish general best practices - around governance, security, and more - that other labs can learn from. (It’s dramatically easier and more likely for a company to copy something that’s already working somewhere else, as opposed to experimenting with their own innovative ways of e.g. protecting AI model weights.)
  • Be a place for lots of alignment-concerned folks to gain credibility and experience with AI systems and companies - positioning them to be influential at other companies, in government, etc. in the future.
  • Have a relatively small marginal impact on speeding up and/or hyping AI, simply by not releasing anything that’s more advanced than what other labs have released. (I think it should still be possible to make big profits despite this practice.)

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. If an AI lab ends up with a several-month “lead” on everyone else, this could enable huge amounts of automated alignment research, threat assessment (which could create very strong demonstrations of risk in the event that automated alignment research isn’t feasible), and other useful tasks with initial human-level-ish systems.

Concerns and reservations. This is a tough one. AI labs can do ~unlimited amounts of harm, and it currently seems hard to get a reliable signal from a given lab’s leadership that it won’t. (Up until AI systems are actually existentially dangerous, there’s ~always an argument along the lines of “We need to move as fast as possible and prioritize fundraising success today, to stay relevant so we can do good later.”) If you’re helping an AI lab “stay in the race,” you had better have done a good job deciding how much you trust leadership, and I don’t see any failsafe way to do that.

That said, it doesn’t seem impossible to me to get this right-ish (e.g., I think today’s conventional wisdom about which major AI labs are “good actors” on a relative basis is neither uninformative (in the sense of rating all labs about the same) nor wildly off), and if you can, it seems like there is a lot of good that can be done by an AI lab.

I’m aware that many people think something like “Working at an AI lab = speeding up the development of transformative AI = definitely bad, regardless of potential benefits,” but I’ve never seen this take spelled out in what seems like a convincing way, especially since it’s pretty easy for a lab’s marginal impact on speeding up timelines to be small (see above).

I do recognize a sense in which helping an AI lab move forward with AI development amounts to “being part of the problem”: a world in which lots of people are taking this action seems worse than a world in which few-to-none are. But the latter seems off the table, not because of Molochian dynamics or other game-theoretic challenges, but because most of the people working to push forward AI simply don’t believe in and/or care about existential risk ~at all (and so their actions don’t seem responsive in any sense, including acausally, to how x-risk-concerned folks weigh the tradeoffs). As such, I think “I can’t slow down AI that much by staying out of this, and getting into it seems helpful on balance” is a prima facie plausible argument that has to be weighed on the merits of the case rather than dismissed with “That’s being part of the problem.”

I think helping out AI labs is the trickiest and highest-downside intervention on my list, but it seems quite plausibly quite good in many cases.

Information security

How a small improvement from the status quo could nontrivially improve our odds. It seems to me that the status quo in security is rough (more), and I think a small handful of highly effective security people could have a very large marginal impact. In particular, it seems like it is likely feasible to make it at least difficult and unreliable for a state actor to steal a fully-developed powerful AI system.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. I think this doesn’t apply so much here, except for a potential somewhat far-fetched case in which someone develops (perhaps with assistance from early powerful-but-not-strongly-superhuman AIs) a surprisingly secure environment that can contain even misaligned AIs significantly (though probably not unboundedly) more capable than humans.

Concerns and reservations. My impression is that most people who aren’t excited about security think one of these things:

  1. The situation is utterly hopeless - there’s no path to protecting an AI from being stolen.
  2. Or: this isn’t an area to focus on because major AI labs can simply hire non-x-risk-motivated security professionals, so why are we talking about this?

I disagree with #2 for reasons given here (I may write more on this topic in the future).

I disagree with #1 as well.

  • I think imperfect measures can go a long way, and I think there are plenty of worlds where stealing dangerous AI systems is quite difficult to pull off, such that a given attempt at stealing takes months or more - which, as detailed above, could be enough to make a huge difference.
  • Additionally, a standards-and-monitoring regime could include provisions for retaliating against theft attempts, and stealing model weights without much risk of getting caught could be especially difficult thanks to serious (but not extreme or perfect) security measures.
  • I also think it’s pretty likely that stealing the weights of an AI system won’t be enough to get the full benefit from it - it could also be necessary to replicate big parts of the scaffolding, usage procedures, dev environment, etc. which could be difficult.

Notes


  1. After drafting this post, I was told that others had been making this same distinction and using this same term in private documents. I make no claim to having come up with it myself! 

  2. Phase 1 in this analysis 

  3. Phase 2 in this analysis 

  4. I think there are ways things could go well without any particular identifiable “pivotal act”; see the “success stories” I linked for more on this. 

  5. “Safe actors” corresponds to “cautious actors” in this post. I’m using a different term here because I want to include the possibility that actors are safe mostly due to luck (slash cheapness of alignment) rather than caution per se. 

  6. The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) 

  7. In the judgment of an auditor, and/or an internal evaluation that is stress-tested by an auditor, or simply an internal evaluation backed by the risk that inaccurate results will result in whistleblowing

  8. I.e, given access to its own weights, it could plausibly create thousands of copies of itself with tens of millions of dollars at their disposal, and make itself robust to an attempt by a few private companies to shut it down. 

  9. A comment from Carl Shulman on this point that seems reasonable: "A key difference here seems to be extremely rapid growth, where year on year effective compute grows 4x or more. So a defector with 1/16th the resources can produce the same amount of danger in 1-2 years, sooner if closer to advanced AGI and growth has accelerated. The anti-nuclear and anti-GMO movements cut adoption of those technologies by more than half, but you didn't see countries with GMO crops producing all the world's food after a few years, or France making so much nuclear power that all electricity-intensive industries moved there.

    For regulatory purposes you want to know if the regulation can block an AI capabilities explosion. Otherwise you're buying time for a better solution like intent alignment of advanced AI, and not very much time. That time is worthwhile, because you can perhaps get alignment or AI mind-reading to work in an extra 3 or 6 or 12 months. But the difference with conventional regulation interfering with tech is that the regulation is offsetting exponential growth; exponential regulatory decay only buys linear delay to find longer-term solutions.

    There is a good case that extra months matter, but it's a very different case from GMO or nuclear power. [And it would be far more to the credit of our civilization if we could do anything sensible at scale before the last few months or years.]" 

  10. We would still be married even if I disagreed sharply with Anthropic’s strategy. In general, I rarely share my views on specific AI labs in public. 

New Comment
41 comments, sorted by Click to highlight new comments since:
[-]Wei DaiΩ9258

One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalignment risk, such as (next generation) alignment research and coordination mechanisms between humans, human-AI teams, or AIs aligned to different humans.

I think this suggests the intervention of doing research aimed at improving the philosophical abilities of the AIs that we'll build. (Aside from misalignment risk, it would help with many other AI-related x-risks that I won't go into here, but which collectively outweigh misalignment risk in my mind.)

A partial counter-argument. It's hard for me to argue about future AI, but we can look at current "human misalignment" - war, conflict, crime, etc..  It seems to me that conflicts in today's world do not arise because that we haven't progressed enough in philosophy since the Greeks. Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources.  The solution for this is not "philosophical progress" as much as being able to move out of the zero-sum setting by finding "win win" resolutions for conflict or growing the overall pie instead of arguing how to split it. 

(This is a partial counter-argument, because I think you are not just talking about conflict, but other issues of making the wrong choices. For example in global warming where humanity makes collectively the mistake of emphasizing short-term growth over long-term safety. However, I think this is related and "growing the pie" would have alleviated this issue as well, and enabled countries to give up on some more harmful ways for short-term growth.) 

Rather conflicts arise when various individuals and populations (justifiably or not) perceive that they are in zero-sum games for limited resources. The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.

I think many of today's wars are at least as much about ideology (like nationalism, liberalism, communism, religion) as about limited resources. I note that Russia and Ukraine both have below replacement birth rates and are rich in natural resources (more than enough to support their declining populations, with Russia at least being one of the biggest exporters of raw materials in the world).

The solution for this is not “philosophical progress” as much as being able to move out of the zero-sum setting by finding “win win” resolutions for conflict or growing the overall pie instead of arguing how to split it.

I think this was part of the rationale for Europe to expand trade relations with Russia in the years before the Ukraine war (e.g. by building/allowing the Nordstream pipelines), but it ended up not working. Apparently Putin was more interested in some notion of Russian greatness than material comforts for his people.

Similarly the US, China, and Taiwan are deeply enmeshed in positive sum trade relationships that a war would destroy, which ought to make war unthinkable from your perspective, but the risk of war has actually increased (compared to 1980, say, when trade was much less). If China did end up invading Taiwan I think we can assign much of the blame to valuing nationalism (or caring about the "humiliation" of not having a unified nation) too much, which seems a kind of philosophical error to me.

(To be clear, I'm not saying that finding “win win” resolutions for conflict or growing the overall pie are generally not good solutions or not worth trying, just that having wrong values/philosophies clearly play a big role in many modern big conflicts.)

I meant “resources” in a more general sense. A piece of land that you believe is rightfully yours is a resource. My own sense (coming from a region that is itself in a long simmering conflict) is that “hurt people hurt people”. The more you feel threatened, the less you are likely to trust the other side.

While of course nationalism and religion play a huge role in the conflict, my sense is that people tend to be more extreme in both the less access to resources, education and security about the future they have.

If someone cares a lot about a strictly zero-sum resource, like land, how do you convince them to 'move out of the zero-sum setting by finding "win win" resolutions'? Like what do you think Ukraine or its allies should have done to reduce the risk of war before Russia invaded? Or what should Taiwan or its allies do now?

Also to bring this thread back to the original topic, what kinds of interventions do you think your position suggests with regard to AI?

I definitely don't have advice for other countries, and there are a lot of very hard problems in my own homeland. I think there could have been an alternate path in which Russia has seen prosperity from opening up to the west, and then going to war or putting someone like Putin in power may have been less attractive. But indeed the "two countries with McDonalds won't fight each other" theory has been refuted. And as you allude to with China, while so far there hasn't been war with Taiwan, it's not as if economic prosperity is an ironclad guarantee of non aggression. 

Anyway, to go back to AI. It is a complex topic, but first and foremost, I think with AI as elsewhere, "sunshine is the best disinfectant." and having people research AI systems in the open, point out their failure modes, examining what is deployed etc.. is very important. The second thing is that I am not worried in any near future about AI "escaping", and so I think focus should not be on restricting research, development, or training, but rather on regulating deployment. Exact form of regulations is beyond a blog post comment and also not something I am an expert on..

The "sunshine" view might seem strange since as a corollary it could lead to AI knowledge "leaking". However, I do think that for the near future, most of the safety issues with AI would be from individual hackers using weak systems, but from massive systems that are built by either very large companies or nation states.  It is hard to hold either of those accountable if AI is hidden behind an opaque wall. 

I'm curious why you are "not worried in any near future about AI 'escaping.'" It seems very hard to be confident in even pretty imminent AI systems' lack of capability to do a particular thing, at this juncture.

Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It's possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn't seem safe to assume.

Re escaping, I think we need to be careful in defining "capabilities". Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it.  Near-term ones might also become better at finding vulnerabilities. But that doesn't mean they can/will spontaneously escape during training.

As I wrote in my "GPT as an intelligence forklift" post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI to act in this setting, nor does this stage even train for any kind of agency. 

There is often a second phase, which can involve building an agent on top of the "forklift". But this phase still doesn't involve much interaction with the outside world, and even if it did, just by information bounds the number of bits exchanged by this interaction should be much less than what's needed to encode the model. (Generally, the number of parameters of models would be comparable to the number of inferences done during in pretraining and completely dominate the number of inferences done in fine-tuning / RLHF / etc. and definitely any steps that involve human interactions.)

Then there are the information-security aspects. You could (and at some point probably should) regulate cyber-security practices during the training phase. After all, if we do want to regulate deployment, then we need to ensure there are three separated phases (1) training, (2) testing, (3) deployment, and we don't want "accidental deployment" where we jumpy from phase (1) to (3).  Maybe at some point, there would be something like Intel SGX for GPUs?

Whether AI helps more the defender or attacker in the cyber-security setting is an open question. But it definitely helps the side that has access to stronger AIs.

In any case, one good thing about focusing regulation on cyber-security aspects is that, while not perfect, we have decades of experience in the field of software security and cyber-security. So regulations in this area are likely to be much more informed and effective.

On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.

I think good security is difficult enough (and inconvenient enough) that we shouldn't expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.

I don't find the points about pretraining compute vs. "agent" compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world - for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically "as an agent." To give a simple concrete example that I admittedly don't have a strong understanding of, Voyager seems pretty competent at a task that it didn't have vast amounts of task-specific training for.

I actually agree! As I wrote in my post, "GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt." So yes, you wouldn't need a lot of scaffolding to adapt a goal-less pretrained model (what I call an "intelligence forklift") into an agent that does very sophisticated things.

However, this separation into two components - the super-intelligent but goal-less "brain", and the simple "will" that turns it into an agent can have safety implications. For starters, as long as you didn't add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this.

The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the "brains and agency" tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems,  but I can't just take their brain and fine-tune it on this task. I have to convince them to do it.

However, things change if we don't think of the agent's "brain" as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled - there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.  

I agree with these points! But:

  • Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
  • I still don't think this adds up to a case for being confident that there aren't going to be "escapes" anytime soon.

Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation.

Regarding "escapes", the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.)  So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to "escape".

I think training exclusively on objective measures has a couple of other issues:

  • For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
  • For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier "approval" measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).

I think your point about the footprint is a good one and means we could potentially be very well-placed to track "escaped" AIs if a big effort were put in to do so. But I don't see signs of that effort today and don't feel at all confident that it will happen in time to stop an "escape."

The "Cooperative AI" bet is along these lines: can we accelerate AI systems that can help humanity with our global cooperation problems (be it through improving human-human cooperation, community-level rationality/wisdom, or AI diplomat - AI diplomat cooperation). https://www.cooperativeai.com/

I agree that this is a major concern. I touched on some related issues in this piece.

This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).

I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

Thank you for writing this! 

One argument for the "playbook" rather than the "plan" view is that there is a big difference between DISASTER (something very bad happening) and DOOM (irrecoverable extinction-level catastrophe).  Consider the case of nuclear weapons.  Arguably the disaster of Hiroshima and Nagasaki bombs led us to better arms control which helped so far prevent the catastrophe (even if not quite existential one) of an all-out nuclear war. In all but extremely fast take-off scenarios, we should see disasters as warning signs before doom.

 
The good thing is that avoiding disasters makes good business. In fact, I don't expect AI labs to require any "altrusim" to focus their attention on alignment and safety.  This survey by Timothy Lee on self-driving cars notes that after a single tragic incident in which an Uber self-driving car killed a pedestrian, "Uber’s self-driving division never really recovered from the crash, and Uber sold it off in 2020. The rest of the industry vowed not to repeat Uber’s mistake."  Given that a single disaster can be extremely hard to recover from,  smart leaders of AI labs should focus on safety, even if it means being a little slower to the market.
 

While the initial push is to get AI to match human capabilities, as these tools become more than impressive demos and need to be deployed in the field, the customers will care much more about reliability and safety than they do about capabilities. If I am a software company using an AI system as a programmer, it's more useful to me if it can reliably deliver bug-free 100-line subroutines than if it writes 10K sized programs that might contain subtle bugs. There is a reason why much of the programming infrastructure for real-world projects, including pull requests, code reviews, unit tests, is not aimed at getting something that kind of works out as quickly as possible, but rather make sure that the codebase grows in a reliable and maintainable fashion.

This doesn't mean that the free market can take care of everything and that regulations are not needed to ensure that some companies don't make a quick profit by deploying unsafe products and pushing off externalities to their users and the broader environment. (Indeed, some would say that this was done in the self-driving domain...) But I do think there is a big commercial incentive for AI labs to invest in research on how to ensure that systems pushed out behave in a predictable manner, and don't start maximizing paperclips.


p.s. The nuclear setting also gives another lesson (TW: grim calculations follow). It is much more than a factor of two harder to extinguish 100% of the population than to kill the ~50% or so that live in large metropolitan areas. Generally, the difference between the effort needed to kill 50% of the population and the effort to kill a 1-p fraction should scale at least as 1/p.
 

Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off - more on that here and here.

These are interesting! And deserve more discussion than just a comment. 

But one high level point regarding "deception" is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this - second pair I tried).  
While in other cases in computing technology we talk about "five nine's reliability", in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet. 
 

If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either. 

I agree that today's AI systems aren't highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that's a double-edged sword.

Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at larger scale and with more/better data is likely to make them more capable in a way that makes them more reliable. (E.g., I think current language models have generally gotten more reliable partly via pure scaling up, though things like RLHF are also part of the picture.) For both reasons, I expect progress on reliability, with the pace of progress very hard to forecast. If AI systems become capable of being intelligent and creative in useful ways while having extraordinary rare mistakes, then it seems like we should be worrying about their having developed reliable deception capabilities as well. Thoughts on that?

At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm  that for every input  computes some function  with probability at least 2/3 (i.e. ) then if we spend  times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with   (i.e.  or something like that where  is the algorithm obtained by scaling up  to compute it  times and output the plurality value). 

This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to  success at a price of  , which is why we can get things like "five nines reliability" in several engineering fields.

In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of  more computation, we only get a reduction in the error that looks like  so it's polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one).

So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.

That's interesting, thanks!

In addition to some generalized concern about "unknown unknowns" leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:

  • Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
  • Being able to use that effort productively, via things like "trying multiple angles on a question" and "setting up systems for error checking."

I think that in some sense humans are quite unreliable, and use a lot of scaffolding - variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it's really important to deceive someone, they're going to make a lot of use of things like this).

I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so.   So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI.

I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like "chain of thought" already spend more time on inference compute to get better performance, but they don't really have a "knob" we can turn so we can control the computation/reliability tradeoff.) 

Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.)

In particular, even if we apply inference time compute, unfortunately I don't think we know of a path to get a  overhead in inference time to achieve a failure probability of . It seems that we are still stuck in the  regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.

I'm not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)

The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like "working every day with people they'd fire if they could, without clearly revealing this." I think they mostly pull this off with:

  • Simple heuristics like "Be nice, unless you're in the very unusual situation where hostile action would work well." (I think the analogy to how AIs might behave is straightforward.)
  • The fact that they don't need to be perfect - lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
  • Also, humans generally need to do a lot of reasoning along the lines of "X usually works, but I do need to notice the rare situations when something radically different is called for." So if this is expensive, they just need to be doing that expensive thing a lot.

Thanks for writing that up. 

I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". And this core consideration is also why I don't think that the "Successful, careful AI lab" is right. 

Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

Thanks for this comment - I get vibes along these lines from a lot of people but I don't think I understand the position, so I'm enthused to hear more about it.

> I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". 

"Standards and monitoring" is the main "decrease the race" path I see. It doesn't seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused on the conditional (i.e., "unless it's demonstrably safe") version. 

But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?

> Staying at the frontier of capabilities and deploying leads the frontrunner to feel the heat which accelerates both capabilities & the chances of uncareful deployment which increases pretty substantially the chances of extinction.

I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?

Thanks for the clarifications. 

But is there another "decrease the race" or "don't make the race worse" intervention that you think can make a big difference? Based on the fact that you're talking about a single thing that can help massively, I don't think you are referring to "just don't make things worse"; what are you thinking of?

1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown. 
2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you're say too loudly that we should slow down as long as our thing is not provably safe.

So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor. 

Other interventions for slowdown are mostly in the realm of public advocacy. 

Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns. 


I agree that this is an effect, directionally, but it seems small by default in a setting with lots of players (I imagine there will be, and is, a lot of "heat" to be felt regardless of any one player's actions). And the potential benefits seem big. My rough impression is that you're confident the costs outweigh the benefits for nearly any imaginable version of this; if that's right, can you give some quantitative or other sense of how you get there?

I guess, heuristically, I tend to take arguments of the form "but others would have done this bad thing anyway" with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post).

On this specific case I think it's not right that there are "lots of players" close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven't been serious players for the past 7 months. So if Anthropic hadn't been around, OA could have chilled longer at ChatGPT level, and then at GPT-4 without plugins + code interpreter & without suffering from any threat. And now they'll need to do something very impressive against the 100k context etc. 

The compound effects of this are pretty substantial because for each new differentiation, it accelerates the whole field and pressures teams to find something new, causing a significantly more powerful race to the bottom. 

If I had to be quantitative (vaguely) for the past 9 months, I'd guess that the existence of Anthropic has caused (/will cause, if we count the 100k thing) 2 significant counterfactual features and 3-5 months of timelines (which will probably compound into more due to self-improvement effects). I'd guess there are other effects (e.g. pressure on compute, scaling for driving costs down etc.) I'm not able to give vague estimates for. 

My guess for the 3-5 months is mostly driven by the release of ChatGPT & GPT-4 which have both likely been released earlier than without Anthropic.

Thanks for the response!

Re: your other interventions - I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don't want to get into the details of most of them, but will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise." (I think this feels fairly clear when looking at other technological breakthroughs and how much they would've been affected by differently timed product releases.)

I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell. 

  • Labs have been pushing for the rule that we should wait for evals to say "it's dangerous" before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe. 
  • Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines.

Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier. 

will comment that it seems like a big leap from "X product was released N months earlier than otherwise" to "Transformative AI will now arrive N months earlier than otherwise."

I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

(Apologies for slow reply!)

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political will for “standards and monitoring,” but you’re right that there could also be benefits simply from buying time generically for other stuff.

I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell. 

I said it’s “far from obvious” empirically what’s going on. I agree that discussion of slowing down has focused on the future rather than now, but I don’t think it has been pointing to a specific time horizon (the vibe looks to me more like “slow down at a certain capabilities level”).

Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier. 

It’s true that no regulation will affect everyone precisely the same way. But there is plenty of precedent for major industry players supporting regulation that generally slows things down (even when the dynamic you’re describing applies).

I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

I don’t agree that we are looking at a lower bound here, bearing in mind that (I think) we are just talking about when ChatGPT was released (not when the underlying technology was developed), and that (I think) we should be holding fixed the release timing of GPT-4. (What I’ve seen in the NYT seems to imply that they rushed out functionality they’d otherwise have bundled with GPT-4.)

If ChatGPT had been held for longer, then:

  • Scaling and research would have continued in the meantime. And even with investment and talent flooding in, I expect that there’s very disproportionate impact from players who were already active before ChatGPT came out, who were easily well capitalized enough to go at ~max speed for the few months between ChatGPT and GPT-4.
  • GPT-4 would have had the same absolute level of impressiveness, revenue potential, etc. (as would some other things that I think have been important factors in bringing in investment, such as Midjourney). You could have a model like “ChatGPT maxed out hype such that the bottleneck on money and talent rushing into the field became calendar time alone,” which would support your picture; but you could also have other models, e.g. where the level of investment is more like a function of the absolute level of visible AI capabilities such that the timing of ChatGPT mattered little, holding fixed the timing of GPT-4. I’d guess the right model is somewhere in between those two; in particular, I'd guess that it matters a lot how high revenue is from various sources, and revenue seems to behave somewhere in between these two things (there are calendar-time bottlenecks, but absolute capabilities matter a lot too; and parallel progress on image generation seems important here as well.)
  • Attention from policymakers would’ve been more delayed; the more hopeful you are about slowing things via regulation, the more you should think of this as an offsetting factor, especially since regulation may be more of a pure “calendar-time-bottlenecked response to hype” model than research and scaling progress.
  • (I also am not sure I understand your point for why it could be more than 3 months of speedup. All the factors you name seem like they will nearly-inevitably happen somewhere between here and TAI - e.g., there will be releases and demos that galvanize investment, talent, etc. - so it’s not clear how speeding a bunch of these things up 3 months speeds the whole thing up more than 3 months, assuming that there will be enough time for these things to matter either way.)

But more important than any of these points is that circumstances have (unfortunately, IMO) changed. My take on the “successful, careful AI lab” intervention was quite a bit more negative in mid-2022 (when I worried about exactly the kind of acceleration effects you point to) than when I did my writing on this topic in 2023 (at which point ChatGPT had already been released and the marginal further speedup of this kind of thing seemed a lot lower). Since I wrote this post, it seems like the marginal downsides have continued to fall, although I do remain ambivalent.


 


 

This is only true if we assume that there are little to no differences in which company takes the lead in AI, or which types of AI are preferable, and I think this is wrong, and there fairly massive differences between OpenAI or Anthropic winning the race, compared to Deepmind winning the race to AGI.

So I guess first you condition over alignment being solved when we win the race. Why do you think OpenAI/Anthropic are very different from DeepMind? 

Noting that I don't think alignment being "solved" is a binary.  As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned "enough," even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.

I don't exactly condition on alignment being solved. I instead point to a very important difference between OpenAI/Anthropic's AI vs Deepmind's AI, and the biggest difference between the two is that OpenAI/Anthropic's AI has a lot less incentive to develop instrumental goals due to having way fewer steps between the input and output, and incentivizes constraining goals, compared to Deepmind which uses RL, which essentially requires instrumental goals/instrumental convergence to do anything.

This is an important observation by porby, which I'd lossily compress it to "Instrumental goals/Instrumental convergence is at best a debatable assumption for LLMs and Non-RL AI, and may not be there at all for LLMs/Non-RL AI."

And this matters, because the assumption of instrumental convergence/powerseeking underlies basically all of the pessimistic analyses on AI, and arguably a supermajority of why AI is fundamentally dangerous, because instrumental convergence/powerseeking is essentially why it's so difficult to gain AI safety. LLMs/Non-RL AI probably bypass all of the AI safety concerns that isn't related to misuse or ethics, and this has massive implications. So massive, I covered them in it's own post here:

https://www.lesswrong.com/posts/8SpbjkJREzp2H4dBB/a-potentially-high-impact-differential-technological

One big implication is obvious: OpenAI and Anthropic are much safer companies to win the AI race, relative to Deepmind, because of the probably non-existent instrumental convergence/powerseeking issue.

It also makes the initial alignment problem drastically easier, as it's a non-adversarial problem that doesn't need security mindset to make the LLM/Non-RL AI Alignment researcher plan work, as described here:

https://openai.com/blog/our-approach-to-alignment-research

And thus makes the whole problem easier as we don't need to worry much about the first AI researcher's alignment, resulting in a stable foundation for their recursive/meta alignment plan.

The fact that instrumental convergence/powerseeking/instrumental goals are at best debatable and probably false is probably the biggest reason why I claim that the different companies are fundamentally different in the probability of extinction, with the p(DOOM) radically reducing conditional on OpenAI or Anthropic winning the race, due to their AIs having a very desirable safety property, which is the lack of an incentive to have instrumental goals/instrumental convergence/powerseeking by default.

I agree that there is a difference between strong AI that has goals and one that is not an agent. This is the point I made here https://www.lesswrong.com/posts/wDL6wiqg3c6WFisHq/gpt-as-an-intelligence-forklift

But this has less to do with the particular lab (eg DeepMind trained Chinchilla) and more with the underlying technology. If the path to stronger models goes through scaling up LLMs then it does seem that they will be 99.9% non agentic (measured in FLOPs https://www.lesswrong.com/posts/f8joCrfQemEc3aCk8/the-local-unit-of-intelligence-is-flops )

You're right, it is the technology that makes the difference, but my point is that specific companies focus more on specific technology paths to safe AGI. And OpenAI/Anthropic's approach tends not to have instrumental convergence/powerseeking, compared to Deepmind, given that Deepmind focuses on RL, which essentially requires instrumental convergence. To be clear, I actually don't think OpenAI/Anthropic's path can work to AGI, but their alignment plans probably do work. And given instrumental convergence/powerseeking is basically the reason why AI is more dangerous than standard technology, that is a very big difference between the companies rushing to AGI.

Thanks for the posts on non-agentic AGI.

My other points are that the non-existence of instrumental convergence/powerseeking even at really high scales, if true, has very, very large implications for the dangerousness of AI, and consequently basically everything has to change with respect to AI safety, given that it's a foundational assumption of why AI is so dangerous at all.

  1. Navigate the initial alignment problem:2 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. It’s also plausible that it’s fiendishly hard.

 

Can you clarify what you mean by human-level-ish and safe? These terms seem almost contradictory to me - human-ish cognition is extremely unsafe in literal humans, and not just because it could be misused or directed towards dangerous or destructive ends by other humans.

"Transformative and safe" (the phrase you use in Nearcast-based "deployment problem" analysis) seems less contradictory. I can imagine AI systems and other technologies that are transformative (e.g. biotech, nanotech, AI that is far below human-level at general reasoning but superhuman in specific domains), and still safe or mostly safe when not deliberately misused by bad actors.

Dangerousness of human-level cognition has nothing to do with how hard or easy alignment of artificial systems is: a literal human, trapped in an AI lab, can probably escape or convince the lab to give them "parole" (meaning anything more than fully-controlled and fully-monitored-in-realtime access to the internet). Literal mind-reading might be sufficient to contain most or all humans, but I don't think interpretability tools currently provide anything close to "mind-reading" for AI systems, and other security precautions that AI labs currently take also seem insufficient for containing literal humans under even mildly adversarial conditions.

Maybe alignment turns out to be easy, and / or it turns out to be trivial to make the AI not want to escape or get parole. But alignment on that level is definitely not a solved problem in humans, so the aim for human-level AI has always seemed kind of strange to me.

To be clear, "it turns out to be trivial to make the AI not want to escape" is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like "Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged" might not have many or any "use cases."

A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.

So the idea isn't that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).

I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive.

Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie.


Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it's probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely.


OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don't think that's what SGD or current RL methods are actually doing.  (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that's a different topic.)

 

  1. ^

    In the words of Alex Turner,  in RL, "reward chisels cognitive grooves into an agent". Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level in the first place.

Hm, it seems to me that RL would be more like training away the desire to deceive, although I'm not sure either "ability" or "desire" is totally on target - I think something like "habit" or "policy" captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn't need 100% elimination of deception anyway, especially not when combined with effective checks and balances.

I notice I don't have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it's better to think about these systems as having habits or shards (note I don't actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now. 

Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I'm interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to "playing the training game" and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.