Not all capabilities will be created equal: focus on strategically superhuman agents

benwr

When, exactly, should we consider humanity to have properly "lost the game", with respect to agentic AI systems?

The most common AI milestone concepts seem to be "artificial general intelligence", followed closely by "superintelligence". Sometimes people talk about "transformative AI", "high-level machine intelligence", or "full automation of the labor force." None of these are well-suited for pointing specifically at the capabilities that would spell a "point of no return" for humanity. In fact, they're all designed to be agnostic to exactly which capabilities will matter.

When working to predict and mitigate existential risks from AI agents, we should try to be as clear as possible about which capabilities we're concerned about. As a result, I think we should focus on "strategically superhuman AI agents": AI agents that are better than the best groups of humans at real-world strategic action.

Skill at real-world strategic action is context-dependent, and isn't a single capability any more than "intelligence" is a single capability: It refers to any of a broad space of situated skills. Among humans, these skills tend to be those possessed by world-class CEOs, military officers, and statesmen.

In the current strategic environment, real-world strategic capacity typically encompasses at least:

Accurately modeling and predicting the world in some broad domain, but especially modeling and predicting individual humans and groups of humans.
Social skills, including persuasion, manipulation, delegation, and coalition building.
Robust planning and resource acquisition on the scale of years, and the ability to adjust plans fluidly as situations change.

I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.

If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost. We might not be immediately wiped out by a nanobot swarm, but from that time forward humans will be more like pawns than players, and when our replacement actuators have been built, we'll likely be left without the resources we need to survive.

Low-effort FAQ

What's the point here? Does anything interesting follow from this?

Here are some things that I think are interesting:

We don't actually need to build AGI proper, for most definitions of AGI, to manifest existential risks. It doesn't matter if your AI system is subhuman at physically solving Rubik's Cubes - it can pay or persuade human Rubik's Cube solvers to solve any Rubik's Cubes that it needs to be solved.
Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power. Control schemes will require correctly identifying and limiting all sets of capabilities that would be sufficient for "escape".
I think this kind of capability could arise in many settings, since I think these are very broadly valuable capabilities for an agent to have. But I'm especially afraid of efforts to build CEO-bots, general-bots, and president-bots, since I think these are where this kind of capability is most obviously necessary in a way that rivals the most competitive real-world strategic capacities of humans.

Isn't this just as vague as other milestones?

Yes; I'm interested in trying to make it crisper. I do think it gets closer to the heart of the problem than "AGI" or "superintelligence", and that seems like an important step.

Won't this happen as soon as we get [AGI, recursive self-improvement, ...]?

Maybe, depending on details that aren’t obvious to me.

Sure, a system that’s better-than-the-best-human in all domains is by definition better-than-the-best-human in real-world strategy. But I don’t think people have a consistent definition of AGI, and a system that’s better-than-the-best-human in all domains will also have a bunch of irrelevant capabilities, that might actually be harder for AI systems to achieve than strategic capabilities.

At least in principle, you could have recursive self-improvement that wasn’t able to, or wasn’t aiming to, achieve superhuman strategic capabilities. E.g. an extremely fast AI R&D iteration loop would have to do almost all of its learning about humans “off-policy” (i.e., without getting to interact with real-time humans during training), and (while I don’t think this is plausible) it seems possible that you can’t reach superhuman strategic ability this way within realistic resource constraints.

Are you just trying to say "powerful AI"? That's too obvious to even mention.

I disagree, in that it does not seem like people are in fact orienting to this type of threshold, which seems like it is in fact far more important than the thresholds that they are orienting to.

Acknowledgements

Thanks to Gretta Duleba, Alex Vermeer, Joe Rogero, David Abecassis, and Mitchell Howe for looking over a draft of this post.

If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.

Something seems funny to me here.

It might be to do with the boundaries of your definition. If humans agents are getting empowered by strategically superhuman (in an everyday sense) AI systems (agentic or otherwise), perhaps that raises the bar for what counts as superhuman for the purposes of this post? If so I think the argument would make sense to me, but it feels a bit funny to me to have this definition which is such a moving goalpost, and also might never get crossed even as AI gets arbitrarily powerful.

Alternatively, it might be that your definition is kind of an everyday one, but in that case your conclusion seems pretty surprising. Like it seems easy to me to imagine worlds where there are some agents without either of those conditions, but that they're not better than the empowered humans.

Or perhaps something else is going on. Just trying to voice my confusions.

I do appreciate the attempt to analyse which kinds of capabilities are actually crucial.

When I'm thinking about this, it seems kind of fine if the goalposts move - human strategic capacity will certainly move over time no matter what, right? Like, someone invented crowdfunding and suddenly we could do types of coordination that we previously couldn't do.

It seems fine to me to have the goalposts moving, but then I think it's important to trace through the implications of that.

Like, if the goalposts can move then this seems like perhaps the most obvious way out of the predicament; to keep the goalposts ever ahead of AI capabilities. But when I read your post I get the vibe that you're not imagining this as a possibility?

I think it seems like a fine possibility in principle, actually; sorry to have given the wrong impression! It's not my central hope, since strategy-stealing seems like it should make many human-augmentations "available" to AI systems as well. This is notably not true for things involving, e.g., BCIs or superbabies.

I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.
If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost.

I'm not sure if I'm being pedantic here, but this doesn't strike me as very significant by itself.

Say I make a small agent on my laptop that fails at (a) and (b). I don't give it a huge amount of money to do things with, and it fails to do much with that money.

I assume humanity hasn't lost yet.

Maybe you're thinking that in (b), "enforce limitations" could mean "limit their money / power". But I assume basically all systems should have limited money/power.

My guess is that "strategic reasoning" agents would only have a limited advantage over humans in the beginning, especially because the humans would be using a bunch of other AI capabilities.

I feel like there's some assumption here that once we have AI with good strategy, it would quickly dominate all human efforts, or something like that - but I'd find this very suspicious.

Btw, I posted my related post here:
https://www.lesswrong.com/posts/byrxvgc4P2HQJ8zxP/6-potential-misconceptions-about-ai-intellectuals?commentId=dpEZ3iohCXChZAWHF#dpEZ3iohCXChZAWHF

It didn't seem to do very well on LessWrong, I'm kind of curious why. (I realize the writing is a bit awkward, but I broadly stand by it)

I think you may have missed, or at least not taken literally, at least one of these things in the post:

The expansion of "superhuman strategic agent" is not "agent that's better than humans at strategic reasoning", it's "agent that is better than the best groups of humans at taking (situated) strategic action"
Strategic action is explicitly context-dependent, e.g. an AI system that's inside a mathematically perfect simulated world that can have no effect on the rest of the physical world and vice versa, has zero strategic power in this sense. Also e.g. in the FAQ, "Capabilities and controls are relevant to existential risks from agentic AI insofar as they provide or limit situated strategic power." So, yes, an agent that lives on your laptop is only strategically superhuman if it has the resources to actually take strategic action rivaling the most strategically capable groups of humans.
"increasingly accurately" is meant to point out that we don't need to understand or limit the capabilities of things that are obviously much strategically worse than us.

I was confused here, had Claude try to explain this to me:

Let me break down Ben's response carefully.
He says you may have missed three key points from his original post:
His definition of "superhuman strategic agent" isn't just about being better at strategic thinking/reasoning - it's about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
Strategic power is context-dependent. He gives two examples to illustrate this:
An AI in a perfect simulation with no connection to the real world has zero strategic power, because it can't affect anything real
An AI on your laptop (referring to your example) only counts as strategically superhuman if it has enough resources to actually compete with top human strategic actors
The phrase "increasingly accurately" in his original post was meant to convey that we only need to carefully control and understand AI systems that are getting close to human-level strategic capability. We don't need to worry as much about obviously weaker systems.
His response is essentially saying: "Your example of a small agent on a laptop doesn't contradict my point, because I'm specifically talking about agents that have both the capability AND the resources to outcompete top human strategic actors in the real world."
This addresses your concern about limited money/power by saying that an agent without sufficient resources wouldn't qualify as "strategically superhuman" under his definition, even if it had excellent strategic reasoning capabilities.

I think I'm still confused. My guess is that the "most strategically capable groups of humans" are still not all that powerful, especially without that many resources. If you do give it a lot of resources, then sure, I agree that an LLM system with human-outperforming strategy and say $10B could do a fair bit of damage.

Not sure if it's worth much more, just wanted to flag that.

If we are going to build these agents without "losing the game", either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there's a day when an AI agent is created without either of these conditions, that's the day I'd consider humanity to have lost. We might not be immediately wiped out by a nanobot swarm, but from that time forward humans will be more like pawns than players, and when our replacement actuators have been built, we'll likely be left without the resources we need to survive.
*****I might be many months late to this post but I wanted to suggest a third option inspired by Russell & Norvig's textbook "Artificial Intelligence: A Modern Approach":
(c) Agents that know they don’t know the objective.
They argue that the standard model of AI, where we supply a fully specified objective, won't scale effectively. They state, "We don’t want machines that are intelligent in the sense of pursuing their objectives; we want them to pursue our objectives... [while being] necessarily uncertain as to what they are."
When an agent recognizes this uncertainty, it is driven to "act cautiously, ask permission, learn more about our preferences..., and defer to human control," enabling provably beneficial behavior.
This approach, in my opinion, beats options (a) and (b) because:
(a) Perfect alignment: We can't fully specify human interests/objectives, and mis-specification harms scale with capability.
(b) Permanent constraints: Static limits are fragile; smarter agents can find ways around them—and they also block useful capabilities.