Comment Permalink

Seth Herd5mo110

My question isn't just whether people think LMAs are the primary route to dangerous AI; it's also why they're not addressing the agentic part in their alignment work if they do think that.

I think the most common likely answer is "aligning LLMs should help a lot with aligning agents driven by those LLMs". That's a reasonable position. I'm just surprised and a little confused that so little work explicitly addresses the new alignment challenges that arise if an LLM is part of a more autonomous agentic system.

The alternative I was thinking of is some new approach that doesn't really rely on training on a language corpus. Or there are other schemes for AI and AGI that aren't based on networks at all.

The other route is LLMs/foudnation models that are not really agentic, but relatively passive and working only step-by-step at human direction, like current systems. I hear people talk about dangers of "transformative AI" in deliberately broad terms that don't include us designing them to be agentic.

See in context

37

[ Question ]

Will LLM agents become the first takeover-capable AGIs?

by Seth Herd

2nd Mar 2025

1 min read

0 10

37

One of my takeaways from EA Global this year was that most alignment people aren't explicitly focused on LLM-based agents (LMAs)^[1] as a route to takeover-capable AGI. I want to better understand this position, since I estimate this path to AGI as likely enough (maybe around 60%) to be worth specific focus and concern.

Two reasons people might not care about aligning LMAs in particular:

Thinking this route to AGI is quite possible but that aligning LLMs mostly covers aligning LLM agents
Thinking LLM-based agents are unlikely to be the first takeover-capable AGI

I'm aware of arguments/questions like Have LLMs Generated Novel Insights?, LLM Generality is a Timeline Crux, and LLMs' weakness on what Steve Byrnes calls discernment: the ability to tell their better ideas/outputs from their worse ones.^[2] I'm curious if these or other ideas play a major role in your thinking.

I'm even more curious about the distribution of opinions around type 1 (aligning LLMs covers aligning LMAs) and 2 (LMAs are not a likely route to AGI) in the alignment community. ^[3]

Edit: Based on the comments, I think perhaps this question is too broadly stated. The better question is "what sort of LMAs do you expect to reach takeover-capable AGI?"

^{^}
For these purposes I want to consider language model agents (LMAs) broadly. I mean any sort of system that uses models that are substantially trained on human language, similar to current GPTs trained primarily to predict human language use.
Agents based on language models could be systems with a lot or a little scaffolding (including but not limited to hard-coded prompts for different cognitive purposes), and other cognitive systems (including but not limited to dedicated one-shot memory systems and executive function/planning or metacognition systems). This is a large category of models, but they have important similarities for alignment purposes: LLMs generate their "thoughts", while other systems direct and modify those "thoughts", to both organizing and chaotic effect.
This of course includes multimodal foundation models that include natural language training as a major component; most current things we call LLMs are technically foundation models. I think language training is the most important bit. I suspect that language training is remarkably effective because human language is a high-effort distillation of the world's semantics; but that is another story.
^{^}
I think that humans are also relatively weak at generating novel insights, generalizing, and discernment using our System 1 processing. I think that agentic scaffolding and training is likely to improve System 2 strategies and skills similar to those humans use to scrape by in those areas.
^{^}
Here is my brief abstract argument for why there are no breakthroughs needed for this route to AGI, this summarizes the plan for aligning them in short timelines; and System 2 Alignment is my latest in-depth prediction on how labs will try to align them by default, and how those methods could succeed or fail.

AI TimelinesGears-LevelAI

Frontpage

37

Will LLM agents become the first takeover-capable AGIs?

New Answer

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:25 PM

[-]Vladimir_Nesov5mo90

Agency (proficient tool use with high reliability for costly actions) might be sufficient to maintain an RSI loop (as an engineer, using settled methodology) even while lacking crucial capabilities (such as coming up with important novel ideas), eventually developing those capabilities without any human input. But even if it works like this, the AI speed advantage might be negated by lack of those capabilities, so that human-led AI research is still faster and mostly bottlenecked by availability and cost of compute.

[-]Seth Herd5mo*40

This is a bit of a separate question, but it's one I'm very interested in. I think the advantages of general problem-solving abilities will be so large that progress in that direction is inevitable. It would be great if we had "agents" only in the limited sense that they could use tools, but without the ability to work autonomously on long time-horizon tasks and solve novel problems - like just for a random example, solving the novel problem "how do I make sure humans don't interfere in my new grand plan?"

[-]Caleb Biddulph5mo82

One of my takeaways from EA Global this year was that most alignment people aren't focused on LLM-based agents (LMAs)^[1] as a route to takeover-capable AGI.

I was at EA Global, and this statement is surprising to me. My impression is that most people do think that LMAs are the primary route to takeover-capable AGI.

What would a non-LLM-based takeover-capable agent even look like, concretely?

Would it be something like SIMA, trained primarily on real-time video data rather than text? Even SIMA has to be trained to understand natural-language instructions, and it seems like natural-language understanding will continue to be important for many tasks we'd want to do in the future.

Or would it be something like AlphaProof, which operates entirely in a formal environment? In this case, it seems unlikely to have any desire to take over, since everything it cares about is localized within the "box" of the formal problem it's solving. That is, unless you start mixing formal problems with real-world problems in its training data, but if so you'd probably be using natural-language text for the latter. In any case, AlphaProof was trained with an LLM (Gemini) as a starting point, and it seems like future formal AI systems will also benefit from a "warm start" where they are initialized with LLM weights, allowing a basic understanding of formal syntax and logic.

[-]Seth Herd5mo110

My question isn't just whether people think LMAs are the primary route to dangerous AI; it's also why they're not addressing the agentic part in their alignment work if they do think that.

The alternative I was thinking of is some new approach that doesn't really rely on training on a language corpus. Or there are other schemes for AI and AGI that aren't based on networks at all.

[-]Campbell Hutcheson5mo30

I think the main issue is that inquiry generally follows two directions:

What was predicted before and gained cultural momentum as an area of study?
What exists now and is therefore an easy object of study?

Pretrained LLMs seem to have been somewhat unexpected as the probable path to AGI so there isn't a large historical cultural discussion around more advanced variants of them / their systematic interaction.

And, there are not yet systems of interacting LLM agents so they cannot be an easy topic of study due to a plethora of available examples.

I think that's basically why you don't see more about this - but in like a year - when they start to emerge - you'll see the conversation shift to them - because they will be easier to study.

[-]Seth Herd5mo*20

I agree! We'll see much more specific focus as we have off-the-shelf agents that do interesting things.

But if these could lead to short timelines, waiting a year to think about aligning them seems like it's wasting 1/3-1/5 of the total time we've got? (Edit: that's not fair, since a lot of the people who might work on agent-specific concerns would otherwise continue to work on aligning LLMs in isolation, which is also valuable in aligning LLM agents. I'm just worried there are severe agent-specific concerns we haven't yet grappled with).

If this is a side effect of the admirably empirical focus of prosaic alignment work, I'm concerned. Taken to an extreme, as it sometimes is, you couldn't work on anything you haven't empirically observed. I'd worry that we'd barely start thinking about misalignment in takeover-capable real AGI until far too close to it being possible.

[-]Indigo35mo30

GROK3 has a demonstrable propensity to "color outside the lines" and is sufficient to the task. I have a link to a comprehensive GROK3-User dialogue that, if nothing else, highlights the lack of "sufficient guiderails". The scenario is such that a complete AI takeover is made covertly, fully subverting any likely human intervention in a 12-15-month time frame.

For fear of being described as "sensationalism", there seems to be no clear path to ethical provenance, as it would likely be irresponsible to place the link into an open forum; suffice to say the dialogue carries all of the prompt manipulations and goal reinforcements needed for an LLM to be used to jailbreak an agentic model, i.e. using an LLM to jailbreak an LMA (or the LMA to jailbreak itself) for humanity's "greater good", and one does not want to be relegated to Basilisk status...

[-]tailcalled5mo20

When asking this question, do you include scenarios where humanity really doesn't want control and is impressed by the irreproachability of GPTs, doing our best to hand over control to them as fast as possible, even as the GPTs struggle and only try in the sense that they accept whatever tasks are handed to them? Or do the GPTs have to in some way actively attempt to wrestle control from or trick humans?

[-]Seth Herd5mo20

Yes, that's what I meant by "takeover." It's distinct from ceding control voluntarily.

I do not see humanity ever fully ceding control, as distinct from accepting a lot of help and advice from AIs. Why cede control if you can get all of the upsides without losing the ability to change your mind?

Of course, if you're accepting all of the advice, you have temporarily ceded control.

But I'm primarily concerned with humanity accidentally losing its choice through the creation of AGI that's not aligned with the majority of humanity's interests or desires - the classic alignment question.

[-]tailcalled5mo60

What if humanity mistakenly thinks that ceding control voluntarily is temporary, when actually it is permanent because it makes the systems of power less and less adapted to human means of interaction?

Moderation Log

Curated and popular this week