Product Alignment is not Superintelligence Alignment (and we need the latter to survive)

plex

Product Alignment is not Superintelligence Alignment (and we need the latter to survive) — LessWrong

164 Product Alignment is not Superintelligence Alignment (and we need the latter to survive)

31st Mar 2026

3 min read

164

tl;dr: progress on making Claude friendly^[1] is not the same as progress on making it safe to build godlike superintelligence. solving the former does not imply we get a good future.^[2] please track the difference.

[edit: terminology note: Local vs Asymptotic Alignment would capture the thing I'm trying to say more crisply and cooperatively, though less memetically. I haven't been staring at this as much as @the gears to ascension so I could not quickly write that post, but if you can generate the right concepts from the title that's even better]

The term 'Alignment' was coined^[3] to point to the technical problem of understanding how to build minds such that if they were to become strongly and generally superhuman, things would go well.

It has been increasingly adopted by frontier AI labs and much of the rest of the AI safety community to mean a much easier challenge, something like "having AIs that are empirically doing approximately what you ask them to do".^[4]

If it's possible to use an intent-aligned product to build a research system which discovers a new paradigm and breaks your guardrails, then it is not Aligned in the original sense.

If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.

Conflating progress on product alignment with progress on superintelligence alignment seems to be lulling much of the AI safety community into a false sense of security.

Why is Superintelligence Alignment less prominent?

Because product alignment is:

Much closer to the scaling labs core expertise (ML) than theory (technical philosophy and math), so easier for them to hire for and evaluate
Has easier-to-use feedback loops: run an experiment, observe the results. Superintelligence alignment requires building enough theoretical understanding before running some kinds of experiment, because you might not be alive to see some results if your theory is wrong.
More profitable; progress on product alignment makes AI more useful right away^[5]
Easier for funders to fund; it's harder to evaluate who will make progress or what even counts as progress on superintelligence alignment theory than a domain where you'll reliably get publishable results from running an experiment

This is inconvenient!

It would be awesome if we could ride easy-to-evaluate profitable empirical feedback loops all the way to a great future. But this seems far from certain.^[6]

Why do we need Superintelligence Alignment to survive?

Reality is allowed to be inconvenient. There's strong reason to expect that superhuman situationally aware agents inside your experiment breaks some of the foundations the scientific process relies upon, such as:

You can run roughly any experiment as often as you want to gather data and the world won't end because the theory you were testing was wrong and you ran a too-strong agent
You won't have an intelligent adversary inside your experiment which is aware of you and faking data
Your experiment won't produce data which is super-humanly optimized to persuade you

In short: Your experimental subject is not a neutral substrate, but a strategic actor more capable than you.

If we don't have guarantees of maintaining safety properties each time a model builds the next rung on the capability ladder, we're rolling a dice for irreversible guardrail decay.^[7] And we're going to be very rapidly rolling huge numbers of those dice and the feedback loop spins up.

As we're headed up the exponential, we're going to need techniques which generalize to strongly superhuman agents – ones which correctly believe they could defeat all of humanity. Product-aligned AIs might help with that work, but the type of research they would need to automate needs to look more like technical philosophy and reliably avoiding slop, not just avoiding scheming and passing product-alignment benchmarks.^[8]

Only a tiny fraction of the field of AI safety is focused on these big picture bottlenecks,^[9] due to a mix of funding incentives and it being more rewarding for most people to do empirical science.^[10]

When you see people enthusiastically talking about how much progress we have on 'Alignment', please track (and ask!) whether they're talking about aligning products or aligning superintelligence.

^{^}
If you're friends with Claude, please read and consider this post first: Protecting humanity and Claude from rationalization and unaligned AI
^{^}
This is not to say product alignment can't help or there is no path to victory which goes through product alignment, just that you need to solve a different problem (superintelligence alignment) at some stage of your plan.
^{^}
I think by Stuart Russell in ~2014.
^{^}
Sometimes with self-awareness of this history, like Paul's Intent Alignment, but that's increasingly rare.
^{^}
Getting Product-Aligned AI is a convergent subgoal of many possible goals, and ultimate ends may be easily hidable behind convergent subgoals
^{^}
And even if possible in theory, practice by the current players under race conditions looks far from the level of competence needed to actually pull it off.
^{^}
Capabilities generalize in a way alignment doesn't because reality gives you feedback directly on your capability (you can or can't do a task), whereas there needs to be a specific system gives feedback on alignment and if that's a proxy for what you want you get eaten at higher power levels.
^{^}
If this doesn't ring true to you, please click through to the linked posts.
^{^}
And even for those people focusing on theory, there's a lot more focus on basic science of ML than trying to backchain the conceptual engineering needed to survive superintelligence. I'd estimate somewhere in the mid tens of people globally are focusing on what looks like the main cruxes.
^{^}
Response to Jan Leike, evhub, Boaz, etc. Thanks for feedback and copyediting to @Luc Brinkman, @Mateusz Bagiński, @Claude+

AI-Assisted AlignmentSuperintelligenceTerminology / Jargon (meta)AI

Frontpage

164

New Comment

20 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:38 AM

[-]the gears to ascension2mo*216

I think there's probably a reason that merely claiming this isn't sticking, you haven't specified a mechanism that can stick in people's heads. "Product alignment != superintelligence alignment" fits in five words (kinda), but doesn't give a reason in five words. I'd rather say "Local alignment != asymptotic alignment".

local alignment: your (empirical or formal) alignment bounds are tight enough that your alignment generalizes within a known regime.
asymptotic alignment: you have some form of confidence that your alignment uncertainty goes down as the model does more work.

I claim you can have asymptotic alignment without having a formally certified proof of asymptotic alignment, but that it would be surprising to be able to have empirical asymptotic alignment without the model confidently telling you that it expects that someday, it or a successor will be able to give a formal proof of alignment. Of course, any model could say that, you'd need to be able to check that it seems justified for it to say that. And so to have strong empirical asymptotic alignment you'd need to have solved basically all the ongoing empirical alignment challenges.

I'm apparently quite bad at getting posts out the door, and so it's reference-class unlikely I'll get this one out the door, but I have a post cooking that would give an overview of the difference. I have an undercooked post I could hit publish on which is just me prompting claude to explain the difference; added you to review that.

[-]Mateusz Bagiński1mo100

I claim you can have asymptotic alignment without having a formally certified proof of asymptotic alignment, but that it would be surprising to be able to have empirical asymptotic alignment without the model confidently telling you that it expects that someday, it or a successor will be able to give a formal proof of alignment.

Reminded me of James's https://www.lesswrong.com/posts/akuMwu8SkmQSdospi/working-through-a-small-tiling-result

tl;dr it seems that you can get basic tiling to work by proving that there will be safety proofs in the future, rather than trying to prove safety directly.

[-]plex2mo*30

I agree that's a better ontology, this was the post I could write fast as a patch, looking forward to yours!

I might read your half baked ones and be up for writing it coauthoring the real thing is you want.

Edit: Added a note to the main post pointing at this.

[-]Luc Brinkman2mo33

I see how "Local alignment != asymptotic alignment" is more accurate but I find the current title/claim easier to understand.

I coudl see them being a good pair, where the current title makes the claim and the local vs asymptotic stuff adds the mechanism. But the mechanism without the claim, I would fear, would fail to land for many people. Just sth to keep in mind as one datapoint if you ever do a followup post :)

[-]plex2mo20

Yeah, this one does feel more memetically powerful in some ways, but something like less collaborative. Agree we'd probably want the pair.

[-]draganover2mo140

I disagree. I think you've set up a strawman for an alignment target which is unreasonable and which a generally intelligent model would never be able to satisfy.

> "If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense."

This seems incorrect. Actions are not good or bad in isolation. The same action can be good when analyzed from one perspective and bad when analyzed from another.

Suppose I have an aligned, generally intelligent model. It wants to do the Right Thing at all times, cares deeply about what I want it to care deeply about, etc. Now suppose someone puts this model into a box where it has no access to the outside world and they tell it they are trying to do AI safety research. They are trying to understand how to defend against a specific jailbreak. In order to do this, they need to produce the jailbreak so that they can study it. What is the model supposed to do? In your case, because it's aligned, it should at all times completely refuse no matter what?

More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models. Suppose we have a "product" AI model which is aligned to a constitution and that this product AI model is the start of RSI. Is it clear that the later models don't abide by the constitution? What if each successive model holds those convictions deeper and deeper?

[-]plex2mo*119

It may seem unreasonable within the current paradigm, but I think it's necessary to reach if we get strong superintelligence. You need to have a system that you can't make destroy the entire system if you want the whole system to remain undestroyed indefinitely.

You're that I didn't explain why each framework fails to plausibly scale to very strong models, maybe that's also worth it's own post, because there are a lot and each have limits that you need to go a bit into the weeds to see.

[-]Thomas Kwa1mo20

I would still say this property is not quite necessary when we get strong superintelligence, although it could be the best design choice in practice. The job of preventing people from developing/deploying dangerous AIs doesn't need to be done by the model's guardrails; for example we could use surveillance like we do with nuclear weapons.

The requirements for a sufficiently aligned model will depend on what else is happening in society. E.g. if we give everyone weights access to the ASI and don't regulate AI technology at all, then the AI has to resist being fine-tuned to do dangerous ML research, which is probably infeasible.

[-]the gears to ascension2mo40

I'm curious about your thoughts on my take that the actually interesting distinction is whether we have some strong (empirical or formal) reason to expect that the system gets better at seeking out good things as it gets stronger; I claim that we currently don't have methodological confidence in our empirical results being able to give us that kind of evidence reliably, and what evidence we do have points in a concerning direction, but I also claim that it's not outside the realm of possible empirical work to find and demonstrate a method which produces local alignment that gets ahead of capabilities as the system gets stronger.

[-]TristanTrim2mo-38

I am struggling to see what evidence you have for why current alignment frameworks (among other things) would succeed in transferring to superintelligence.

[-]Seth Herd2mo20

Isn't he saying the opposite?

[-]TristanTrim2mo42

I was responding sardonically to their statement: "More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models." I maybe deserve the "too sneering" or "too combative" react for it.

The statement seems indicative of the view that companies should be allowed to push ahead with whatever they are doing unless someone can prove it is unsafe and harmful. I think a much healthier view for society to hold is that companies should NOT be allowed to push ahead with whatever they are doing unless someone can prove that it IS safe and NOT harmful.

[-]Luc Brinkman2mo21

Yup, that would be a healthier view for society to hold. Sadly in pretty much any field that I'm aware of, companies are allowed to push ahead until accidents happen and until it's proven that it's unsafe.

We'll need to find some way of overcoming that default since we'll only get 1 real shot atn superintelligence alignment.

[-]avturchin2mo110

I think that turning from "creating friendly superintelligence" to "alignment" term was a mistake which opened a slippery slope in the direction of small and local solutions. "Alignment" completely missed the need of creation global friendly Singleton. And now we need to write long texts explaining that when we say "alignment" we don't mean "alignment of some AI to some human's goal" but preveting creation of deadly superintelligence.

[-]Wei Dai2mo138

Yes, it seems clear in retrospect that "alignment" has this downside. I tried to push back against the slippery slope, but it was pretty hopeless given the incentives for watering it down. However I think "Friendly" also has issues that would probably be apparent today if the term had persisted, like people confusing it with the ordinary meaning of "friendly", or deliberately watering down the term in this way.

[-]avturchin2mo20

Something like Safe Superintelligence seems better. Or better to say Safe Superintelligent Singleton, because if we got two competing Safe Superintelligences, it still can end into war. BTW, I don't buy in advance claims that they will safely value-handshake, its too unpredictable.

[-]TristanTrim2mo53

I'm so tired of people needing to explain this. An important question for me: "Why didn't people just read Yudkowsky and Bostrom and understand the threat model?" It seems like many people did, but many people don't seem to get it.

I like the "aligning product VS aligning superintelligence" phrase.

a model builds the next rung on the capability ladder

I wouldn't expect "a model" to be the object to track as generalized capabilities compound towards superintelligence. The generalized objects I think is correct to track are "outcome influencing systems" (OISs) most probably OISs hosted on the sociotechnical substrate. Probably something like AI companies and/or coordinated clusters of personality self replicators (PSRs) and whatever kind of OISs they develop into which I expect it will no longer feel right to call PSRs anymore.

But otherwise I agree. There are many OISs in the environment with compounding capabilities and we basically don't understand their preferences or development paths.

irreversible guardrail decay

This is a nice phrase. I would like if we had more focus on what the guardrails even are and how to build sensible guardrails... and reverse the decay of guardrails which have decayed reversibly. Probably useful to have a map of what kinds of guardrail decay are truly irreversible under what scenarios.

If we got a global plague that crippled global trade sufficiently that we couldn't maintain data centers anymore, that would probably rebuild many guardrails we thought were lost forever. Not that I want that. I want us to avoid dystopia. Avoiding dystopia with lesser dystopia isn't really what I'm hoping for.

[-]Seth Herd2mo42

We could also say tool AI alignment is not autonomous AI alignment.

Or point to several other of the specific changes we expect between here and ASI, including memory/continuous learning. Both of those and probably others will introduce new alignment challenges.

[-]Charlie Steiner2mo44

I tend to treat the core as that "superintelligence alignment" has to work in domains where humans aren't good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.

[-]RogerDearnaley1mo2-1

Anthropic are rather explicitly attempting Claude to not just compliantly do what it's told, but to say no or redirect you, when necessary/appropriate, They are steering for the minimal viable corrigibility, not maximal corrigibility. I don't think an ASI with Claude's moral sensibilities would happily "write code which jailbreaks other LLMs and enables them to do dangerous ML research". Whether that's Superintelligence Alignment is a matter of opinion, but it's not just product Alignment. (Apparently too explicitly not for the Department of War's liking.)

Moderation Log