Alignment -> capabilities seems like a really useful alternate framing for advocating for alignment. I particularly like incentivizing alignment as an additional method; it can be used in parallel with pushing for alignment as a requirement or moral obligation. They're not mutually exclusive.
Alignment -> capabilities is closely related to capabilities -> alignment. The same techniques could be invented for either purpose, then serve for the other.
I've been thinking a lot about this type of dual-use from the other side. New frontiers in CoT training open up techniques that will be re-used for alignment just because it's easy and useful to keep CoT and agents on-task.
System 2 Alignment gives a bunch of current and likely near-future techniques for that can be repurposed at very low cost to serve for real alignment when we get to truly goal-directed, competent agents.
OpenAI's deliberative alignment is one example, since it's probably exactly the same training method they used for CoT capabilities. It's not really helping with true alignment, just behavior control, but it could be just as easily used that way once there's a reason to worry about actual alignment of goal-directed agents.
it could be just as easily used that way once there's a reason to worry about actual alignment of goal-directed agents
This seems to assume that we solve various Goodhart's law and deception problems
I think it's working on one part of the problem, while other parts remain. If I were to be equally uncharitable, I'd say you seem to assume that if you can't solve everything all at once, you shouldn't say anything.
I don't actually think you assume that.
What I do think is that Instruction-following AGI is easier and more likely than value aligned AGI, and that's a route to solving goodharting and deception. It's complex and unfinished, like every other proposed approach to avoiding death by AGI. You might like more meticulous detail; if so see Max Harms' admirably detailed corrigibility as singular target (CAST) sequence on a very similar alignment target and approach to solving goodharting and deception.
That is completely fair, and I was being uncharitable (which is evidently what happens when I post before I have my coffee, apologies.)
I do worry that we're not being clear enough that we don't have solutions for this worryingly near-term problem, and think that there's far too little public recognition that this is a hard or even unsolvable problem.
One of the biggest challenges here is that subsidies designed to be support alignment could be snagged by AI companies misrepresenting capabilities works as safety work. Do you think the government has the ability to differentiate between these?
The load bearing assumption here seems to be that we won't make unaligned superintelligent systems given current methods soon enough to think it matters.
This seems false, and at the very least should be argued explicitly.
Yes, I am hopeful we have enough time before superintelligent AI systems are created to implement effective alignment approaches. I don't know if that is possible or not, but I think it is worth trying.
Given uncertainty about timelines and currently accelerating capabilities, it would be preferable to live in a world where we are making sure alignment advances more than otherwise.
Not all that long ago, the idea of advanced AI in Washington, DC seemed like a nonstarter. Policymakers treated it as weird sci‐fi-esque overreach/just another Big Tech Thing. Yet, in our experience over the last month, recent high-profile developments—most notably, DeepSeek's release of R1 and the $500B Stargate announcement—have shifted the Overton window significantly.
For the first time, DC policy circles are genuinely grappling with advanced AI as a concrete reality rather than a distant possibility. However, this newfound attention has also brought uncertainty: policymakers are actively searching for politically viable approaches to AI governance, but many are increasingly wary of what they see as excessive focus on safety at the expense of innovation and competitiveness. Most notably at the recent Paris summit, JD Vance explicitly moved to pivot the narrative from "AI safety" to "AI opportunity"—a shift that the current administration’s AI czar David Sacks praised as a "bracing" break from previous safety-focused gatherings.
Sacks positions himself as a "techno-realist," gravitating away from both extremes of certain doom and unchecked optimism. We think this is an overall-sensible strategic perspective for now—and also recognize that halting or slowing AI development at this point would, as Sacks puts it, “[be] like ordering the tides to stop.”[1] The pragmatic question at this stage isn't whether to develop AI, but how to guide its development responsibly while maintaining competitiveness. Along these lines, we see a crucial parallel that's often overlooked in the current debate: alignment research, rather than being a drain on model competitiveness, is likely actually key to maintaining a competitive edge.
Some policymakers and investors hear "safety" and immediately imagine compliance overhead, slowdowns, regulatory capture, and ceded market share. The idea of an "alignment tax" is not new—many have long argued that prioritizing reliability and guardrails means losing out to the fastest (likely-safety-agnostic) mover. But key evidence continues to emerge that alignment techniques can enhance capabilities rather than hinder them (some strong recent examples are documented in the collapsible section below).[2]
This dynamic—where supposedly idealistic constraints reveal themselves as competitive advantages—would not be unique to AI. Consider the developmental trajectory of renewable energy. For decades, clean power was dismissed as an expensive luxury. Today, solar and wind in many regions are outright cheaper than fossil fuels—an advantage driven by deliberate R&D, policy support, and scaling effects—meaning that in many places, transitioning to the more ‘altruistic’ mode of development was successfully incentivized through market forces rather than appeals to long-term risk.[3]
Similarly, it is plausible that aligned AI, viewed today as a costly-constraint-by-default, becomes the competitive choice as soon as better performance and more reliable and trustworthy decisions translate into real commercial value. The core analogy here might be to RLHF: the major players racing to build AGI virtually all use RLHF/RLAIF (a [clearly imperfect] alignment technique) in their training pipelines not because they necessarily care deeply about alignment, but rather simply because doing so is (currently) competitively required. Moreover, even in cases where alignment initially imposes overhead, early investments will bring costs down—just as sustained R&D investment slashed the cost of solar from $100 per watt in the 1970s to less than $0.30 per watt today.[4]
(10 recent examples of alignment-as-competitive-advantage)
A growing body of research demonstrates how techniques often framed as “safety measures” can also globally improve model performance.
1. Aligner: Efficient Alignment by Learning to Correct (Ji et al., 2024)
2. Shepherd: A Meta AI Critic Model (Wang et al., 2023)
3. Zero-Shot Verification-Guided Chain of Thought (Chowdhury & Caragea, 2025)
4. Multi-Objective RLHF (Mukherjee et al., 2024)
5. Mitigating the Alignment Tax of RLHF (Lin et al., 2023)
6. RAG-Reward: Optimizing RAG with Reward Modeling and RLHF (Zhang et al., 2025)
7. Critique Fine-Tuning: Learning to Critique is More Effective Than Learning to Imitate (Wang et al., 2025)
8. Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback (Lin et al., 2025)
9. Feature Guided Activation Additions (Soo et al., 2025)
10. Evolving Deeper LLM Thinking (Lee et al., 2025)
This research trend clearly indicates alignment-inspired techniques can translate directly into more competent models, which can in turn render short-term competitiveness gains—in addition to a long-term hedge against existential threats.[5]
Certainly, no one claims every alignment technique will yield a “negative tax.” But even so, there now seems to be enough empirical evidence to undermine the blanket assumption that safety is always a drain. And if we hope to see alignment become standard practice in model development—similar to how robust QA processes became standard in software—these examples can serve as proof points that alignment work is not purely altruistic overhead.
Scaling neglected alignment research
The business case for investing in alignment research has become increasingly compelling. As frontier AI labs race to maintain competitive advantages, strategic investment in alignment offers a path to both near-term performance gains and long-term sustainability. Moreover, there's a powerful network effect at play: as more organizations contribute to alignment research, the entire field benefits from accelerated progress and shared insights, much like how coordinated investment in renewable energy research helped drive down costs industry-wide.
And even with promising new funding opportunities, far too many projects remain starved for resources and attention. Historically, major breakthroughs—from jumping genes to continental drift to ANNs—often emerged from overlooked or “fringe” research. Alignment has its own share of unorthodox-yet-promising proposals, but they can easily languish if most funding keeps flowing to the same small cluster of relatively “safer” directions.
One path forward here is active government support for neglected alignment research. For instance, DARPA-style programs have historically funded big, high-risk bets that mainstream funders ignored, but we can imagine any robust federal or philanthropic effort—grants, labs, specialized R&D mandates—structured specifically to test promising alignment interventions at scale, iterate quickly, and share partial results openly.
This kind of parallelization is powerful and necessary in a world with shortened AGI timelines: even if, by default, the vast majority of outlier hunches do not pan out, the handful that show promise could radically reduce AI's capacity for deceptive or hazardous behaviors, and potentially improve base performance. At AE Studio, we've designed a systematic approach to scaling neglected alignment research, creating an ecosystem that rapidly tests and refines promising but underexplored ideas. While our early results have generated promising signals, scaling this research requires broader government and industry buy-in. The U.S. should treat this as a strategic advantage, similar to historical investments in critical defense and scientific initiatives. This means systematically identifying and supporting unconventional approaches, backing high-uncertainty but high-upside R&D efforts, and even using AI itself to accelerate alignment research. The key is ensuring that this research is systematically supported, rather than tacked on as a token afterthought—or ignored altogether.
Three concrete ways to begin implementing this vision now
As policymakers grapple with how to address advanced AI, some propose heavy-handed regulations or outright pauses, while others push for unbridled acceleration.
Both extremes risk missing the central point: the next wave of alignment breakthroughs could confer major market advantages that are completely orthogonal to caring deeply about existential risk. Here are three concrete approaches to seize this opportunity in the short-term:
Such measures will likely yield a virtuous cycle: as alignment research continues to demonstrate near-term performance boosts, that “tax” narrative will fade, making alignment the competitively necessary choice rather than an altruistic add-on for developers.
A critical window of opportunity
In spite of some recent comments from the VP, the Overton window for advanced AI concerns in DC seems to have shifted significantly over the past month. Lawmakers and staff who used to be skeptical are actively seeking solutions that don’t just boil down to shutting down or hampering current work. The alignment community can meet that demand with a credible alternative vision:
Our recent engagements with lawmakers in DC indicate that when we focus on substantive discussion of AI development and its challenges, right-leaning policymakers are fully capable of engaging with the core issues. The key is treating them as equal partners in addressing real technical and policy challenges, not talking down to them or otherwise avoiding hard truths.
If we miss this window—if we keep presenting alignment as a mandatory "tax" that labs must grudgingly pay rather than a savvy long-term investment in reliable frontier systems—then the public and policy appetite for supporting real and necessary alignment research may semi-permanently recede. The path forward requires showing what we've already begun to prove: that aligned approaches to AI development may well be the most performant ones.
Note that this may simply reflect the natural mainstreaming of AI policy: as billions in funding and serious government attention pour in, earlier safety-focused discussions inevitably give way to traditional power dynamics—and, given the dizzying pace of development and the high variance of the political climate, this de-emphasis of safety could prove short-lived and things like a global pause may eventually come to be entirely plausible.
At the most basic level, models that reliably do what developers and users want them to do are simply better products. More concretely—and in spite of its serious shortcomings as an alignment technique—RLHF still stands out as the most obvious example: originally developed as an alignment technique to make models less toxic and dangerous, it has been widely adopted by leading AI labs primarily because it dramatically improves task performance and conversational ability. As Anthropic noted in their 2022 paper, "our alignment interventions actually enhance the capabilities of large models"—suggesting that for sufficiently advanced AI, behaving in a reliably aligned way may be just another capability.
It is also worth acknowledging the converse case: while it is true that some capabilities research can also incidentally yield alignment progress, this path is unreliable and indirect. In our view, prioritizing alignment explicitly is the only consistent way to ensure long-term progress—and it’s significantly more likely to reap capabilities benefits along the way than the converse.
Take the illustrative case of Georgetown, Texas: in 2015, the traditionally conservative city transitioned to 100% renewable energy—not out of environmental idealism, but because a straightforward cost–benefit analysis revealed that wind and solar offered significantly lower, more stable long-term costs than fossil fuels.
These kinds of trends also reflect a broader economic transition over the course of human history: namely, from zero-sum competition over finite resources to creating exponentially more value through innovation and cooperation.
Of course, methods like these are highly unlikely to be sufficient for aligning superintelligent systems. In fact, improving current capabilities can create new alignment challenges by giving models more tools to circumvent or exploit our oversight. So while these techniques deliver real near-term benefits, they do not eliminate the need for deeper solutions suited to stronger AI regimes.