I think what you perceive as an aversion to empiricism and rigorous justification is just an acknowledgment of the areas where empiricism and rigorous justification are currently impossible.
It is very common for AI safety researchers to fall for the trap of doing mostly useless (or counterproductive) empirical work rather than trying to solve important problems. We know that something abstract like agent foundations will be necessary, because it is strictly impossible to do empirical work on superintelligence if superintelligence doesn't exist. (You probably can't meaningfully run experiments on superintelligence once it does exist, either.) We need theories that are able to give us a clear line of sight beyond several thresholds that are known to be hard epistemic barriers for all current methods.
Put another way, almost all empirical AI safety work has the implicit epistemic status of: "it is unknown whether this is actually relevant to AI safety."
Mostly-failing to solve problems that might actually help if solved is a lot better than successfully solving problems that are very unlikely to help. It is better to spend decades trying to invent a flashlight that lets you see through walls rather than to spend that time mapping everything under the street lamps. If we are foolish enough to crash on through the next barrier, the cracks in the sidewalk that we have been studying aren't going to give us any information about whether there is a sinkhole ahead.
I agree that agent foundations and related abstract efforts need to be criticized, and they deserve good criticism. They appear to be the only efforts of the kind that can actually allow superintelligence and humans to coexist. I see your criticism as blurring the line between "I think they need to do this more carefully" and "I think they should do something else instead," and it doesn't grapple with the hard epistemic barriers between here and superintelligence.
(Of course, the most valuable efforts are the ones to prevent superintelligence from existing at all, at least until we can solve these problems, if they are even possible to solve. If it is truly the case that our only options are either building useless castles in the sky or playing in the mud, then we should stop pretending we are responsible adults and indefinitely shelve the whole enterprise.)
Regards safe outcomes for superintelligence, your parenthetical remark is the one I believe most important. Far above any prosaic or theoretical safety work, our priority should be regulation preventing the development and release of superintelligence, at least until we have strong guarantees on its safety.
I don't really disagree with any of the other points in your comment. Without a regulatory framework, it seems very likely that prosaic safety techniques will only contribute to bad outcomes. So it makes sense to me if one wants to focus on agent foundations and similar theoretic work. My post is not intended as a critique of agent foundations per se!
However, I do believe that one must be clearsighted on the risks of theoretic work, particularly when built upon abstractions. My critique is that agent foundations sometimes fails to make its assumptions explicit and works backwards from abstractions, effectively building a castle in the sky. A more robust approach would be to make these assumptions very explicit, ideally linking a theory to a set of axioms, so that we can better assess the defensibility of a theory. Some branches of continental philosophy are very bad at this (e.g. Lacan), starting from "metaphor" rather than an axiom, which is why I draw the parallel.
I will note that prosaic safety work could be relevant under a strong regulatory framework. For example, suppose we established an international treaty to freeze AI development at ChatGPT 5.5 Pro / Mythos. The treaty states that we can only advance to higher capabilities/intelligence when we are "sure" that the next model is aligned. With huge amounts of resources dedicated to verifying the next model if safe, it seems feasible to me that prosaic approaches could play a large role in building safe AI under such a regime.
Now, setting up sufficiently strong regulation is of course very hard, and one might critique that "proving" that the next generation of a model is aligned is akin to solving alignment itself! But I suspect that guaranteeing a single model is aligned is much easier than solving alignment for all possible models.
I would still guess it is better not to do prosaic safety work until a global regulatory framework exists, since it accelerates AI progress and thus reduces opportunities to implement said regulation. But there are enough counterarguments that I would be careful moralizing over it (not suggesting anyone in the comments is doing so!).
>> "Now, setting up sufficiently strong regulation is of course very hard, and one might critique that "proving" that the next generation of a model is aligned is akin to solving alignment itself! But I suspect that guaranteeing a single model is aligned is much easier than solving alignment for all possible models."
Do you know how to prove " alignment of a model"?
This is currently almost as hard as solving alignment for all possible models.
In fact, strictly speaking it is impossible - the model may have simply encrypted its thoughts and there is no method known to man that can break oneway functions.
Now maybe we make some assumption that we are in a good world where the AI didn't cryptographically hide its thoughts. It doesn't really help. The fundamental problem is that there is no behaviourial tests that can definitely exclude a sharp left turn/ treacherous turn/ deception/ name-du-jour. So there is no ' proving', just hope & cope and we're back to prosaic alignment & evals again.
We are much more in agreement than I expected! Thank you for clarifying. I agree with every point you made in this comment. (Excepting the comparison to continental philosophy, which is new to me and not something I think I can evaluate one way or the other.)
> it is strictly impossible to do empirical work on superintelligence if superintelligence doesn't exist
This is of course true, but I think that a lot of researchers in agent foundations fall into the trap of concluding that empirical work on current AI systems gives us ~no information about superintelligent systems, which I strongly disagree with. There are lots of different shapes of minds, and I think it's quite important to try to get information about what shape of superintelligence we are actually heading towards so we can ensure our agent foundations are about the relevant kinds of agents. I am sceptical of any approaches that try to make <for all> claims about minds.
The shape of superintelligence that we are building isn't necessarily the shape of superintelligence that will ultimately exist, though, right? That's one of the hard barriers to see beyond: the ability for an AI to invent new AI paradigms.
It's better than useless to study current systems if they give us insight into intelligence in general, and we might get useful information about the types of minds that are likely to exist. (I find work like the Natural Abstractions Hypothesis really interesting, and that benefits from empirical work). But that isn't quite the same as studying things that are shaped like the superintelligence we will get.
Mostly agree with the vibes of what you're saying, but I think the shape of intelligences that we are currently building is likely to give us useful information about the shape of superintelligence that will ultimately exist, even if it is not an insight about intelligence in general. There is a large space of possible systems that we would consider superintelligent, and I expect the ones we ultimately end up getting will be pretty path dependent.
I think for this post to be more substantial, it would need to
I believe this is partly due to the specific writing style of Eliezer Yudkowsky, which tends to be verbose and prone to metaphor and other narrative approaches of argumentation.
I didn't study EY's other technical writing, but his Arbital sequence is very concise and narrative-free. For example, check out the article on ontology identificaion.
A lot of commentary on AI risk and alignment theory takes place on LessWrong. As a community blog intended to provoke new directions of thought, LessWrong is fantastic! An important first step in developing new theories is to work on concepts, while formalization and evidence come later. But a lot of actual alignment work also occurs directly on LessWrong, or the associated AI Alignment Forum. Because of this dual use, I feel that there has been a mode collapse, where it is difficult to distinguish between metaphorical/conceptual work and rigorous research.
I think there are factors which make rigorous and informal work hard to tell apart (for example, see the historical and ahistorical debates about the rigour of early calculus). But those factors have nothing to do with forum norms and you don't get into them. Maybe you mean "it's hard for an outsider to understand which work has to be taken fully seriously" (but you don't seem to really talk about that either) or "researchers disagree which things are rigorous" (but that has nothing to do with forum norms, again).
Btw, it rarely feels like research really happens on LW or Alignment Forum. It seems like research happens in cliques of people and the results get published on forums. But I have little experience here.
(1) It's surprising to me that you bring up analytic philosophy as a better parallel. Writing in agent foundations / LessWrong feels very different to me than analytic philosophy!
Analytic philosophy works within a well established and rigorous taxonomy of terms / concepts, as evidenced by, e.g., PhilPapers and the Stanford Encyclopedia of Philosophy. The assumptions at the roots of this taxonomy are generally pretty well explored. So even if philosophers are not exactly deducting an entire chain of belief for every paper, we can usually articulate the tradition within which an author operates, and understand the common arguments and axioms.
This is in contrast to continental philosophy, which is often much less explicit about its assumptions, and instead draws on a hodge-podge of different thinkers, ranging from Freud to Hegel, without rigorously examining its own claims. Not all continental philosophy is like this! Alain Badiou, for example, starts from an ontological exploration of reality based on set theory to build up to his theory of politics. But the parallel with continent philosophy is exactly to point out this lack of consistency and this poor habit of leaving assumptions implicit.
If others belief I am being too generous in my treatment of analytic philosophy, I'd be interested to hear why.
(2) I agree with your point that the examples could be improved.
(3) I agree that clean conclusions would be nice, but it seems legitimate as well to simplify identify the problem. I'd also assert that some of the conclusions are implicit in the critique, i.e. be cautious of formalizing an inherently imprecise concept, or don't treat the "epistemic status" label as permission to advocate for a dubious opinion. Agent foundations has a very hard task set for itself, so I wouldn't pretend to have the answers for how it can ensure intellectual rigor.
EY has been incredibly productive, so while I'm sure there's counterpoints like those you cited, The Sequences themselves seem like a clear example of a more verbose writing style (without making an assessment of this as good/bad; maybe it's fit for purpose! My critique is that this has influenced others to replicate the style when it may not be appropriate).
I guess my main problem with the post is that too many points are implicit/underexplained/have no examples.
be cautious of formalizing an inherently imprecise concept
This is an interesting point, but it felt random in the post. Like you just drop this point, disconnected from the main criticism.
don't treat the "epistemic status" label as permission to advocate for a dubious opinion
I'm not sure AF researchers do this. Though I understand that calling out specific people would be offensive. Just saying that it's not an obvious claim.
EY has been incredibly productive, so while I'm sure there's counterpoints like those you cited, The Sequences themselves seem like a clear example of a more verbose writing style (without making an assessment of this as good/bad; maybe it's fit for purpose! My critique is that this has influenced others to replicate the style when it may not be appropriate).
I cited EY's AF work, because the post is about AF. The Sequences are not really AF work. I'm not sure people do AF work in the style of the Sequences? Examples would be nice. Meanwhile, here're other counterexamples: ARC (Eliciting Latent Knowledge), Paul Christiano, Vanessa Kosoy, Alex Flint (optimization, accumulation of knowledge), TurnTrout (reframing impact), John Wentworth, Thane Ruthenis, decision theory / Lobian obstacle / logical induction work...
Analytic philosophy works within a well established and rigorous taxonomy of terms / concepts, as evidenced by, e.g., PhilPapers and the Stanford Encyclopedia of Philosophy. The assumptions at the roots of this taxonomy are generally pretty well explored. So even if philosophers are not exactly deducting an entire chain of belief for every paper, we can usually articulate the tradition within which an author operates, and understand the common arguments and axioms. This is in contrast to continental philosophy, which is often much less explicit about its assumptions, and instead draws on a hodge-podge of different thinkers, ranging from Freud to Hegel, without rigorously examining its own claims. (...) If others belief I am being too generous in my treatment of analytic philosophy, I'd be interested to hear why.
Definitely you're either too charitable to analytic philosophy or too uncharitable to AF research. Or have in mind some non-obvious distinction.
In my experience, lots of analytic philosophy starts with thought experiments and non-axiomatic theories, then discusses possible counter-arguments to them (still not establishing any axioms). Sometimes it does infer "axioms" in the context of a specific debate, but anyone is free to disregard them or question their implications (because those "axioms" are not really axioms in some logic). See
Also, I think I haven't seen an AF researcher who just relies on a hodge-podge of other researcher's ideas without questioning them... The researchers I listed above don't do it.
I found the post interesting, but I agree with Q Home that the OP's argument doesn't quite work.
There's some tension between the posts "not sufficiently rigorous" and "overly focused on rigor" criticisms, that needs to be handled with not care.
If he wants to criticise a field for formalisms that are disconnected from reality, then criticizing analytical philosophy would be better. If he wants to criticise a field for be handwavey, then a continental comparison would be better. But making the comparison and doing both doesn't quite work, at least for me.
I appreciate what this post is trying to point out, although I am sympathetic to some of the other comments that it doesn't make an airtight case for its specific analogy-thesis.
A while ago I gave a talk at a local university in Tokyo about agent foundations. (You can look at the speaker notes to get a more-or-less verbatim idea of what I said.) I tried to give a relatively fair tour of the field, but the final slide section was titled "Does any of this matter?", and the final slide, well, let me just quote it:
So overall, my impression of the field of agent foundations is that they haven't caught up to the deep learning revolution. We have a bunch of research programs, mostly rooted in the 2014-era paradigm of aligning idealized, AIXI-ish agents. They're proceeding fairly independently, and as we saw from the work on corrigibility, even when progress gets made, the field doesn't have great way of building on this progress.
But if we go back to my earliest slides, the whole point of building these mathematical theories is that they should be universally applicable! We should be able to ask questions like:
- Is ChatGPT using functional decision theory or logical induction in its forward passes?
- How does Gemini updates its beliefs? Is it an infra-Bayesian reasoner?
- Is Claude finding natural abstractions in its weights, or in the Minecraft world?
- What are the shards that emerged during LLaMA's training process? Can we find them, using interpretability tools?
These seem like pretty interesting questions to me! We've had since 2014 to develop a bunch of mathematics that supposedly applies to any agent. Since 2022, we've had actual machine intelligences, and they're at least capable of simulating agency. Why haven't people been applying these tools to modern LLMs?
I've seen some papers vaguely gesturing in these directions, but they mostly amount to asking the LLMs questions, like "how would you decide in Newcomb's problem". There was a very recent paper published about whether LLMs introspect, that had a clever twist involving comparing two different models, but was still done using black-box testing. That's something, but… you don't need a fully mathematical theory of agency for that!
I think these would be very interesting research problems. Either to try and succeed, or try and find out why all that agent foundations math doesn't actually help with real-world systems after all. Both would be pretty exciting!
So yeah, I strongly sympathize with the vibe that agent foundations seems unmoored from the program of actually aligning agents, building castles in the sky on top of abstract mathematical formulations but never quite getting around to showing that they are useful.
(And I love a good abstract mathematical castle! It would give me great personal joy to sit down and become one of the few people in the world who understands all the infra-Bayesianism math. I just don't see how it's going to help us align AIs.)
I should probably write up that slideshow into a top-level post at some point, but, it's a bit intimidating to anticipate what the reactions might be from all the people I'm critiquing. So, I'll hide it in this comment for now.
Maybe someone should attempt an agent-foundations analysis of one of those DeepSeek AIs for which everything (architecture, training protocol, etc) is fully in the public domain?
I suspect that the framing has two problems.
First of all, in order to empirically study alignment of systems to goals, we need to have such systems, and the only such systems which were available when Yudkowsky wrote the arguments down were human brains, animal brains, DeepBlue and more primitive bots since AlphaGo emerged in 2016. Humans and animals are arguably[1] aligned to various proxies instead of actual life-related goals, while bots are built by rather primitive methods like evaluating an army of positions via some known metric. Therefore, according to Yudkowsky, the ASI will be unlikely to be aligned to the humans' actual goals instead of some proxies which are unlikely to satisfy the humans. Since existing AIs display concerning behaviors like the ones demonstrated in MATS 9 or by Greenblatt or outright inducing psychosis, it is natural to assume that the AIs haven't internalised the actual goals that mankind wanted them to have.
Secondly, I doubt that "writing on LessWrong can also feel like a game, where the objective is to cleverly restate your idea with mathematics or tie it back to a niche concept from The Sequences". For example, there was a post in defense of AI slop despite the fact that in 2018 Yudkowsky wrote a piece which I expect to be fully applicable to slop. As for "some alignment researchers who have become averse to empiricism and rigorous justification of claims", who exactly are these researchers aside from the MIRI cluster?
Animal brains could also be too primitive to be aligned to goals instead of proxies. For example, unlike the humans making a conscious decision to store wheat for the winter, a squirrell only has an instinct to hide nuts. While human goal formation is impacted by culture, there exist conservative arguments which claim that the desires of an individual human do include having raised kids unless severely corrupted by adversaries.
Freud developed the first modern theory of the unconscious. His writings on drives, dreams and the id were instrumental in developing modern practices of psychology and neuroscience. Modern researchers are unlikely to leverage concepts like the superego or the oedipal complex, because we have been able to further our understanding of the mind through empirical work, which does not support many of Freud’s specific claims. Freud pushed us in the right direction, but he lacked an empirical foundation to make precise claims.
Freudian psychoanalysis mapped well to our narrative claims about the mind. Particularly coming out of the prudishness of late 19th century Europe, it was intuitive to learn we all had unconscious drives that did not track with societal norms. Stefan Zweig, writing about the austere Austro-Hungarian norms of his youth, wrote:
Leaving this period of social conservatism for the competing liberalisation and violence of the first half of the 20th century, it should not be surprising if the public was eager to learn about our suppressed tendencies, nor if they continued to explore this narrative framework far past its utility.
In 1952, Jacques Lacan began a series of seminars on his idiosyncratic psychoanalytic theory, which would later be published as “Écrits” in 1966. Building on the work of Freud and connecting it with the structuralist linguistics developed by Ferdinand de Saussure, Lacan’s work is famously dense. He builds a new vocabulary to situate the self beside language and sexual desire/anxiety, introducing terms like big Other and little other with distinct symbols (A, a) and “algebraic” relationships. This leads to graphs like
Lacan, Écrits (Seuil, 1966), 817
and equations like
Lacan, Écrits (Seuil, 1966), 819
Both examples come from “The Subversion of the Subject and the Dialectic of Desire,” which is generally understood to be one of the best introductory texts on Lacan.
Recall that, by the time of the publication of “Écrits,” the “cognitive revolution” had been underway for nearly a decade. This would eventually mature into cognitive science, and our modern understanding of psychology, neuroscience, linguistics and artificial intelligence.
Understanding Lacan’s work presents a unique intellectual challenge for his readers. Working through his formalizations can be fun, in the same way it is exciting to figure out a complicated puzzle. The use of symbols and algebraic notation provides the allure of sophistication that prose alone cannot. Making statements like “castration represents the disavowal of the Real by the Imaginary” is a helpful signal of one’s own intellect.
Theory does not require strictly empirical work, but building theories on top of theories creates castles in the sky, with little predictive power for reality.
In college, I spent a a lot of time studying the French intellectual tradition. Not everyone in the tradition is so separated from reality. Writers like Albert Camus articulated a meaningful engagement with life in a post-theistic society; intellectuals like Foucault turned to history to reveal our latent power structures. But obscurantist authors like Lacan and Derrida have spawned numerous imitators. As a former Arrogant Student, I can attest that writing in reference to this dense terminology can make one feel satisfyingly smart. But I think the primary motivation behind authors in this tradition has been to drape their writing in a pretense of rigor, stealing the formalisms of science to write without its methodologies.
Theory, to remain rigorous, needs to conceptualize and assess its own ties to reality. If you remain too bound to present understanding, you will never advance knowledge. But if you build upon layers of abstraction, without careful justification, you will deviate from an accurate model of ground truth.
As I have begun to research superintelligence alignment in the past year, particularly agent foundations, I have sometimes had the uncanny feeling that I was back in college studying continental philosophy. Reading about functional decision theory and deep deceptiveness certainly makes me feel like I am learning about the challenges of alignment. But much of the material in the field has been written in such a way that it is difficult to unpack the chain of assumptions leading to a single argument. Rather than dealing with crisp theoretical formulations which make explicit claims, one argues through a series of metaphors, with an ill-defined link back to reality.
I believe this is partly due to the specific writing style of Eliezer Yudkowsky, which tends to be verbose and prone to metaphor and other narrative approaches of argumentation. As a foundational figure in alignment research, it makes sense that he would continue to hold sway over present styles of authorship. But if Yudkowsky’s style is well-suited for introducing an online, technical audience to alignment problems, it is poorly suited for arguing those claims robustly against specific, material considerations.
A lot of commentary on AI risk and alignment theory takes place on LessWrong. As a community blog intended to provoke new directions of thought, LessWrong is fantastic! An important first step in developing new theories is to work on concepts, while formalization and evidence come later. But a lot of actual alignment work also occurs directly on LessWrong, or the associated AI Alignment Forum. Because of this dual use, I feel that there has been a mode collapse, where it is difficult to distinguish between metaphorical/conceptual work and rigorous research.
Consider this example from Andrew Critch’s boundaries sequence (which I liked a lot but was fresh on my mind). Critch introduces a cross-disciplinary concept of “boundaries,” which he suggests are an important precept in modeling agent behavior and preferences. Early in the sequence, he uses the example of an employee who can/cannot maintain a good boundary between their personal life and work.
Critch, «Boundaries» Part 2 (2022)
This is a nice conceptual framing! It does seem that “boundaries” are a useful emergent property of a number of different theories, which merits further thinking. I don’t have a good formal conception of what boundaries might look like, but I can intuit how they will behave in practice.
Later in the sequence, Critch attempts to formalize boundaries using an “approximate directed (dynamic) Markov blanket.”
Critch, «Boundaries» Part 3a (2022)
Critch’s formalization does not come out of thin air. Active inference, for example, also uses Markov blankets to formalize the boundary of a single agent. But I worry that we have now overformalized a conceptual theory, or attempted to turn metaphor into math! While formalization does allow for greater specificity and may feel intuitive to individuals coming from a background heavy in math, I worry that this brings an unjustified level of precision to a solution that is inherently approximate.
Does this bring us closer to a transferable understanding of boundaries? Or have we allowed ourselves to get lost in unnecessary details?
Far more problematic to me are arguments like the sharp left turn, which become load-bearing in some misalignment scenarios, but are poorly specified in the work of Nate Soares and others. As Paul Christiano astutely comments:
I worry that some alignment researchers have become averse to empiricism and rigorous justification of claims. LessWrong provides a low-stakes environment to share sketches of ideas that will be formalized elsewhere. But some never seem to escape the stylistic pull of the internet blog format. Mixing informal and formal techniques creates mimetic risk for the alignment researcher.
Writers are usually upfront when their thoughts are cursory or speculative. The “epistemic status” label is meant to be an honest acknowledgement that different posts come with different strengths of claims. But in practice, the label functions as a permission slip to say whatever one wants. As long as you’re honest that your claims are groundless, then there’s nothing to stop you from making your claims! As Slavoj Zizek put it in a defense of psychoanalysis:
Can’t writing on LessWrong also feel like a game, where the objective is to cleverly restate your idea with mathematics or tie it back to a niche concept from The Sequences?
There is a lot of good discourse in the community. I do not mean to devalue how important some of the problems raised by agent foundations can be. But we should be careful not to reside too long in the “virtual” world of speculative discourse, which leads to play and clever anecdotes, rather than scientific advancement and technological change.
Écrits, trans. Bruce Fink (Norton, 2006), 672
Translation by Claude