In his recent post arguing against AI Control research, John Wentworth argues that the median doom path goes through AI slop, rather than scheming. I find this to be plausible. I believe this suggests a convergence of interests between AI capabilities research and AI alignment research.
Historically, there has been a lot of concern about differential progress amongst AI safety researchers (perhaps especially those I tend to talk to). Some research is labeled as "capabilities" while other is labeled as "safety" (or, more often, "alignment"[1]). Most research is dual-use in practice (IE, has both capability and safety implications) and therefore should be kept secret or disclosed carefully.
Recently, a colleague expressed concern that future AIs will read anything AI safety researchers publish now. Since the alignment of future AIs seems uncertain and even implausible, almost any information published now could be net harmful for the future.
I argued the contrary case, as follows: a weak form of recursive self-improvement has already started (in the sense that modern LLMs can usefully accelerate the next generation of LLMs in a variety of ways[2]). I assume that this trend will intensify as AI continues to get more useful. Humans will continue to keep themselves at least somewhat in the loop, but at some point, mistakes may be made (by either the AI or the humans) which push things drastically off-course. We want to avoid mistakes like that.
John spells it out more decisively:
The problem is that we mostly don’t die of catastrophic risk from early transformative AI at all. We die of catastrophic risk from stronger AI, e.g. superintelligence (in the oversimplified model). The main problem which needs to be solved for early transformative AI is to use it to actually solve the hard alignment problems of superintelligence.
The key question (on my model) is: does publishing a given piece of information reduce or increase the probability of things going off-course?
Think of it like this. We're currently navigating foreign terrain with a large group of people. We don't have the choice of splitting off from the group; we expect to more-or-less share the same fate, whatever happens. We might not agree with the decision-making process of the group. We might not think we're currently on-course for a good destination. Sharing some sorts of information with the group can result in doom.[3] However, there will be many types of information which will be good to share.
AI Slop
AI slop is a generic derogative term for AI-generated content, due to it being easy to mass-produce low-quality content full of hallucinations[4], extra fingers, and other hallmarks of AI-generated content.
As the AI hype continues to increase, I've continued to attempt to use AI to accelerate my research. While it is obviously getting better, my experience is that it continues to be useful only as a sounding board. I find myself often falling into the habit of not even reading the AI outputs, because they have proven worse than useless: when I describe my technical problem and ask for a solution, I get something that looks plausible at first glance, but on close analysis, assumes what is to be proven in one of the proof steps. I'm not exactly sure why this is the case. Generating a correct novel proof should be hard, sure; but checking proofs is easier than generating them. Generating only valid proof steps should be relatively easy.[5]
These AIs seem strikingly good at conversing about sufficiently well-established mathematics, but the moment I ask for something a little bit creative, the fluent competence falls apart.
Claude 3.5 was the first model whose proofs were good enough to fool me for a little while, rather than being obvious slop. The o1 model seems better, in the sense that its proofs look more convincing and it takes me longer to find the holes in the proofs. I haven't tried o3 yet, but early reports are that it hallucinates a lot, so I mostly expect it to continue the trend of being worse-than-useless in this way.[6]
I'm not denying that these models really are getting better in a broad sense. There's a general pattern that LLMs are much more useful for people who have a lower level of expertise in a field. That waterline continues to increase.[7]
However, as these models continue to get better, they seemingly continue to display a very high preference for convincingness over correctness when the two come into conflict. If this continues it is, plausibly, a big problem for the future.
Coherence & Recursive Self-Improvement
Recursive self-improvement (RSI) is a tricky business. One wrong move can send you teetering into madness. It is, in a broad sense, the business which leading AI labs are already engaged in.
Again quoting John:
First, some lab builds early transformatively-useful AI. They notice that it can do things like dramatically accelerate AI capabilities or alignment R&D. Their alignment team gets busy using the early transformative AI to solve the alignment problems of superintelligence. The early transformative AI spits out some slop, as AI does. Alas, one of the core challenges of slop is that it looks fine at first glance, and one of the core problems of aligning superintelligence is that it’s hard to verify; we can’t build a superintelligence to test it out on without risking death. Put those two together, add a dash of the lab alignment team having no idea what they’re doing because all of their work to date has focused on aligning near-term AI rather than superintelligence, and we have a perfect storm for the lab thinking they’ve solved the superintelligence alignment problem when they have not in fact solved the problem.
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
To avoid this sort of outcome, it seems like we need to figure out how to make models "coherent" in a fairly broad sense (related to some formal notions of coherence, eg probabilistic coherence, and also informal notions of coherence). Here are some important-seeming properties to illustrate what I mean:
- Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
- Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
- Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
- Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn't know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there's not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next -- hence we get hallucinations.
All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI. We don't just need to orient AI towards are values; we need to orient AI towards (the best of) the whole human truth-seeking process, including (the best of) moral philosophy, philosophy of science, etc.
What's to be done?
To my knowledge, we still lack a good formal model clarifying what it would even mean to solve the hardest parts of the AI safety problem (eg, the pointers problem). However, we do have a plausible formal sketch of metaphilosophy: Logical Induction![8]
Logical Induction comes with a number of formal guarantees about its reasoning process. This is something that cannot be said about modern "reasoning models" (which I think are a move in the wrong direction).
Can we apply ideas from logical induction to improve the reasoning of modern AI? I think it is plausible. Should we? I think it is plausible.[9]
More generally, this post can be viewed as a continuation of the ideas I expressed in LLMs for Alignment Research: A Safety Priority? and AI Craftsmanship. I am suggesting that it might be time for safety-interested people to work on specific capabilities-like things, with an eye particularly towards capabilities which can accelerate AI safety research, and more generally, an eye towards reducing AI slop.
I believe that scaling up current approaches is not sufficient; it seems important to me to instead understand the underlying causes of the failure modes we are seeing, and design approaches which avoid those failure modes. If we can provide a more-coherent alternative to the current paradigm of "reasoning models" (and get such an alternative to be widely adopted), well, I think that would be good.
Trying to prevent jailbreaking, avoid hallucinations,[4] get models to reason well, etc are not new ideas. What I see as new here is my argument that the interests of safety researchers and capabilities researchers are aligned on these topics. This argument might move some people to work on "capabilities" or to publish such work when they might not otherwise do so.
Above all, I'm interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
- ^
I have come to prefer "AI safety" as the broader and more descriptive term for research intended to help reduce AI risk. The term "alignment" still has meaning to me, as a synonym for value-loading research, which aims to build agentic AI whose goal-directedness is aimed at human values. However, I think it is quite important to keep one's thoughts as close as possible to the main aim of one's research. To me, it seems that safety is the better aim than alignment. Alignment is one way to achieve safety, but may not be the best way.
- ^
Approaches such as constitutional AI, RLHF, and deliberative safety use AI directly to help train AI. LLMs are also useful for programmers, so I imagine that they see some use for writing code at the AI labs themselves. More generally, researchers might have conversations with LLMs about their research.
- ^
EG, maybe the majority of the group thinks that jumping off of cliffs is a good idea, so we don't want to tell the group the direction to the nearest cliff.
- ^
One colleague of mine uses the term "confabulation" rather than the more common "hallucination" -- I think it is a much more fitting term. The so-called hallucinations are (1) in the output of the system rather than the input; confabulations are a behavior, whereas hallucinations are a sensory phenomenon; (2) verbal rather than visual; while hallucinations can be auditory or impact other senses, the central thing people think of is visual hallucinations. "Confabulation" calls to mind a verbal behavior, which fits the phenomenon being described very well.
"Confabulation" also seems to describe some details of the phenomenon well; in particular, AI confabulation and human confabulation share patterns of motivated cognition: both will typically try to defend their confabulated stories, rather than conceding in the face of evidence.
I recognize, unfortunately, that use of the term "hallucination" to describe LLM confabulation has become extremely well-established. However, thinking clearly about these things seems important, and using clear words to describe them aids such clarity.
- ^
I'm not saying "logic is simple, therefore generating only valid proof-steps should be simple" -- I understand that mathematicians skip a large number of "obvious" steps when they write up proofs for publication, such that fully formalizing proofs found in a randomly chosen math paper can be quite nontrivial. So, yes, "writing only valid proof steps" is much more complex than simply keeping to the rules of logic.
Still, most proofs in the training data will be written for a relatively broad audience, so (my argument goes) fluency in discussing the well-established math in a given area should be about the level of skill needed for generating only valid proof steps. This is a strong pattern, useful for predicting the data. From this, I would naively predict that LLMs trying to write proofs would write a bunch of valid steps (perhaps including a few accidental mistakes, rather than strategic mistakes) and fail to reach the desired conclusion, rather than generating clever arguments.
To me, the failure of this prediction requires some explanation. I can think of several possible explanations, but I am not sure which is correct.
- ^
A colleague predicted that o3-pro will still generate subtly flawed proofs, but at that point I'll lose the ability to tell without a team of mathematicians. I disagree: a good proof is a verifiable proof. I can at least fall back on asking o3-pro to generate a machine-checkable version of the proof, and count it as a failure if it is unable to do so.
- ^
Although: it is easy for people to overestimate how quickly that waterline is increasing. AI will naturally be optimized to pass the shallower tests of competence, and people will naturally be biased to make generalized predictions about its competence based on shallower tests. Furthermore, since most people aren't experts in most fields, Gell-Mann Amnesia leads to overestimation of AI.
- ^
Readers might not be prepared to think about Logical Induction as a solution to metaphilosophy. I don't have the bandwidth to defend this idea in the current essay, but I hope to defend it at some point.
- ^
The idea of mainstream AI taking inspiration from Logical Induction to generate capability insights is something that a number of people I know have considered to be a risk for some time; the argument being that it would be net-negative due to accelerating capabilities.
I’m still curious about how you’d answer my question above. Right now, we don't have ASI. Sometime in the future, we will. So there has to be some improvement to AI technology that will happen between now and then. My opinion is that this improvement will involve AI becoming (what you describe as) “better at extrapolating”.
If that’s true, then however we feel about getting AIs that are “better at extrapolating”—its costs and its benefits—it doesn’t much matter, because we’re bound to get those costs and benefits sooner or later on the road to ASI. So we might as well sit tight and find other useful things to do, until such time as the AI capabilities researchers figure it out.
…Furthermore, I don’t think the number of months or years between “AIs that are ‘better at extrapolating’” and ASI is appreciably larger if the “AIs that are ‘better at extrapolating’” arrive tomorrow, versus if they arrive in 20 years. In order to believe that, I think you would need to expect some second bottleneck standing between “AIs that are ‘better at extrapolating’”, and ASI, such that that second bottleneck is present today, but will not be present (as much) in 20 years, and such that the second bottleneck is not related to “extrapolation”.
I suppose that one could argue that availability of compute will be that second bottleneck. But I happen to disagree. IMO we already have an absurdly large amount of compute overhang with respect to ASI, and adding even more compute overhang in the coming decades won’t much change the overall picture. Certainly plenty of people would disagree with me here. …Although those same people would probably say that “just add more compute” is actually the only way to make AIs that are “better at extrapolation”, in which case my point would still stand.
I don’t see any other plausible candidates for the second bottleneck. Do you? Or do you disagree with some other part of that? Like, do you think it’s possible to get all the way to ASI without ever making AIs “better at extrapolating”? IMO it would hardly be worthy of the name “ASI” if it were “bad at extrapolating” :)