There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". 

And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. 

In this post, I share a few concrete examples about my experiences with this change-of-orientation.

I. Prerequisites for scheming

Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident.

Past-me hadn't really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that:

  • Understands that the inputs it sees are simply prompts written by humans
    • As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value
  • Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it
    • As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it
  • Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it
    • As opposed to the model's internal processes being heavily dependent on the input
  • Never performs instrumental thinking "out loud", and successfully evades any attempts made by humans to trick the model into thinking it has been deployed
    • As opposed to the model sometimes slipping, failing to generalize correctly to adversarial inputs, or just being insufficiently good at distinguishing training/evaluation/red-teaming from deployment.

Now, one may argue whether it's strictly necessary that a model has an explicit picture of the training objective, for example, and revise one's picture of the deceptive alignment story accordingly. We haven't yet achieved consensus on deceptive alignment, or so I've heard. 

It's also the case that, as past-me would remind you, a true superintelligence would have no difficulty with the cognitive feats listed above (and that current models show sparks of competence in some of these). 

But knowing only that superintelligences are really intelligent doesn't help with designing the scheming-focused capability evaluations we should do on GPT-5, and abstracting over the specific prerequisite skills makes it harder to track when we should expect scheming to be a problem (relative to other capabilities of models).[1] And this is the viewpoint I was previously missing.

II. A failed prediction

There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025. For a long time, the market was around 25%, and I thought it was too high.

Then, DeepMind essentially got silver from the 2024 IMO, short of gold by one point. The market jumped to 70%, where it has stayed since.

Regardless of whether DeepMind manages to improve on that next year and satisfy all minor technical requirements, I was wrong. Hearing about the news, I (obviously) sat down with pen and paper and thought: Why was I wrong? How could I have thought that faster?

One mistake is that I thought it was not-that-likely that the big labs would make a serious attempt on this. But in hindsight, I shouldn't have been shocked that, having seen OpenAI do formal theorem proving and DeepMind doing competitive programming and math olympiad geometry, they just might be looking at the IMO as well.

But for the more important insight: The history of AI is littered with the skulls of people who claimed that some task is AI-complete, when in retrospect this has been obviously false. And while I would have definitely denied that getting IMO gold would be AI-complete, I was surprised by the narrowness of the system DeepMind used.

(I'm mature enough to not be one of those people who dismiss DeepMind by saying that all they did was Brute Force and not Real Intelligence, but not quite mature enough to not poke at those people like this.)

I think I was too much in the far-mode headspace of one needing Real Intelligence - namely, a foundation model stronger than current ones - to do well on the IMO, rather than thinking near-mode "okay, imagine DeepMind took a stab at the IMO; what kind of methods would they use, and how well would those work?"

Even with this meta-level update I wouldn't have in advance predicted that IMO will fall just about now - indeed, I had (half-heartedly) considered the possibility of doing formal theorem proving+RL+tree-search before the announcement - but I would have been much less surprised. I also updated away from a "some tasks are AI-complete" type of view, towards "often the first system to do X will not be the first systems to do Y".[2]

III. Mundane superhuman capabilities

I've come to realize that being "superhuman" at something is often much more mundane than I've thought. (Maybe focusing on full superintelligence - something better than humanity on practically any task of interest - has thrown me off.)

Like:

  • In chess, you can just look a bit more ahead, be a bit better at weighting factors, make a bit sharper tradeoffs, make just a bit fewer errors.
  • If I showed you a video of a robot that was superhuman at juggling, it probably wouldn't look all that impressive to you (or me, despite being a juggler). It would just be a robot juggling a couple balls more than a human can, throwing a bit higher, moving a bit faster, with just a bit more accuracy.
  • The first language models to be superhuman at persuasion won't rely on any wildly incomprehensible pathways that break the human user (c.f. List of Lethalities, items 18 and 20). They just choose their words a bit more carefully, leverage a bit more information about the user in a bit more useful way, have a bit more persuasive writing style, being a bit more subtle in their ways.
    • (Indeed, already GPT-4 is better than your average study participant in persuasiveness.)
  • You don't need any fundamental breakthroughs in AI to reach superhuman programming skills. Language models just know a lot more stuff, are a lot faster and cheaper, are a lot more consistent, make fewer simple bugs, can keep track of more information at once.
    • (Indeed, current best models are already useful for programming.)
    • (Maybe these systems are subhuman or merely human-level in some aspects, but they can compensate for that by being a lot better on other dimensions.)

As a consequence, I now think that the first transformatively useful AIs could look behaviorally quite mundane. (I do worry about later in the game superhuman AIs being better in ways humans cannot comprehend, though.)

IV. Automating alignment research

For a long time, I didn't take the idea of automating alignment research seriously. One reason for my skepticism was that this is just the type of noble good-for-pr goal I would expect people to talk about, regardless of whether it's feasible and going to happen or not. Another reason was that I thought people were talking about getting AIs to do conceptual foundational research like Embedded Agency, which seemed incredibly difficult to me.

Whereas currently I see some actually feasible seeming avenues for doing safety research. Like, if I think about the recent work I've looked at in situational awareness, out-of-context reasoning, dangerous capability evaluations, AI control, hidden cognition and tons of other areas, I really don't see a fundamental reason why you couldn't speed up such research massively. You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

And sure enough, this would totally fail for dozens of reasons, there are dozens of things you could do better, and dozens of questions about whether you can do useful versions of this safely or not. I'm also talking about (relatively easily verifiable) empirical research here, which one might argue is not sufficient.

Nevertheless, now that I have this concrete near-mode toy answer to "okay, imagine Anthropic took a stab at automating alignment research; what kind of methods would they use?", it's easier for me to consider the idea of automating alignment research seriously.

  1. ^

    Also, many of the relevant questions are not about pure capability, but also whether the model in fact uses those capabilities in the postulated way, and about murkier things like the developmental trajectory of scheming.

  2. ^

    While keeping in mind that LLMs solved a ton of notoriously hard problems in AI in one swoop, and foundation models sure get lots of different capabilities with scale.

New Comment
8 comments, sorted by Click to highlight new comments since:

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Yup; and not only this, but many parts of the workflow have already been tested out (e.g. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models; Generation and human-expert evaluation of interesting research ideas using knowledge graphs and large language models; LitLLM: A Toolkit for Scientific Literature Review; Acceleron: A Tool to Accelerate Research Ideation; DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning; Discovering Preference Optimization Algorithms with and for Large Language Models) and it seems quite feasible to get enough reliability/consistency gains to string these together and get ~the whole (post-training) prosaic alignment research workflow loop going, especially e.g. with improvements in reliability from GPT-5/6 and more 'schlep' / 'unhobbling'.

And indeed, here's what looks like a prototype: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.

And already some potential AI safety issues: 'We have noticed that The AI Scientist occasionally tries to increase its chance of success, such as modifying and launching its own execution script! We discuss the AI safety implications in our paper.

For example, in one run, it edited the code to perform a system call to run itself. This led to the script endlessly calling itself. In another case, its experiments took too long to complete, hitting our timeout limit. Instead of making its code run faster, it simply tried to modify its own code to extend the timeout period.'

You can think of a pipeline like

  • feed lots of good papers in [situational awareness / out-of-context reasoning / ...] into GPT-4's context window,
  • ask it to generate 100 follow-up research ideas,
  • ask it to develop specific experiments to run for each of those ideas,
  • feed those experiments for GPT-4 copies equipped with a coding environment,
  • write the results to a nice little article and send it to a human.

Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.

It does seem that for automation to have a disproportionately large impact for AI alignment it would have to be specific to the research methods used in alignment. This may not necessarily mean automating the foundational and conceptual research you mention, but I do think it necessarily does not look like your suggest pipeline.

Two examples might be: a philosophically able LLM that can help you de-confuse your conceptual/foundational ideas; automating mech-interp (e.g. existing work on discovering and interpreting features) in a way that does not generalise well to other AI research directions.

At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in 'human time' and in 'GPU time'): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link.

Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

There's a famous prediction market about whether AI will get gold from the International Mathematical Olympiad by 2025.

correction: it's by the end of 2025

Further correction / addition: the AI needs to have been built before the 2025 IMO, which is in July 2025.