Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes:

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18

"For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment." - John von Neumann

"I believe that the creation of greater than human intelligence will occur during the next thirty years. (Charles Platt has pointed out the AI enthusiasts have been making claims like this for the last thirty years. Just so I'm not guilty of a relative-time ambiguity, let me more specific: I'll be surprised if this event occurs before 2005 or after 2030.)" - Vernor Vinge, Singularity

AI alignment researcher, ML engineer. Masters in Neuroscience.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I now work for SecureBio on AI-Evals.

relevant quotes:

I think this is correct, but for what it's worth, I did take this into account in my proposed AI representative system. My system wasn't a free-for-all, it was government-run with an equally powerful value-customizable AI assigned to each citizen. Then these representatives would debate in a controlled setting, like a giant parliament with established discussion rules. Like a legislature the size of the populace.

In the free-for-all setting, then you not only have problems with States and corporations, but also inequality between people. Wealthy people can afford smarter, faster, more numerous AIs. Less cautious people can let their AIs operate with less hindrance from human oversight. So the "AI teams" that pull ahead will be those who started out powerful and acted recklessly (and probably also immorally).

So the only way this situation is ok is if the government bans all of that free-for-all AI. Or, if a value-aligned AI Singleton or coalition bans it. Naive personally-aligned AIs are a bad equilibrium.

Yes, the fine-tuned model is denoted with an "SFT" for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.

Potentially saved my life by convincing me to rush to the hospital when I wasn't sure I needed to.

Substantial gains in coding ability. Was only barely useful last year, now hugely useful.

Yeah, I think it's like maybe 5 people's worth of contribution from AE studio. Which is something, but not at all comparable to the whole company being full-time on alignment.

A couple thousand brave souls standing between humanity and utter destruction. Kinda epic from a storytelling narrative where the plucky underdog good guys win in the end.... but kinda depressing from a realistic forecast perspective.

Can humanity really not muster more researchers to throw at this problem? What an absurd coordination failure.

Note: these are good points, and things that occurred to us also across the course of our 3 month project. We didn't have time to address these things, but future work definitely should.

Key ToDo items:

Better, more realistic side tasks. (This will get easier once the models get better. Getting the red team models to accomplish both the main task and a realistic side task in a stealthy way was quite challenging.)
Strategic side task selection by the Red Team (only defecting when there's actually a strategically good opportunity to do so. Too easy for the Blue Team if you force defections in dumb scenarios).
Fine-tuning the monitor (we almost had time, but not quite!)

No, actually I think things are looking pretty good. The thing about really subtle stuff, like sabotaging code in non-obvious ways is that you'd likely need to successfully do a lot of such actions in order to get lucky enough to place the flaw in some sufficiently key piece of software that it let you make progress towards elevating your permissions and potentially escaping the lab.

The non-subtle stuff on the other hand, like downloading an unauthorized file, is potentially serious enough of a violation that even a single failure would have a high probability of compromising the system. Thus, you need a much better detection rate on the non-subtle but high-impact actions than you do on the subtle low-impact actions.

So, I agree with Buck that this situation looks promising for the defenders if the defenders expect to catch only a fraction of actions, and then actually take appropriate counter-measures once they catch the untrusted AI misbehaving even once.

Yes, I quite agree the next step is to try fine-tuning a monitor model. We did try quite a lot of prompting strategies, of which we only show a few.

We would've done it in this paper, but only had 3 months in the LASR fellowship to complete it. Even two more weeks would have been enough to do some monitor training experiments!

We also ran up against budget constraints on compute, so big thanks to LASR for the support on this!

This is just a first step in exploring the space of CoT monitoring. I wouldn't recommend anybody get either optimistic or pessimistic at this point, just curious to ask the next obvious questions!

Good points! Obvious now that I think about it that having models 'submit' at the end of an agentic code editing eval by submitting a PR via a fake git tool would be vastly more realistic.

That makes more sense! Yes, I think that's possibly true... Unless the malicious model gets good at jailbreaking, in which case even fine-tuning the monitor to mostly ignore the CoT would be risky.

And if the monitor completely ignores the CoT, then it loses out on the benefits of seeing the CoT. The CoT can be importantly helpful for detecting subtle sabotage by a stealthy malicious model. For example, the fine-tuned malicious model had a 0% chance of getting caught on subtle tasks by the monitor not using CoT. With CoT, that went up to like 3% catch rate. Small but important difference.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments