Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here:

 https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe 

Wiki Contributions

Comments

As I asked someone who challenged this point on Twitter, if you think you have a test that is lighter touch or more accurate than the compute threshold for determining where we need to monitor for potential dangers, then what is the proposal? So far, the only reasonable alternative I have heard is no alternative at all. Everyone seems to understand that ‘use benchmark scores’ would be worse.

 

As someone who thinks it's a bad idea to try to write legislation focused on compute thresholds, because I believe that compute thresholds will become outdated suddenly in the not so distant future... I would far rather that legislators say something along the lines of, "We do not currently have a good way to measure how risky a given AI system is. As a first step we are going to commission the creation of a battery of tests, some public and some classified, to thoroughly evaluate a given system. We will require all companies wishing to do business in our country to submit their models to our examination."

I've been working for the past 8 months on trying to create good evaluations of AI biorisk. My team's initial attempts to do this were met with the accusation that our evaluations were insufficiently precise and objective. That's not wrong. They were the best we could do at short notice, but far from adequate. We've been working hard since then to develop better evals, thorough and objective enough to convince skeptics. But this isn't easy. It's a labor intensive process, and we can't afford much labor. The US Federal Government CAN afford to hire a bunch of scientists to design and author and review thousands of in-depth questions.

Criticisms of the biorisk evals so far have pointed out:
 
'Yes, the models show a lot of book knowledge about virology and genetic engineering, but that's because reciting facts from papers and textbooks plays to their strengths. Their high scores on such tests don't imply the same level of understanding or skill or utility as would similarly high scores from a human expert. This fails to evaluate the most important bottlenecks such as the detailed tacit knowledge of hands-on wetlab skills.'

Sure, we need to check for both. But without adequate funding, how can we be expected to be able to hire people to go set up fake lab experiments and photograph and videotape them going wrong to create tests to see if models like GPT-4o can help troubleshoot well enough to be a significant uplift for inexpert lab workers? That's inherently a time-intensive and material-intensive sort of test to create! And until we do, and then show that the AI models get low scores on those exams, we are operating under uncertainty about the models' skills. Our critics assume the models are currently incapable at this and will remain so, but they offer no proof of that. They are not scrambling themselves to create the tests which could prove the models' incapability. Given the novel territory rapidly being broken by new models, we should start considering new models 'dangerous until proven safe' not 'innocent until proven guilty'. 

My vision of model regulation

To be clear, my goal is not to stifle model development and release, or harm the open-source community. I expect the process of evaluating models to be something we can do cheaply, automatically, quickly. You submit your model weights and code through a web form, and get back a thumbs up within minutes. It's free and easy. You never see anyone failing to pass. The first failure will very likely occur when one of the largest labs submits their latest experimental model's checkpoint, long before they'd even considered releasing it publicly, just to satisfy their curiosity. And when that day comes, we will all be immensely grateful that we had the safety checks in place.

 The expense of designing, creating, and operating this will be substantial. But it is in the service of preventing a national security catastrophe, so it seems to me like a very worthwhile expenditure of taxpayer funds.

I like Seth's thoughts on this, and I do think that Seth's proposal and Max's proposal do end up pointing at a very similar path. I do think that Max has some valuable insights explained in his more detailed Corrigibility-as-a-target theory which aren't covered here. 

For me, I found it helpful seeing Seth's take evolve separately from Max's, as having them both independently come to similar ideas made me feel more confident about the ideas being valuable.

My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I'm quite confident aren't going to spread the details. It's a hard thing to discuss in detail without sharing capabilities thoughts. If I don't give details or cite sources, then... it's just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you'd like to bet on it, I'm open to showing my confidence in my opinion by betting that the world turns out how I expect it to.

Sounds neat. I think it would make more sense to frame a 'public non-legislation-enacting non-official-electing vote' as a public opinion poll. Politicians pay attention to opinion polls! Opinion polls matter! Framing an opinion poll as a weird sort of transferable ineffective vote is just confusing and detracts from the genuine value in the idea.

 I bet I'd enjoy answering some of these opinion polls. Too bad I don't and won't have a Twitter/X account. This would seem much more interesting to me if it could make a digital twin for me from my writing here on LessWrong, or out of arbitrary documents I uploaded to my account on your site for that express purpose. 

Also agree on the timelines. If we don't take some dramatic governance actions, then AGI looks probable in the next 5 years, and very probable in the next 10. And after that, the odds of the world / society being similar to the way it has been for the past 50 years seems vanishingly small. If you aren't already highly educated in the technical skills needed to help with this, probably political action is your best bet for having a future that conforms to your desires.

My view is that there's huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it's not that scaling leads to ASI directly, it's:

  1. scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
  2. Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
  3. Training these newly discovered architectures at large scales leads to much more competent automated researchers.
  4. This process repeats quickly over a few months or years.
  5. This process results in AGI.
  6. AGI, if instructed (or allowed, if it's agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
  7. This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans. 

Note that this process isn't inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren't disempowered until near the end of this.

No data wall blocking GPT-5. That seems clear. For future models, will there be data limitations? Unclear.

https://youtube.com/clip/UgkxPCwMlJXdCehOkiDq9F8eURWklIk61nyh?si=iMJYatfDAZ_E5CtR 

I have so much more confidence in Jan and Ilya. Hopefully they go somewhere to work on AI alignment together. The critical time seems likely to be soon. See this clip from an interview with Jan: https://youtube.com/clip/UgkxFgl8Zw2bFKBtS8BPrhuHjtODMNCN5E7H?si=JBw5ZUylexeR43DT 

[Edit: watched the full interview with John and Dwarkesh. John seems kinda nervous, caught a bit unprepared to answer questions about how OpenAI might work on alignment. Most of the interesting thoughts he put forward for future work were about capabilities. Hopefully he does delve deeper into alignment work if he's going to remain in charge of it at OpenAI.]

Interesting to watch Sam Altman talk about it here at timestamp 18:40: 

 the executives— 

Diane Yoon: vice president of people 

Chris Clark: head of nonprofit and strategic initiatives

left the company earlier this week, a company spokesperson said.

Load More