Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here:

 https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe 

Wiki Contributions

Comments

Something I think humanity is going to have to grapple with soon is the ethics of self-modification / self-improvement, and the perils of value-shift due to rapid internal and external changes. How do we stay true to ourselves while changing fundamental aspects of what it means to be human?

This is a solid seeming proposal. If we are in a world where the majority of danger comes from big datacenters and large training runs, I predict that this sort of regulation would be helpful. I don't think we are in that world though, which I think limits how useful this would be. Further explanation here: https://www.lesswrong.com/posts/sfWPjmfZY4Q5qFC5o/why-i-m-doing-pauseai?commentId=p2avaaRpyqXnMrvWE

Answer by Nathan Helm-BurgerMay 04, 202452

Personally, I have gradually moved to seeing this as lowering my p(doom). I think humanity's best chance is to politically coordinate to globally enforce strict AI regulation. I think the most likely route to this becoming politically feasible is through empirical demonstrations of the danger of AI. I think AI is more likely to be legibly empirically dangerous to political decision-makers if it is used in the military. Thus, I think military AI is, counter-intuitively, lowering p(doom). A big accident that caused military AI to kill thousands of innocent people that the military had not intended to kill could be really great for p(doom). 

 

This is a sad thing to think, obviously. I'm hopeful we can come up with harmless demonstrations of the dangers involved, so that political action will be taken without anyone needing to be killed.

In scenarios where AI becomes powerful enough to present an extinction risk to humanity, I don't expect that the level of robotic weaponry it has control over to matter much. It will have many many opportunities to hurt humanity that look nothing like armed robots and greatly exceed the power of armed robots.

I absolutely sympathize, and I agree that with the world view / information you have that advocating for a pause makes sense. I would get behind 'regulate AI' or 'regulate AGI', certainly. I think though that pausing is an incorrect strategy which would do more harm than good, so despite being aligned with you in being concerned about AGI dangers, I don't endorse that strategy.

Some part of me thinks this oughtn't matter, since there's approximately ~0% chance of the movement achieving that literal goal. The point is to build an anti-AGI movement, and to get people thinking about what it would be like to be able to have the government able to issue an order to pause AGI R&D, or turn off datacenters, or whatever. I think that's a good aim, and your protests probably (slightly) help that aim.

I'm still hung up on the literal 'Pause AI' concept being a problem though. Here's where I'm coming from: 

1. I've been analyzing the risks of current day AI. I believe (but will not offer evidence for here) current day AI is already capable of providing small-but-meaningful uplift to bad actors intending to use it for harm (e.g. weapon development). I think that having stronger AI in the hands of government agencies designed to protect humanity from these harms is one of our best chances at preventing such harms. 

2. I see the 'Pause AI' movement as being targeted mostly at large companies, since I don't see any plausible way for a government or a protest movement to enforce what private individuals do with their home computers. Perhaps you think this is fine because you think that most of the future dangers posed by AI derive from actions taken by large companies or organizations with large amounts of compute. This is emphatically not my view. I think that actually more danger comes from the many independent researchers and hobbyists who are exploring the problem space. I believe there are huge algorithmic power gains which can, and eventually will, be found. I furthermore believe that beyond a certain threshold, AI will be powerful enough to rapidly self-improve far beyond human capability. In other words, I think every AI researcher in the world with a computer is like a child playing with matches in a drought-stricken forest. Any little flame, no matter how small, could set it all ablaze and kill everyone. Are the big labs playing with bonfires dangerous? Certainly. But they are also visible, and can be regulated and made to be reasonably safe by the government. And the results of their work are the only feasible protection we have against the possibility of FOOM-ing rogue AGI launched by small independent researchers. Thus, pausing the big labs would, in my view, place us in greater danger rather than less danger. I think we are already well within the window of risk from independent-researcher-project-initiated-FOOM. Thus, the faster we get the big labs to develop and deploy worldwide AI-watchdogs, the sooner we will be out of danger.

I know these views are not the majority views held by any group (that I know of). These are my personal inside views from extensive research. If you are curious about why I hold these views, or more details about what I believe, feel free to ask. I'll answer if I can.

I've mentioned it elsewhere, but I'll repeat it again here since it's relevant. For GPT-style transformers, and probably for other model types, you can smoothly subtly degrade the performance of the model by adding in noise to part or all of the activations. This is particularly useful for detecting sandbagging, because you would expect sandbagging to show up as an anomalous increase in capability, breaking the smooth downward trend in capability, as you increased the amount of noise injected or fraction of activations to which noise was added. I found that there was noticeable decrease in performance even when the noise was added even to a small fraction (e.g. < 1%) of the activations and was created to be small relative to the given activation-magnitude it was being added to. 

So, after reading the KAN paper, and thinking about it in the context of this post: https://www.lesswrong.com/posts/gTZ2SxesbHckJ3CkF/transformers-represent-belief-state-geometry-in-their

My vague intuition is that the same experiment done with a KAN would result in a clearer fractal which wiggled less once training loss had plateaued. Is that also other people's intuition?

I, on the other hand, have very little confidence that people trying to build AGI will fail to quickly (within the next 3 years, aka 2027) find ways to do it. I do have confidence that we can politically coordinate to stop the situation becoming an extinction or near-extinction-level catastrophe. So I place much less emphasis on abstaining from publishing ideas which may help both alignment and capabilities, and more emphasis on figuring out ways to generate empirical evidence of the danger before it is too late, so as to facilitate political coordination.

I think that the situation in which humanity fails to politically coordinate to avoid building catastrophically dangerous AI is a situation that leads into conflict, likely a World War III with wide-spread use of nuclear weapons. I don't expect humanity to go extinct from this and I don't expect the rogue AGI to emerge as the victor, but I do think it is in everyone's interests to work hard to avoid such a devastating conflict. I do think that any such conflict would likely wipe out the majority of humanity. That's a pretty grim risk to be facing on the horizon.

EY may be too busy to respond, but you can probably feel pretty safe consulting with MIRI employees in general. Perhaps also Conjecture employees, and Redwood Research employees, if you read and agree with their views on safety. That at least gives you a wider net of people to potentially give you feedback.

Some features I'd like:

a 'mark read' button next to posts so I could easily mark as read posts that I've read elsewhere (e.g. ones cross-posted from a blog I follow)

a 'not interested' button which would stop a given post from appearing in my latest or recommended lists. Ideally, this would also update my recommended posts so as to recommend fewer posts like that to me. (Note: the hide-from-front-page button could be this if A. It worked even on promoted/starred posts, and B. it wasn't hidden in a three-dot menu where it's frustrating to access)

a 'read later' button which will put the post into a reading list for me that I can come back to later.

a toggle button for 'show all' / 'show only unread' so that I could easily switch between the two modes.

These features would help me keep my 'front page' feeling cleaner and more focused.

Yeah, I agree that releasing open-weights non-frontier models doesn't seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance.

That can be bad in different ways. Let me pose a couple hypotheticals.

  1. What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it's a dangerous thing to have open-weight models catching up.

  2. What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren't sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.

So, as we get closer to danger, open-weight models take on more safety significance.

Load More