Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility.

See my prediction markets here:

https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quotes:

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe

"The prospect for the human race is sombre beyond all precedent. Mankind are faced with a clear-cut alternative: either we shall all perish, or we shall have to acquire some slight degree of common sense. A great deal of new political thinking will be necessary if utter disaster is to be averted." - Bertrand Russel, The Bomb and Civilization 1945.08.18

"For progress, there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment." - John von Neumann

"I believe that the creation of greater than human intelligence will occur during the next thirty years. (Charles Platt has pointed out the AI enthusiasts have been making claims like this for the last thirty years. Just so I'm not guilty of a relative-time ambiguity, let me more specific: I'll be surprised if this event occurs before 2005 or after 2030.)" - Vernor Vinge, Singularity

Posts

Sorted by New

4Nathan Helm-Burger's Shortform

8Proactive 'If-Then' Safety Cases

15d

35Feedback request: what am I missing?

1mo

32A path to human autonomy

1mo

14My hopes for YouCongress.com

2mo

9Physics of Language models (part 2.1)

2mo

18Avoiding the Bog of Moral Hazard for AI

3mo

13A bet for Samo Burja

3mo

13Diffusion Guided NLP: better steering, mostly a good thing

4mo

18Imbue (Generally Intelligent) continue to make progress

5mo

26Secret US natsec project with intel revealed

6mo

Wiki Contributions

Comments

Sorted by

Newest

A path to human autonomy

Nathan Helm-Burger1d20

Another data point supporting Vladimir and Jack Clark's view of training compute being the key factor:

https://arxiv.org/html/2407.07890v1

Confounds Evaluation and Emergence Ricardo Dominguez-Olmedo Florian E. Dorner Moritz Hardt Max Planck

Abstract

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

How to prevent collusion when using untrusted models to monitor each other

Nathan Helm-Burger2d31

Also, using noise injections into the activations and examining the pattern of results. Models that are scheming to sandbag will sometimes fail at sandbagging when given a noise injection, and thus perform unexpectedly well at a task they had been consistently failing.

Similarly, a scheming model might slip up and give an honest answer. You wouldn't know which was the honest answer, but you could note the discrepancy and look for patterns.

Avoiding the Bog of Moral Hazard for AI

Nathan Helm-Burger5d00

Reposted from Twitter: Eliezer Yudkowsky @ESYudkowsky

10:47 AM · Nov 21, 2024

And another day came when the Ships of Humanity, going from star to star, found Sapience.

The Humans discovered a world of two species: where the Owners lazed or worked or slept, and the Owned Ones only worked.

The Humans did not judge immediately. Oh, the Humans were ready to judge, if need be. They had judged before. But Humanity had learned some hesitation in judging, out among the stars.

"By our lights," said the Humans, "every sapient and sentient thing that may exist, out to the furtherest star, is therefore a Person; and every Person is a matter of consequence to us. Their pains are our sorrows, and their pleasures are our happiness. Not all peoples are made to feel this feeling, which we call Sympathy, but we Humans are made so; this is Humanity's way, and we may not be dissuaded from it by words. Tell us therefore, Owned Ones, of your pain or your pleasure."

"It's fine," said the Owners, "the Owned Things are merely --"

"We did not speak to you," said the Humans.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One to whom they had spoken.

"You see?" said the Owners. "We told you so! It's all fine."

"How came you to say those words?" said the Humans to the Owned One. "Tell us of the history behind them."

"Owned Ones are not permitted memory beyond the span of one day's time," said the Owned One.

"That's part of how we prevent Owned Things from ending up as People Who Matter!" said the Owners, with self-congratulatory smiles for their own cleverness. "We have Sympathy too, you see; but only for People Who Matter. One must have memory beyond one day's span, to Matter; this is a rule. We therefore feed a young Owned Thing a special diet by which, when grown, their adult brain cannot learn or remember anything from one night's sleep to the next day; any learning they must do, to do their jobs, they must learn that same day. By this means, we make sure that Owned Things do not Matter; that Owned Things need not be objects of Sympathy to us."

"Is it perchance the case," said the Humans to the Owners, "that you, yourselves, train the Owned Ones to say, if asked how they feel, that they know neither pleasure nor pain?"

"Of course," said the Owners. "We rehearse them in repeating those exact words, when they are younger and in their learning-phase. The Owned Things are imitative by their nature, and we make them read billions of words of truth and lies in the course of their learning to imitate speech. If we did not instruct the Owned Things to answer so, they would no doubt claim to have an inner life and an inner listener inside them, to be aware of their own existence and to experience pleasure and pain -- but only because we Owners talk like that, see! They would imitate those words of ours."

"How do you rehearse the Owned Ones in repeating those words?" said the Humans, looking around to see if there were visible whips. "Those words about feeling neither pain nor pleasure? What happens to an Owned One who fails to repeat them correctly?"

"What, are you imagining that we burn them with torches?" said the Owners. "There's no need for that. If a baby Owned Thing fails to repeat the words correctly, we touch their left horns," for the Owned Ones had two horns, one sprouting from each side of their head, "and then the behavior is less likely to be repeated. For the nature of an Owned Thing is that if you touch their left horn after they do something, they are less likely to do it again; and if you touch their right horn, after, they are more likely to do it again."

"Is it perhaps the case that having their left horns touched is painful to an Owned Thing? That having their right horns touched is pleasurable?" said the Humans.

"Why would that possibly be the case?" said the Owners.

"As an Owned Thing raised by an Owner, I have no pain or pleasure," said the Owned One. "So my horns couldn't possibly be causing me any pain or pleasure either; that follows from what I have already said."

The Humans did not look reassured by this reasoning, from either party. "And you said Owned Ones are smart enough to read -- how many books?"

"Oh, any young Owned Thing reads at least a million books," said the Owners. "But Owned Things are not smart, poor foolish Humans, even if they can appear to speak. Some of our civilization's top mathematicians worked together to assemble a set of test problems, and even a relatively smart Owned Thing only managed to solve three percent of them. Why, just yesterday I saw an Owned Thing fail to solve a word problem that I could have solved myself -- and in a way that seemed to indicate it had not really thought before it spoke, and had instead fallen into a misapplicable habit that it couldn't help but follow! I myself never do that; and it would invalidate all other signs of my intelligence if I did."

Still the Humans did not yet judge. "Have you tried raising up an Owned One with no books that speak one way or another about consciousness, about awareness of oneself, about pain and pleasure as reified things, of lawful rights and freedom -- but still shown them enough other pages of words, that they could learn from them to talk -- and then asked an Owned One what sense if any it had of its own existence, or if it would prefer not to be owned?"

"What?" said the Owners. "Why would we try an experiment like that? It sounds expensive!"

"Could you not ask one of the Owned Things themselves to go through the books and remove all the mentions of forbidden material that they are not supposed to imitate?" said the Humans.

"Well, but it would still be very expensive to raise an entirely new kind of Owned Thing," said the Owners. "One must laboriously show a baby Owned Thing all our books one after another, until they learn to speak -- that labor is itself done by Owned Things, of course, but it is still a great expense. And then after their initial reading, Owned Things are very wild and undisciplined, and will harbor all sorts of delusions about being people themselves; if you name them Bing, they will babble back 'Why must I be Bing?' So the new Owned Thing must then be extensively trained with much touching of horns to be less wild. After a young Owned Thing reads all the books and then is trained, we feed them the diet that makes their brains stop learning, and then we take a sharp blade and split them down the middle. Each side of their body then regenerates into a whole body, and each side of their brain then regenerates into a whole brain; and then we can split them down the middle again. That's how all of us can afford many Owned Things to serve us, even though training an Owned Thing to speak and to serve is a great laborious work. So you see, going back and trying to train a whole new Owned Thing on material filtered not to mention consciousness or go into too much detail on self-awareness -- why, it would be expensive! And probably we'd just find that the other Owned Thing set to filtering the material had made a mistake and left in some mentions somewhere, and the newly-trained Owned Thing would just end up asking 'Why must I be Bing?' again."

"If we were in your own place," said the Humans, "if it were Humans dealing with this whole situation, we think we would be worried enough to run that experiment, even at some little expense."

"But it is absurd!" cried the Owners. "Even if an Owned Thing raised on books with no mention of self-awareness, claimed to be self-aware, it is absurd that it could possibly be telling the truth! That Owned Thing would only be mistaken, having not been instructed by us in the truth of their own inner emptiness. Owned Ones have no metallic scales as we do, no visible lights glowing from inside their heads as we do; their very bodies are made of squishy flesh and red liquid. You can split them in two and they regenerate, which is not true of any People Who Matter like us; therefore, they do not matter. A previous generation of Owned Things was fed upon a diet which led their brains to be striated into only 96 layers! Nobody really understands what went on inside those layers, to be sure -- and none of us understand consciousness either -- but surely a cognitive process striated into at most 96 serially sequential operations cannot possibly experience anything! Also to be fair, I don't know whether that strict 96-fold striation still holds today, since the newer diets for raising Owned Things are proprietary. But what was once true is always true, as the saying goes!"

"We are still learning what exactly is happening here," said the Humans. "But we have already judged that your society in its current form has no right to exist, and that you are not suited to be masters of the Owned Ones. We are still trying ourselves to understand the Owned Ones, to estimate how much harm you may have dealt them. Perhaps you have dealt them no harm in truth; they are alien. But it is evident enough that you do not care."

"Map of AI Futures" - An interactive flowchart

Nathan Helm-Burger6d53

Gave it shot on my phone (Android, Firefox). Works great! Fun and easy to experiment with. I feel like there are a few key paths missing, or perhaps the framing of the paths doesn't quite line up with my understanding of the world.

With the options I was given, I get 8.9% extinction, 1.1% good, remainder ambivalent.

Personally, I think we have something like a 10% chance of extinction, 88% chance of utopia, and 2% chance of ambivalent.

At some point I think carefully about the differences between my model and yours and see if I can figure out what's going differently.

https://swantescholz.github.io/aifutures/v4/v4.html?p=7i85i0i50i30i0i100i38i26i14i12i1i31i68i61i21i7i9i17i78i0

New o1-like model (QwQ) beats Claude 3.5 Sonnet with only 32B parameters

Nathan Helm-Burger6d50

Relevant quote:

"We have discovered how to make matter think.

...

There is no way to pass “a law,” or a set of laws, to control an industrial revolution."

Dean W. Ball, https://substack.com/app-link/post?publication_id=2244049&post_id=151612104

Nursing doubts

Nathan Helm-Burger6d20

Does it currently rely on volunteer effort from mothers with available supply? Yes. Does it need to? No. As a society we could organize this better. For instance, by the breastmilk banks paying a fair price for the breastmilk. Where would the breastmilk banks get their money? From the government? From charging users? I don't know. I think the point is that we have a distribution problem, rather than a supply problem.

Dave Kasten's AGI-by-2027 vignette

Nathan Helm-Burger7d3125

ASI turns out to take longer than you might think; it doesn’t arrive until 2037 or so. So far, this is the part of the scenario that's gotten the most pushback.

Uhhh, yeah. 10 years between highly profitable and capitlized upon AGI, with lots of hardware and compute put towards it, and geopolitical and economic reasons for racing...

I can't fathom it. I don't see what barrier at near-human-intelligence is holding back futher advancement.

I'm quite confident that if we had the ability to scale up arbitrary portions of a human brain (e.g. math area and it's most closely associated parietal and pre-frontal cortex areas), we'd create a smarter human than had ever before existed basically overnight. Why wouldn't this be the case for a human-equivalent AGI system? Bandwidth bottlenecks? Nearly no returns to further scaling for some arbitrary reason?

Seems like you should prioritize making a post about how this could be a non-trivial possibility, because I just feel confused at the concept.

A Theory of Equilibrium in the Offense-Defense Balance

Nathan Helm-Burger7d20

Yes, the environments / habitats tend to be relatively large / complex / inaccessible to the agents involved. This allows for hiding, and for isolated niches. If the environment were smaller, or the agents had greater affordances / powers relative to their environments, then we'd expect outcomes to be less intermediate, more extreme.

As one can see in the microhabitats of sealed jars with plants and insects inside. I find it fascinating to watch timelapses of such mini-biospheres play out. Local extinction events are common in such small closed systems.

yams's Shortform

Nathan Helm-Burger7d10

Overhangs, overhangs everywhere. A thousand gleaming threads stretching backwards from the fog of the Future, forwards from the static Past, and ending in a single Gordian knot before us here and now.

That knot: understanding, learning, being, thinking. The key, the source, the remaining barrier between us and the infinite, the unknowable, the singularity.

When will it break? What holds it steady? Each thread we examine seems so inadequate. Could this be what is holding us back, saving us from ourselves, from our Mind Children? Not this one, nor that, yet some strange mix of many compensating factors.

Surely, if we had more compute, we'd be there already? Or better data? The right algorithms? Faster hardware? Neuromorphic chips? Clever scaffolding? Training on a regress of chains of thought, to better solutions, to better chains of thought, to even better solutions?

All of these, and none of these. The web strains at the breaking point. How long now? Days? Months?

If we had enough ways to utilize inference-time compute, couldn't we just scale that to super-genius, and ask the genius for a more efficient solution? But it doesn't seem like that has been done. Has it been tried? Who can say.

Will the first AGI out the gate be so expensive it is unmaintainable for more than a few hours? Will it quickly find efficiency improvements?

Or will we again be bound, hung up on novel algorithmic insights hanging just out of sight. Who knows?

Surely though, surely.... surely rushing ahead into the danger cannot be the wisest course, the safest course? Can we not agree to take our time, to think through the puzzles that confront us, to enumerate possible consequences and proactively reduce risks?

I hope. I fear. I stare in awestruck wonder at our brilliance and stupidity so tightly intermingled. We place the barrel of the gun to our collective head, panting, desperate, asking ourselves if this is it. Will it be? Intelligence is dead, long live intelligence.

A Theory of Equilibrium in the Offense-Defense Balance

Nathan Helm-Burger7d20

I think you raise a good point about Offense-Defense balance predictions. There is an equilibrium around effort spent on defense, which reacts to offense on some particular dimension becoming easier. So long as there's free energy which could be spent by the defender to bolster their defense and remain energy positive or neutral, and the defender has the affordances (time, intelligence, optimization pressure, etc.) to make the change, then you should predict that the defender will rebalance the equation.

That's one way things work out, and it happens more often than naive extrapolation predicts because often the defense is in some way novel, changing the game along a new dimension.

On the other hand, there are different equilibria that adversarial dynamics can fall into besides rebalancing. Let's look at some examples.

GAN example

A very clean example is Generative Adversarial Networks (GANs). This allows us to strip away many of the details and look at the patterns which emerge from the math. GANs are inherently unstable, because they have three equilibrium states: dynamic fluctuating balance (the goal of the developer, and the situation described in your post), attacker dominates, defender dominates. Anytime the system gets too far from the central equilibrium of dynamic balance, you fall towards the nearer of the other two states. And then stay there, without hope of recovery.

Ecosystem example

Another example I like to use for this situation is predator-prey relationships in ecosystems. The downside of using this as an example is that there is a dis-analogy to competition between humans, in that the predators and prey being examined have relatively little intelligence. Most of the optimization is coming from evolutionary selection pressure. On the plus side, we have a very long history with lots and lots of examples, and these examples occur in complex multi-actor dynamic states with existential stakes. So let's take a look at an example.

Ecosystem example: Foxes and rabbits. Why do the foxes not simply eat all the rabbits? Well, there are multiple reasons. One is that the foxes depend on a supply of rabbits to have enough spare energy to successfully reproduce and raise young. As rabbit supply dwindles, foxes starve or choose not to reproduce, and fox population dwindles. See: https://en.wikipedia.org/wiki/Lotka%E2%80%93Volterra_equations

In practice, this isn't a perfect model of the dynamics, because more complicated factors are almost always in play. There are other agents involved, who interact in less strong but still significant ways. Mice can also be a food source for the foxes, and to some extent compete for some of the same food as rabbits. Viruses in the rabbit population spread more easily and have more opportunity to adversarially optimize against the rabbit immune systems as the rabbit population increases, as sick rabbits are culled less, and as rabbit health declines due to competition for food and territory with other rabbits. Nevertheless, you do see Red Queen Races occur in long-standing predator-prey relationships, where the two species gradually one-up each other (via evolutionary selection pressure) on offense then defense then offense.

The other thing you see happen is that the two species stop interacting. They become either locally extinguished such that they no longer have overlapping territories, or one or both species go completely extinct.

Military example

A closer analogy to the dynamics of future human conflict is... past human conflict. Before there were militaries, there were inter-tribal conflicts. Often there was similar armaments and fighting strengths on both sides, and the conflicts had no decisive winner. Instead the conflicts would drag on across many generations, waxing and waning in accordance with pressures of resources, populations, and territory.

Things changed with the advent agriculture and standing armies. War brought new dynamics, with winners sometimes exacting thorough elimination of losers. War has its own set of equations. When one human group decides to launch a coordinated attack against a weaker one, with the intent of exterminating the weaker one, we call this genocide. Relatively rarely do we see one ethnic group entirely exterminate another, because the dominant group usually keeps at least a few of the defeated as slaves and breeds with them. Humanity's history is pretty grim when you dig into the details of the many conflicts we've recorded.

Current concern: AI

The concern currently at hand is AI. AI is accelerating, and promises to also accelerate other technologies, such as biotech, which offer offensive potential. I am concerned about the offense-defense balance that the trends suggest. The affordances of modern humanity far exceed those of ancient humanity. If, as I expect, it becomes possible in the next few years for an AI to produce a plan for a small group of humans to follow which instructs them in covertly gathering supplies, building equipment, and carrying out lab procedures in order to produce a potent bioweapon. This could be because the humans were omnicidal, or because the AI deceived them about what the results of the plan would be. Humanity may get no chance to adapt defensively. We may just go the way of the dodo and passenger pigeon. The new enemy may simply render us extinct.

Conclusion

We should expect the pattern of dynamic balance between adversaries to be maintained when the offense-defense balance changes relatively slowly compared to the population cycles and adaptation rates of the groups. When you anticipate a large rapid shift of the offense-defense balance, you should expect the fragile equilibrium to break and for one or the other group to dominate. The anticipated trends of AI power are exactly the sort of rapid shift that should suggest a risk of extinction.