Nathan Helm-Burger

LESSWRONG
LW

Nathan Helm-Burger — LessWrong

Replying toOptimal Timing for Superintelligence: Mundane Considerations for Existing People

Optimal Timing for Superintelligence: Mundane Considerations for Existing People

For many parameter settings, the optimal strategy would involve moving quickly to AGI capability, then pausing briefly before full deployment: swift to harbor, slow to berth.

I've also been thinking about the merits and drawbacks of such an approach. I've been referring to it as "the Tokyo Drift strategy". Accelerate towards the edge of the cliff, then pull the handbrake and spin the wheel at the last possible second, sliding towards the edge of the cliff while accelerating in a new direction. In my mind's eye I see the rear tire sending a cloud of dust and pebbles over the edge as it slides by, a fraction of an inch from destruction.

Not a reassuring vision, but it would certainly make for a dramatic historical recreation if we survive!

Replying toHow do we (more) safely defer to AIs?

Nathan Helm-Burger1d

How do we (more) safely defer to AIs?

Since Opus 4.5, I've really been ramping up my rate of cyborg'd alignment eval creation. Now even faster with 4.6. Still very much mostly me doing the ideation. I am beginning to occasionally get ideas I like and incorporate from Claude though, and I can definitely imagine this increasing over the next couple versions.

I think there's good reason to hope that we can do a lot of the high-level abstract planning for what needs to be done, and when time comes to move from cyborg to deference (human-mostly-out-of-the-loop) we could then be deferring mostly on implementation rather than broad directions. That seems safer to me.

Replying toUnderstanding Agency through Markov Blankets

Nathan Helm-Burger1mo

Understanding Agency through Markov Blankets

I have been thinking about this in a less mathematically-robust and more messy-biological-details way by looking at bottlenecks on information (for both input/learning and output/control) in brains and agents. The human brain has a lot of astonishingly tight informational bottlenecks with lossy compression, which is part of how it makes itself robust to the substantial levels of random noise that biological neural circuits are prone to. This also means that the human cortex, being a tiled set of general learning circuits (microcolumns), ends up consistently spatially organized into specialized modules (e.g. Brodmann areas, aka parcellations). For more on this, please check out my talk here: https://foresight.org/resource/nathan-helm-burger-bottlenecks-in-the-brain/

Nathan Helm-Burger2mo

Yeah, a spoiler tag seems like a good solution

Nathan Helm-Burger2mo

Meta: seems like a good reason to have agreement vote counts hidden until after you've made your vote.

-2

Replying toWhat's up with Anthropic predicting AGI by early 2027?

Nathan Helm-Burger3mo

What's up with Anthropic predicting AGI by early 2027?

Note that if the "evolutionary programming" axis is real, and a breakthrough does occur there, it is possible that it might mean reaearchers could possibly "pre-program" a model to be good at rapid learning (aka increase its Chollet-style intelligence).

I suspect that something like this process of evolution predisposing a brain to be better at abstract learning is a key factor in the differentiation of humans from previous primates.

Replying toWhat's up with Anthropic predicting AGI by early 2027?

Nathan Helm-Burger3mo

What's up with Anthropic predicting AGI by early 2027?

In my mental model, I expect we are at least one fundamental breakthrough away from AGI (in keeping with François Chollet's points about intelligence being about rapid efficient learning rather than just applying broad knowledge).

It seems difficult to me to predict how far we are from a breakthrough that gives us a significant improvement on this metric.

So, to me, it seems really important to ask how much a given level of LLM coding assistance is enabling researchers to iterate more easily over a broader range of experiments. I don't have sufficient insight into the research patterns of AI companies to have a good sense of novel experiments per researcher per month (ERM).... (read more)

•••

Nathan Helm-Burger4mo

Tangent: have you seen Black Ops Chess? It's a blend of Chess and Stratego. https://blackopschess.com/game

I loved Stratego as a kid, and I find this very appealing. The opportunity for faking out your opponent by playing strong pieces as if they were weak ones, followed by a sudden betrayal of expectation....

Nathan Helm-Burger4mo

Makes me curious to see a game between humans where non-sensible moves are defined in some objective way and forbidden by guardrail AI. Like, not even considered a legal move by the computer UI.

Would this extend the games of humans to around 64 moves on average? What would the experience of playing such a game be for low ELO humans? Confusion about why certain moves were forbidden, probably.

Replying toTowards a Typology of Strange LLM Chains-of-Thought

Nathan Helm-Burger4mo

Towards a Typology of Strange LLM Chains-of-Thought

I've been suspecting that Anthropic is doing some reinforcement of legibility of CoT, because their CoTs seemed unusually normal and legible. Gemini too, back when it had visible CoT instead of summarized.

Also possible that Anthropic is actually giving edited CoTs rather than raw ones.

I agree with this post that claims that research taste and compute budgets are fairly fungible: https://x.com/francoisfleuret/status/1958211714601607441

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Benjamin Arnav

Benjamin Arnav, Pablo Bernabeu Perez, Timothy Kostolansky, HanneWhitt, Nathan Helm-Burger, Mary Phuong

8mo

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."

Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant... (read 631 more words →)

Why does ai-2027 predict China falling behind? Because the next level of compute beyond the current level is going to be hard for DeepSeek to muster. In other words, that DeepSeek will be behind in 2026 because of hardware deficits in late 2025. If things moved more slowly, and the critical strategic point hit in 2030 instead of 2027, I think it's likely China would have closed the compute gap by then.

I agree with this take, but I think it misses some key alternative possibilities. The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute... (read more)

Desired AI safety tool: A combo translator/chat interface (e.g. custom webpage) split down the middle. On one side I can type in English, and receive English translations. On the other side is a model (I give an model name, host address, and api key). The model receives all my text translated (somehow) into a language of my specification. All the models outputs are displayed raw on the 'model' side, but then translated to English on 'my' side.

Use case: exploring and red teaming models in languages other than English

Another take on the plausibility of RSI; https://x.com/jam3scampbell/status/1892521791282614643

(I think RSI soon will be a huge deal)

A point in favor of evals being helpful for advancing AI capabilities: https://x.com/polynoamial/status/1887561611046756740

Noam Brown @polynoamial A lot of grad students have asked me how they can best contribute to the field of AI when they are short on GPUs and making better evals is one thing I consistently point to.

I'm pretty sure that measures of the persuasiveness of a model which focus on text are going to greatly underestimate the true potential of future powerful AI.

I think a future powerful AI would need different inputs and outputs to perform at maximum persuasiveness.

Inputs

speech audio in
live video of target's face (allows for micro expression detection, pupil dilation, gaze tracking, bloodflow and heart rate tracking)
EEG signal would help, but is too much to expect for most cases
sufficiently long interaction to experiment with the individual and build a specific understanding of their responses

Outputs

emotionally nuanced voice
visual representation of an avatar face (may be cartoonish)
ability to present audiovisual data (real or fake, like graphs of data, videos, pictures)

For reference on bloodflow, see: https://youtu.be/rEoc0YoALt0?si=r0IKhm5uZncCgr4z

[Edit 2: faaaaaaast. https://x.com/jrysana/status/1902194419190706667 ] [Edit: Please also see Nick's reply below for ways in which this framing lacks nuance and may be misleading if taken at face value.]

https://blogs.nvidia.com/blog/deepseek-r1-nim-microservice/

The DeepSeek-R1 NIM microservice can deliver up to 3,872 tokens per second on a single NVIDIA HGX H200 system.

[Edit: that's throughput including parallel batches, not serial speed! Sorry, my mistake.

Here's a claim from Cerebras of 2100 tokens/sec serial speed on Llama 80B. https://cerebras.ai/blog/cerebras-inference-3x-faster]

Let's say a fast human can type around 80 words per minute. A rough average token conversion is 0.75 words per token. Lets call that 110 tokens/min or around 2 tokens/sec. Speed typists can do more than this, but that isn't generating novel text,... (read more)

Some brief reference definitions for clarifying conversations.

Consciousness:

The state of being awake and aware of one's environment and existence
The capacity for subjective experience and inner mental states
The integrated system of all mental processes, both conscious and unconscious
The "what it's like" to experience something from a first-person perspective
The global workspace where different mental processes come together into awareness

Sentient:

Able to have subjective sensory experiences and feelings. Having the capacity for basic emotional responses.
Capable of experiencing pleasure and pain, valenced experience, preferences.
Able to process and respond to sensory information. Having fundamental awareness of environmental stimuli. Having behavior that is shaped by this perception of external world.

Sapient:

Possessing wisdom or the ability to think deeply
Capable of complex reasoning

... (read more)

This tweet summarizes a new paper about using RL and long CoT to get a smallish model to think more cleverly. https://x.com/rohanpaul_ai/status/1885359768564621767

It suggests that this is a less compute wasteful way to get inference time scaling.

The thing is, I see no reason you couldn't just throw tons of compute and a large model at this, and expect stronger results.

The fact that RL seems to be working well on LLMs now, without special tricks, as reported by many replications of r1, suggests to me that AGI is indeed not far off. Not sure yet how to adjust my expectations.

Reasoning Model CoT faithfulness idea

As Janus and others have mentioned, I get a vibe of unfaithfulness/steganography from comparing DeepSeek r1's reasoning traces to its actual outputs. I mean, not literally steganography, since I don't ascribe any intentionality to this, just opacity arising naturally from the training process.

My recommendation:

Should be possible to ameliorate this with a simple 'rephrasing'.

Process

Generate a bunch of CoTs on verifiable problems. Collect the ones where the answer is correct.
Have a different LLM rephrase the CoTs while preserving the semantic meaning. In addition to rephrasing, add a bit of inconsequential noise like extra spaces, or weird unnecessary punctuation, or translating all or part of it into a different language (e.g.

... (read more)

Proactive 'If-Then' Safety Cases

Nathan Helm-Burger

Holden proposed the idea of if-then planning for AI safety. I think this is potentially a very good idea, depending on the implementation details.

I've heard criticisms of the If-Then style of planning that it is inherently reactive, rather than proactive. I think this is not necessarily true, and want to make a case for the value of proactive if-then planning.

Reactive

First, let's look at reactive if-then planning. I'll lay out reactive versions of three critical questions facing us about our near term future about which we might want to do if-then planning.

Reactive triggers:

- AI Biorisk: If we see empirical proof of substantial harm resulting from bioweapons which provably required AI-assistance to be created,... (read 1074 more words →)

A path to human autonomy

Nathan Helm-Burger

"Each one of us, and also us as the current implementation of humanity are going to be replaced. Persistence in current form is impossible. It's impossible in biology; every species will either die out or it will change and adapt, in which case it is again not the same species. So the next question is once you've given up the idea that you can stay exactly as you are, what would you like to be replaced by?"

Michael Levin ^[1]

But if the technological Singularity can happen, it will. Even if all the governments of the world were to understand the "threat" and be in deadly fear of it, progress toward the goal would continue.

... (read 5839 more words →)

My hopes for YouCongress.com

Nathan Helm-Burger

Background

For background on YouCongress.com see this post by Hector Perez Arenas.

I love this general concept, and have a lot of ideas for how this implementation could be expanded. I'm hoping that writing out some of my ideas might inspire someone to jump in an contribute code. I would myself if I didn't feel full to the brim on trying to work on AI alignment / control / safety ideas.

A brief summary of the current concept is that it is a political polling platform (which could in theory be used to guide the decision making of political representatives). The platform allows users to create polls. Polls are answered by users and by 'digital... (read 977 more words →)

Physics of Language models (part 2.1)

Nathan Helm-Burger

This is perhaps the best interpretability work I've seen outside of Chris Olah's team.

Avoiding the Bog of Moral Hazard for AI

Nathan Helm-Burger

Imagine if you will, a map of a landscape. On this map, I will draw some vague regions. Their boundaries are uncertain, for it is a new and under-explored land. This map is drawn as a graph, but I want to emphasize that the regions are vague guesses, and the true borders could be very convoluted.

So here's the problem. We're making these digital minds, these entities which are clearly not human and process the world in different ways from human minds. As we improve them, we wander further and further into this murky fog covered bog of moral hazard. We don't know when these entities will become sapient / conscious / valenced... (read 554 more words →)

A bet for Samo Burja

Nathan Helm-Burger

I'm listening to Samo Burja talk on the Cognitive Revolution podcast with Nathan Labenz. Samo said that he would bet that AGI is coming perhaps in the next 20-50 years, but not in the next 5.

I will take that bet. I can't afford to make an impressively large bet because my counterfactual income is already tied up in a bet against the universe. I quit my well-paying industry job as a machine learning engineer / data scientist three years ago to focus on AI safety/alignment research. To make the bet interesting, I will therefore offer 10:1 odds. I bet $1000 USD against your $100 USD that AGI will be invented in the next 5... (read 325 more words →)

Diffusion Guided NLP: better steering, mostly a good thing

Nathan Helm-Burger

I think this is a very promising method for improving the steering of LLMs. Which is great for reducing risk from model-originating harms like deception.

The flipside is that it increases misuse potential.

This is yet another possibility for the widening of the safety gap between closed-weight models with locked-down controls, and open weight models.

Imbue (Generally Intelligent) continue to make progress

Nathan Helm-Burger

I've been following the company Imbue and their podcast Generally Intelligent since they started. They've said thoughtful and creative things in their podcast, and I think they are making impressive progress towards AGI considering their relatively smaller size.

Just wanting to keep people appraised. If they did hit on something unusually potent, would they get acquired by a larger actor? Hard to know.

What is it that they're releasing? In addition to their 70B param model...

11 sanitized and extended NLP reasoning benchmarks including ARC, GSM8K, HellaSwag, and Social IQa
An original code-focused reasoning benchmark
A new dataset of 450,000 human judgments about ambiguity in NLP questions
A hyperparameter optimizer for scaling small experiments to a 70B run
Infrastructure

Nathan Helm-Burger

Unclear how relevant this news is to AI safety, but it seems like the sort of thing we ought to notice.

A backroom Washington deal brokered two years ago is undercutting a key part of President Joe Biden’s policy to grow the national high-tech manufacturing base — pushing more than $3 billion into a secretive national-security project promoted by chipmaker Intel.
In recent weeks, Biden and Senate Majority Leader Chuck Schumer have been taking victory laps for the 2022 CHIPS and Science Act, a law intended to create jobs and fund innovation in a key global industry. It has already launched a series of grants, incentives and research proposals to help America regain its

... (read more)

LESSWRONG
LW

LESSWRONG
LW

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

A path to human autonomy

Secret US natsec project with intel revealed

What more compute does for brain-like models: response to Rohin

Nathan Helm-Burger

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Proactive 'If-Then' Safety Cases

A path to human autonomy

My hopes for YouCongress.com

Physics of Language models (part 2.1)

Avoiding the Bog of Moral Hazard for AI

A bet for Samo Burja

Nathan Helm-Burger

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

A path to human autonomy

Secret US natsec project with intel revealed

What more compute does for brain-like models: response to Rohin

Nathan Helm-Burger

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Proactive 'If-Then' Safety Cases

A path to human autonomy

My hopes for YouCongress.com

Physics of Language models (part 2.1)

Avoiding the Bog of Moral Hazard for AI

A bet for Samo Burja

Inputs

Outputs

Reactive

Background