LESSWRONG
LW

Steven Byrnes — LessWrong

If only there was a handy-dandy lesswrong blog-publishing checklist that included “update the link preview image” as one of its items.

Steven Byrnes2d*Quick Take

FYI, if anyone read my post “The nature of LLM algorithmic progress” last week, it’s now a heavily-revised version 2.

Steven Byrnes3d

FWIW, inspired by Justis, I’ve been keeping up a list of things that I could usefully automate with Claude Code (or similar) for my own personal productivity, adding to the list every time something pops into my head. I’ve been adding to the list for the past three weeks. But so far it’s a very underwhelming list! Here’s ~the whole thing:

Custom interface for composing tweet-threads, including their funny formula for counting characters (I have some complaints about the built-in twitter one, e.g. I usually also post them onto bluesky)
Jeff’s “clipboard normalizer” (but I have a PC not Mac)
…And something similar for clipboard conversion from simple HTML into the abstruse “typst” format that

Steven Byrnes3d

Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Thanks! I appreciate the brainstorming here.

it feels like you are adding more gears to your model as you go. … I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.

I am acutely aware of the risk of post-hoc storytelling instead of principled postdiction :) I think I'm pretty good at doing principled postdiction rather than post-hoc storytelling (although maybe everybody thinks that about themselves), but I’m certainly capable of the latter, especially when I’m just brainstorming and haven't stewed on something for months or years. E.g. much of my previous comment was early-stage low-confidence brainstorming, I hope I made that... (read 849 more words →)

Replying toThe nature of LLM algorithmic progress (v2)

Steven Byrnes3d

The nature of LLM algorithmic progress (v2)

That’s helpful, thanks! The new version 2 has a rewritten optimization section, hope it’s better now.

Replying toThe nature of LLM algorithmic progress (v2)

Steven Byrnes3d

The nature of LLM algorithmic progress (v2)

Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!

…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.

The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of

Steven Byrnes4d

Are there lessons from high-reliability engineering for AGI safety?

(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.

This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain under those circumstances.

If there’s a distribution shift (test environment... (read more)

Replying toIn (highly contingent!) defense of interpretability-in-the-loop ML training

Steven Byrnes4d

In (highly contingent!) defense of interpretability-in-the-loop ML training

I’m not making any claims about what the “interpretability” system is. It can be any system whatsoever whose input is activations and whose output is one or more numbers. The “system” could be a linear probe. Or the “system” could be a team of human researchers who pause the model after every forward pass, scrutinize the activation state for a week, and then output a “this activation state represents scheming” score from 0 to 10. (That’s not a practical example, because if you pause for a week on each forward pass then the training would take a zillion years. But in principle, sure!) Or the “system” could be something even more exotic... (read more)

Steven Byrnes4d

One thing you can maybe do is throw such accusations right back: “You say I’m being closed-minded to you, but aren’t you equally being closed-minded to me?”

It comes across as escalatory, and might be counterproductive, but I’ve also sometimes found it helpful. Depends a lot on the person and situation.

Replying toThe nature of LLM algorithmic progress (v2)

Steven Byrnes5d

The nature of LLM algorithmic progress (v2)

Thanks!! Quick question while I think over the rest:

What data are you plotting? Where exactly did you get it (i.e., what references)?

And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?

In (highly contingent!) defense of interpretability-in-the-loop ML training

Steven Byrnes

Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.

Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Or Zvi 2025:

The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is

... (read 824 more words →)

The nature of LLM algorithmic progress (v2)

Steven Byrnes

(Heavily revised on Feb. 9, 2026—see changelog at the bottom.)

There’s a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:

Epoch AI: “[training] compute required to reach a set performance threshold has halved approximately every 8 months”.
Dario Amodei 2025: “I'd guess the number today is maybe ~4x/year”.
Gundlach et al. 2025a “Price of Progress”: “Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around 3× per year”.

It’s nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign... (read 3724 more words →)

103

•••

Are there lessons from high-reliability engineering for AGI safety?

Steven Byrnes

11d

This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment:

If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it. I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. —Joshua Achiam on Twitter, 2021
“We just have to sit down and actually write a damn specification, even if it's like pulling teeth. It's the most important

... (read 2236 more words →)

New version of “Intro to Brain-Like-AGI Safety”

Steven Byrnes

21d

A new version of “Intro to Brain-Like-AGI Safety” is out!

Things that have not changed

Same links as before:

As a series of 15 blog posts on LessWrong / Alignment Forum: https://www.lesswrong.com/s/HzcM2dkCq7fwXBej8
As a 225-page PDF (now up to version 3): https://osf.io/preprints/osf/fe36n
Summary video: Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

…And same abstract as before:

Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition as the human brain. How would we use such an algorithm safely?
I will argue that this is an open technical problem, and my goal in this post series is to bring readers with no prior knowledge all the way up to the front-line of

... (read 5581 more words →)

For anyone who read my “Intuitive Self-Models” series (2024): I’m planning to revise it by replacing the term “homunculus” with “active self” wherever it appears. I just think “active self” is a better term for the specific thing I’m trying to talk about there. If anyone disagrees, here’s your chance to speak up. :)

My AGI safety research—2025 review, ’26 plans

Steven Byrnes

2mo

Previous: 2024, 2022

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody^[1]

1. Background & threat model

The main threat model I’m working to address is the same as it’s been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:

The “secret sauce” of human intelligence is a big uniform-ish learning algorithm centered around the cortex;
This learning algorithm is different from and more powerful than LLMs;
Nobody knows how it works today;
Someone someday will either reverse-engineer this learning algorithm, or reinvent something similar;
And then we’ll have Artificial General Intelligence (AGI) and superintelligence (ASI).

I think that, when this learning algorithm is understood,... (read 3492 more words →)

133

Reward Function Design: a starter pack

Steven Byrnes

2mo

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I don’t have all the answers, but I think I’ve made some progress, and the... (read 4556 more words →)

We need a field of Reward Function Design

Steven Byrnes

2mo

(Brief pitch for a general audience, based on a 5-minute talk I gave.)

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

My research focuses on “RL agents”, broadly construed. These were big in the 2010s—they made the news for learning to play Atari games, and Go, at superhuman level. Then LLMs came along in the 2020s, and everyone kinda forgot that RL agents existed. But I’m part of a small group of researchers who still thinks that the field will pivot back to RL agents, one of these days. (Others in this category include Yann LeCun and Rich Sutton & David Silver.)

Why do I think that?... (read 1235 more words →)

118

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

Steven Byrnes

2mo

Tl;dr

AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”

As it happens, I’m basically in that “alas, just you wait” camp,... (read 4808 more words →)

357

•••

Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking

Steven Byrnes

3mo

(Follow-up to Social drives 1: “Sympathy Reward”, from compassion to dehumanization, but this post is self-contained.)

1. Intro & summary

1.1 Background

In^[1] Intro to Brain-Like-AGI Safety (2022), I argued: (1) We should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”); and (2) Reverse-engineering human social innate drives in particular would be a great idea—not only would it help explain human personality, mental health, morality, and more, but it might also yield useful tools and insights for the technical alignment problem for Artificial General Intelligence.

Then in Neuroscience of human social instincts: a sketch... (read 5003 more words →)

Social drives 1: “Sympathy Reward”, from compassion to dehumanization

Steven Byrnes

3mo

1. Intro & summary

1.1 Background

In Intro to Brain-Like-AGI Safety (2022), I argued: (1) We should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”); and (2) Reverse-engineering human social innate drives in particular would be a great idea—not only would it help explain human personality, mental health, morality, and more, but it might also yield useful tools and insights for the technical alignment problem for Artificial General Intelligence.

Then in Neuroscience of human social instincts: a sketch (2024), I worked towards that goal of reverse-engineering human social drives, by proposing what I... (read 3733 more words →)

Quick book review of "If Anyone Builds It, Everyone Dies" (cross-post from X/twitter & bluesky):

Just read the new book If Anyone Builds It, Everyone Dies. Upshot: Recommended! I ~90% agree with it.

The authors argue that people are trying to build ASI (superintelligent AI), and we should expect them to succeed sooner or later, even if they obviously haven’t succeeded YET. I agree. (I lean “later” more than the authors, but that’s a minor disagreement.)

Ultra-fast minds that can do superhuman-quality thinking at 10,000 times the speed, that do not age and die, that make copies of their most successful representatives, that have been refined by billions of trials into unhuman kinds of thinking

... (read 770 more words →)

Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.

I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)

(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing... (read more)

No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.

Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?

If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and cheats.) So that’s the first problem.

Now, it’s possible that, during o3’s RL CoT post-training, it got certain... (read more)

In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:

“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.

For some reason it took me until now to notice that:

my “outer misalignment” is more-or-less synonymous with “specification gaming”,
my “inner misalignment” is more-or-less synonymous with “goal misgeneralization”.

(I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related... (read more)

I’m intrigued by the reports (including but not limited to the Martin 2020 “PNSE” paper) that people can “become enlightened” and have a radically different sense of self, agency, etc.; but friends and family don’t notice them behaving radically differently, or even differently at all. I’m trying to find sources on whether this is true, and if so, what’s the deal. I’m especially interested in behaviors that (naïvely) seem to centrally involve one’s self-image, such as “applying willpower” or “wanting to impress someone”. Specifically, if there’s a person whose sense-of-self has dissolved / merged into the universe / whatever, and they nevertheless enact behaviors that onlookers would conventionally put into one of... (read more)

I went through and updated my 2022 “Intro to Brain-Like AGI Safety” series. If you already read it, no need to do so again, but in case you’re curious for details, I put changelogs at the bottom of each post. For a shorter summary of major changes, see this twitter thread, which I copy below (without the screenshots & links):

I’ve learned a few things since writing “Intro to Brain-Like AGI safety” in 2022, so I went through and updated it! Each post has a changelog at the bottom if you’re curious. Most changes were in one the following categories: (1/7)
REDISTRICTING! As I previously posted ↓, I booted the pallidum out of the

... (read more)

I think there’s a connection between (A) a common misconception in thinking about future AI (that it’s not a huge deal if it’s “only” about as good as humans at most things), and (B) a common misconception in economics (the “Lump Of Labor Fallacy”).

So I started writing a blog post elaborating on that, but got stuck because my imaginary reader is not an economist and kept raising objections that amounted to saying “yeah but the Lump Of Labor Fallacy isn’t actually a fallacy, there really is a lump of labor” 🤦

Anyway, it’s bad pedagogy to explain a possibly-unintuitive thing by relating it to a different possibly-unintuitive thing. Oh well. (I might still try again to finish writing it at some point.)

Some ultra-short book reviews on cognitive neuroscience

On Intelligence by Jeff Hawkins & Sandra Blakeslee (2004)—very good. Focused on the neocortex - thalamus - hippocampus system, how it's arranged, what computations it's doing, what's the relation between the hippocampus and neocortex, etc. More on Jeff Hawkins's more recent work here.
I am a strange loop by Hofstadter (2007)—I dunno, I didn't feel like I got very much out of it, although it's possible that I had already internalized some of the ideas from other sources. I mostly agreed with what he said. I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
Consciousness and

... (read more)

Steven Byrnes

[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?

Four ways learning Econ makes people dumber re: future AI

[Intuitive self-models] 1. Preliminaries

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes

In (highly contingent!) defense of interpretability-in-the-loop ML training

The nature of LLM algorithmic progress (v2)

Are there lessons from high-reliability engineering for AGI safety?

New version of “Intro to Brain-Like-AGI Safety”

My AGI safety research—2025 review, ’26 plans

Reward Function Design: a starter pack

We need a field of Reward Function Design

Intuitive Self-Models

Valence

Intro to Brain-Like-AGI Safety

Steven Byrnes

[Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now?

Four ways learning Econ makes people dumber re: future AI

[Intuitive self-models] 1. Preliminaries

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes

In (highly contingent!) defense of interpretability-in-the-loop ML training

The nature of LLM algorithmic progress (v2)

Are there lessons from high-reliability engineering for AGI safety?

New version of “Intro to Brain-Like-AGI Safety”

My AGI safety research—2025 review, ’26 plans

Reward Function Design: a starter pack

We need a field of Reward Function Design

Intuitive Self-Models

Valence

Intro to Brain-Like-AGI Safety

Things that have not changed

1. Background & threat model

Let’s talk about Reinforcement Learning (RL) agents as a possible path to Artificial General Intelligence (AGI)

Tl;dr

1. Intro & summary

1.1 Background

1. Intro & summary

1.1 Background