Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models

Valence

Intro to Brain-Like-AGI Safety

Posts

Sorted by New

5steve2152's Shortform

130Foom & Doom 2: Technical alignment is hard

15d

247Foom & Doom 1: “Brain in a box in a basement”

50Reward button alignment

2mo

46Re SMTM: negative feedback on negative feedback

2mo

25Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

2mo

115“The Era of Experience” has an unsolved technical alignment problem

2mo

43Self-dialogue: Do behaviorist rewards make scheming AGIs?

5mo

216“Sharp Left Turn” discourse: An opinionated review

5mo

89Heritability: Five Battles

6mo

145Applying traditional economic thinking to AGI: a trilemma

6mo

Wikitag Contributions

Wanting vs Liking

(+139/-26)

Waluigi Effect

(+2087)

Comments

Sorted by

Newest

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes1h20

I talk about RLVR a bunch in the next post (but from an alignment rather than capabilities perspective).

I wasn’t bringing up imitation learning here to argue that LLMs will not scale to AGI (which I believe, but was not trying to justify in this post), but rather to explain a disanalogy between how LLM capabilities have grown over time, versus the alleged future scary paradigm.

If you like, you can replace that text with a weaker statement “Up through 2024, the power of LLMs has come almost entirely from imitation learning on human text…”. That would still work in the context of that paragraph. (For the record, I do think the stronger statement as written is also valid. We’ll find out one way or the other soon enough!)

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes19h30

I think the entire crux is that all of those robots/solar cell chips you referenced currently depend on human industry/modern civilization to actually work, and they'd quickly degrade and become non-functional on the order of weeks or months if modern civilization didn't exist, and this is arguably somewhat inevitable due to economics (until you can have tech that obviates the need for long supply chains).

OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar panel with two teleoperated robots. Every time they scavenge another chip and solar cell, there becomes another human-level AI copy. Every time a robot builds another teleoperated robot from scavenged parts, there’s that too. What exactly is going to break in “weeks or months”? Solar cells can work for 30 years, no problem. GPUs are also reported to last for decades. (Note that, as long as GPUs are a non-renewable resource, the AI would presumably take extremely good care of them, keeping them dust-free, cooling them well below the nominal temperature spec, etc.) The AI can find decent GPUs in every house on the street, and I think hundreds of millions more by breaking into big data centers. Similar for solar panels. If one robot breaks, another robot can repair it. Janky teleoperated robots without fingers made by students for $20K can vacuum, make coffee, cook a meal, etc. Competent human engineers can make pretty impressive mechanical hands using widely-available parts. I grant that it would take a long while before the growing AI clone army could run a semiconductor supply chain by itself, but it has all the time in the world. I expect it to succeed, and thus to sustain itself into the indefinite future, and I’m confused why you don’t. (Or maybe you do and I’m misunderstanding.)

BTW I also think that a minimal semiconductor supply chain would be very very much simpler than the actual semiconductor supply chain that exists in our human world, which has been relentlessly optimized for cost, not simplicity. For example, EBL (e-beam lithography) has better resolution than EUV and is a zillion times easier to build, but the human economy would never support building out km²-scale warehouses full of millions of EBL machines to compensate for their crappy throughput. But for an AI bootstrapping its way back up, why not?

(I’m continuing to assume no weird nanotech for the sake of argument, but I will point out that, since brains exist, it follows that it is possible to grow self-assembling brain-like computing devices (in vats, tended by robots), using only widely-available raw materials like plants and oxygen.)

I’m confused about other parts of your comment as well. Joseph Stalin was able to use his (non-superhuman) intelligence and charisma to wind up in dictatorial control of Russia. What’s your argument that an AI could not similarly wind up with dictatorial control over humans? Don’t the same arguments apply? “If we catch the AI trying to gain power in bad ways, we’ll shut it down.” “If we catch Stalin trying to gain power in bad ways, we’ll throw him in jail.” But the latter didn’t happen. What’s the disanalogy, from your perspective?

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes1d42

Right, there’s a possible position which is: “I’ll accept for the sake of argument your claim there will be an egregiously misaligned ASI requiring very little compute (maybe ≲1 chip per human equivalent including continuous online learning), emerging into a world not terribly different from today’s. But even if so, that’s OK! While the ASI will be a much faster learner than humans, it will not magically know things that it has no way to have figured out (§1.8.1), and that includes developing nanotechnology. So it will be reliant on humans and human infrastructure during a gradual process.”

Or something like that?

Anyway, if so, yeah I disagree, even if I grant (for the sake of argument) that exotic nanotech does not exist.

I’m not an ASI and haven’t thought very hard about it, so my strategies might be suboptimal, but for example it seems to me that an ASI could quite rapidly (days or weeks not months) earn or steal tons of money, and hack into basically every computer system in the world (even APT groups are generally unable to avoid getting hacked by other APT groups!), and then the AI (which now exists in a zillion copies around the world) can get people around the world to do whatever it wants via hiring them, bribing them, persuading them, threatening them, tricking them, etc.

And what does it get the people to do? Mainly “don’t allow other ASIs to be built” and “do build and release novel pandemics”. The latter should be pretty quick—making pandemics is worryingly easy IIUC (see Kevin Esvelt). If infrastructure and the electric grid starts going down, fine, the AI can rebuild, as long as it has at least one solar-cell-connected chip and a teleoperated robot that can build more robots and scavenge more chips and solar panels (see here), and realistically it will have many of those spread all around.

(See also Carl Shulman on AI takeover.)

There are other possibilities too, but hopefully that’s suggestive of “AI doom doesn’t require zero-shot designs of nanotech” (except insofar as viruses are arguably nanotech).

Oh, I guess we also disagree RE “currently we don't have the resources outside of AI companies to actually support a superintelligent AI outside the lab, due to interconnect issues”. I expect future ASI to be much more compute-efficient. Actually, even frontier LLMs are extraordinarily expensive to train, but if we’re talking about inference rather than training, the requirements are not so stringent I think, and people keep working on it.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes1d40

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

Asking an LLM to judge, on the other hand...

I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes2dΩ220

My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,^[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”

I don’t think anything about human brains and their evolution cuts against this claim.

If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said. There are lots of parts of the human brain that are doing essential-for-AGI stuff, but if they’re not in place, then you also fail to pass the earlier threshold of “impressive and proto-AGI-ish”, e.g. by doing things that LLMs (and other existing techniques) cannot already do.

Or maybe your argument is “brain-like AGI will involve lots of useful components, and we can graft those components onto LLMs”? If so, I’m skeptical. I think the cortex is the secret sauce, and the other components are either irrelevant for LLMs, or things that LLM capabilities researchers already know about. For example, the brain has negative feedback loops, and the brain has TD learning, and the brain has supervised learning and self-supervised learning, etc., but LLM capabilities researchers already know about all those things, and are already using them to the extent that they are useful.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes2d*40

In §2.4.1 I talk about learned reward functions.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
- I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
- I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.

The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes2d60

I’m worried about treacherous turns and such, and I don’t think any of the things you mentioned are relevant to that:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.

It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.

It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.

I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.

If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.

It seems like there are quite a few examples of learned classifiers working well in practice:

All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes4d40

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

That’s not quite my position.

Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.

I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.

But it seems likely to me that they programmers won't know what code to write for the reward function since it would be hard to encode complex human values…

I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:

(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.

(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes6d94

Thanks!

This is a surprising prediction because it seems to run counter to Rich Sutton's bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I'm just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.

No, I’m also talking about “general methods that leverage computation (like search and learning)”. Brain-like AGI would also be an ML algorithm. There’s more than one ML algorithm. The Bitter Lesson doesn’t say that all ML algorithms are equally effective at all tasks, nor that there are no more ML algorithms left to discover, right? If I’m not mistaken, Rich Sutton himself is hard at work trying to develop new, more effective ML algorithms as we speak. (alas)

Authors Have a Responsibility to Communicate Clearly

Steven Byrnes7d*40

Part of the problem IMO is that both sides of that conversation seem to be mainly engaging in a debate over “yay Bengio” versus “boo Bengio”. That kind of debate (I call it “vibes-based meaningless argument”) is not completely decision-irrelevant, but IMO people in general are motivated to engage in it way out of proportion to its very slight decision-relevance†. Better to focus on what’s really decision-relevant here: mainly “what is the actual truth about reward tampering?” (and perhaps “shall I email the authors and bug them to rewrite that section?” or “shall I publicly complain?”, but probably not, that’s pretty rare).

†For example, it’s slightly useful to have a “general factor of Bengio being correct about stuff”, since that can inform how and whether one will engage with other Bengio content in the future. But it’s not that useful, because really Bengio is gonna be correct about some things and incorrect about other things, just like everyone is.