Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models

Valence

Intro to Brain-Like-AGI Safety

Posts

Sorted by New

5steve2152's Shortform

147Foom & Doom 2: Technical alignment is hard

18d

268Foom & Doom 1: “Brain in a box in a basement”

102

50Reward button alignment

2mo

46Re SMTM: negative feedback on negative feedback

2mo

25Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

2mo

115“The Era of Experience” has an unsolved technical alignment problem

3mo

43Self-dialogue: Do behaviorist rewards make scheming AGIs?

5mo

216“Sharp Left Turn” discourse: An opinionated review

5mo

89Heritability: Five Battles

6mo

145Applying traditional economic thinking to AGI: a trilemma

6mo

Wikitag Contributions

Wanting vs Liking

(+139/-26)

Waluigi Effect

(+2087)

Comments

Sorted by

Newest

Foom & Doom 2: Technical alignment is hard

Steven Byrnes2d50

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true. I remain unconvinced but added a caveat to the article just to be safe:

Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]

Foom & Doom 2: Technical alignment is hard

Steven Byrnes3d40

My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite.

For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”.

Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not:

For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that

Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default;
Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes3d20

I talk about RLVR a bunch in the next post (but from an alignment rather than capabilities perspective).

I wasn’t bringing up imitation learning here to argue that LLMs will not scale to AGI (which I believe, but was not trying to justify in this post), but rather to explain a disanalogy between how LLM capabilities have grown over time, versus the alleged future scary paradigm.

If you like, you can replace that text with a weaker statement “Up through 2024, the power of LLMs has come almost entirely from imitation learning on human text…”. That would still work in the context of that paragraph. (For the record, I do think the stronger statement as written is also valid. We’ll find out one way or the other soon enough!)

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes4d30

I think the entire crux is that all of those robots/solar cell chips you referenced currently depend on human industry/modern civilization to actually work, and they'd quickly degrade and become non-functional on the order of weeks or months if modern civilization didn't exist, and this is arguably somewhat inevitable due to economics (until you can have tech that obviates the need for long supply chains).

OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar panel with two teleoperated robots. Every time they scavenge another chip and solar cell, there becomes another human-level AI copy. Every time a robot builds another teleoperated robot from scavenged parts, there’s that too. What exactly is going to break in “weeks or months”? Solar cells can work for 30 years, no problem. GPUs are also reported to last for decades. (Note that, as long as GPUs are a non-renewable resource, the AI would presumably take extremely good care of them, keeping them dust-free, cooling them well below the nominal temperature spec, etc.) The AI can find decent GPUs in every house on the street, and I think hundreds of millions more by breaking into big data centers. Similar for solar panels. If one robot breaks, another robot can repair it. Janky teleoperated robots without fingers made by students for $20K can vacuum, make coffee, cook a meal, etc. Competent human engineers can make pretty impressive mechanical hands using widely-available parts. I grant that it would take a long while before the growing AI clone army could run a semiconductor supply chain by itself, but it has all the time in the world. I expect it to succeed, and thus to sustain itself into the indefinite future, and I’m confused why you don’t. (Or maybe you do and I’m misunderstanding.)

BTW I also think that a minimal semiconductor supply chain would be very very much simpler than the actual semiconductor supply chain that exists in our human world, which has been relentlessly optimized for cost, not simplicity. For example, EBL (e-beam lithography) has better resolution than EUV and is a zillion times easier to build, but the human economy would never support building out km²-scale warehouses full of millions of EBL machines to compensate for their crappy throughput. But for an AI bootstrapping its way back up, why not?

(I’m continuing to assume no weird nanotech for the sake of argument, but I will point out that, since brains exist, it follows that it is possible to grow self-assembling brain-like computing devices (in vats, tended by robots), using only widely-available raw materials like plants and oxygen.)

I’m confused about other parts of your comment as well. Joseph Stalin was able to use his (non-superhuman) intelligence and charisma to wind up in dictatorial control of Russia. What’s your argument that an AI could not similarly wind up with dictatorial control over humans? Don’t the same arguments apply? “If we catch the AI trying to gain power in bad ways, we’ll shut it down.” “If we catch Stalin trying to gain power in bad ways, we’ll throw him in jail.” But the latter didn’t happen. What’s the disanalogy, from your perspective?

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes4d42

Right, there’s a possible position which is: “I’ll accept for the sake of argument your claim there will be an egregiously misaligned ASI requiring very little compute (maybe ≲1 chip per human equivalent including continuous online learning), emerging into a world not terribly different from today’s. But even if so, that’s OK! While the ASI will be a much faster learner than humans, it will not magically know things that it has no way to have figured out (§1.8.1), and that includes developing nanotechnology. So it will be reliant on humans and human infrastructure during a gradual process.”

Or something like that?

Anyway, if so, yeah I disagree, even if I grant (for the sake of argument) that exotic nanotech does not exist.

I’m not an ASI and haven’t thought very hard about it, so my strategies might be suboptimal, but for example it seems to me that an ASI could quite rapidly (days or weeks not months) earn or steal tons of money, and hack into basically every computer system in the world (even APT groups are generally unable to avoid getting hacked by other APT groups!), and then the AI (which now exists in a zillion copies around the world) can get people around the world to do whatever it wants via hiring them, bribing them, persuading them, threatening them, tricking them, etc.

And what does it get the people to do? Mainly “don’t allow other ASIs to be built” and “do build and release novel pandemics”. The latter should be pretty quick—making pandemics is worryingly easy IIUC (see Kevin Esvelt). If infrastructure and the electric grid starts going down, fine, the AI can rebuild, as long as it has at least one solar-cell-connected chip and a teleoperated robot that can build more robots and scavenge more chips and solar panels (see here), and realistically it will have many of those spread all around.

(See also Carl Shulman on AI takeover.)

There are other possibilities too, but hopefully that’s suggestive of “AI doom doesn’t require zero-shot designs of nanotech” (except insofar as viruses are arguably nanotech).

Oh, I guess we also disagree RE “currently we don't have the resources outside of AI companies to actually support a superintelligent AI outside the lab, due to interconnect issues”. I expect future ASI to be much more compute-efficient. Actually, even frontier LLMs are extraordinarily expensive to train, but if we’re talking about inference rather than training, the requirements are not so stringent I think, and people keep working on it.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes4dΩ440

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

Asking an LLM to judge, on the other hand...

I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes5dΩ220

My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,^[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”

I don’t think anything about human brains and their evolution cuts against this claim.

If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said. There are lots of parts of the human brain that are doing essential-for-AGI stuff, but if they’re not in place, then you also fail to pass the earlier threshold of “impressive and proto-AGI-ish”, e.g. by doing things that LLMs (and other existing techniques) cannot already do.

Or maybe your argument is “brain-like AGI will involve lots of useful components, and we can graft those components onto LLMs”? If so, I’m skeptical. I think the cortex is the secret sauce, and the other components are either irrelevant for LLMs, or things that LLM capabilities researchers already know about. For example, the brain has negative feedback loops, and the brain has TD learning, and the brain has supervised learning and self-supervised learning, etc., but LLM capabilities researchers already know about all those things, and are already using them to the extent that they are useful.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes5d*40

In §2.4.1 I talk about learned reward functions.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
- I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
- I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.

The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes5d*Ω560

I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.

Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:

It's often the case that evaluation is easier than generation which would give the classifier an edge over the generator.

It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.

It's possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.

I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.

If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.

It seems like there are quite a few examples of learned classifiers working well in practice:

All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes7dΩ340

In the post you say that human programmers will write the AI's reward function and there will be one step of indirection (and that the focus is the outer alignment problem).

That’s not quite my position.

Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.

I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.

But it seems likely to me that they programmers won't know what code to write for the reward function since it would be hard to encode complex human values…

I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:

(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.

(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.