Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models

Valence

Intro to Brain-Like-AGI Safety

Posts

Sorted by New

5steve2152's Shortform

147Foom & Doom 2: Technical alignment is hard

23d

269Foom & Doom 1: “Brain in a box in a basement”

12d

103

50Reward button alignment

2mo

46Re SMTM: negative feedback on negative feedback

2mo

25Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI

2mo

115“The Era of Experience” has an unsolved technical alignment problem

3mo

43Self-dialogue: Do behaviorist rewards make scheming AGIs?

5mo

216“Sharp Left Turn” discourse: An opinionated review

5mo

89Heritability: Five Battles

6mo

145Applying traditional economic thinking to AGI: a trilemma

6mo

Wikitag Contributions

Wanting vs Liking

(+139/-26)

Waluigi Effect

(+2087)

Comments

Sorted by

Newest

Why I’m not into the Free Energy Principle

Steven Byrnes13h20

would that be a crux?

No. …I’m just gonna start ranting, sorry for any mischaracterizations…

For one thing, I think the whole experimental concept is terrible. I think that a learning algorithm is a complex and exquisitely-designed machine. While the brain doesn't do backprop, backprop is still a good example of how “updating a trained model to work better than before” takes a lot more than a big soup of neurons with Hebbian learning. Backprop requires systematically doing a lot of specific calculations and passing the results around in specific ways and so on.

So you look into the cortex, and you can see the layers and the minicolumns and the cortex-thalamus-cortex connections and so on. It seems really obvious to me that there's a complex genetically-designed machine here. My claim is that it's a machine that implements a learning algorithm and queries the trained model. So obviously (from my perspective), there are going to be lots of synapses that are written into the genetic blueprint of this learning-and-querying machine, and lots of other synapses that are part of the trained model that this machine is editing and querying.

In ML, it's really obvious which bits of information and computation are part of the human-created learning algorithm and which bits are part of the trained model, because we wrote the algorithm ourselves. But in the brain, everything is just neurons and synapses, and it's not obvious what’s what.

Anyway, treating neurons-in-a-dish as evidence for how the brain works at an algorithmic level is like taking a car, totally disassembling it, and putting all the bolts and wheels and sheet-metal etc. into a big dumpster, and shaking it around, and seeing if it can drive. Hey, one of the wheels is rolling around a bit, let's publish. :-P

(If you’re putting neurons in a dish in order to study some low-level biochemical thing like synaptic vesicles, fine. Likewise, you can legitimately learn about the shape and strength of nuts and bolts by studying a totally-disassembled-car-in-a-dumpster. But you won’t get to see anything like a working car engine!)

The brain has thousands of neuron types. Perhaps it’s true that if you put one type of neuron in a dish then it does (mediocre) reinforcement learning, where a uniform-random 150mV 5Hz stimulation is treated by the neurons as negative reward, and where a nonrandom 75mV 100Hz stimulation is treated by the neurons as positive reward. I don’t think it’s true, but suppose that were the case. Then my take would be: “OK cool, whatever.” If that were true, I would strongly guess that the reason that the two stimulation types had different effects was their different waveforms, which somehow interacts with neuron electrophysiology, as opposed to the fact that one is more “predictable” than the other. And if it turned out that I’m wrong about that (i.e., if many experiments showed that “unpredictability” is really load-bearing), then I would guess that it’s some incidental result that doesn’t generalize to every other neuron type in the brain. And if that guess turned out wrong too, I still wouldn’t care, for the same reason that car engine behavior is quite different when it’s properly assembled versus when all its parts are disconnected in a big pile.

Even putting all that aside, the idea that the brain takes actions to minimize prediction error is transparently false. Just think about everyday life: Sometimes it’s unpleasant to feel confused. But other times it’s delightful to feel confused!—we feel mesmerized and delighted by toys that behave in unintuitive ways, or by stage magic. We seek those things out. Not to mention the Dark Room Problem. …And then the FEP people start talking about how the Dark Room Problem is not actually a problem because “surprise” actually means something different and more complicated then “failing to predict what’s about to happen”, blah blah blah. But as soon as you start adding those elaborations, suddenly the Pong experiment is not supporting the theory anymore! Like, the Pong experiment is supposed to prove that neurons reconfigure to avoid impossible-to-predict stimuli, as a very low-level mechanism. Well, if that’s true, then you can’t turn around and redefine “surprise” to include homeostatic errors.

EDIT: the paper in your last link seems to be a purely semantic criticism of the paper's usage of words like "sentience" and "intelligence". They do not provide any analysis at all of the actual experiment performed.

Ah, true, thanks.

Why I’m not into the Free Energy Principle

Steven Byrnes1d30

My current belief is that the neurons-in-a-dish did not actually learn to play Pong, but rather the authors make it kinda look that way by p-hacking and cherry-picking. You can see me complaining here, tailcalled here, and Seth Herd also independently mentioned to me that he looked into it once and wound up skeptical. Some group wrote up a more effortful and official-looking criticism paper here.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes7d50

Probably not worth the time to further discuss what certain other people do or don’t believe, as opposed to what’s true. I remain unconvinced but added a caveat to the article just to be safe:

Why do Eliezer and others expect pure consequentialism? [UPDATE: …Or if I’m misreading Eliezer, as one commenter claims I am, replace that by: “Why might someone expect pure consequentialism?”]

Foom & Doom 2: Technical alignment is hard

Steven Byrnes8d40

My issue is that you claimed that Eliezer believes AIs can only have goals about the distant future, and then contrasted your own views with this. It's strawmanning. And it isn't supported by any of the links you cite.

For the record, my OP says something weaker than that—I wrote “Eliezer and some others…seem to expect ASIs to behave like a pure consequentialist, at least as a strong default…”.

Maybe this is a pointless rabbit’s hole, but I’ll try one more time to argue that Eliezer seems to have this expectation, whether implicitly or explicitly, and whether justified or not:

For example, look at Eliezer’s Coherent decisions imply consistent utilities, and then reflect on the fact that knowing that an agent is “coherent”, a.k.a. a “utility maximizer”, tells you nothing at all about its behavior, unless you make additional assumptions about the domain of its utility function (e.g. that the domain is ‘the future state of the world’). To me it seems clear that

Either Eliezer is making those “additional assumptions” without mentioning them in his post, which supports my claim that pure-consequentialism is (to him) a strong default;
Or his post is full of errors, because for example he discusses whether an AI will be “visibly to us humans shooting itself in the foot”, when in fact it’s fundamentally impossible for an external observer to know whether an agent is being incoherent / self-defeating or not, because (again) coherent utility-maximizing behaviors include absolutely every possible sequence of actions.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes8d20

I talk about RLVR a bunch in the next post (but from an alignment rather than capabilities perspective).

I wasn’t bringing up imitation learning here to argue that LLMs will not scale to AGI (which I believe, but was not trying to justify in this post), but rather to explain a disanalogy between how LLM capabilities have grown over time, versus the alleged future scary paradigm.

If you like, you can replace that text with a weaker statement “Up through 2024, the power of LLMs has come almost entirely from imitation learning on human text…”. That would still work in the context of that paragraph. (For the record, I do think the stronger statement as written is also valid. We’ll find out one way or the other soon enough!)

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes9d30

I think the entire crux is that all of those robots/solar cell chips you referenced currently depend on human industry/modern civilization to actually work, and they'd quickly degrade and become non-functional on the order of weeks or months if modern civilization didn't exist, and this is arguably somewhat inevitable due to economics (until you can have tech that obviates the need for long supply chains).

OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar panel with two teleoperated robots. Every time they scavenge another chip and solar cell, there becomes another human-level AI copy. Every time a robot builds another teleoperated robot from scavenged parts, there’s that too. What exactly is going to break in “weeks or months”? Solar cells can work for 30 years, no problem. GPUs are also reported to last for decades. (Note that, as long as GPUs are a non-renewable resource, the AI would presumably take extremely good care of them, keeping them dust-free, cooling them well below the nominal temperature spec, etc.) The AI can find decent GPUs in every house on the street, and I think hundreds of millions more by breaking into big data centers. Similar for solar panels. If one robot breaks, another robot can repair it. Janky teleoperated robots without fingers made by students for $20K can vacuum, make coffee, cook a meal, etc. Competent human engineers can make pretty impressive mechanical hands using widely-available parts. I grant that it would take a long while before the growing AI clone army could run a semiconductor supply chain by itself, but it has all the time in the world. I expect it to succeed, and thus to sustain itself into the indefinite future, and I’m confused why you don’t. (Or maybe you do and I’m misunderstanding.)

BTW I also think that a minimal semiconductor supply chain would be very very much simpler than the actual semiconductor supply chain that exists in our human world, which has been relentlessly optimized for cost, not simplicity. For example, EBL (e-beam lithography) has better resolution than EUV and is a zillion times easier to build, but the human economy would never support building out km²-scale warehouses full of millions of EBL machines to compensate for their crappy throughput. But for an AI bootstrapping its way back up, why not?

(I’m continuing to assume no weird nanotech for the sake of argument, but I will point out that, since brains exist, it follows that it is possible to grow self-assembling brain-like computing devices (in vats, tended by robots), using only widely-available raw materials like plants and oxygen.)

I’m confused about other parts of your comment as well. Joseph Stalin was able to use his (non-superhuman) intelligence and charisma to wind up in dictatorial control of Russia. What’s your argument that an AI could not similarly wind up with dictatorial control over humans? Don’t the same arguments apply? “If we catch the AI trying to gain power in bad ways, we’ll shut it down.” “If we catch Stalin trying to gain power in bad ways, we’ll throw him in jail.” But the latter didn’t happen. What’s the disanalogy, from your perspective?

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes9d42

Right, there’s a possible position which is: “I’ll accept for the sake of argument your claim there will be an egregiously misaligned ASI requiring very little compute (maybe ≲1 chip per human equivalent including continuous online learning), emerging into a world not terribly different from today’s. But even if so, that’s OK! While the ASI will be a much faster learner than humans, it will not magically know things that it has no way to have figured out (§1.8.1), and that includes developing nanotechnology. So it will be reliant on humans and human infrastructure during a gradual process.”

Or something like that?

Anyway, if so, yeah I disagree, even if I grant (for the sake of argument) that exotic nanotech does not exist.

I’m not an ASI and haven’t thought very hard about it, so my strategies might be suboptimal, but for example it seems to me that an ASI could quite rapidly (days or weeks not months) earn or steal tons of money, and hack into basically every computer system in the world (even APT groups are generally unable to avoid getting hacked by other APT groups!), and then the AI (which now exists in a zillion copies around the world) can get people around the world to do whatever it wants via hiring them, bribing them, persuading them, threatening them, tricking them, etc.

And what does it get the people to do? Mainly “don’t allow other ASIs to be built” and “do build and release novel pandemics”. The latter should be pretty quick—making pandemics is worryingly easy IIUC (see Kevin Esvelt). If infrastructure and the electric grid starts going down, fine, the AI can rebuild, as long as it has at least one solar-cell-connected chip and a teleoperated robot that can build more robots and scavenge more chips and solar panels (see here), and realistically it will have many of those spread all around.

(See also Carl Shulman on AI takeover.)

There are other possibilities too, but hopefully that’s suggestive of “AI doom doesn’t require zero-shot designs of nanotech” (except insofar as viruses are arguably nanotech).

Oh, I guess we also disagree RE “currently we don't have the resources outside of AI companies to actually support a superintelligent AI outside the lab, due to interconnect issues”. I expect future ASI to be much more compute-efficient. Actually, even frontier LLMs are extraordinarily expensive to train, but if we’re talking about inference rather than training, the requirements are not so stringent I think, and people keep working on it.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes9dΩ440

(partly copying from other comment)

I would have assumed that this Python would be impossible to get right

I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).

Asking an LLM to judge, on the other hand...

I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.

Foom & Doom 1: “Brain in a box in a basement”

Steven Byrnes10dΩ220

My claim was “I think that, once this next paradigm is doing anything at all that seems impressive and proto-AGI-ish,^[12] there’s just very little extra work required to get to ASI (≈ figuring things out much better and faster than humans in essentially all domains).”

I don’t think anything about human brains and their evolution cuts against this claim.

If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said. There are lots of parts of the human brain that are doing essential-for-AGI stuff, but if they’re not in place, then you also fail to pass the earlier threshold of “impressive and proto-AGI-ish”, e.g. by doing things that LLMs (and other existing techniques) cannot already do.

Or maybe your argument is “brain-like AGI will involve lots of useful components, and we can graft those components onto LLMs”? If so, I’m skeptical. I think the cortex is the secret sauce, and the other components are either irrelevant for LLMs, or things that LLM capabilities researchers already know about. For example, the brain has negative feedback loops, and the brain has TD learning, and the brain has supervised learning and self-supervised learning, etc., but LLM capabilities researchers already know about all those things, and are already using them to the extent that they are useful.

Foom & Doom 2: Technical alignment is hard

Steven Byrnes10d*40

In §2.4.1 I talk about learned reward functions.
In §2.3.5 I talk about whether or not there is such a thing as “RLVR done right” that doesn’t push towards scheming / treacherous turns. My upshot is:
- I’m mildly skeptical (but don’t feel super-strongly) that you can do RLVR without pushing towards scheming at all.
- I agree with you that there’s clearly room for improvement in making RLVR push towards scheming less on the margin.

much of this also applies to new architectures, since a major lab can apply RL via LLM-oriented tools to them

If the plan is what I call “the usual agent debugging loop”, then I think we’re doomed, and it doesn’t matter at all whether this debugging loop is being run by a human or by an LLM, and it doesn’t matter at all whether the reward function being updated during this debugging loop is being updated via legible code edits versus via weight-updates within an inscrutable learned classifier.

The problem with “the usual agent debugging loop”, as described at that link above, is that more powerful future AIs will be capable of treacherous turns, and you can’t train those away by the reward function because as soon as the behavior manifests even once, it’s too late. (Obviously you can and should try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.)

The threat model requires additional assumptions: To get dangerous reward hacking despite these improvements, we'd need models that are situationally aware enough to selectively reward hack only in specific calculated scenarios while avoiding detection during training/evaluation. This requires much more sophistication and subtlety than the current reward hacks.

As a side-note, I’m opposed to the recent growth of the term “reward hacking” as a synonym for “lying and cheating”. I’m talking about scheming and treacherous turns, not “obvious lying and cheating in deployment” which should be solvable by ordinary means just like any other obvious behavior. Anyway, current LLMs seem to already have enough situational awareness to enable treacherous turns in principle (see the Alignment Faking thing), and even if they didn’t, future AI certainly will, and future AI is what I actually care about.