Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety

Wikitag Contributions

Comments

Sorted by

Thanks! …But I think you misunderstood.

Suppose I tell an AI:

Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!

That’s full 100% automation, not 90%, right?

“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.

Normal people can tell that their umbrella keeps them dry without knowing anything about umbrella production. Normal people can tell whether their smartphone apps are working well without knowing anything about app development and debugging. Etc.

And then I’m claiming that this kind of strategy will “work” until the AI is sufficiently competent to grab the reward button and start building defenses around it etc.

Thanks!

RE 2 – I was referring here to (what I call) “brain-like AGI”, a yet-to-be-invented AI paradigm in which both “human-like ability to reason” and “human-like social and moral instincts / reflexes” are in a nuts-and-bolts sense, like they’re actually doing the same kinds of algorithmic steps that a human brain would do. Human brains are quite different from LLMs, even if their text outputs can look similar. For example, small groups of humans can invent grammatical languages from scratch, and of course historically humans invented science and tech and philosophy and so on from scratch. So bringing up “training data” is kinda the wrong idea. Indeed, for humans (and brain-like AGI), we should be talking about “training environments”, not “training data”—more like the RL agent paradigm of the late 2010s than like LLMs, at least in some ways.

I do agree that we shouldn’t trust LLMs to make good philosophical progress that goes way beyond what’s already in their human-created training data.

RE 1 – Let’s talk about feelings of friendship, compassion, and connection. These feelings are unnecessary for cooperation, right? Logical analysis of costs vs benefits of cooperation, including decision theory, reputational consequences, etc., are all you need for cooperation to happen. (See §2-3 here.) But for me and almost anyone, a future universe with no feelings of friendship, compassion, and connection in it seems like a bad thing that I don’t want to happen. I find it hard to believe that sufficient reflection would change my opinion on that [although I have some niggling concerns about technological progress]. “Selfishness” isn’t even a coherent concept unless the agent intrinsically wants something, and innate drives are upstream of what it wants, and those feelings of friendship, compassion etc. can be one of those innate drives, potentially a very strong one. Then, yes, there’s a further question about whom those feelings will be directed towards—AIs, humans, animals, teddy bears, or what? It needs to start with innate reflexes to certain stimuli, which then get smoothed out upon reflection. I think this is something where we’d need to think very carefully about the AI design, training environment, etc. It might be easier to think through what could go right and worng after developing a better understanding of human social instincts, and developing a more specific plan for the AI. Anyway, I certainly agree that there are important potential failure modes in this area.

RE 3 – I’m confused about how this plan would be betting the universe on a particular meta-ethical view, from your perspective. (I’m not expressing doubt, I’m just struggling to see things from outside my own viewpoint here.)

By the way, my perspective again is “this might be the least-bad plausible plan”, as opposed to “this is a great plan”. But to defend the “least-bad” claim, I would need to go through all the other possible plans that seem better, and why I don’t find them plausible (on my idiosyncratic models of AGI development). I mentioned my skepticism about AGI pause / anti-proliferation above, and I have a (hopefully) forthcoming post that should talk about how I’m thinking about corrigibility and other stuff. (But I’m also happy to chat about it here; it would probably help me flesh out that forthcoming post tbh.)

However: I expect that AIs capable of causing a loss of control scenario, at least, would also be capable of top-human-level alignment research.

Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?

Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be, so this might or might not be important. E.g. imagine a scenario where FOOM is possible but too dangerous for humans to allow. If so, that would be an astronomical alignment tax!)

we don’t tend to imagine humans directly building superintelligence

Speak for yourself! Humans directly built AlphaZero, which is a superintelligence for board game worlds. So I don’t think it’s out of the question that humans could directly build a superintelligence for the real world. I think that’s my main guess, actually?

(Obviously, the humans would be “directly” building a learning algorithm etc., and then the trained weights come out of that.)

(OK sure, the humans will use AI coding assistants. But I think AI coding assistants, at least of the sort that exist today, aren’t fundamentally changing the picture, but rather belong in the same category as IDEs and PyTorch and other such mundane productivity-enhancers.)

(You said “don’t tend to”, which is valid. My model here [AI paradigm shift → superintelligence very quickly and with little compute] does seem pretty unusual with respect to today’s alignment community zeitgeist.)

You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.

I’m curious if you could name some examples, from your perspective?

I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further difficulties after that.

…Well, actually, I’m not so sure. I feel like I’m confused about what “safely automating top-human-level alignment research” actually means. You say that it’s less than “handoff”. But if humans are still a required part of the ongoing process, then it’s not really “automation”, right? And likewise, if humans are a required part of the process, then an alignment MVP is insufficient for the ability to turn tons of compute into tons of alignment research really fast, which you seem to need for your argument.

You also talk elsewhere about “performing more limited tasks aimed at shorter-term targets”, which seems to directly contradict “performs all the cognitive tasks involved in alignment research at or above the level of top human experts”, since one such cognitive task is “making sure that all the pieces are coming together into a coherent viable plan”. Right?

Honestly I’m mildly concerned that an unintentional shell game might be going on regarding which alignment work is happening before vs. after the alignment MVP.

Relatedly, this sentence seems like an important crux where I disagree: “I am cautiously optimistic that for building an alignment MVP, major conceptual advances that can’t be evaluated via their empirical predictions are not required.” But again, that might be that I’m envisioning a more capable alignment MVP than you are.

Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.

Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts of economically valuable automation, we should expect lots of effort to go towards resolving them. … [then more discussion in §6.1]

This feels wrong to me. I feel like “the human must evaluate the output, and doing so is hard” is more of an edge case, applicable to things like “designs for a bridge”, where failure is far away and catastrophic. (And applicable to alignment research, of course.)

Like you mention today’s “reward-hacking” (e.g. o3 deleting unit tests instead of fixing the code) as evidence that evaluation is necessary. But that’s a bad example because the reward-hacked code doesn’t actually work! And people notice that it doesn’t work. If the code worked flawlessly, then people wouldn’t be talking about reward-hacking as if it’s a bad thing. People notice eventually, and that constitutes an evaluation. Likewise, if you hire a lousy head of marketing, then you’ll eventually notice the lack of new customers; if you hire a lousy CTO, then you’ll eventually notice that your website doesn’t work; etc.

OK, you anticipate this reply and then respond with: “…And even if these tasks can be evaluated via more quantitative metrics in the longer-term (e.g., “did this business strategy make money?”), trying to train on these very long-horizon reward signals poses a number of distinctive challenges (e.g., it can take a lot of serial time, long-horizon data points can be scarce, etc).”

But I don’t buy that because, like, humans went to the moon. That was a long-horizon task but humans did not need to train on it, rather they did it with the same brains we’ve been using for millennia. It did require long-horizon goals. But (1) If AI is unable to pursue long-horizon goals, then I don’t think it’s adequate to be an alignment MVP (you address this in §9.1 & here, but I’m more pessimistic, see here & here), (2) If the AI is able to pursue long-horizon goals, then the goal of “the human eventually approves / presses the reward button” is an obvious and easily-trainable approach that will be adequate for capabilities, science, and unprecedented profits (but not alignment), right up until catastrophe. (Bit more discussion here.)

((1) might be related to my other comment, maybe I’m envisioning a more competent “alignment MVP” than you?)

I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.

Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big range in people’s ability to apply (B) to figure things out. But what happens in “normal” sciences like biology is that there are people with a lot of (B), and they can figure out what’s going on, on the basis of hints and indirect evidence. Others don’t. The former group can gather ever-more-direct and ever-more-unassailable (A)-type evidence over time, and use that evidence as a cudgel with which to beat the latter group over the head until they finally get it. (“If you don’t believe my 7 independent lines of evidence for plate tectonics, OK fine I’ll go to the mid-Atlantic ridge and gather even more lines of evidence…”)

This is an important social tool, and explains why bad scientific ideas can die, while bad philosophy ideas live forever. And it’s even worse than that—if the bad philosophy ideas don’t die, then there’s no common knowledge that the bad philosophers are bad, and then they can rise in the ranks and hire other bad philosophers etc. Basically, to a first approximation, I think humans and human institutions are not really up to the task of making intellectual progress systematically over time, except where idiot-proof verification exists for that intellectual progress (for an appropriate definition of “idiot”, and with some other caveats).

…Anyway, AFAICT, OP is just claiming that AI alignment research involves both easy-to-verify progress and hard-to-verify progress, which seems uncontroversial.

Thanks!

Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?

(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)

I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.

But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.

If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.

… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.

https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ

That was a very good constructive comment btw, sorry I forgot to reply to it earlier but I just did.

Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.

To add on a bit to what I wrote in that post, and reply more directly…

So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it's done early in the training run, before it can try to deceive or manipulate us.” Right so far?

That might or might not make sense for an LLM. I don’t think it makes sense for brain-like AGI.

In particular, the reward function is just looking at what the model does, not what its motivations are. “Synthetic data to make misaligned agents reveal themselves safely” doesn’t seem to make sense in that context. If the reward function incentivizes the AI to say “I’m blowing the whistle on myself! I was out to get you!” then the AI will probably start saying that, even if it’s false. (Or if the reward function incentivizes the AI to only whistle-blow on itself if it has proof of its own shenanigans, then the AI will probably keep generating such proof and then showing it to us. …And meanwhile, it might also be doing other shenanigans in secret.)

Or consider trying to incentivize the AGI for honestly reporting its intentions. That’s defined by the degree of match or mismatch between inscrutable intentions and text outputs. How do you train and apply the reward model to detect that?

More broadly, I think there are always gonna be ways for the AI to be sneaky without the reward function noticing, which leads to a version of “playing the training game”. And almost no matter what the detailed “training game” is, an AI can do a better job on it by secretly creating a modified copy to self-reproduce around the internet and gain maximal money and power everywhere else on Earth, if that’s possible to do without getting caught. So that’s pretty egregious scheming.

“Alignment generalizes further than capabilities” is kinda a different issue—it’s talking about the more “dignified” failure mode of generalizing poorly to new situations and environments, whereas I’m claiming that we don’t even have a solution to egregious scheming within an already-existing test environment.

Again, more in that post. Sorry if I’m misunderstanding, and happy to keep chatting!

(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.

Load More