Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Sequences

Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety

Wikitag Contributions

Comments

Sorted by

However: I expect that AIs capable of causing a loss of control scenario, at least, would also be capable of top-human-level alignment research.

Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?

Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be, so this might or might not be important. E.g. imagine a scenario where FOOM is possible but too dangerous for humans to allow. If so, that would be an astronomical alignment tax!)

we don’t tend to imagine humans directly building superintelligence

Speak for yourself! Humans directly built AlphaZero, which is a superintelligence for board game worlds. So I don’t think it’s out of the question that humans could directly build a superintelligence for the real world. I think that’s my main guess, actually?

(Obviously, the humans would be “directly” building a learning algorithm etc., and then the trained weights come out of that.)

(OK sure, the humans will use AI coding assistants. But I think AI coding assistants, at least of the sort that exist today, aren’t fundamentally changing the picture, but rather belong in the same category as IDEs and PyTorch and other such mundane productivity-enhancers.)

(You said “don’t tend to”, which is valid. My model here [AI paradigm shift → superintelligence very quickly and with little compute] does seem pretty unusual with respect to today’s alignment community zeitgeist.)

You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.

I’m curious if you could name some examples, from your perspective?

I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further difficulties after that.

…Well, actually, I’m not so sure. I feel like I’m confused about what “safely automating top-human-level alignment research” actually means. You say that it’s less than “handoff”. But if humans are still a required part of the ongoing process, then it’s not really “automation”, right? And likewise, if humans are a required part of the process, then an alignment MVP is insufficient for the ability to turn tons of compute into tons of alignment research really fast, which you seem to need for your argument.

You also talk elsewhere about “performing more limited tasks aimed at shorter-term targets”, which seems to directly contradict “performs all the cognitive tasks involved in alignment research at or above the level of top human experts”, since one such cognitive task is “making sure that all the pieces are coming together into a coherent viable plan”. Right?

Honestly I’m mildly concerned that an unintentional shell game might be going on regarding which alignment work is happening before vs. after the alignment MVP.

Relatedly, this sentence seems like an important crux where I disagree: “I am cautiously optimistic that for building an alignment MVP, major conceptual advances that can’t be evaluated via their empirical predictions are not required.” But again, that might be that I’m envisioning a more capable alignment MVP than you are.

Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.

Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts of economically valuable automation, we should expect lots of effort to go towards resolving them. … [then more discussion in §6.1]

This feels wrong to me. I feel like “the human must evaluate the output, and doing so is hard” is more of an edge case, applicable to things like “designs for a bridge”, where failure is far away and catastrophic. (And applicable to alignment research, of course.)

Like you mention today’s “reward-hacking” (e.g. o3 deleting unit tests instead of fixing the code) as evidence that evaluation is necessary. But that’s a bad example because the reward-hacked code doesn’t actually work! And people notice that it doesn’t work. If the code worked flawlessly, then people wouldn’t be talking about reward-hacking as if it’s a bad thing. People notice eventually, and that constitutes an evaluation. Likewise, if you hire a lousy head of marketing, then you’ll eventually notice the lack of new customers; if you hire a lousy CTO, then you’ll eventually notice that your website doesn’t work; etc.

OK, you anticipate this reply and then respond with: “…And even if these tasks can be evaluated via more quantitative metrics in the longer-term (e.g., “did this business strategy make money?”), trying to train on these very long-horizon reward signals poses a number of distinctive challenges (e.g., it can take a lot of serial time, long-horizon data points can be scarce, etc).”

But I don’t buy that because, like, humans went to the moon. That was a long-horizon task but humans did not need to train on it, rather they did it with the same brains we’ve been using for millennia. It did require long-horizon goals. But (1) If AI is unable to pursue long-horizon goals, then I don’t think it’s adequate to be an alignment MVP (you address this in §9.1 & here, but I’m more pessimistic, see here & here), (2) If the AI is able to pursue long-horizon goals, then the goal of “the human eventually approves / presses the reward button” is an obvious and easily-trainable approach that will be adequate for capabilities, science, and unprecedented profits (but not alignment), right up until catastrophe. (Bit more discussion here.)

((1) might be related to my other comment, maybe I’m envisioning a more competent “alignment MVP” than you?)

Steven ByrnesΩ264716

I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.

Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big range in people’s ability to apply (B) to figure things out. But what happens in “normal” sciences like biology is that there are people with a lot of (B), and they can figure out what’s going on, on the basis of hints and indirect evidence. Others don’t. The former group can gather ever-more-direct and ever-more-unassailable (A)-type evidence over time, and use that evidence as a cudgel with which to beat the latter group over the head until they finally get it. (“If you don’t believe my 7 independent lines of evidence for plate tectonics, OK fine I’ll go to the mid-Atlantic ridge and gather even more lines of evidence…”)

This is an important social tool, and explains why bad scientific ideas can die, while bad philosophy ideas live forever. And it’s even worse than that—if the bad philosophy ideas don’t die, then there’s no common knowledge that the bad philosophers are bad, and then they can rise in the ranks and hire other bad philosophers etc. Basically, to a first approximation, I think humans and human institutions are not really up to the task of making intellectual progress systematically over time, except where idiot-proof verification exists for that intellectual progress (for an appropriate definition of “idiot”, and with some other caveats).

…Anyway, AFAICT, OP is just claiming that AI alignment research involves both easy-to-verify progress and hard-to-verify progress, which seems uncontroversial.

Thanks!

Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?

(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)

I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.

But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.

If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.

… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.

https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ

That was a very good constructive comment btw, sorry I forgot to reply to it earlier but I just did.

Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.

To add on a bit to what I wrote in that post, and reply more directly…

So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it's done early in the training run, before it can try to deceive or manipulate us.” Right so far?

That might or might not make sense for an LLM. I don’t think it makes sense for brain-like AGI.

In particular, the reward function is just looking at what the model does, not what its motivations are. “Synthetic data to make misaligned agents reveal themselves safely” doesn’t seem to make sense in that context. If the reward function incentivizes the AI to say “I’m blowing the whistle on myself! I was out to get you!” then the AI will probably start saying that, even if it’s false. (Or if the reward function incentivizes the AI to only whistle-blow on itself if it has proof of its own shenanigans, then the AI will probably keep generating such proof and then showing it to us. …And meanwhile, it might also be doing other shenanigans in secret.)

Or consider trying to incentivize the AGI for honestly reporting its intentions. That’s defined by the degree of match or mismatch between inscrutable intentions and text outputs. How do you train and apply the reward model to detect that?

More broadly, I think there are always gonna be ways for the AI to be sneaky without the reward function noticing, which leads to a version of “playing the training game”. And almost no matter what the detailed “training game” is, an AI can do a better job on it by secretly creating a modified copy to self-reproduce around the internet and gain maximal money and power everywhere else on Earth, if that’s possible to do without getting caught. So that’s pretty egregious scheming.

“Alignment generalizes further than capabilities” is kinda a different issue—it’s talking about the more “dignified” failure mode of generalizing poorly to new situations and environments, whereas I’m claiming that we don’t even have a solution to egregious scheming within an already-existing test environment.

Again, more in that post. Sorry if I’m misunderstanding, and happy to keep chatting!

(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.

Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.

I didn’t intend (A) & (B) to be a precise and complete breakdown.

AIs might learn to think thoughts in different formats

Yeah that’s definitely a thing to think about. Human examples might include “compassion fatigue” (shutting people out because it’s too hard to feel for them); or my theory that many people with autism learn to deliberately unconsciously avoid a wide array of innate social reactions from a young age; or choosing spending more and more time and mental space with imaginary friends, virtual friends, teddy bears, movies, etc. instead of real people. There are various tricks to mitigate these kinds of complications, and they seem to work well enough in human brains. So I think it’s premature to declare that this problem is definitely unsolvable. (And I think the Deep Deceptiveness post is too simplistic, see my comment on it.)

I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?

Load More