Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn

Sequences

Valence
Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

Eliezer has a more recent metaethical theory (basically "x is good" = "x increases extrapolated volition") which is moral realist in a conventional way. He discusses it here.

I don’t think that’s “moral realist in a conventional way”, and I don’t think it’s in contradiction with my second bullet in the comment above. Different species have different “extrapolated volition”, right? I think that link is “a moral realist theory which is only trivially different from a typical moral antirealist theory”. Just go through Eliezer’s essay and do a global-find-and-replace of “extrapolated volition” with “extrapolated volition”, and “good” with “good”, etc., and bam, now it’s a central example of a moral antirealist theory. You could not do the same with, say, metaethical hedonism without sucking all the force out of it—the whole point of metaethical hedonism is that it has some claim to naturalness and universality, and does not depend on contingent facts about life in the African Savanna. When I think of “moral realist in a conventional way”, I think of things like metaethical hedonism, right?

Note that this kind of messaging can (if you’re not careful) come across as “hey let’s work on AI x-risk instead of climate change”, which would be both very counterproductive and very misleading—see my discussion here.

Yeah I oversimplified. :) I think “the literal definition of moral realism” is a bit different from “the important substantive things that people are usually talking about when they talk about moral realism”, and I was pointing at the latter instead of the former. For example:

  • It’s possible to believe: “there is a moral truth, but it exists in another realm to which we have no epistemic access or connection. For all we know, the true moral imperative is to maximize helium. We can never possibly know one way or the other, so this moral truth is entirely irrelevant to our actions and decisions. Tough situation we find ourselves in!”
    • See: The ignorance of normative realism bot.
    • This position is literally moral realism, but in practice this person will be hanging out with the moral antirealists (and nihilists!) when deciding what to do with their lives and why.
  • It’s possible to believe: “there is a moral truth, and it is inextricably bound up with entirely contingent (“random”) facts about the human brain and its innate drives. For example, maybe it turns out that “justice is part of true morality”, but if the African Savanna had had a different set of predators, then maybe we would be a slightly different but equally intelligent species having an analogous discussion, and we would be saying “justice is not part of true morality”, and nobody in this story has made any mistake in their logic. Rather, we are humans, and “morality” is our human word, so it’s fine if there’s contingent-properties-of-human-brains underlying what that word points to.”
    • I believe Eliezer would put himself in this camp, see my summary here.
    • Again, this position is literally moral realism, but it has no substantive difference whatsoever from a typical moral antirealism position. The difference is purely semantics / terminological choices. Just replace “true morality” with “true morality” and so on. Again see here for details.

Anyway, my strong impression that a central property of moral realist claims—the thing that makes those claims substantively different from moral antirealism, in a way that feeds into pondering different things and making different decisions, the thing that most self-described moral realists actually believe, as opposed to the trivialities above—is that moral statements can be not just true but also that their truth is “universally accessible to reason and reflection” in a sense. That’s what you need for nostalgebraist’s attempted reductio ad absurdum (where he says: if I had been born in the other country, I would be holding their flag, etc.) to not apply. So that’s what I was trying to talk about. Sorry for leaving out these nuances. If there’s a better terminology for what I’m talking about, I’d be interested to hear it. :)

AlphaZero is playing a zero-sum game - as such, I wouldn't expect it to learn anything along the lines of cooperativeness or kindness, because the only way it can win is if other agents lose, and the amount it wins is the same amount that other agents lose.

OK well AlphaZero doesn’t develop hatred and envy either, but now this conversation is getting silly.

If AlphaZero was trained on a non-zero-sum game (e.g. in an environment where some agents were trying to win a game of Go, and others were trying to ensure that the board had a smiley-face made of black stones on a background of white stones somewhere on the board), it would learn how to model the preferences of other agents and figure out ways to achieve its own goals in a way that also allowed the other agents to achieve their goals.

I’m not sure why you think that. It would learn to anticipate its opponent’s moves, but that’s different from accommodating its opponent’s preferences, unless the opponent has ways to exact revenge? Actually, I’m not sure I understand the setup you’re trying to describe. Which type of agent is AlphaZero in this scenario? What’s the reward function it’s trained on? The “environment” is still a single Go board right?

Anyway, I can think of situations where agents are repeatedly interacting in a non-zero-sum setting but where the parties don’t do anything that looks or feels like kindness over and above optimizing their own interest. One example is: the interaction between craft brewers and their yeast. (I think it’s valid to model yeast as having goals and preferences in a behaviorist sense.)

I think this implies that if one wanted to figure out why sociopaths are different than neurotypical people, one should look for differences in the reward circuitry of the brain rather than the predictive circuitry. Do you agree with that?

OK, low confidence on all this, but I think some people get an ASPD diagnosis purely for having an anger disorder, but the central ASPD person has some variant on “global under-arousal” (which can probably have any number of upstream root causes). That’s what I was guessing here; see also here (“The best physiological indicator of which young people will become violent criminals as adults is a low resting heart rate, says Adrian Raine of the University of Pennsylvania. … Indeed, when Daniel Waschbusch, a clinical psychologist at Penn State Hershey Medical Center, gave the most severely callous and unemotional children he worked with a stimulative medication, their behavior improved”).

Physiological arousal affects all kinds of things, and certainly does feed into the reward function, at least indirectly and maybe also directly.

There’s an additional complication that I think social instincts are in the same category as curiosity drive in that they involve the reward function taking (some aspects of) the learned world-model’s activity as an input (unlike typical RL reward functions which depend purely on exogenous inputs, e.g. Atari points—see “Theory 2” here). So that also complicates the picture of where we should be looking to find a root cause.

So yeah, I think the reward is a central part of the story algorithmically, but that doesn’t necessarily imply that the so-called “reward circuitry of the brain” (by which people usually mean VTA/SNc or sometimes NAc) is the spot where we should be looking for root causes. I don’t know the root cause; again there might be many different root causes in different parts of the brain that all wind up feeding into physiological arousal via different pathways.

I'm not sure I agree with "I don’t think anyone is being disingenuous here."

Yeah I added a parenthetical to that, linking to your comment above.

I think people should generally be careful about using the language "kill literally everyone" or "notkilleverybodyism" [sic] insofar as they aren't confident that misaligned AI would kill literally everyone. (Or haven't considered counterarguments to this.)

I don’t personally use the term “notkilleveryoneism”. I do talk about “extinction risk” sometimes. Your point is well taken that I should be considering whether my estimate of extinction risk is significantly lower than my estimate of x-risk / takeover risk / permanent disempowerment risk / whatever.

I quickly searched my writing and couldn’t immediately find anything that I wanted to change. It seems that when I use the magic word “extinction”, as opposed to “x-risk”, I’m almost always saying something pretty vague, like “there is a serious extinction risk and we should work to reduce it”, rather than giving a numerical probability.

Yeah, I meant “training process” to include training data and/or training environment. Sorry I didn’t make that explicit.

Here are three ways to pass the very low bar of “there’s at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously”, and whether I think those reasons actually stand up to scrutiny:

  • “The AIs are LLMs, trained mostly by imitative learning of human data, and humans are nice sometimes.” I don’t have an opinion about whether this argument is sound, it’s not my area, I focus on brain-like model-based RL. It does seem to be quite a controversy, see for example here. (Note that model-based RL AIs can imitate, but do so in a fundamentally different way from LLM pretraining.) 
  • “The AIs are model-based RL, and they have other agents in their training environment.” I don’t think this argument works because I think intrinsic kindness drives are things that need to exist in the AI’s reward function, not just the learned world-model and value function. See for example this comment pointing out among other things that if AlphaZero had other agents in its training environment (and not just copies of itself), it wouldn’t learn kindness. Likewise, we have pocket calculators in our training environment, and we learn to appreciate their usefulness and to skillfully interact with them and repair them when broken, but we don’t wind up feeling deeply connected to them :)
  • “The AIs are model-based RL, and the reward function will not be programmed by a human, but rather discovered by a process analogous to animal evolution.” This isn’t impossible and a truly substantive argument if true, but my bet would be against that actually happening mainly because it’s extremely expensive to run outer loops around ML training like that, and meanwhile human programmers are perfectly capable of writing effective reward functions, they do it all the time in the RL literature today. I also think humans writing the reward function has the potential to turn out better than allowing an outer-loop search to write the reward function, if only we can figure out what we’re doing, cf. here especially the subsection “Is it a good idea to build human-like social instincts by evolving agents in a social environment?”

Thanks!

I changed “we doomers are unhappy about AI killing all humans” to “we doomers are unhappy about the possibility of AI killing all humans” for clarity.

If I understand you correctly:

  • You’re OK with “notkilleveryoneism is the problem we’re working on”
  • You’re at least willing to engage with claims like “there’s >>90% chance of x-risk” / “there’s >>90% chance of AI takeover” / “there’s >>90% chance of AI extinction or permanent human disempowerment” / etc., even if you disagree with those claims [I disagree with those claims too—“>>90%” is too high for me]
  • …But here you’re strongly disagreeing with people tying those two things together into “It’s important to work on the notkilleveryoneism problem, because the way things are going, there’s >>90% chance that this problem will happen”

If so, that seems fair enough. For my part, I don’t think I’ve said the third-bullet-point-type thing, but maybe, anyway I’ll try to be careful not to do that in the future.

governments will act quickly and (relativiely) decisively to  bring these agents under state-control. national security concerns will dominate.

I dunno, like 20 years ago if someone had said “By the time somebody creates AI that displays common-sense reasoning, passes practically any written test up including graduate-level,  (etc.), obviously governments will be flipping out and nationalizing AI companies etc.”, to me that would have seemed like a reasonable claim. But here we are, and the idea of the USA govt nationalizing OpenAI seems a million miles outside the Overton window.

Likewise, if someone said “After it becomes clear to everyone that lab leaks can cause pandemics costing trillions of dollars and millions of lives, then obviously governments will be flipping out and banning the study of dangerous viruses—or at least, passing stringent regulations with intrusive monitoring and felony penalties for noncompliance etc,” then that would also have sounded reasonable to me! But again, here we are.

So anyway, my conclusion is that when I ask my intuition / imagination whether governments will flip out in thus-and-such circumstance, my intuition / imagination is really bad at answering that question. I think it tends to underweight the force compelling goverments to continue following longstanding customs / habits / norms? Or maybe it’s just hard to predict and these are two cherrypicked examples, and if I thought a bit harder I’d come up with lots of examples in the opposite direction too (i.e., governments flipping out and violating longstanding customs on a dime)? I dunno. Does anyone have a good model here?

Just wanted to flag that this post was the most helpful single thing I read about social status in the course of writing my own recent posts on that topic (part 1, part 2). Thanks!!

I think Yann LeCun thinks "AGI in 2040 is perfectly plausible", AND he believes "AGI is so far away it's not worth worrying about all that much". It's a really insane perspective IMO. As recently as like 2020, "AGI within 20 years" was universally (correctly) considered to be a super-soon forecast calling for urgent action, as contrasted with the people who say "centuries".

Load More