I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Note that this kind of messaging can (if you’re not careful) come across as “hey let’s work on AI x-risk instead of climate change”, which would be both very counterproductive and very misleading—see my discussion here.
Yeah I oversimplified. :) I think “the literal definition of moral realism” is a bit different from “the important substantive things that people are usually talking about when they talk about moral realism”, and I was pointing at the latter instead of the former. For example:
Anyway, my strong impression that a central property of moral realist claims—the thing that makes those claims substantively different from moral antirealism, in a way that feeds into pondering different things and making different decisions, the thing that most self-described moral realists actually believe, as opposed to the trivialities above—is that moral statements can be not just true but also that their truth is “universally accessible to reason and reflection” in a sense. That’s what you need for nostalgebraist’s attempted reductio ad absurdum (where he says: if I had been born in the other country, I would be holding their flag, etc.) to not apply. So that’s what I was trying to talk about. Sorry for leaving out these nuances. If there’s a better terminology for what I’m talking about, I’d be interested to hear it. :)
AlphaZero is playing a zero-sum game - as such, I wouldn't expect it to learn anything along the lines of cooperativeness or kindness, because the only way it can win is if other agents lose, and the amount it wins is the same amount that other agents lose.
OK well AlphaZero doesn’t develop hatred and envy either, but now this conversation is getting silly.
If AlphaZero was trained on a non-zero-sum game (e.g. in an environment where some agents were trying to win a game of Go, and others were trying to ensure that the board had a smiley-face made of black stones on a background of white stones somewhere on the board), it would learn how to model the preferences of other agents and figure out ways to achieve its own goals in a way that also allowed the other agents to achieve their goals.
I’m not sure why you think that. It would learn to anticipate its opponent’s moves, but that’s different from accommodating its opponent’s preferences, unless the opponent has ways to exact revenge? Actually, I’m not sure I understand the setup you’re trying to describe. Which type of agent is AlphaZero in this scenario? What’s the reward function it’s trained on? The “environment” is still a single Go board right?
Anyway, I can think of situations where agents are repeatedly interacting in a non-zero-sum setting but where the parties don’t do anything that looks or feels like kindness over and above optimizing their own interest. One example is: the interaction between craft brewers and their yeast. (I think it’s valid to model yeast as having goals and preferences in a behaviorist sense.)
I think this implies that if one wanted to figure out why sociopaths are different than neurotypical people, one should look for differences in the reward circuitry of the brain rather than the predictive circuitry. Do you agree with that?
OK, low confidence on all this, but I think some people get an ASPD diagnosis purely for having an anger disorder, but the central ASPD person has some variant on “global under-arousal” (which can probably have any number of upstream root causes). That’s what I was guessing here; see also here (“The best physiological indicator of which young people will become violent criminals as adults is a low resting heart rate, says Adrian Raine of the University of Pennsylvania. … Indeed, when Daniel Waschbusch, a clinical psychologist at Penn State Hershey Medical Center, gave the most severely callous and unemotional children he worked with a stimulative medication, their behavior improved”).
Physiological arousal affects all kinds of things, and certainly does feed into the reward function, at least indirectly and maybe also directly.
There’s an additional complication that I think social instincts are in the same category as curiosity drive in that they involve the reward function taking (some aspects of) the learned world-model’s activity as an input (unlike typical RL reward functions which depend purely on exogenous inputs, e.g. Atari points—see “Theory 2” here). So that also complicates the picture of where we should be looking to find a root cause.
So yeah, I think the reward is a central part of the story algorithmically, but that doesn’t necessarily imply that the so-called “reward circuitry of the brain” (by which people usually mean VTA/SNc or sometimes NAc) is the spot where we should be looking for root causes. I don’t know the root cause; again there might be many different root causes in different parts of the brain that all wind up feeding into physiological arousal via different pathways.
I'm not sure I agree with "I don’t think anyone is being disingenuous here."
Yeah I added a parenthetical to that, linking to your comment above.
I think people should generally be careful about using the language "kill literally everyone" or "notkilleverybodyism" [sic] insofar as they aren't confident that misaligned AI would kill literally everyone. (Or haven't considered counterarguments to this.)
I don’t personally use the term “notkilleveryoneism”. I do talk about “extinction risk” sometimes. Your point is well taken that I should be considering whether my estimate of extinction risk is significantly lower than my estimate of x-risk / takeover risk / permanent disempowerment risk / whatever.
I quickly searched my writing and couldn’t immediately find anything that I wanted to change. It seems that when I use the magic word “extinction”, as opposed to “x-risk”, I’m almost always saying something pretty vague, like “there is a serious extinction risk and we should work to reduce it”, rather than giving a numerical probability.
Yeah, I meant “training process” to include training data and/or training environment. Sorry I didn’t make that explicit.
Here are three ways to pass the very low bar of “there’s at least prima facie reason to think that kindness might arise non-coincidentally and non-endogenously”, and whether I think those reasons actually stand up to scrutiny:
Thanks!
I changed “we doomers are unhappy about AI killing all humans” to “we doomers are unhappy about the possibility of AI killing all humans” for clarity.
If I understand you correctly:
If so, that seems fair enough. For my part, I don’t think I’ve said the third-bullet-point-type thing, but maybe, anyway I’ll try to be careful not to do that in the future.
governments will act quickly and (relativiely) decisively to bring these agents under state-control. national security concerns will dominate.
I dunno, like 20 years ago if someone had said “By the time somebody creates AI that displays common-sense reasoning, passes practically any written test up including graduate-level, (etc.), obviously governments will be flipping out and nationalizing AI companies etc.”, to me that would have seemed like a reasonable claim. But here we are, and the idea of the USA govt nationalizing OpenAI seems a million miles outside the Overton window.
Likewise, if someone said “After it becomes clear to everyone that lab leaks can cause pandemics costing trillions of dollars and millions of lives, then obviously governments will be flipping out and banning the study of dangerous viruses—or at least, passing stringent regulations with intrusive monitoring and felony penalties for noncompliance etc,” then that would also have sounded reasonable to me! But again, here we are.
So anyway, my conclusion is that when I ask my intuition / imagination whether governments will flip out in thus-and-such circumstance, my intuition / imagination is really bad at answering that question. I think it tends to underweight the force compelling goverments to continue following longstanding customs / habits / norms? Or maybe it’s just hard to predict and these are two cherrypicked examples, and if I thought a bit harder I’d come up with lots of examples in the opposite direction too (i.e., governments flipping out and violating longstanding customs on a dime)? I dunno. Does anyone have a good model here?
I think Yann LeCun thinks "AGI in 2040 is perfectly plausible", AND he believes "AGI is so far away it's not worth worrying about all that much". It's a really insane perspective IMO. As recently as like 2020, "AGI within 20 years" was universally (correctly) considered to be a super-soon forecast calling for urgent action, as contrasted with the people who say "centuries".
I don’t think that’s “moral realist in a conventional way”, and I don’t think it’s in contradiction with my second bullet in the comment above. Different species have different “extrapolated volition”, right? I think that link is “a moral realist theory which is only trivially different from a typical moral antirealist theory”. Just go through Eliezer’s essay and do a global-find-and-replace of “extrapolated volition” with “extrapolated volitionhuman”, and “good” with “goodhuman”, etc., and bam, now it’s a central example of a moral antirealist theory. You could not do the same with, say, metaethical hedonism without sucking all the force out of it—the whole point of metaethical hedonism is that it has some claim to naturalness and universality, and does not depend on contingent facts about life in the African Savanna. When I think of “moral realist in a conventional way”, I think of things like metaethical hedonism, right?