including probably reworking some of my blog post ideas into a peer-reviewed paper for a neuroscience journal this spring.
I think this is a great idea. It will broadcast your ideas to an audience prepared to receive them. You can leave out the "friendly AI" motivation and your ideas will stand on their own as a theory of (some of) cognition.
This is brilliant work, thank you. It's great that someone is working on these topics and they seem highly relevant to AGI alignment.
One intuition for why a neuroscience-inspired approach to AI alignment seems promising is that apparently a similar strategy worked for AI capabilities: the neural network researchers from the 1980s who tried to copy how the brain works using deep learning were ultimately the most successful at building highly intelligent AIs (e.g. GPT-4) and more synthetic approaches (e.g. pure logic) were less successful.
Similarly, we already know that the brain has the capacity to represent and be directed by human values so arguably the shortest path to succeeding at AI alignment is to try to understand and replicate the brain's circuitry underlying human motivations and values in AIs.
The only other AI alignment research agenda I can think of that seems to follow a similar strategy is Shard Theory though it seems more high-level and more related to RL than neuroscience.
I love your stuff and I'm very excited to see where you go next.
I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame.
What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas? What do you think about claims of hierarchical extension of these models? How does this affect multipolar threat models? What are the fundamental processes that we should care about? When should we expand these concepts cognitively, when should we constrain them?
(Giving some answers without justification; feel free to follow up.)
What are your takes on Michael Levin's work on agency and "morphologenesis" in relation to your neuroscience ideas?
I haven’t found that work to be relevant or useful for what I’m doing.
Biology is full of cool things. It’s fun. I’ve been watching zoology videos in my free time. Can’t get enough. Not too work-relevant though, from my perspective.
What do you think about claims of hierarchical extension of these models?
I don’t think I’ve heard such claims, or if I did, I probably would have ignored it as probably-irrelevant-to-me.
I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame.
I don’t have any grand conceptual framework for that, and tend to rely on widespread common-sense concepts like “race-to-the-bottom” and “incentives” and “competition” and “employees who are or aren’t mission-aligned” and “miscommunication” and “social norms” and “offense-defense balance” and “bureaucracy” and “selection effects” and “coordination problems” and “externalities” and “stag hunt” and “hard power” and “parallelization of effort” and on and on. I think that this is a good general approach; I think that grand conceptual frameworks are not what I or anyone needs; and instead we just need to keep clarifying and growing and applying this collection of ideas and considerations and frames. (…But perhaps I just don’t know what I’m missing.)
Previous: My AGI safety research—2022 review, ’23 plans. (I guess I skipped it last year.)
“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody
Tl;dr
1. Main research project: reverse-engineering human social instincts
1.1 Background: What’s the problem and why should we care?
(copied almost word-for-word from Neuroscience of human social instincts: a sketch)
My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence safety) back in 2019.
What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:
1.2 More on the path-to-impact
1.3 Progress towards reverse-engineering human social instincts
It was a banner year!
Basically, for years, I’ve had a vague idea about how human social instincts might work, involving what I call “transient empathetic simulations”. But I didn’t know how to pin it down in more detail than that. One subproblem was: I didn’t have even one example of a specific social instinct based on this putative mechanism—i.e., a hypothesis where a specific innate reaction would be triggered by a specific transient empathetic simulation in a specific context, such that the results would be consistent with everyday experience and evolutionary considerations. The other subproblem was: I just had lots of confusion about how these things might work in the brain, in detail.
I made progress on the first subproblem in late 2023, when I guessed that there’s an innate “drive to feel liked / admired”, related to prestige-seeking, and I had a specific idea about how to operationalize that. It turned out that I was still held back by confusion about how social status works, and thus I spent some time in early 2024 sorting that out—see my three posts Social status part 1/2: negotiations over object-level preferences, and Social status part 2/2: everything else, and a rewritten [Valence series] 4. Valence & Liking / Admiring (which replaced an older, flawed attempt at part 4 of the Valence series).
Now I had at least one target to aim for—an innate social drive that I felt I understood well enough to sink my teeth into. That was very helpful for thinking about how that drive might work neuroscientifically. But getting there was still a hell of a journey, and was the main thing I did the whole rest of the year. I chased down lots of leads, many of which were mostly dead ends, although I wound up figuring out lots of random stuff along the way, and in fact one of those threads turned into my 8-part Intuitive Self-Models series.
But anyway, I finally wound up with Neuroscience of human social instincts: a sketch, which posits a neuroscience-based story of how certain social instincts work, including not only the “drive to feel liked / admired” mentioned above, but also compassion and spite, which (I claim) are mechanistically related, to my surprise. Granted, many details remain hazy, but this still feels like great progress on the big picture. Hooray!
1.4 What’s next?
In terms of my moving this project forward, there’s lots of obvious work in making more and better hypotheses and testing them against existing literature. Again, see Neuroscience of human social instincts: a sketch, in which I point out plenty of lingering gaps and confusions. Now, it’s possible that I would hit a dead end at some point, because I have a question that is not answered in the existing neuroscience literature. In particular, the hypothalamus and brainstem have hundreds of tiny cell groups with idiosyncratic roles, and most of them remain unmeasured to date. (As an example, see §5.2 of A Theory of Laughter, the part where it says “If someone wanted to make progress on this question experimentally…”). But a number of academic groups are continuing to slowly chip away at that problem, and with a lot of luck, connectomics researchers will start mass-producing those kinds of measurements in as soon as the next few years.
(Reminder that Connectomics seems great from an AI x-risk perspective, and as mentioned in the last section of that link, you can get involved by applying for jobs, some of which are for non-bio roles like “ML engineer”, or by donating.)
2. My plans going forward
Actually, “reverse-engineering human social instincts” is on hold for the moment, as I’m revisiting the big picture of safe and beneficial AGI, now that I have this new and hopefully-better big-picture understanding of human social instincts under my belt. In other words, knowing what I (think I) know now about how human social instincts work, at least in broad outline, well, what should a brain-like-AGI reward function look like? What about training environment? And test protocols? What are we hoping that AGI developers will do with their AGIs anyway?
I’ve been so deep in neuroscience that I have a huge backlog of this kind of big-picture stuff that I haven’t yet processed.
After that, I’ll probably wind up diving back into neuroscience in general, and reverse-engineering human social instincts in particular, but only after I’ve thought hard about what exactly I’m hoping to get out of it, in terms of AGI safety, on the current margins. That way, I can be focusing on the right questions.
Separate from all that, I plan to stay abreast of the broader AGI safety field, from fundamentals to foundation models, even if the latter is not really my core interest or comparative advantage. I also plan to continue engaging in AGI safety pedagogy and outreach when I can, including probably reworking some of my blog post ideas into a peer-reviewed paper for a neuroscience journal this spring.
If someone thinks that I should be spending my time differently in 2025, please reach out and make your case!
3. Sorted list of my blog posts from 2024
The “reverse-engineering human social instincts” project:
Other neuroscience posts, generally with a less immediately obvious connection to AGI safety:
Everything else related to Safe & Beneficial AGI:
Random non-work-related rants etc. in my free time:
Also in 2024, I went through and revised my 15-post Intro to Brain-Like-AGI Safety series (originally published in 2022). For summary of changes, see this twitter thread. (Or here without pictures, if you want to avoid twitter.) For more detailed changes, each post of the series has a changelog at the bottom.
4. Acknowledgements
Thanks Jed McCaleb & Astera Institute for generously supporting my research since August 2022!
Thanks to all the people who comment on my posts before or after publication, or share ideas and feedback with me through email or other channels, and especially those who patiently stick it out with me through long back-and-forths to hash out disagreements and confusions. I’ve learned so much that way!!!
Thanks to my coworker Seth for fruitful ideas and discussions, and to Beth Barnes and the Centre For Effective Altruism Donor Lottery Program for helping me get off the ground with grant funding in 2021-2022. Thanks Lightcone Infrastructure (don’t forget to donate!) for maintaining and continuously improving this site, which has always been an essential part of my workflow. Thanks to everyone else fighting for Safe and Beneficial AGI, and thanks to my family, and thanks to you all for reading! Happy New Year!
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.