This is a talk I gave at the recent AI Safety Europe Retreat (AISER) on my research on obtaining insights from the cognitive science of empathy and applying them to RL agents and LLMs.
Talk link: https://clipchamp.com/watch/6c0kTETRqBc
Slides link: https://bit.ly/3ZFmjN8
Talk description: I begin by presenting a short review on the cognitive science of empathy as a Perception-Action Mechanism (PAM) which relies on self-other overlap at the neuronal level. I continue by presenting the theory of change of this research direction by arguing that inducing self-other overlap as empathy is model agnostic and that it has the potential to avert AI x-risk and be sub-agent stable in the limit. Then I present experimental evidence of the emergence of PAM in RL agents and present a way of inducing PAM in RL agents. I end the talk by discussing how this paradigm could be extended to LLMs.
Acknowledgements: I am thankful for Dr. Bogdan Ionut-Cirstea for inspiring me to look into this neglected research direction, the Long-Term Future Fund for funding the initial deep dive into the literature, and for Center for AI Safety for funding half of this research as part of their Student Researcher programme. Last but not least, I want to thank Dr. Matthias Rolf for supervising me and providing good structure and guidance. The review on the cognitive science of empathy is adapted from a talk given by Christian Keysers from the Netherlands Institute for Neuroscience.
I’m confused by your second bullet point. Let’s say you really like chocolate chip cookies but you’re strictly neutral on peanut butter chocolate chip cookies. And they look and smell the same until you bite in (maybe you have a bad sense of smell).
Now you see a plate on a buffet with a little sign next to it that says “Peanut butter chocolate chip cookies”. You ask your trusted friend whether the sign is correct, and they say “Yeah, I just ate one, it was very peanut buttery, yum.” Next to that plate is a plate of brownies, which you like a little bit, but much less than you like chocolate chip cookies.
Your model seems to be: “I won’t be completely sure that the sign isn’t wrong and my friend isn’t lying. So being risk-seeking, I’m going to eat the cookie, just in case.”
I don’t think that model is realistic though. Obviously you’re going to believe the sign & believe your trusted friend. The odds that the cookies aren’t peanut butter are essentially zero. You know this very well. So you’ll go for the brownie instead.
Now we switch to the empathy case. Again, I think it’s perfectly obvious to anyone that a plate labeled & described as “peanut butter chocolate chip cookies” is in fact full of peanut butter chocolate chip cookies, and people will act accordingly. Well, it’s even more obvious by far that “my friend eating a chocolate chip cookie” is not in fact “me eating a chocolate chip cookie”! So, if I don’t feel an urge to eat the cookie that I know damn well has peanut butter in it, I likewise won’t feel an urge to take actions that I know damn well will lead to someone else eating a yummy cookie instead of me.
So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)
I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.