This is a talk I gave at the recent AI Safety Europe Retreat (AISER) on my research on obtaining insights from the cognitive science of empathy and applying them to RL agents and LLMs.
Talk link: https://clipchamp.com/watch/6c0kTETRqBc
Slides link: https://bit.ly/3ZFmjN8
Talk description: I begin by presenting a short review on the cognitive science of empathy as a Perception-Action Mechanism (PAM) which relies on self-other overlap at the neuronal level. I continue by presenting the theory of change of this research direction by arguing that inducing self-other overlap as empathy is model agnostic and that it has the potential to avert AI x-risk and be sub-agent stable in the limit. Then I present experimental evidence of the emergence of PAM in RL agents and present a way of inducing PAM in RL agents. I end the talk by discussing how this paradigm could be extended to LLMs.
Acknowledgements: I am thankful for Dr. Bogdan Ionut-Cirstea for inspiring me to look into this neglected research direction, the Long-Term Future Fund for funding the initial deep dive into the literature, and for Center for AI Safety for funding half of this research as part of their Student Researcher programme. Last but not least, I want to thank Dr. Matthias Rolf for supervising me and providing good structure and guidance. The review on the cognitive science of empathy is adapted from a talk given by Christian Keysers from the Netherlands Institute for Neuroscience.
I think you’re confused, or else I don’t follow. Can we walk through it? Assume traditional ML-style actor-critic RL with TD learning, if that’s OK. There’s a value function V and a reward function R. Let’s assume we start with:
So when I see that Joe is about to eat a cookie, this pleases me (V>0). But then Joe eats the cookie, and the reward is zero, so TD learning kicks in and reduces V(Joe is about to eat a cookie) for next time. Repeat a few more times and V(Joe is about to eat a cookie) approaches zero, right? So eventually, when I see that Joe is about to eat a cookie, I don’t care.
How does your story differ from that? Can you walk through the mechanism in terms of V and R?
This is just terminology, but let’s say the water company tries to get people to reduce their water usage by giving them a gift card when their water usage is lower in Month N+1 than in Month N. Two of the possible behaviors that might result are (A) the intended behavior where people use less water each month, (B) an unintended behavior where people waste tons and tons of water in odd months to ensure that it definitely will go down in even months.
I would describe this as ONE INCENTIVE that incentivizes both of these two behaviors (and many other possible behaviors as well). Whereas you would describe it as “an incentive to do (A) and a competing incentive to do (B)”, apparently. (Right?) I’m not an economist, but when I google-searched for “competing incentives” just now, none of the results were using the term in the way that you’re using it here. Typically people used the phrase “competing incentives” to talk about two different incentive programs provided by two different organizations working at cross-purposes, or something like that.