Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:
Initial Observation
At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of... emerges, out of, like out of thin air”:
"What happens... it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.”
Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data.
From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:
“...it's something that just happens. In a sense, you can't avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can't stop it."
Search for Biological Analogue
This system reminded some of the neuroscientists in Botvinick’s group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of “synaptic memory” and “activity-based memory.” They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks.
They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments.
As I understand it, their story goes as follows:
The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value.
DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC’s synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of “a second RL algorithm, implemented entirely in the prefrontal network’s activation dynamics.” That is, they think DA mostly stores its model in synaptic memory, while PFC mostly stores it in activity-based memory (i.e. directly in the dopamine distributions).[2]
What’s the case for this story? They cite a variety of neuroscience findings as evidence for parts of this hypothesis, many of which involve doing horrible things to monkeys, and some of which they simulate using their meta-RL model to demonstrate that it gives similar results. These points stood out most to me:
Does RL occur in the PFC?
Some scientists implanted neuroimaging devices in the PFCs of monkeys, then sat the monkeys in front of two screens with changing images, and rewarded them with juice when they stared at whichever screen was displaying a particular image. The probabilities of each image leading to juice-delivery periodically changed, causing the monkeys to update their policies. Neurons in their PFCs appeared to exhibit RL-like computation—that is, to use information about the monkey’s past choices (and associated rewards) to calculate the expected value of actions, objects and states.
Wang et al. simulated this task using their meta-RL system. They trained a RNN on the changing-images task using RL; when run, it apparently demonstrated similar performance as the monkeys, and when they inspected it they found units that similarly seemed to encode EV estimates based on prior experience, continually adjust the action policy, etc.
Interestingly, the system continued to improve its performance even once its weights were fixed, which they take to imply that the learning which led to improved performance could only have occured within the activation patterns of the recurrent network.[3]
Can the two RL algorithms diverge?
When humans perform two-armed bandit tasks where payoff probabilities oscillate between stable and volatile, they increase their learning rate during volatile periods, and decrease it during stable periods. Wang et al. ran their meta-RL system on the same task, and it varied its learning rate in ways that mimicked human performance. This learning again occurred after weights were fixed, and notably, between the end of training and the end of the task, the learning rates of the two algorithms had diverged dramatically.
Implications
The account detailed by Botvinick and Wang et al. strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong. [Edit: Note that some commenters, like Rohin Shah and Evan Hubinger, disagree].
These researchers did not set out to train RNNs in such a way that they would turn into reinforcement learners. It just happened. And the researchers seem to think this phenomenon will occur spontaneously whenever “a very general set of conditions” is met, like the system having memory, being trained via RL, and receiving a related sequence of inputs. Meta-RL, in their view, is just “an emergent effect that results when the three premises are concurrently satisfied... these conditions, when they co-occur, are sufficient to produce a form of ‘meta-learning’, whereby one learning algorithm gives rise to a second, more efficient learning algorithm.”
So on the whole I felt alarmed reading this. That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures. Maybe this is just because of some non-scalable hack, like that the systems involved aren’t very powerful optimizers.[5] But I wouldn't be surprised if coming to better understand the biological mechanisms involved led to safety-relevant insights.
Thanks to Rafe Kennedy for helpful comments and feedback.
The authors hypothesize that DA is a model-free RL algorithm, and that the spinoff (mesa?) RL algorithm it creates within PFC is model-based, since that’s what happens in their ML model. But they don’t cite biological evidence for this. ↩︎
Depending on what portion of memories are encoded in this way, it may make sense for cryonics standby teams to attempt to reduce the supraphysiological intracellular release of dopamine that occurs after cardiac arrest, e.g. by administering D1-receptor antagonists. Otherwise entropy increases in PFC dopamine distributions may result in information loss. ↩︎
They demonstrated this phenomenon (continued learning after weights were fixed) in a variety of other contexts, too. For example, they cite an experiment in which manipulating DA activity was shown to directly manipulate monkeys’ reward estimations, independent of actual reward—i.e., when their DA activity was blocked/stimulated while they pressed a lever, they exhibited reduced/increased preference for that lever, even if pressing it did/didn’t give them food. They trained their meta-RL system to simulate this, again demonstrated similar performance as the monkeys, and again noticed that it continued learning even after the weights were fixed. ↩︎
The authors seem unsure whether meta-RL also occurs in other brain regions, since for it to occur you need A) inputs carrying information about recent actions/rewards, and B) network dynamics (like recurrence) that support continual activation. Maybe only PFC has this confluence of features. Personally, I doubt it; I would bet that meta-RL (and other sorts of mesa-optimization) occur in a wide variety of brain systems, but it would take more time than I want to allocate here to justify that intuition. ↩︎
Although note that neuroscientists do commonly describe the PFC as disproportionately responsible for the sort of human behavior one might reasonably wish to describe as “optimization.” For example, the neuroscience textbook recommended on lukeprog’s textbook recommendation post describes PFC as “often assumed to be involved in those characteristics that distinguish us from other animals, such as self-awareness and the capacity for complex planning and problem solving.” ↩︎
(EDIT: I'm already seeing downvotes of the post, it was originally at 58 AF karma. This wasn't my intention: I think this is a failure of the community as a whole, not of the author.)
Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I'm not planning to put this in the Alignment Newsletter.
(A lot copied over from this comment of mine)
This is extremely basic RL theory.
The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn't know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn.
However, because bandit problems have been studied in the AI literature, and "learning algorithms" have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as "learning algorithms spontaneously emerge". I don't understand why this was surprising to the original researchers, it seems like if you just thought about what the optimal policy would be given the observable information, you would make exactly this prediction. Perhaps it's because it's primarily a neuroscience paper, and they weren't very familiar with AI.
More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization. If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it. If you mean the chance than a policy trained by RL will "learn" without gradient descent, I can't imagine a way that could fail to be true for an intelligent system trained by deep RL -- presumably a system that is intelligent is capable of learning quickly, and when we talk about deep RL leading to an intelligent AI system, presumably we are talking about the policy being intelligent (what else?), therefore the policy must "learn" as it is being executed.
Gwern notes here that we've seen this elsewhere. This is because it's exactly what you'd expect, just that in the other cases we call conditioning on observations "adaptation" rather than "learning".
----
Meta: I'm disappointed that I had to be the one to point this out. (Though to be fair, Gwern clearly understands this point.) There's clearly been a lot of engagement with this post, and yet this seemingly obvious point hasn't been said. When I saw this post first come up, my immediate reaction was "oh I'm sure this is a typical LW example of a case where the optimal policy is interpreted as learning, I'm not even going to bother clicking on the link". Do we really have so few people who understand machine learning, that of the many, many views this post must have had, not one person could figure this out? It's really no surprise that ML researchers ignore us if this is the level of ML understanding we as a community have.
EDIT: I should give credit to Nevan for pointing out that this paper is not much evidence in favor of the hypothesis that the neural network weights encode some search algorithm (before I wrote this comment).
I agree. It seems pretty bad if the participants of a forum about AI alignment don't know RL.