(EDIT: I'm already seeing downvotes of the post, it was originally at 58 AF karma. This wasn't my intention: I think this is a failure of the community as a whole, not of the author.)
Okay, this has gotten enough karma and has been curated and has influenced another post, so I suppose I should engage, especially since I'm not planning to put this in the Alignment Newsletter.
(A lot copied over from this comment of mine)
This is extremely basic RL theory.
The linked paper studies bandit problems, where each episode of RL is a new bandit problem where the agent doesn't know which arm gives maximal reward. Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do. Basic POMDP theory will tell you that when you have partial observability your policy needs to depend on history, i.e. it needs to learn.
However, because bandit problems have been studied in the AI literature, and "learning algorithms" have been proposed to solve bandit problems, this very normal fact of a policy depending on observation history is now trotted out as "...
This is extremely basic RL theory.
I note that this doesn't feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven't touched much RL, because they're focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
More broadly, I don't understand what people are talking about when they speak of the "likelihood" of mesa optimization.
I don't think I have a fully crisp view of this, but here's my frame on it so far:
One view is that we design algorithms to do things, and those algorithms have properties that we can reason about. Another is that we design loss functions, and then search through random options for things that perform well on those loss functions. In the second view, often which options we search through doesn't matter very much, because there's something like the "optimal solution" that all things we actually find will be trying to approximate in one way or another.
Mesa-optimization is something like, "when we search through the options, will we find something that itself searches through a different set of options?". Some...
I note that this doesn't feel like a problem to me, mostly because of reasons related to Explainers Shoot High. Aim Low!. Even among ML experts, many of them haven't touched much RL, because they're focused on another field. Why expect them to know basic RL theory, or to have connected that to all the other things that they know?
I'm perfectly happy with good explanations that don't assume background knowledge. The flaw I am pointing to has nothing to do with explanations. It is that despite this evidence being a clear consequence of basic RL theory, for some reason readers are treating it as important evidence. Clearly I should update negatively on things-AF-considers-important. At a more gears level, presumably I should update towards some combination of:
Any of these would be a pretty damning critique of the forum. And the update should be fairly strong, given that this was (prior to my comment) the highest-upvoted post ever by AF karma.
I think people often underestimate the degree to wh...
I guess I should explain why I upvoted this post despite agreeing with you that it's not new evidence in favor of mesa-optimization. I actually had a conversation about this post with Adam Shimi prior to you commenting on it where I explained to him that I thought that not only was none of it new but also that it wasn't evidence about the internal structure of models and therefore wasn't really evidence about mesa-optimization. Nevertheless, I chose to upvote the post and not comment my thoughts on it. Some reasons why I did that:
(Flagging that I curated the post, but was mostly relying on Ben and Habryka's judgment, in part since I didn't see much disagreement. Since this discussion I've become more agnostic about how important this post is)
One thing this comment makes me want is more nuanced reacts that people have affordance to communicate how they feel about a post, in a way that's easier to aggregate.
Though I also notice that with this particular post it's a bit unclear what the react would be appropriate, since it sounds like it's not "disagree" so much as "this post seems confused" or something.
Unfortunately, I also only have so much time, and I don't generally think that repeating myself regularly in AF/LW comments is a super great use of it.
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I'm working on a reply. Before I post it, I'd be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
Does this summary feel like it reasonably characterizes your objections?
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
Thanks. I know I came off pretty confrontational, sorry about that. I didn't mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don't think you've exactly captured what I meant, some comments below.
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as "giving rise to a separate learning algorithm."
I think it is reasonable to describe it both as "conditioning on observations" and as "giving rise to a separate learning algorithm".
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
On my interpretation of "learning" in this context, I would agree with that claim (i.e. I agree that learning is implemented in activation changes rather then weight changes via ...
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren't very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It's hard for me to imagine their surprise was due to them lacking basic knowledge about RL?
Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind's website describes as "closely related work," and which appears to me to involve a very similar setup— describes their result similarly:
We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. T...
I imagine this was not your intention, but I'm a little worried that this comment will have an undesirable chilling effect. I think it's good for people to share when members of DeepMind / OpenAI say something that sounds a lot like "we found evidence of mesaoptimization".
I also think you're right that we should be doing a lot better on pushing back against such claims. I hope LW/AF gets better at being as skeptical of AI researchers assertions that support risk as they are of those that undermine risk. But I also hope that when those researchers claim something surprising and (to us) plausibly risky is going on, we continue to hear about it.
I imagine this was not your intention, but I'm a little worried that this comment will have an undesirable chilling effect.
Note that there are desirable chilling effects too. I think it's broadly important to push back on inaccurate claims, or ones that have the wrong level of confidence. (Like, my comment elsewhere is intended to have a chilling effect.)
E.g. TurnTrout has done a lot of self-learning from textbooks and probably has better advice [for learning RL]
I have been summoned! I've read a few RL textbooks... unfortunately, they're either a) very boring, b) very old, or c) very superficial. I've read:
Which is exactly why I asked you for recommendations.
Yes, I never said you shouldn't ask me for recommendations. I'm saying that I don't have any good recommendations to give, and you should probably ask other people for recommendations.
showing some concrete things that might be relevant (as I repeated in each comment, not an exhaustive list) would make the injunction more effective.
In practice I find that anything I say tends to lose its nuance as it spreads, so I've moved towards saying fewer things that require nuance. If I said "X might be a good resource to learn from but I don't really know", I would only be a little surprised to hear a complaint in the future of the form "I deeply read X for two months because Rohin recommended it, but I still can't understand this deep RL paper".
If I actually were confident in some resource, I agree it would be more effective to mention it.
I'm just confused because it seems low effort for you, net positive, and the kind of "ask people for recommendation" that you preach in the previous comment.
I'm not convinced the low effort version is net positive, for the reasons menti...
What is all of humanity if not a walking catastrophic inner alignment failure? We were optimized for one thing: inclusive genetic fitness. And only a tiny fraction of humanity could correctly define what that is!
It could both be the case that there exists catastrophic inner alignment failure between humans and evolution, and also that humans don't regularly experience catastrophic inner alignment failures internally.
In practice I do suspect humans regularly experience internal inner alignment failures, but given that suspicion I feel surprised by how functional humans do manage to be. In other words, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
In practice I do suspect humans regularly experience internal (within-brain) inner alignment failures, but given that suspicion I feel surprised by how functional humans manage to be. That is, I notice expecting that regular inner alignment failures would cause far more mayhem than I observe, which makes me wonder whether brains are implementing some sort of alignment-relevant tech.
I don't know why you expect an inner alignment failure to look dysfunctional. Instrumental convergence suggests that it would look functional. What the world looks like if there are inner alignment failures inside the human brain is (in part) that humans pursue a greater diversity of terminal goals than can be accounted for by genetics.
All natural selection does is gradient descent (hill climbing technically), with no capacity for lookahead.
I think if you're interested in the analysis and classification of optimization techniques, there's enough differences between what natural selection is doing and what deep learning is doing that it isn't a very natural analogy. (Like, one is a population-based method and the other isn't, the update rules are different, etc.)
Thanks for this. It seems important. Learning still happening after weights are frozen? That's crazy. I think it's a big deal because it is evidence for mesa-optimization being likely and hard to avoid.
It also seems like evidence for the Scaling Hypothesis. One major way the scaling hypothesis could be false is if there are further insights needed to get transformative AI, e.g. a new algorithm or architecture. A simple neural network spontaneously learning to do its own, more efficient form of learning? This seems like a data point in favor of the idea that our current architectures and algorithms are fine, and will eventually (if they are big enough) grope their way towards more efficient internal structures on their own.
EDIT: Now i'm less sure of all the above, thanks to Rohin's comment below. I guess this is a case of "Evidence to the people who didn't already understand the theory well enough to make the prediction," which maybe included me? Though I think I would have made the prediction too had I been asked...
The thing I meant by "catastrophic" is just "leading to death of the organism." I suspect mesa-optimization is common in humans, but I don't feel confident about this, nor that this is a joint-carvey ontology. I can imagine it being the case that many examples of e.g. addiction, goodharting, OCD, and even just "everyday personal misalignment"-type problems of the sort IFS/IDC/multi-agent models of mind sometimes help with, are caused by phenomena which might reasonably be described as inner alignment failures.
But I think these things don't kill people very often? People do sometimes choose to die because of beliefs. And anorexia sometimes kills people, which currently feels to me like the most straightforward candidate example I've considered.
I just feel like things could be a lot worse. For example, it could have been the case that mind-architectures that give rise to mesa-optimization at all simply aren't viable at high levels of optimization power—that it always kills them. Or that it basically always leads to the organism optimizing for a set of goals which is unrecognizably different from the base objective. I don't think you see these things, so I'm curious how evolution prevented them.
Governments and corporations experience inner alignment failures all the time, but because of convergent instrumental goals, they are rarely catastrophic. For example, Russia underwent a revolution and a civil war on the inside, followed by purges and coups etc., but from the perspective of other nations, it was more or less still the same sort of thing: A nation, trying to expand its international influence, resist incursions, and conquer more territory. Even its alliances were based as much on expediency as on shared ideology.
Perhaps something similar happens with humans.
Funnily enough, I wrote a blog distilling what I learned from reproducing experiments of that 2018 Nature paper, adding some animations and diagrams. I especially look at the two-step task, the Harlow task (the one with monkeys looking at a screen), and also try to explain some brain things (e.g. how DA interacts with the PFN) at the end.
The slot-based NN paper is "Meta-Learning with Memory-Augmented Neural Networks", Santoro et al 2016 (Arxiv).
I don't think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.
So the policy doesn't have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.
Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn't observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn't make it stand out compared to another partially observed environment.
The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you're unsure what of a large family of MDPs you're in): "Meta-learning of Sequential Strategies", Ortega et al 2019; "Reinforcement Learning, Fast and Slow", Botvinick et al 2019; "Meta-learners' learning dynamics are unlike learners'", Rabinowitz 2019; "Bayesian Reinforcement Learning: A Survey", Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).
To reuse a quote from my backstop essay: as Duff 2002 puts it,
..."One way of thinking about the computational proc
It looks like humans actually suffer from mesa-optimisation: when our mind finds a hack to get more dopamine via some sort of "illegal" reward center stimulation: pornography, drugs etc.
What you're describing is humans being mesa-optimizers inside the natural selection algorithm. The phenomenon this post talks about is one level deeper.
I dunno, I didn't really like the meta-RL paper. Maybe it has merits I'm not seeing. But I didn't find the main analogy helpful. I also don't think "mesa-optimizer" is a good description of the brain at this level. (i.e., not the level involving evolution). I prefer "steered optimizer" for what it's worth. :-)
The temporal difference learning algorithm is an efficient way to do reinforcement learning. And probably something like it happens in the human brain. If you are playing a game like chess, it may take a long time to get enough examples of wins and losses, for training an algorithm to predict good moves. Say you play 128 games, that's only 7 bits of information, which is nothing. You have no way of knowing which moves in a game were good and which were bad. You have to assume all moves made during a losing game were bad. Which throws out a lot of informati...
Curated. [Edit: no longer particularly endorsed in light of Rohin's comment, although I also have not yet really vetted Rohin's comment either and currently am agnostic on how important this post is]
When I first started following LessWrong, I thought the sequences made a good theoretical case for the difficulties of AI Alignment. In the past few years we've seen more concrete, empirical examples of how AI progress can take shape and how that might be alarming. We've also seen more concrete simple examples of AI failure in the form of specification gaming a...
Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper:
Initial Observation
At some point, a group of DeepMind researchers in Botvinick’s group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren’t trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of... emerges, out of, like out of thin air”:
Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that’s needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data.
From Botvinick’s description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms:
Search for Biological Analogue
This system reminded some of the neuroscientists in Botvinick’s group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of “synaptic memory” and “activity-based memory.” They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks.
They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments.
As I understand it, their story goes as follows:
The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value.
DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC’s synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of “a second RL algorithm, implemented entirely in the prefrontal network’s activation dynamics.” That is, they think DA mostly stores its model in synaptic memory, while PFC mostly stores it in activity-based memory (i.e. directly in the dopamine distributions).[2]
What’s the case for this story? They cite a variety of neuroscience findings as evidence for parts of this hypothesis, many of which involve doing horrible things to monkeys, and some of which they simulate using their meta-RL model to demonstrate that it gives similar results. These points stood out most to me:
Does RL occur in the PFC?
Some scientists implanted neuroimaging devices in the PFCs of monkeys, then sat the monkeys in front of two screens with changing images, and rewarded them with juice when they stared at whichever screen was displaying a particular image. The probabilities of each image leading to juice-delivery periodically changed, causing the monkeys to update their policies. Neurons in their PFCs appeared to exhibit RL-like computation—that is, to use information about the monkey’s past choices (and associated rewards) to calculate the expected value of actions, objects and states.
Wang et al. simulated this task using their meta-RL system. They trained a RNN on the changing-images task using RL; when run, it apparently demonstrated similar performance as the monkeys, and when they inspected it they found units that similarly seemed to encode EV estimates based on prior experience, continually adjust the action policy, etc.
Interestingly, the system continued to improve its performance even once its weights were fixed, which they take to imply that the learning which led to improved performance could only have occured within the activation patterns of the recurrent network.[3]
Can the two RL algorithms diverge?
When humans perform two-armed bandit tasks where payoff probabilities oscillate between stable and volatile, they increase their learning rate during volatile periods, and decrease it during stable periods. Wang et al. ran their meta-RL system on the same task, and it varied its learning rate in ways that mimicked human performance. This learning again occurred after weights were fixed, and notably, between the end of training and the end of the task, the learning rates of the two algorithms had diverged dramatically.
Implications
The account detailed by Botvinick and Wang et al. strikes me as a relatively clear example of mesa-optimization, and I interpret it as tentative evidence that the attractor toward mesa-optimization is strong. [Edit: Note that some commenters, like Rohin Shah and Evan Hubinger, disagree].
These researchers did not set out to train RNNs in such a way that they would turn into reinforcement learners. It just happened. And the researchers seem to think this phenomenon will occur spontaneously whenever “a very general set of conditions” is met, like the system having memory, being trained via RL, and receiving a related sequence of inputs. Meta-RL, in their view, is just “an emergent effect that results when the three premises are concurrently satisfied... these conditions, when they co-occur, are sufficient to produce a form of ‘meta-learning’, whereby one learning algorithm gives rise to a second, more efficient learning algorithm.”
So on the whole I felt alarmed reading this. That said, if mesa-optimization is a standard feature[4] of brain architecture, it seems notable that humans don’t regularly experience catastrophic inner alignment failures. Maybe this is just because of some non-scalable hack, like that the systems involved aren’t very powerful optimizers.[5] But I wouldn't be surprised if coming to better understand the biological mechanisms involved led to safety-relevant insights.
Thanks to Rafe Kennedy for helpful comments and feedback.
The authors hypothesize that DA is a model-free RL algorithm, and that the spinoff (mesa?) RL algorithm it creates within PFC is model-based, since that’s what happens in their ML model. But they don’t cite biological evidence for this. ↩︎
Depending on what portion of memories are encoded in this way, it may make sense for cryonics standby teams to attempt to reduce the supraphysiological intracellular release of dopamine that occurs after cardiac arrest, e.g. by administering D1-receptor antagonists. Otherwise entropy increases in PFC dopamine distributions may result in information loss. ↩︎
They demonstrated this phenomenon (continued learning after weights were fixed) in a variety of other contexts, too. For example, they cite an experiment in which manipulating DA activity was shown to directly manipulate monkeys’ reward estimations, independent of actual reward—i.e., when their DA activity was blocked/stimulated while they pressed a lever, they exhibited reduced/increased preference for that lever, even if pressing it did/didn’t give them food. They trained their meta-RL system to simulate this, again demonstrated similar performance as the monkeys, and again noticed that it continued learning even after the weights were fixed. ↩︎
The authors seem unsure whether meta-RL also occurs in other brain regions, since for it to occur you need A) inputs carrying information about recent actions/rewards, and B) network dynamics (like recurrence) that support continual activation. Maybe only PFC has this confluence of features. Personally, I doubt it; I would bet that meta-RL (and other sorts of mesa-optimization) occur in a wide variety of brain systems, but it would take more time than I want to allocate here to justify that intuition. ↩︎
Although note that neuroscientists do commonly describe the PFC as disproportionately responsible for the sort of human behavior one might reasonably wish to describe as “optimization.” For example, the neuroscience textbook recommended on lukeprog’s textbook recommendation post describes PFC as “often assumed to be involved in those characteristics that distinguish us from other animals, such as self-awareness and the capacity for complex planning and problem solving.” ↩︎