(UPDATE 1½ YEARS LATER: I’ve learned a lot since writing this post, and you shouldn’t assume that I still endorse everything herein.)
I’m biased: I have a strong prior belief that reinforcement learning should not be involved in sensory processing in the brain. (Update: I meant "directly involved", see comment.) (Update 2: I should have said "low-level sensory processing", see discussion of inferotemporal cortex here.) The reason is simple: avoiding wishful thinking.
If there’s a positive reward for looking at beautiful ocean scenes, for example, then the RL solution would converge towards parsing whatever you look at as a beautiful ocean scene, whether it actually is or not!
Predictive learning (a.k.a. self-supervised learning), by contrast, seems perfect for the job of training sensory-processing systems. Well, really weighted predictive learning, where some prediction errors are treated as worse than others, and where top-down attention is providing the weights. (Or something. Not sure about the details here.) Anyway, predictive learning does not have a “wishful thinking” problem; plus there’s a nice information-rich supervisory signal that provides full error vectors (which are used for error-driven learning) which are easier to learn from than 1D rewards. (Imagine a high school teacher grading essays. If they say “That essay was bad; next time try using shorter paragraphs”, that’s like an error gradient signal, telling the student how to improve. If they just say “That essay was bad”, that’s like a reward signal, and now making progress is much harder!!) So that’s been my belief: I say reward learning has no place in the sensory-processing parts of the brain.
So later on, I was excited to learn that the basal ganglia (which plays a central role in reinforcement learning) sends signals into the frontal lobe of the brain—the home of plans and motor outputs—but not to the other lobes, which are more involved in sensory processing. (Update: There's at least one exception, namely inferotemporal cortex; I guess the division between RL / not-RL does not quite line up perfectly with the division between frontal lobe / other lobes.) (Update 2: I have a lot more to say about the specific case of inferotemporal cortex here.) Anyway, that seemed to confirm my mental picture.
Then I was reading Marblestone Wayne Kording 2016 (let’s call it MWK), and was gobsmacked—yes, gobsmacked!—when I came across this little passage:
Reward-driven signaling is not restricted to the striatum [part of the basal ganglia], and is present even in primary visual cortex (Chubykin et al., 2013; Stănişor et al., 2013). Remarkably, the reward signaling in primary visual cortex is mediated in part by glial cells (Takata et al., 2011), rather than neurons, and involves the neurotransmitter acetylcholine (Chubykin et al., 2013; Hangya et al., 2015). On the other hand, some studies have suggested that visual cortex learns the basics of invariant object recognition in the absence of reward (Li and Dicarlo, 2012), perhaps using reinforcement only for more refined perceptual learning (Roelfsema et al., 2010).
But beyond these well-known global reward signals, we argue that the basic mechanisms of reinforcement learning may be widely re-purposed to train local networks using a variety of internally generated error signals. These internally generated signals may allow a learning system to go beyond what can be learned via standard unsupervised methods, effectively guiding or steering the system to learn specific features or computations (Ullman et al., 2012)."
Could it be?? Am I just horribly confused about everything?
So I downloaded and skimmed MWK’s key citations here. To make a long story short, I’m not convinced. Here are my quick impressions of the relevant papers ...
Start with common sense. Let’s say you just love finding pennies on the sidewalk. It’s just your favorite thing. You dream about it at night. Now, if you see a penny on the sidewalk, you’re bound to immediately notice it and pay continuing attention to it. That’s obvious, right? The general pattern is: Attention is often directed towards things that are rewarding, and the amount of reward is likely to bear at least some relation to how much attention you pay. Moreover, the strength and direction of attention influences the activity of neurons all over the cortex.
Now, the authors did an experiment on macaques, where a dot appeared and they had to look at it, and the color of the dot impacted how much reward they got when they did so. I guess the idea was that they were controlling for attention because the macaques were paying attention regardless of the dot colors—how else would they saccade to it? I don’t really buy that. I think that attention has a degree—it’s not just on or off. Let’s say I love finding pennies on the sidewalk, and I like finding nickels on the sidewalk, but my heart’s not in it. When I see either a nickel or a penny, I’ll look at it. But I’ll look a lot more intently at the penny than the nickel! For example, maybe I was singing a song in my head as I walked. If I see the nickel, maybe I’ll look at the nickel but keep singing the song. I notice the nickel, but I still have some attention left over! If I see the penny, I’ll be so excited that everything else in my head stops in its tracks, and 100% of my attention is focused on that penny.
From that perspective, everything in the paper seems to actually support my belief that reward-based learning plays no role whatsoever in the visual cortex. Reward affects the frontal lobe, and then depending on the reward, the frontal lobe flows more or less attention towards the visual cortex. That would nicely explain their finding that: “The neuronal latency of this reward value effect in V1 was similar to the latency of attentional influences. Moreover, V1 neurons with a strong value effect also exhibited a strong attention effect, which implies that relative value and top–down attention engage overlapping, if not identical, neuronal selection mechanisms.”
“A Cholinergic Mechanism for Reward Timing within Primary Visual Cortex”, Chubykin et al. 2013.
The authors took rats and coaxed them to learn that after they did a certain thing (lick a thing a certain number of times), a rewarding thing would happen to them (they’re thirsty and they get a drop of water). Right when they expected the reward, there were telltale signs of activity in their primary visual cortex (V1).
I figure, as above, that this talk about “rewards” is a red herring—the important thing is that the reward expectation coincides with some shift of the rat’s attention, which has ripple effects all over the cortex, and thus all parts of the cortex learn to expect those ripple effects.
Then the experimenters changed the number of licks required for the reward. The telltale signs of activity, predictably, shifted to the new reward time ... but not if the experimenters infused the rat brain (or more specifically the part of the visual cortex where they were recording) with a toxin that prevented acetylcholine effects. How do I account for that? Easy: acetylcholine = learning rate—see separate post. No acetylcholine, no learning. The visual cortex is still a learning algorithm, even if it’s not learning from rewards, but rather learning to expect a certain attentional shift within the brain.
“Central Cholinergic Neurons Are Rapidly Recruited by Reinforcement Feedback”, Hangya et al. 2015.
The authors offer loads of evidence that reward and punishment cause acetylcholine to appear. But I don’t think they claim (or offer evidence) that the acetylcholine is communicating a reward. Indeed, if I’m reading it right, the acetylcholine spikes after both reward and punishment, whereas a reward signal like phasic dopamine needs to swing both positive and negative.
The evidence all seems to be consistent with my belief (see separate post) that acetylcholine controls learning rate, and that it’s biologically advantageous for the brain to use a higher learning rate when important things like reward and punishment are happening (and when you’re aroused, when a particular part of the brain is in the spotlight of top-down attention, etc.).
“Perceptual learning rules based on reinforcers and attention”, Roelfsema et al. 2010.
If I’m reading it right (a big “if”!), everything in this paper is consistent with what I wrote above and in the other post. Acetylcholine determines learning rate, and you learn better by having the capability to set different learning rates at different times and in different parts of the brain, and the presence of rewards and punishments is one signal that maybe the learning rate should be unusually high right now.
~~
Well, those are the main citations. This is a very quick-and-dirty analysis, but I’m sufficiently satisfied that I’m going to stick with my priors here: reward-based learning is not involved in sensory processing in the brain.
(Thanks Adam Marblestone for comments on a draft.)
Hey thanks for explaining this - makes sense to me and I think we are mostly in agreement. Using the proxy signal as a supervised learning target to recognize the learned target pattern in IT is a straightforward way to implement the matching, but probably not quite complete in practice. I suspect you also need to combine that with some strong priors to correctly carve out the target concept.
Consider the equivalent example of trying to train a highly accurate cat image detector given a dataset containing say 20% cats combined with a crappy low complexity proxy cat detector to provide the labels. Can you really bootstrap improve discriminative models in that way with non-trivial proxy label noise? I suspect that the key to making this work is using the powerful generative model of the cortex as a regularizer, so you train it to recognize images the proxy detector labels as cats that are also close to the generative model's data manifold. If you then reoptimize (in evolutionary time) the proxy detector to leverage that I think it makes the problem much more tractable. The generative model allows you to make the learned model far more selective around the actual data manifold to increase robustness. In very simple vague terms the model would then be learning the combination of high proxy probability combined with low distance to the data manifold of examples from the critical training set.
Later if you then test OoD on vague non-cats (dogs, stuffed animals) not encountered in training that would confuse the simple proxy the learned model can reject those - even though it never saw them during critical training - simply because they are far from the generative manifold, and the learned model is 'shrunk' to fit that manifold.
I do agree the amygdala does seem like a good fit for the location of the learned symbol circuit, although at that point it raises the question of why not also just have the proxy in the amygdala? If the amygdala has the required inputs from LGN and/or V1 it would be my guess that it could also just colocate the innate proxy circuit. (I haven't looked in the lit to see if those connections exist)
Also 6 seems required for the system to work as well in adulthood as it typically does, and yet also explain the out of distribution failures for imprinting etc. (Once the IT representation is learned you want to use that exclusively, as it should be strictly superior to the proxy circuit. This seems a little weird at first, but the)
The hope is that this same mechanism which seems well suited for handling imprinting also works for grounding sexual attraction (as an elaboration of imprinting) and then more complex concepts like representations of other's emotions from facial expression, vocal tone, etc proxies, and then combining that with empathic simulation to ground a model of other's values/utility for social game theory, altruism, etc.