A Crash Course in the Neuroscience of Human Motivation

lukeprog

[PDF of this article updated Aug. 23, 2011]

Whenever I write a new article for Less Wrong, I'm pulled in two opposite directions.

One force pulls me toward writing short, exciting posts with lots of brain candy and just one main point. Eliezer has done that kind of thing very well many times: see Making Beliefs Pay Rent, Hindsight Devalues Science, Probability is in the Mind, Taboo Your Words, Mind Projection Fallacy, Guessing the Teacher's Password, Hold Off on Proposing Solutions, Applause Lights, Dissolving the Question, and many more.

Another force pulls me toward writing long, factually dense posts that fill in as many of the pieces of a particular argument in one fell swoop as possible. This is largely because I want to write about the cutting edge of human knowledge but I keep realizing that the inferential gap is larger than I had anticipated, and I want to fill in that inferential gap quickly so I can get to the cutting edge.

For example, I had to draw on dozens of Eliezer's posts just to say I was heading toward my metaethics sequence. I've also published 21 new posts (many of them quite long and heavily researched) written specifically because I need to refer to them in my metaethics sequence.¹ I tried to make these posts interesting and useful on their own, but my primary motivation for writing them was that I need them for my metaethics sequence.

And now I've written only four posts² in my metaethics sequence and already the inferential gap to my next post in that sequence is huge again. :(

So I'd like to try an experiment. I won't do it often, but I want to try it at least once. Instead of writing 20 more short posts between now and the next post in my metaethics sequence, I'll attempt to fill in a big chunk of the inferential gap to my next metaethics post in one fell swoop by writing a long tutorial post (a la Eliezer's tutorials on Bayes' Theorem and technical explanation).³

So if you're not up for a 20-page tutorial on human motivation, this post isn't for you, but I hope you're glad I bothered to write it for the sake of others. If you are in the mood for a 20-page tutorial on human motivation, please proceed.

Who knows what I want to do? Who knows what anyone wants to do? How can you be sure about something like that? Isn’t it all a question of brain chemistry, signals going back and forth, electrical energy in the cortex? How do you know whether something is really what you want to do or just some kind of nerve impulse in the brain. Some minor little activity takes place somewhere in this unimportant place in one of the brain hemispheres and suddenly I want to go to Montana or I don’t want to go to Montana.

- Don DeLillo, White Noise

Preface

How do we value things, and choose between options? Philosophers, economists, and psychologists have long tried to answer these questions. But human behavior continues to defy our most subtle models of it, and the algorithms producing our behavior remained hidden in a black box.

But now, neuroscientists are directly measuring the neurons whose firing rates encode value and produce our choices. We know a lot more about the neuroscience of human motivation than you might think. Now we can peer directly into the black box of human motivation, and begin (dimly) to read our own source code.

The neuroscience of human motivation has implications for philosophy of mind and action, for scientific self-help, and for metaethics and Friendly AI. (We don't really know what we want, and looking directly at the algorithms that produce human wanting might help in solving this mystery.)

So, I wrote a crash course in the neuroscience of human motivation.

The purpose of this document is not to argue for any of the conclusions presented within it. That would require not a long blog post but instead a couple 500-page books — say, Foundations of Neuroeconomic Analysis and Handbook of Reward and Decision Making (my two greatest sources for this post).⁴

Instead, I merely want to summarize the current mainstream scientific picture on the neuroscience of human motivation, explain some of the concepts it uses, and tell a few stories about how our current picture of human motivation developed.

As you read this, I hope that many questions and objections will come to mind, because it's not the full story. That's why I went to the trouble of linking to PDFs of almost all my sources (see References): so you can check the full data and the full arguments yourself if you like.

This document is long. You may prefer to read it in sections.

Folk Psychology

There are these things called 'humans' on planet Earth. They undergo metabolism and cell growth. They produce waste. They maintain homeostasis. They reproduce. They move. They communicate. Sometimes they have pillow fights.

Some of these human processes are 'automatic', like cell growth and breathing. Other processes are 'intentional' or 'willed', like moving and communicating and having pillow fights. We call these latter processes intentional actions, or simply actions. Sometimes we're not sure where to draw the line between automatic processes and actions, but this should become clearer as we learn more. In the meantime, we ask...

How can we explain human actions?

One popular explanation is 'folk psychology.' Folk psychology posits that we humans have beliefs and desires, and that we are motivated to do what we believe will fulfill our desires.

I desire to eat a cookie. I believe I can fulfill that desire if I walk to the kitchen and put one of the cookies there into my mouth. So I am motivated to walk to the kitchen and put a cookie in my mouth.

Of course there are complications. For example I have multiple desires. Suppose I desire to eat a cookie and believe there are cookies in the kitchen. But I also desire to remain sitting comfortably in the living room. Can I satisfy both desires? I also believe that if I nicely ask my friend in the kitchen to bring me a cookie, she will. So I ask her to bring me a cookie and I begin to eat it, without having to leave the comfy living room sofa. We still explain my behavior with constructs like 'beliefs' and 'desires', but we consider more than one of each to do so.

Most of us use folk psychology every day to successfully predict human behavior. I believe that my friend desires to do nice things for me on occasion if they're not too much trouble, and I believe that my friend, once I tell her I want a cookie, will believe she can be nice to me without much trouble if she brings me a cookie from the kitchen. So, I predict that my friend will bring me a cookie when I ask her. So I ask her, and behold! My prediction was correct. I am happily eating a cookie on the sofa.

But folk psychology (FP) faces some problems.⁵ Consider its context in history:

The presumed domain of FP used to be much larger than it is now. In primitive cultures, the behavior of most of the elements of nature were understood in intentional terms. The wind could know anger, the moon jealousy, the river generosity… These were not metaphors… the animistic approach to nature has dominated our history, and it is only in the last two or three thousand years that we have restricted FP’s literal application to the domain of the higher animals.

[Even still,] the FP of the Greeks is essentially the FP we uses today… This is a very long period of stagnation and infertility for any theory to display, especially when faced with such an enormous backlog of anomalies and mysteries in its own explanatory domain… To use Imre Lakatos’ terms, FP is a stagnant or degenerating research program, and has been for millennia.

Consider also its prospects for inter-theoretic reduction:

If we approach homo sapiens from the perspective of natural history and the physical sciences, we can tell a coherent story of its constitution, development, and behavioral capacities which encompasses particle physics, atomic and molecular theory, organic chemistry, evolutionary theory, biology, physiology, and materialistic neuroscience. The story, though still radically incomplete, is already extremely powerful, outperforming FP at many points even in its own domain. And it is deliberately… coherent with the rest of our developing world picture. In short, the greatest theoretical synthesis in [history] is currently in our hands…

But FP is no part of this growing synthesis. Its intentional categories stand magnificently alone, without visible prospect of reduction to that larger corpus. A successful reduction cannot be ruled out, in my view, but FP’s explanatory impotence and long stagnation inspire little faith that its categories will find themselves neatly reflected in the framework of neuroscience. On the contrary, one is reminded of how alchemy must have looked as elemental chemistry was taking form, how Aristotelean cosmology must have looked as classical mechanics was being articulated, or how the vitalist conception of life must have looked as organic chemistry marched forward.

Finally, consider the problem of habit. I sit at my computer and want to type my name, 'Luke.' However, I have just used a special program to switch the function of the keys labeled L and P so that they will input the other character instead (so that I can play a prank on my friend, who will be using my computer shortly). I believe that typing the key labeled L will input P instead, but nevertheless when I type my name my fingers fall into their familiar habit and I end up typing my name as 'Puke.' My act of typing was intentional, and yet I didn't do what I believed would fulfill my desire to type my name.

Folk psychology faces both successes and failures in explaining human action. Hopefully we can do better.

Neoclassical Economics

Folk psychology was updated and quantified by neoclassical economics. To summarize:

One [assumption of] neoclassical economics is "rationality," in which individuals are said to choose alternatives that maximize expected utilities. In particular, the neoclassical view is that individuals rank all possible alternatives according to how much satisfaction they will bring and then choose the alternative that [they expect] will bring the most satisfaction or utility...⁶

Let's review this notion of maximizing expected utility. Suppose I can choose one of two boxes sitting before me, red and blue. There is a 10% chance the red box contains a million dollars, and a 90% chance it contains nothing. As for the blue box, I am certain it contains $10,000. The 'expected value' of choosing the red box is (0.1 × $1,000,000) + (0.9 × $0), which is equal to $100,000. The expected value of choosing the blue box is !1 × $10,000), or $10,000. An agent that chose whatever had the highest expected value would choose the red box, which has 10 times the expected value of the blue box ($100,000 vs. $10,000).

But humans don't value things only according to their dollar value. A million dollars might have 10 times the objective value of $100,000, but it might have less than 10 times the subjective value of $100,000 because after $100,000 you only care a little how much more wealthy you are.

Or, you might be risk averse. You might prefer a sure thing to something that is uncertain. So a 10% chance of a million dollars might be worth less — in subjective value — than a 100% chance of $10,000. If you are risk averse you might choose the blue box because it has higher expected subjective value even though it has lower expected objective value.

We call objective value simply 'value'. We call subjective value 'utility.'

Neoclassical economics quantifies folk psychology by measuring the strength of belief with probability and by measuring the strength of desire with utility. It then says that humans act so as to maximize expected utility, a measure that combines the utility of particular thing with your subjective probability of getting it.⁷

This neoclassical model of human behavior has faced many challenges, and is regularly revised in the face of new evidence.⁸ For example, Loewenstein (1987) found that if students were asked to place a value on the opportunity to kiss a celebrity of their choice 1-5 days in the future, they placed the highest value on a kiss in 3 days. This didn't fit any existing neoclassical models of utility, but was explained in 2001 when Caplin & Leahy (2001) incorporated "anticipatory feeling" into the neoclassical model, explaining that the students got some utility from anticipating the kiss with the celebrity (but also, as usual, discounted the utility of a reward the further away it was in the future), and this is why they didn't want the kiss right away.

Keep in mind that economists don't argue that we actually compute the expected utility of each option before us and then choose the best one, but that we always act "as if" we were doing that.⁹

But sometimes we don't even act "as if" we are obeying the axioms of neoclassical economics. For example, the independence axiom of expected utility theory says that if you prefer an apple over an orange, then you must prefer the Gamble A (72% chance you get an apple, otherwise you get a cat) over the Gamble B (72% chance you get an orange, otherwise you get a cat). But Allais (1953) found that subjects do violate this basic assumption under some conditions.

Such violations of the basic axioms of neoclassical economics led to the development of behavioral economics and theories like Kahneman and Tversky's (1979) prospect theory,¹⁰ which transcends some assumptions of the neoclassical model. But these new theories don't fit the data perfectly, either.¹¹

The models of human motivation we've surveyed so far are conceptually related to decision theory (beliefs and desires, or probabilities and utilities), so I'll call them 'decision-theoretic models' of human motivation. We'll discuss decision-theoretic models again when we finally get to the topic of neuroscience, but for now I want to discuss a different approach to motivation.

Behaviorism and Reinforcement Learning

While neoclassical economists formulated expected utility theory, behaviorist psychologists developed a different set of explanations for human action. Though behaviorists were wrong when they said that science can't talk about mental activity or mental states, you can charitably think of behaviorists as playing a game of Rationalist's Taboo with constructs of folk psychology like "want" or "fear" in order to get at phenomena more appropriate for quantification in technical explanation. Also, the behaviorist approach led to 'reinforcement learning', an important concept in the neuroscience of human motivation.

Before I explain reinforcement learning, let's recall operant conditioning:

Stick a pigeon in a box with a lever and some associated machinery (a "Skinner box"). The pigeon wanders around, does various things, and eventually hits the lever. Delicious sugar water squirts out. The pigeon continues wandering about and eventually hits the lever again. Another squirt of delicious sugar water. Eventually it percolates into its tiny pigeon brain that maybe pushing this lever makes sugar water squirt out. It starts pushing the lever more and more, each push continuing to convince it that yes, this is a good idea.

Consider a second, less lucky pigeon. It, too, wanders about in a box and eventually finds a lever. It pushes the lever and gets an electric shock. Eh, maybe it was a fluke. It pushes the lever again and gets another electric shock. It starts thinking "Maybe I should stop pressing that lever." The pigeon continues wandering about the box doing anything and everything other than pushing the shock lever.

The basic concept of operant conditioning is that an animal will repeat behaviors that give it reward, but avoid behaviors that give it punishment.

Behaviorism died in the wake of cognitive psychology, but its approach to motivation turned out to be very useful in the field of artificial intelligence, where it is called reinforcement learning:

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards. These two characteristics — trial-and-error search and delayed reward — are the two most important distinguishing features of reinforcement learning.

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best..¹²

In addition to the agent and its environment, there are four major components of a reinforcement learning system:

...a policy, a reward function, a value function, and, optionally, a model of the environment.

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states...

A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. A reinforcement-learning agent's sole objective is to maximize the total reward it receives in the long run. ...[A reward function may] be used as a basis for changing the policy. For example, if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future...

Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. For example, a state might always yield a low immediate reward, but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true...

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made on the basis of value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain for us the greatest amount of reward over the long run...

...The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Want an example? Here is how a reinforcement learning agent would learn to play Tic-Tac-Toe:

First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's current value, and the whole table is the learned value function. State A has higher value than state B, or is considered 'better' than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play Xs, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Øs in a row... the correct probability is 0, as we cannot win from them. We set the initial values of all the other states, the nonterminals, to 0.5, representing an informed guess that we have a 50% chance of winning.

Now we play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from one of the other moves instead; these are called exploratory moves because they cause us to experience states that we might otherwise never see.

A sequence of Tic-Tac-Toe moves might look like this:¹³

Solid lines are the moves our reinforcement learning agent made, and dotted lines are moves it considered but did not make. The second move was an exploratory move: it was taken even though another sibling move, that leading to e*, was ranked higher.

While playing, the agent changes the values assigned to the states it finds itself in. To improve its estimates concerning the probability of winning from various states, it 'backs up' the value of state after each greedy move to the state before the move (as suggested by the arrows.) What this means is that the value of the earlier state is adjusted to be closer to the value of the later state.

If we let s denote the state before the greedy move, and s' the state after, then the update to the estimated value of s, denoted V(s), can be written:

V(s) <- V(s) + α[V(s') - V(s)]

where α is a small positive fraction called the step-size parameter, which influences the rate of learning. The update rule is an example of a temporal difference learning method, so called because its changes are based on a difference... between [value] estimates at two different times.

...if the step-size parameter is reduced properly over time, this method converges, for any [unchanging] opponent, to the true probabilities of winning from each state give optimal play by the agent.

And that's how a simple version of temporal difference (TD) reinforcement learning works.

Reinforcement Learning and Decision Theory

You may have noticed a key advantage of reinforcement learning: an agent using it can be 'dumber' than a decision-theoretic agent. It can just start with guesses ("What the hell; let's try 50%!") for the value of various states, and then it learns their true values by running through many, many trials.

But what if you don't have many trials to run through, and you need to make an important decision right now?

Then you have to be smart. You need to have a good model of the world and use decision theory to choose the action with the highest expected utility.

This is precisely what rationality — being good at building correct models of the world — is especially good for:

For some tasks, the world provides rich, inexpensive empirical feedback. In these tasks you hardly need reasoning. Just try the task many ways... and take care to notice what is and isn’t giving you results.

Thus, if you want to learn to sculpt, [studying rationality] is a bad way to go about it. Better to find some clay and a hands-on sculpting course. The situation is similar for small talk, cooking, selling, programming, and many other useful skills.

Unfortunately, most of us also have goals for which we can obtain no such ready success/failure data. For example, if you want to know whether cryonics is a good buy, you can’t just try buying it and not-buying it and see which works better. If you miss your first bet, you’re out for good.

Reinforcement learning can be a good strategy if you have time to learn from many trials. If you've only got one shot at a problem, you'd better build up a really accurate model of the world first and then try to maximize expected utility.

Now, back to our story.

It turns out that reinforcement learning seems to underlie many of our mental processes. (More on this later.)

The lesson Yvain drew from this discovery was:

Reinforcement learning is evolution writ small; behaviors propagate or die out based on their consequences to reinforcement in a mind, just as mutations propagate or die out based on their consequences to reproduction in an organism. In the behaviorist model, our mind is not an agent, but a flourishing ecosystem of behaviors both physical and mental, all scrabbling for supremacy and mutating into more effective versions of themselves.

Just as evolving organisms are adaptation-executors and not fitness-maximizers, so minds are behavior-executors and not utility-maximizers.

But things are a bit more complicated than that, as we'll now see.

The Turn to the Brain

I hesitate to say that men will ever have the means of measuring directly the feelings of the human heart. It is from the quantitative effects of the feelings that we must estimate their comparative amounts.

William Jevons (1871)

It turns out that Jevons was wrong. Modern neuroscience allows us to peer into the black box of the human value system and measure directly "the feelings of the human heart."¹⁴

We'll begin with the experiments of Wolfram Shultz. Schultz recorded the activity of single dopamine neurons in monkeys who sat in front of a water spout. At irregular intervals, a speaker played a tone and a drop of water dropped from the spout.¹⁵ The monkeys' dopamine neurons normally fired at the baseline rate, but responded with a burst of activity when water was delivered. Over time, though, the neurons responded less and less to the water and more and more to the tone.

But if Schultz delivered water without first giving the tone, then the dopamine neurons responded with a burst of activity again. And if he played the tone and didn't provide water, the neurons reduced their firing rates below the baseline. The neurons weren't responding to the water itself but to a difference between expected reward and actual reward — a reward prediction error (RPE).

Two other researchers, Read Montague and Peter Dayan, noticed that these patterns of neuronal activity were exactly predicted by TD reinforcement learning theory from computer science.¹⁶ In particular, the RPE observed in neurons appeared to play the same role in monkey learning as the difference between value estimates at two different times did in TD reinforcement learning theory.

Since then, researchers have done many more single-neuron recording studies to test particular versions of TD reinforcement learning and revise the theory until it predicts more and more behavior while also predicting novel experimental discoveries.

Caplin & Dean¹⁷ provided another way to test the hypothesis that dopamine neurons encoded RPE in a TD-class model. They showed that all existing RPE-models could be reduced to three axiomatic statements. If a system violated one of these axioms, it could not be an RPE system. Later, Caplin et al. (2010) tested the axioms on actual brain activity to see if they held up. They did. This is another reason why so many scientists working in this field believe the current 'dopamine hypothesis' — that dopamine neurons encode RPE in a TD-class reinforcement learning system in the brain.

TD-class reinforcement learning works in computers by updating numbers that represent the values of states. How does reinforcement learning work when using nerve cells?

Hebbian Learning

By Hebbian learning, of course. "Cells that fire together, wire together."

Imagine a neural pathway (in one of Pavlov's dogs) that connects the neural circuits that sense the ringing of a bell to the neural circuits for salivation. This is a weak connection at first, which is why the bell doesn't initially elicit salivation.

Also imagine a third neuron that connects the salivation circuit to a circuit that detects food. This is a strong connection, and that's why food does elicit salivation right away:¹⁸

Donald Hebb proposed:

When an axon of cell A is near enough to excite cell B and repeatedly or persistently take part in firing it, a growth process of metabolic change takes place in one or both cells such that A's efficacy, as one of the cells firing B, is increased.¹⁹

In short, whenever two connected cells are active at the same time, the synapses connecting them are strengthened.

Consider Pavlov's experiment. At first, the Bell cell will fire whenever bells ring, but probably not when the salivation cells happen to be active. So, the connection between the Bell cell and the Salivation cell remains weak. But then, Pavlov intervenes and causes the Bell cell and the Salivation cell to fire at the same time by ringing the bell and presenting food at the same time (the Food detector cell already has a strong connection to the Salivation cell). Whenever the Bell cell and the Salivation cell happen to fire at the same time, the synapse between them is strengthened. Once the connection is strong enough, the Bell cell can cause the Salivation cell to fire on its own, just like the Food detector cell can.

It was a fine theory, but it wasn't observed until Bliss & Lomo (1973) observed Hebb's mechanism at work in the rabbit hippocampus. Today, we know how some forms of Hebb's mechanism work at the molecular level.²⁰

Later, Wickens (1993) proposed a similar mechanism called the three-factor rule, according to which some synapses are strengthened whenever presynaptic and postsynaptic activity occurred in the presence of dopamine. These same synapses might be weakened when activity occurred in the absence of dopamine. Later studies confirmed this hypothesis.²¹

Suppose a monkey receives an unexpected reward and encodes a large positive RPE. Glimcher explains:

The TD model tells us that under these conditions we want to increment the value attributed to all actions or sensations that have just occurred. Under these conditions, we know that the dopamine neurons release dopamine throughout the frontocortical-basal ganglia loops, and do so in a highly homogenous manner. That means we can think of any neuron equipped with dopamine receptors as 'primed' for synaptic strengthening. When this happens, any segment of the frontocortical-basal ganglia loop that is already active will have its synapses strengthened.²²

We will return to the dopamine system later, but for now let us back up and pursue the neoclassical economic path into the brain.

Expected Utility in Neurons

Ever since Friedman (1953), economists have insisted that humans only behave as if they are utility maximizers, not that they actually compute expected utility and try to maximize it.

It was a surprise, then, when neuroscientists stumbled upon the neurons that were encoding expected utility in their firing rates.

Tanji & Evarts (1976) did their experiments with rhesus monkeys because they are our closest relative besides the apes, and this kind of work is usually forbidden on apes for ethical reasons (we need to implant a recording electrode in the brain).

The monkeys were trained to know that a colored light on the screen meant they would soon be offered a reward (a drop of water) either for pushing or pulling, but not for both. This was the ‘ready’ cue. A second later, researchers gave a ‘direction’ cue that told the monkeys which action — push or pull — was going to be rewarded. The third cue was the 'go' signal: if the monkey made the previously indicated movement, it was rewarded.

This is what they saw:

At the ‘ready’ cue, the neurons associated with a pushing motion became weakly active (but fired above the baseline rate), and so did the neurons associated with a pulling motion. When the ‘direction’ cue was given, the neurons associated with the to-be-rewarded motion doubled their firing rate, and the neurons associated with the opposite motion fell back to the baseline rate. Then at the ‘go’ cue, the neurons associated with the to-be-rewarded movement increased again rapidly, up past the threshhold required to produce movement, and the movement was produced shortly thereafter.

One tempting explanation of the data is that after the ‘ready’ cue, the monkey’s brain 'decides' there’s a 50% chance that pulling will get the reward, and a 50% chance that pushing will get the reward. That’s why we see the neuron firing rates associated with those two actions each jump to slightly less than 50% of the movement threshold when the ‘ready’ cue is given. But then, when the ‘direction’ cue is given, those expectations shift to 100%/0% or 0%/100%, depending on which action is about to be rewarded according to the ‘direction’ cue. That’s why activity in the circuit associated with the to-be-rewarded action doubles and the other one drops to baseline. And then the ‘go’ cue is delivered and firing rates blast past the movement threshold, and movement is produced.

Let's jump ahead to Basso & Wurtz (1997), who did a similar experiment except that they used voluntary eye movements (called ‘saccades’) instead of voluntary arm movements. And this time, they presented each monkey with one, two, four, or eight possible targets, instead of just two targets (push and pull) like Tanji & Evarts did.

What they found was that as more potential targets were presented, the magnitude of the preparatory activity associated with each target systematically decreased. And again, once the ‘direction’ and ‘go’ cues were presented, the activity associated with those other potential targets dropped rapidly and activity burst rapidly in neurons associated with the to-be-rewarded movement. It was as though the monkeys’ brains were distributing their probability mass evenly across the potentially rewarded actions, and then once they knew which action should in fact be rewarded, they moved all their probability mass to that action and performed the action and got the reward.

Real-Time Expected Utility Updates

Other researchers showed monkeys a black screen with flickering white dots on it. In each frame of the video, the computer moved each dot in a random direction. The independent variable was a measure called 'coherence.' In a 100% leftward coherence condition, all dots moved to the left. In a 60% rightward condition, 60% of the dots move rightward while the rest moved randomly. And so on.

In a typical experiment, the researchers would identify a neuron in a monkey's brain that increased its firing rate in response to rightward coherence of the dots, and decreased its firing rate in response to leftward coherent of the dots. Then they would present the monkey with a sequence (in random order) of every possible leftward and rightward coherence condition.

A leftward coherence (of any magnitude) meant the monkey would be rewarded for leftward eye movement, and a rightward coherence meant the monkey would be rewarded for rightward eye movement. But, the monkey had to wait two seconds before being rewarded.

In this experiment, the probabilities always started at 50% but then updated continuously. A 100% rightward coherence condition allowed the monkey to very quickly know which voluntary eye movement would be rewarded, but in a 5% rightward coherence condition the expected utility of the rightward target grew more slowly.

The results? The greater the coherence of rightward motion of the dots, the faster the neurons associated with rightward eye movement increased their firing rate. (A higher coherence meant the monkey was able to update its probabilities more quickly.)

Argmax and Reservation Price

Many studies show that the brain controls movement by way of a 'winner take all' mechanism that is isomorphic to the argmax operation from economics.²³ That is, there are many possibilities competing for your final choice, but just before your choice the single strongest signal remains after all the others are inhibited.

This choice mechanism was investigated in more detail by Michael Shadlen and others.²⁴ Shadlen gave monkeys the same eye movement task as above, except that the monkeys could make their choice at any time instead of waiting for two seconds. He found that:

When the direction of the dots is unambiguous, monkeys make their choices quickly.
As the direction of the dots becomes more ambiguous, they take longer to make their choices.
Throughout the experiment, the firing rates of neurons in the LIP (part of the 'final common path' for generating eye movement) grew toward a specific threshold level.

The threshold level acts as a kind of criterion of choice. Once the criterion is met, action is taken. Or in economic terms, the monkeys seemed to set a reservation price on making certain movements.²⁵

Random Utility

When deciding between goods of different expected utilities, humans exhibit a stochastic transfer function:

Consider a human subject choosing between two objects of highly different expected utilities, such as a first lottery with a 50% chance of winning $5 and a second lottery with a 25% chance of winning $5. We observe highly deterministic behavior under these conditions: basically all subjects always choose the 5o% chance of winning $5. But what happens when we increment the value of the 25% lottery? As the amount one stands to win from that lottery is incremented, individual subjects eventually switch their preference. Exactly when they make that switch depends on their idiosyncratic degree of risk aversion. What is most interesting about this behavior for these purposes, though, is that actual human subjects, when presented with this kind of choice repeatedly, are never completely deterministic. As the value of the 25% lottery increases, they begin to show probabilistic behavior — selecting the 25% lottery sometimes, but not always.²⁶

Our behavior has an element of randomness in it. Daniel McFadden won a Nobel Prize in economics for capturing such behavior using a random utility model.²⁷ The way he did it is to suppose that when a chooser asks himself what a thing is worth, he doesn't get a fixed answer but a variable one. That is, there is actual variation in his preferences. Thus, his expected utility for a particular lottery is drawn from a distribution of possible utilities, usually one with a Gaussian variance.²⁸

This behavior makes sense when we think about the human choice mechanism at the neuronal level, because neuron firing rates are stochastic.²⁹ When a neurobiologist says "The neuron was firing at 200 Hz," what she means is that the mean firing rate of the neuron over a long time and stable conditions would have been close to 200 Hz. So the neurons that encode utility (wherever they are) will exhibit stochasticity, and thereby introduce some randomness into our choices. In this way, neurobiological data constrains our economic models of human behavior. An economic model without some randomness in it will have difficulty capturing human choices for as long as humans run on neurons.³⁰

Discounting

Louie & Glimcher (2010) examined temporal discounting in the brain. The two monkeys in this study were repeatedly asked to choose between a small, immediately available reward and a larger reward available after a small delay. For example, on one day they were asked to choose between 0.13 millileters of juice right now, or else 0.2 millileters of juice available after a delay of 2, 4, 8, or 12 seconds. A monkey might be willing to wait 2, 4, or 8 seconds for the larger reward, but not 12 seconds.

After many, many measurements of this kind, Louie and Glimcher were able to describe the discounting function being used by each monkey. (One of them was more impatient than the other.)

Moreover, the neurons in the relevant section of the brain fired at rates that reflected each monkey’s discounting function. If 0.2 millileters of juice was offered with no delay, the neurons were highly active. If the same reward was offered at a delay of 2 seconds, they were slightly less active. If the same reward was offered after 4 seconds, the neurons were less active still. And so on. As it turned out, the discounting function that captured their choices was identical to the discounting function that captured the firing rates of these neurons.

This shouldn't be a surprise at this point, but just to confirm: Yes, we can observe discounting in the firing rates of neurons involved in the choice-making process.

Relative and Absolute Utility

Dorris & Glimcher (2004) observed monkeys and their choice mechanism neurons while the monkeys engaged in repeated plays of the inspection game. The study is too involved for me to explain here, but the results suggested that choice mechanism neurons encode relative expected utilities (relative to other actions under consideration) rather than absolute expected utilities.

Tobler et al. (2005) suggested that the brain only encodes relative expected utilities. But there is reason to suspect this can't be right. If we stored only relative expected utilities, then we would routinely violate the axiom of transitivity (if you prefer A to B and B to C, you can't also prefer C to A). To see why this is the case, consider Glimcher's example (he says 'expected value' instead of 'utility'):

...consider a subject trained to choose between objects A and B, where A is $1,000,000 worth of goods and B is $100,000 worth of goods... A system that represented only the relative expected subjective value of A and B would represent SV(A) > SV(B). Next, consider training the same subject to choose between C and D, where C is $1,000 worth of goods and D is $100 worth of goods. Such a system would represent SV(C) > SV(D). What happens when we ask a chooser to select between B and C? For a chooser who represents only relative expected subjective value, the choice should be C: she should pick $1,000 worth of goods over $100,000 worth of goods because it has a higher learned relative expected subjective value. In order for our chooser to... construct transitive preferences across choice sets (and to obey the continuity axiom)... it is required that somewhere in the brain she represent the absolute subjective values of her choices.³¹

And we mostly do seem to obey the axiom of transitivity.

So if the choice mechanism neurons do represent relative utilities, then some other neurons elsewhere must encode a more absolute form of utility. Other implications of this are explored in the next section.

Normalization

David Heeger showed³² that the firing rates of 'feature detector' neurons in the visual cortex captured a response to a feature in the visual field divided by the sum of the activation rates of nearby neurons sensitive to the same image. Thus, these neurons encode not only whether they 'see' the feature they are built to detect, but also how unique it is in the visual field.

The effect of this is that neurons reacting to the edge of a visual object fire more actively than others do. Behold! Edge detection!

It's also an efficient way to encode information about the world. Consider a world where orange dots are ubiquitous. For an animal in that world, it would be wasteful to fire action potentials to represent orange dots. Better to represent the absence of orange dots, or the transition from orange dots to something else. An optimally efficient encoding method would be sensitive not to the 'alphabet' of all possible inputs, but to a smaller alphabet of the inputs that actually appear in the world. This insight was mathematically formalized by Schwartz & Simoncelli (2001).

The efficiency of this normalization technique may explain why we've discovered it at work in so many different places in the brain.³³ And given that we've found it almost everywhere we've looked for it, it wouldn't be a surprise to see it show up in our choice-making circuits. Indeed, Simoncelli & Schwartz's normalization equation may be what our brains use to encode expected utilities that are relative to the other choices under consideration.

One implication of their equation is that a chooser's errors become more frequent as the size of the choice set grows. Thus, behavioral errors on small choice sets should be rarer than might be predicted by most random utility models, but error rates will increase rapidly with choice set size (and beyond a certain choice set size, choices will appear random).

Preliminary evidence that choice set size effects error rates has arrived from behavioral economics. For example, consider Iyengar & Lepper's (2000) study of supermarket shoppers. They set up a table showing either 6 or 24 flavors of jams, allowing shoppers to sample as many as they wanted. Customers who saw 24 flavors had a 3% chance of buying a jar, while those who saw only 6 flavors had a 30% chance!

In another experiment, Iyengar & Lepper let subjects choose one of either 6 or 30 different chocolates. Those who chose from among only 6 options were more satisfied with their selection than those who had been presented with 30 different chocolates.

These data fit our expectation that as the choice set grows, the frequency of errors in our behavior rises and the likelihood that an option will rise above the threshold for purchase drops. When Louie & Glimcher (2010) investigated this phenomena in monkey choice mechanism neurons, they found it at work there, too. But the process of choice-set editing is still poorly understood, and some recent studies have failed to replicate Iyengar & Lepper's results (Scheibehenne et al. 2010).

Perhaps the most surprising implication of these findings is that because of neuronal stochasticity, and because errors increase as the choice set grows, we should expect stochastic violations of the independence axiom, and that when choosers face very large choice sets they will essentially ignore the independence axiom.

This is a prediction about human behavior not made by earlier models from neoclassical economics, but it is suggested by looking at the neurons involved in human choice-making.

Are Actions Choices?

But all these data come from experiments where the choices are actions, and from our knowledge of the brain's "final common path" for producing actions. How do actions map on to choices about lovers and smartphones?

Studies by Greg Horowitz have provided some relevant data, because monkeys had to choose options identified by color rather than by action.³⁴ For example in one trial, a 'red' option might offer one reward and a 'green' option might offer a different reward. On each trial, the red and green options would appear at random places on the computer screen, and the monkey could choose a reward with a voluntary eye movement. The key here is that rewards were chosen by color and not by a (particular) action.

Horowitz found that the choice mechanism neurons showed the same pattern of activation under these conditions as was the case under action-based choice tasks.

So, it looks like the valuation circuits can store the value of a colored target, and these valuations can be mapped to the choice mechanism. But we don't know much about how this works, yet.

The Primate Choice Mechanism: A Brief Review

Thus far, we have mostly discussed the primate brain's choice mechanism. To review:

The choice circuit resides in the final common pathway for action.
It takes as its input a signal that encodes stochastic expected utility, a concept aligned to the random utility term in economic models proposed by McFadden (2005) and Gul & Pesendorfer (2006).
This input signal is represented by a normalized firing rate (with Poisson variance, like all neurons).
As the choice set size grows, so does the error rate.
Final choice is implemented by an argmax function or a reservation price mechanism. (A single circuit can achieve both modes.³⁵)

But how are probability and utility calculated such that they can be fed into the expected utility representations of the choice mechanism? I won't discuss how the brain forms probabilistic beliefs in this article,³⁶ so let us turn to the study of how utility is calculated in the brain: the question of valuation.

Marginal Utility and Reference Dependence

Consider the following story:

Imagine an animal exploring a novel environment from a nest on a day when both (1) its blood concentration is dilute (and thus its need for water is low) and (2) its blood sugar level is low (and thus its need for food is high). The animal travels west one kilometer from the nest and emerges from the undergrowth into an open clearing at the shores of a large lake. Not very thirsty, the animal bends down to sample the water and finds it... unpalatable... the next day the same animal leaves its nest in the same metabolic state and travels one kilometer to the east, where it discovers a grove of trees that yield a dry but nutritious fruit, a grove of dried apricot trees. It samples the fruit and finds it sweet and highly palatable.

What has the animal actually learned about the value of going west and the value of going east? It has had a weakly negative experience, in the psychological sense, when going west and a very positive experience when going east. Do these subjective properties of its experience influence what it has learned? Do the stored representations derived from these experiences encode the actual objective values of going west and east, or do they encode the subjective experiences? That is a critical question about what the animal has learned, because it determines what it does when it wakes up thirsty. When it wakes up thirsty it should, in a normative sense, go west towards the... lake, despite the fact that its previous visit west was a negative experience.³⁷

Economists have known this problem for a long time, and solved it with an idea called marginal utility.

In neoclassical economics, we view the animal as having two kinds of 'wealth': a sugar wealth and a water wealth (the total store of sugar and water in the animal's body at a given time). A piece of fruit or a sip of water is an increment in the animal's total sugar or water wealth. The utility of a piece of fruit or a sip of water, then, depends on its current levels of sugar and water wealth.

On day one, the animal's need for water is low and its need for sugar is high. On that day, the marginal utility of a piece of fruit is greater than the marginal utility of a sip of water. But suppose during the next week the animal has a high blood sugar level. At that time, the marginal utility of a piece of fruit is low. Thus, the marginal utility of a consumable resource depends on wealth. The wealthier the chooser, the lower the marginal utility provided by a fixed amount of gain ('diminishing marginal utility').

In neoclassical economics, the animal faced with the option of going east or west in the morning would first estimate how much the water and the fruit would change its objective wealth level, and then it would estimate how much those objective changes in wealth would change its utility. That is, it would use objective values to compute its marginal (subjective) utility. If it only had access to the subjective experiences in our story, it couldn't compute a new marginal utilities when it finds itself unexpectedly thirsty.

The problem with this solution is that the brain does not appear to encode the objective values of stimuli, and humans behaviorally don't seem to respect the objective values of options either, as discussed here.

In response to the behavioral evidence, Kahneman & Tversky (1979) developed a reference dependent utility function to describe human behavior: prospect theory. Their suggestion was, basically:

Rather than computing marginal utilities against [objective] wealth as in [standard neoclassical economic models], utilities (not marginal utilities) could be computed directly as deviations from a baseline level of wealth, and then choices could be based on direct comparisons of these utilities rather than on comparisons of marginal utilities. Their idea was to begin with something like the chooser's status quo, how much wealth he thinks he has. Each gamble is then represented as the chance of winning or losing utilities relative to that status-quo-like reference point.³⁸

This fits with the neurobiological fact that we encode signals from external stimuli relative to reference points, and don't have access to the objective values of stimuli.

The advantage of the neoclassical economic model is that it keeps a chooser's choices consistent. The advantage of the reference-dependent approach is that it better fits human behavior and human neurobiology.

Most neoclassical economists seem to ignore the problems for their theories that are presented by reference dependence in human behavior and human neurobiology, but two neoclassical economists at Berkeley, Matthew Rabin and Botond Koszegi, have begun to take reference dependence seriously. As they put it:

...while an unexpected monetary windfall in the lab may be assessed as a gain, a salary of $5o,000 to an employee who expected $60,000 will not be assessed as a large gain relative to status-quo wealth, but rather as a loss relative to expectations of wealth. And in nondurable consumption — where there is no object with which the person can be endowed — a status-quo-based theory cannot capture the role of reference dependence at all: it would predict, for instance, that a person who misses a concert she expected to attend would feel no differently than somebody who never expected to see the concert.³⁹

Their reference-dependent model makes particular predictions:

[Our theory] shows that a consumer's willingness to pay a given price for shoes depends on the probability with which she expected to buy them and the price she expected to pay. On the one hand, an increase in the likelihood of buying increases a consumer's sense of loss of shoes if she does not buy, creating an "attachment effect" that increases her willingness to pay. Hence, the greater the likelihood she thought prices would be low enough to induce purchase, the greater is her willingness to buy at higher prices. On the other hand, holding the probability of getting the shoes fixed, a decrease in the price a consumer expected to pay makes paying a higher price feel like more of a loss, creating a "comparison effect" that lowers her willingness to pay the high price. Hence, the lower the prices she expected among those prices that induce purchase, the lower is her willingness to buy at higher prices.

Thus, the cost of accepting the human fact of reference-dependence is that we have to admit that humans are irrational (in the sense of 'rationality' defined by the axioms of revealed preference):

The fact that a consumer will pay more for shoes she expected to buy than for shoes she did not expect to buy, or that an animal would prefer inferior fruit it expected to eat over superior fruit it did not expect to eat, is exactly the kind of irrational behavior that we might hope the pressures of evolution would preclude. What observations tell us, however, is that these behaviors do occur. The neuroscience of sensory encoding tells us that these behaviors are an inescapable product of the fundamental structure of our brains.⁴⁰

But really, shouldn't it have been obvious all along that humans are irrational? Perhaps it is, to everyone but neoclassical economists and Aristoteleans. (Okay, enough teasing...)

One thing to keep in mind is that the brain encodes information about the external world in a reference-dependent way because that method makes a more efficient use of neurons. So evolution traded away some rationality for greater efficiency in the encoding mechanism.

Valuation in the Brain

Back to dopamine. Earlier, we learned that the brain learns the values of their actions with a dopaminergic reward system that uses something like temporal difference (TD) reinforcement learning. This reward system updates the stored values for actions by generating a reward prediction error (RPE) from the difference between expected reward and experience reward, and propagating this learning throughout relevant structures of the brain using the neurotransmitter dopamine. In particular, some synapses are strengthened whenever presynaptic and postsynaptic activity occur in the presence of dopamine, as proposed by Wickens (1993).

But we haven't yet discussed how utilities for actions are generated in the first place, or how they are stored (independent of the expected utilities represented during the choice process). It feels like I generally want ice cream a little bit and hot sex a lot more. Where is that information stored?

Dozens⁴¹ of fMRI studies show that two brain regions in particular are correlated with subjective value: the ventral striatum and the medial prefrontal cortex. Other studies suggest that at least five more brain regions probably also contribute to the valuation process: the orbitofrontal cortex, the dorsolateral prefrontal cortex, the amygdala, the insula, and the anterior cingulate cortex.

There are many theories about how the human brain generates and stores utilities, but these theories are far more speculative and in their infancy than everything else I've presented in this tutorial, so I won't discuss them here. Instead, let us conclude with a summary of what neuroscientists know about the human brain's motivational system, and what some of the greatest open questions are.

Summary and Research Directions

Here's what we've learned:

Utilities are real numbers ranging from 0 to 1,000 that take action potentials per second as their natural units. (By 'utility' here I don't mean what's usually meant by the term, I just mean 'utility' for the purpose of predicting choice by measuring the firing rates of certain populations of neurons in the final common path of the choice circuit in the human brain.)
Mean utilities are mean firing rates of specific populations of neurons in the final common path of human choice circuits.
Mean utilities predict choice stochastically, similar to random utility models from economics.
Utilities are encoded cardinally in firing rates relative to neuronal baseline firing rates. (This is opposed to post-Pareto, ordinal notions of utility.)
The choice circuit takes as its input a firing rate that encodes relative (normalized) stochastic expected utility.
As the choice set size grows, so does the error rate.
Final choice is implemented by an argmax function or a reservation price mechanism.

Paul Glimcher lists⁴² the greatest open questions in the field as:

Where is utility stored and how does it get to the choice mechanism?
How does the brain decide when it's time to choose?
What is the neural mechanism that allows us to substitute between two goods at a certain point?
How are probabilistic beliefs represented in the brain?
Utility functions are state-dependent, so how do state and utility function interact?

Later, we'll explore the implications of our findings for metaethics. As of August 2011, if you've read this then you probably know more about how human values actually work than almost every professional metaethicist on Earth. The general lesson here is that you can often out-pace most philosophers simply by reading what today's leading scientists have to say about a given topic instead of reading what philosophers say about it.

Notes

¹ They are: Less Wrong Rationality and Mainstream Philosophy, Philosophy: A Diseased Discipline, On Being Okay with the Truth, The Neuroscience of Pleasure, The Neuroscience of Desire, How You Make Judgments: The Elephant and its Rider, Being Wrong About Your Own Subjective Experience, Intuition and Unconscious Learning, Inferring Our Desires, Wrong About Our Own Desires, Do Humans Want Things?, Not for the Sake of Pleasure Alone, Not for the Sake of Selfishness Alone, Your Evolved Intuitions, When Intuitions Are Useful, Cornell Realism, Railton's Moral Reductionism (Part 1), Railton's Moral Reductionism (Part 2), Jackson's Moral Functionalism, Moral Reductionism and Moore's Open Question Argument, and Are Deontological Moral Judgments Rationalizations?

² Heading Toward: No-Nonsense Metaethics, What is Metaethics?, Conceptual Analysis and Moral Theory, and Pluralistic Moral Reductionism.

³ I tried something similar before, with Cognitive Science in One Lesson.

⁴ Glimcher (2010) offers the best coverage of the topic in a single book. Tobler & Kobayashi (2009) offer the best coverage in a single article.

⁵ The quotes in this section are from Churchland (1981).

⁶ Allen & Ng (2004).

⁷ This perspective goes back at least as far back as Arnauld (1662), who wrote:

To judge what one must do to obtain a good or avoid an evil, it is necessary to consider not only the good and the evil in itself, but also the probability that it happens or does not happen: and to view geometrically the proportion that all these things have together.

⁸ In addition to Caplin & Leahy (2001), see Kreps & Porteus' (1978, 1979) incroporation of the "utility of knowing", Loomes & Sugden's (1982) incorporation of "regret", Gul & Pesendorfer's (2001) incorporation of "the cost of self-control", and Koszegi & Rabin's (2007, 2009) incorporation of the "reference point".

⁹ Friedman (1953).

¹⁰ See a review in Fox & Poldrack (2009).

¹¹ For one difficulty with prospect theory, see Laury & Holt (2008).

¹² Sutton & Barto (2008), p. 3. All quotes from this section are from the early pages of this book.

¹³ From Sutton & Barto (2008).

¹⁴ Much of the rest of this post is basically a summary and paraphrase of Glimcher (2010).

¹⁵ Mirenowicz & Schultz (1994).

¹⁶ Schultz et al. (1997).

¹⁷ Caplin & Dean (2007)

¹⁸ From Glimcher (2010).

¹⁹ Hebb (1949).

²⁰ Malenka & Bear (2004).

²¹ Reynolds & Wickens (2002).

²² Glimcher (2010), p. 341.

²³ Edelman & Keller (1996); Van Gisbergen et al. (1987).

²⁴ Gold and Shadlen (2007); Roitman and Shadlen (2002).

²⁵ Simon (1957).

²⁶ Glimcher (2010), p. 215.

²⁷ McFadden (2000). The behavior of gradually transitioning between two choices is described by Selten (1975).

²⁸ For a probably improved random utility model, see Gul & Pesendorfer (2006).

²⁹ Dean (1983); Werner & Mountcastle (1963).

³⁰ Unless some other feature of the brain turns out to 'smooth out' the stochasticity of neurons involved in valuation and choice-making.

³¹ Glimcher (2010).

³² Heeger (1992, 1993); Carandini & Heeger (1994); Simoncelli & Heeger (1998).

³³ Carandini & Heeger (1994); Britten & Heuer (1999); Zoccolan et al. (2005); Louie & Glimcher (2010).

³⁴ Horowitz & Newsome (2001a, 2001b, 2004).

³⁵ Liu & Wang (2008).

³⁶ But, see Deneve (2009).

³⁷ Glimcher (2010), p. 281.

³⁸ Glimcher (2010), p. 283.

³⁹ This quote and the next quote are from Koszegi & Rabin (2006).

⁴⁰ Glimcher (2010), p. 292.

⁴¹ I won't list them all here. For an overview, see Glimcher (2010), ch. 14.

⁴² Glimcher (2010), ch. 17. I've paraphrased his open questions. I also excluded his 6th question: What Is the Neural Organ for Representing Money?

References

Allais (1953). Le comportement de l'homme rationel devant le risque. Critique des postulates et axiomes de l'ecole americaine. Econometrica, 21: 503-546.

Allen & Ng (2004). Economic behavior. In Spielberger (ed.), Encyclopedia of Applied Psychology, Vol. 1 (pp. 661-666). Academic Press.

Arnauld (1662). Port-Royal Logic.

Basso & Wurtz (1997). Modulation of neuronal activity in superior colliculus by changes in target probability. Journal of Neuroscience, 18: 7519-7534.

Britten & Heuer (1999). Spatial summation in the receptive fields of MT neurons. Journal of Neuroscience, 19: 5074-5084.

Caplin & Dean (2007). Axiomatic neuroeconomics.

Caplin, Dean, Glimcher, & Rutledge (2010). Measuring beliefs and rewards: a neuroeconomic approach. Quarterly Journal of Economics, 125: 3.

Caplin & Leahy (2001). Psychological expected utility theory and anticipatory feelings. Quarterly Journal of Economics, 116: 55-79.

Carandini & Heeger (1994). Summation and devision by neurons in primate visual cortex. Science, 264: 1333-1336.

Churchland (1981). Eliminative materialism and the propositional attitudes. The Journal of Philosophy, 78: 67-90.

Dean (1983). Adaptation-induced alteration of the relation between response amplitude and contrast in cat striate cortical neurons. Vision Research, 23: 249-256.

Deneve (2009). Bayesian decision making in two-alternative forced choices. In Dreher & Tremblay (eds.), Handbook of Reward and Decision Making (pp. 441-458). Academic Press.

Dorris & Glimcher (2004). Activity in posterior parietal cortex is correlated with the subjective desireability of an action. Neuron, 44: 365-378.

Edelman & Keller (1996). Activity of visuomotor burst neurons in the superior colliculus accompanying express saccades. Journal of Neurophysiology, 76: 908-926.

Fox & Poldrack (2009). Prospect theory and the brain. In Glimcher, Camerer, Fehr, & Poldrack (eds.), Neuroeconomics: Decision Making and the Brain (pp. 145-173). Academic Press.

Friedman (1953). Essays in Positive Economics. University of Chicago Press.

Glimcher (2010). Foundations of Neuroeconomic Analysis. Oxford University Press.

Gold and Shadlen (2007). The neural basis of decision making. Annual Review of Neuroscience, 30: 535-574.

Gul & Pesendorfer (2001). Temptation and self-control. Econometrica, 69: 1403-1435.

Gul & Pesendorfer (2006). Random expected utility. Econometrica, 74: 121-146.

Hebb (1949). The organization of behavior. Wiley & Sons.

Heeger (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9: 181-197.

Heeger (1993). Modeling simple-cell direction selectivity with normalized, half-squared linear operators. Journal of Neurophysiology, 70: 1885-1898.

Horowitz & Newsome (2001a). Target selection for saccadic eye movements: direction selective visual responses in the superior colliculus induced by behavioral training. Journal of Neurophysiology, 86: 2527-2542.

Horowitz & Newsome (2001b). Target selection for saccadic eye movements: prelude activity in the superior colliculus during a direction discrimination task. Journal of Neurophysiology, 86: 2543-2558.

Iyengar & Lepper (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality and Social Psychology, 79: 995-1006.

Jevons (1871). The Theory of Political Economy. Macmillan and Co.

Kahneman & Tversky (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47: 263-291.

Koszegi & Rabin (2006). A model of reference-dependent preferences. Quarterly Journal of Economics, 121: 1133-1165.

Koszegi & Rabin (2007). Reference-dependent risk attitudes. American Economic Review, 97: 1047-1073.

Koszegi & Rabin (2009). Reference-dependent consumption plans. American Economic Review, 99: 909-936.

Kreps & Porteus (1978). Temporal resolution of uncertainty and dynamic choice theory. Econometrica, 46: 185-200.

Kreps & Porteus (1979). Dynamic choice theory and dynamic programming. Econometrica, 47: 91-100.

Laury & Holt (2008). Payoff scale effects and risk preference under real and hypothetical conditions. In Plott & Smith (eds.), Handbook of Experimental Economic Results, Vol. 1 (pp. 1047-1053). Elsevier Press.

Loewenstein (1987). Anticipation and the valuation of delayed consumption. Economic Journal, 97: 666-684.

Liu & Wang (2008). A common cortical circuit mechanism for perceptual categorical discrimination and veridical judgment. PLOS Computational Biology, 4: 1-14.

Loomes & Sugden (1982). Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92: 805-824.

Louie & Glimcher (2010). Separating value from choice: delay discounting activity in the lateral intraparietal area. Journal of Neuroscience, 30: 5498-5507.

Malenka & Bear (2004). LTP and LTD: an embarrassment of riches. Neuron, 44: 5–21.

Mirenowicz & Schultz (1994). Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology, 72: 1024-1027.

Reynolds & Wickens (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15: 507-521.

Roitman and Shadlen (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Nature Neuroscience, 22: 9475-9489.

Scheibehenne, Greifeneder, & Todd (2010). Can there ever be too many options? A meta-analytic review of choice overload. Journal of Consumer Research, 37: 409-425.

Schultz, Dayan, & Montague (1997). A neural substrate of prediction and reward. Science, 275: 1593–1599.

Schwartz & Simoncelli (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4: 819-825.

Selten (1975). Reexamination of perfectness concept for equilibrium points in extensive games. International Journal of Game Theory, 4: 25-55.

Simoncelli & Heeger (1998). A model of neuronal responses in visual area MT. Vision Research, 38: 743-761.

Sutton & Barto (2008). Reinforcement Learning: An Introduction. MIT Press.

Tanji & Evarts (1976). Anticipatory activity of motor cortex neurons in relation to direction of an intended movement. Journal of Neurophysiology, 39: 1062-1068.

Tobler & Kobayashi (2009). Electrophysiological correlates of reward processing in dopamine neurons. In Dreher & Tremblay (eds.), Handbook of Reward and Decision Making (pp. 29-50). Academic Press.

Van Gisbergen, Opstal, & Tax (1987). Collicular ensemble coding of saccades based on vector summation. Neuroscience, 21: 651.

Werner & Mountcastle (1963). The variability of central neural activity in a sensory system, and its implications for central reflection of sensory events. Journal of Neurophysiology, 26: 958-977.

Wickens (1993). A Theory of the Striatum. Pergamon Press.

Zoccolan, Cox, & DiCarlo (2005). Multiple object response normalization in monkey inferotemporal cortex. Journal of Neuroscience, 25: 8150-8164.

[PDF of this article updated Aug. 23, 2011]

[skip to preface]

Whenever I write a new article for Less Wrong, I'm pulled in two opposite directions.

And now I've written only four posts² in my metaethics sequence and already the inferential gap to my next post in that sequence is huge again. :(

Who knows what I want to do? Who knows what anyone wants to do? How can you be sure about something like that? Isn’t it all a question of brain chemistry, signals going back and forth, electrical energy in the cortex? How do you know whether something is really what you want to do or just some kind of nerve impulse in the brain. Some minor little activity takes place somewhere in this unimportant place in one of the brain hemispheres and suddenly I want to go to Montana or I don’t want to go to Montana.

- Don DeLillo, White Noise

Preface

So, I wrote a crash course in the neuroscience of human motivation.

This document is long. You may prefer to read it in sections.

Folk Psychology

How can we explain human actions?

One popular explanation is 'folk psychology.' Folk psychology posits that we humans have beliefs and desires, and that we are motivated to do what we believe will fulfill our desires.

But folk psychology (FP) faces some problems.⁵ Consider its context in history:

The presumed domain of FP used to be much larger than it is now. In primitive cultures, the behavior of most of the elements of nature were understood in intentional terms. The wind could know anger, the moon jealousy, the river generosity… These were not metaphors… the animistic approach to nature has dominated our history, and it is only in the last two or three thousand years that we have restricted FP’s literal application to the domain of the higher animals.

[Even still,] the FP of the Greeks is essentially the FP we uses today… This is a very long period of stagnation and infertility for any theory to display, especially when faced with such an enormous backlog of anomalies and mysteries in its own explanatory domain… To use Imre Lakatos’ terms, FP is a stagnant or degenerating research program, and has been for millennia.

Consider also its prospects for inter-theoretic reduction:

If we approach homo sapiens from the perspective of natural history and the physical sciences, we can tell a coherent story of its constitution, development, and behavioral capacities which encompasses particle physics, atomic and molecular theory, organic chemistry, evolutionary theory, biology, physiology, and materialistic neuroscience. The story, though still radically incomplete, is already extremely powerful, outperforming FP at many points even in its own domain. And it is deliberately… coherent with the rest of our developing world picture. In short, the greatest theoretical synthesis in [history] is currently in our hands…

But FP is no part of this growing synthesis. Its intentional categories stand magnificently alone, without visible prospect of reduction to that larger corpus. A successful reduction cannot be ruled out, in my view, but FP’s explanatory impotence and long stagnation inspire little faith that its categories will find themselves neatly reflected in the framework of neuroscience. On the contrary, one is reminded of how alchemy must have looked as elemental chemistry was taking form, how Aristotelean cosmology must have looked as classical mechanics was being articulated, or how the vitalist conception of life must have looked as organic chemistry marched forward.

Folk psychology faces both successes and failures in explaining human action. Hopefully we can do better.

Neoclassical Economics

Folk psychology was updated and quantified by neoclassical economics. To summarize:

One [assumption of] neoclassical economics is "rationality," in which individuals are said to choose alternatives that maximize expected utilities. In particular, the neoclassical view is that individuals rank all possible alternatives according to how much satisfaction they will bring and then choose the alternative that [they expect] will bring the most satisfaction or utility...⁶

We call objective value simply 'value'. We call subjective value 'utility.'

Keep in mind that economists don't argue that we actually compute the expected utility of each option before us and then choose the best one, but that we always act "as if" we were doing that.⁹

Behaviorism and Reinforcement Learning

Before I explain reinforcement learning, let's recall operant conditioning:

Stick a pigeon in a box with a lever and some associated machinery (a "Skinner box"). The pigeon wanders around, does various things, and eventually hits the lever. Delicious sugar water squirts out. The pigeon continues wandering about and eventually hits the lever again. Another squirt of delicious sugar water. Eventually it percolates into its tiny pigeon brain that maybe pushing this lever makes sugar water squirt out. It starts pushing the lever more and more, each push continuing to convince it that yes, this is a good idea.

Consider a second, less lucky pigeon. It, too, wanders about in a box and eventually finds a lever. It pushes the lever and gets an electric shock. Eh, maybe it was a fluke. It pushes the lever again and gets another electric shock. It starts thinking "Maybe I should stop pressing that lever." The pigeon continues wandering about the box doing anything and everything other than pushing the shock lever.

The basic concept of operant conditioning is that an animal will repeat behaviors that give it reward, but avoid behaviors that give it punishment.

Behaviorism died in the wake of cognitive psychology, but its approach to motivation turned out to be very useful in the field of artificial intelligence, where it is called reinforcement learning:

Reinforcement learning is learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards. These two characteristics — trial-and-error search and delayed reward — are the two most important distinguishing features of reinforcement learning.

To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best..¹²

In addition to the agent and its environment, there are four major components of a reinforcement learning system:

...a policy, a reward function, a value function, and, optionally, a model of the environment.

A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states...

A reward function defines the goal in a reinforcement learning problem. Roughly speaking, it maps perceived states (or state-action pairs) of the environment to a single number, a reward, indicating the intrinsic desirability of the state. A reinforcement-learning agent's sole objective is to maximize the total reward it receives in the long run. ...[A reward function may] be used as a basis for changing the policy. For example, if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future...

Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future starting from that state. Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states. For example, a state might always yield a low immediate reward, but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true...

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made on the basis of value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain for us the greatest amount of reward over the long run...

...The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Want an example? Here is how a reinforcement learning agent would learn to play Tic-Tac-Toe:

First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's current value, and the whole table is the learned value function. State A has higher value than state B, or is considered 'better' than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play Xs, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Øs in a row... the correct probability is 0, as we cannot win from them. We set the initial values of all the other states, the nonterminals, to 0.5, representing an informed guess that we have a 50% chance of winning.

Now we play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from one of the other moves instead; these are called exploratory moves because they cause us to experience states that we might otherwise never see.

A sequence of Tic-Tac-Toe moves might look like this:¹³

If we let s denote the state before the greedy move, and s' the state after, then the update to the estimated value of s, denoted V(s), can be written:

V(s) <- V(s) + α[V(s') - V(s)]

where α is a small positive fraction called the step-size parameter, which influences the rate of learning. The update rule is an example of a temporal difference learning method, so called because its changes are based on a difference... between [value] estimates at two different times.

...if the step-size parameter is reduced properly over time, this method converges, for any [unchanging] opponent, to the true probabilities of winning from each state give optimal play by the agent.

And that's how a simple version of temporal difference (TD) reinforcement learning works.

Reinforcement Learning and Decision Theory

But what if you don't have many trials to run through, and you need to make an important decision right now?

Then you have to be smart. You need to have a good model of the world and use decision theory to choose the action with the highest expected utility.

This is precisely what rationality — being good at building correct models of the world — is especially good for:

For some tasks, the world provides rich, inexpensive empirical feedback. In these tasks you hardly need reasoning. Just try the task many ways... and take care to notice what is and isn’t giving you results.

Thus, if you want to learn to sculpt, [studying rationality] is a bad way to go about it. Better to find some clay and a hands-on sculpting course. The situation is similar for small talk, cooking, selling, programming, and many other useful skills.

Unfortunately, most of us also have goals for which we can obtain no such ready success/failure data. For example, if you want to know whether cryonics is a good buy, you can’t just try buying it and not-buying it and see which works better. If you miss your first bet, you’re out for good.

Now, back to our story.

It turns out that reinforcement learning seems to underlie many of our mental processes. (More on this later.)

The lesson Yvain drew from this discovery was:

Reinforcement learning is evolution writ small; behaviors propagate or die out based on their consequences to reinforcement in a mind, just as mutations propagate or die out based on their consequences to reproduction in an organism. In the behaviorist model, our mind is not an agent, but a flourishing ecosystem of behaviors both physical and mental, all scrabbling for supremacy and mutating into more effective versions of themselves.

Just as evolving organisms are adaptation-executors and not fitness-maximizers, so minds are behavior-executors and not utility-maximizers.

But things are a bit more complicated than that, as we'll now see.

The Turn to the Brain

I hesitate to say that men will ever have the means of measuring directly the feelings of the human heart. It is from the quantitative effects of the feelings that we must estimate their comparative amounts.

William Jevons (1871)

It turns out that Jevons was wrong. Modern neuroscience allows us to peer into the black box of the human value system and measure directly "the feelings of the human heart."¹⁴

TD-class reinforcement learning works in computers by updating numbers that represent the values of states. How does reinforcement learning work when using nerve cells?

Hebbian Learning

By Hebbian learning, of course. "Cells that fire together, wire together."

Also imagine a third neuron that connects the salivation circuit to a circuit that detects food. This is a strong connection, and that's why food does elicit salivation right away:¹⁸

Donald Hebb proposed:

When an axon of cell A is near enough to excite cell B and repeatedly or persistently take part in firing it, a growth process of metabolic change takes place in one or both cells such that A's efficacy, as one of the cells firing B, is increased.¹⁹

In short, whenever two connected cells are active at the same time, the synapses connecting them are strengthened.

Suppose a monkey receives an unexpected reward and encodes a large positive RPE. Glimcher explains:

The TD model tells us that under these conditions we want to increment the value attributed to all actions or sensations that have just occurred. Under these conditions, we know that the dopamine neurons release dopamine throughout the frontocortical-basal ganglia loops, and do so in a highly homogenous manner. That means we can think of any neuron equipped with dopamine receptors as 'primed' for synaptic strengthening. When this happens, any segment of the frontocortical-basal ganglia loop that is already active will have its synapses strengthened.²²

We will return to the dopamine system later, but for now let us back up and pursue the neoclassical economic path into the brain.

Expected Utility in Neurons

Ever since Friedman (1953), economists have insisted that humans only behave as if they are utility maximizers, not that they actually compute expected utility and try to maximize it.

It was a surprise, then, when neuroscientists stumbled upon the neurons that were encoding expected utility in their firing rates.

This is what they saw:

Real-Time Expected Utility Updates

Argmax and Reservation Price

When the direction of the dots is unambiguous, monkeys make their choices quickly.
As the direction of the dots becomes more ambiguous, they take longer to make their choices.
Throughout the experiment, the firing rates of neurons in the LIP (part of the 'final common path' for generating eye movement) grew toward a specific threshold level.

Random Utility

When deciding between goods of different expected utilities, humans exhibit a stochastic transfer function:

Consider a human subject choosing between two objects of highly different expected utilities, such as a first lottery with a 50% chance of winning $5 and a second lottery with a 25% chance of winning $5. We observe highly deterministic behavior under these conditions: basically all subjects always choose the 5o% chance of winning $5. But what happens when we increment the value of the 25% lottery? As the amount one stands to win from that lottery is incremented, individual subjects eventually switch their preference. Exactly when they make that switch depends on their idiosyncratic degree of risk aversion. What is most interesting about this behavior for these purposes, though, is that actual human subjects, when presented with this kind of choice repeatedly, are never completely deterministic. As the value of the 25% lottery increases, they begin to show probabilistic behavior — selecting the 25% lottery sometimes, but not always.²⁶

Discounting

After many, many measurements of this kind, Louie and Glimcher were able to describe the discounting function being used by each monkey. (One of them was more impatient than the other.)

This shouldn't be a surprise at this point, but just to confirm: Yes, we can observe discounting in the firing rates of neurons involved in the choice-making process.

Relative and Absolute Utility

...consider a subject trained to choose between objects A and B, where A is $1,000,000 worth of goods and B is $100,000 worth of goods... A system that represented only the relative expected subjective value of A and B would represent SV(A) > SV(B). Next, consider training the same subject to choose between C and D, where C is $1,000 worth of goods and D is $100 worth of goods. Such a system would represent SV(C) > SV(D). What happens when we ask a chooser to select between B and C? For a chooser who represents only relative expected subjective value, the choice should be C: she should pick $1,000 worth of goods over $100,000 worth of goods because it has a higher learned relative expected subjective value. In order for our chooser to... construct transitive preferences across choice sets (and to obey the continuity axiom)... it is required that somewhere in the brain she represent the absolute subjective values of her choices.³¹

And we mostly do seem to obey the axiom of transitivity.

Normalization

The effect of this is that neurons reacting to the edge of a visual object fire more actively than others do. Behold! Edge detection!

This is a prediction about human behavior not made by earlier models from neoclassical economics, but it is suggested by looking at the neurons involved in human choice-making.

Are Actions Choices?

Horowitz found that the choice mechanism neurons showed the same pattern of activation under these conditions as was the case under action-based choice tasks.

So, it looks like the valuation circuits can store the value of a colored target, and these valuations can be mapped to the choice mechanism. But we don't know much about how this works, yet.

The Primate Choice Mechanism: A Brief Review

Thus far, we have mostly discussed the primate brain's choice mechanism. To review:

The choice circuit resides in the final common pathway for action.
It takes as its input a signal that encodes stochastic expected utility, a concept aligned to the random utility term in economic models proposed by McFadden (2005) and Gul & Pesendorfer (2006).
This input signal is represented by a normalized firing rate (with Poisson variance, like all neurons).
As the choice set size grows, so does the error rate.
Final choice is implemented by an argmax function or a reservation price mechanism. (A single circuit can achieve both modes.³⁵)

Marginal Utility and Reference Dependence

Consider the following story:

Imagine an animal exploring a novel environment from a nest on a day when both (1) its blood concentration is dilute (and thus its need for water is low) and (2) its blood sugar level is low (and thus its need for food is high). The animal travels west one kilometer from the nest and emerges from the undergrowth into an open clearing at the shores of a large lake. Not very thirsty, the animal bends down to sample the water and finds it... unpalatable... the next day the same animal leaves its nest in the same metabolic state and travels one kilometer to the east, where it discovers a grove of trees that yield a dry but nutritious fruit, a grove of dried apricot trees. It samples the fruit and finds it sweet and highly palatable.

What has the animal actually learned about the value of going west and the value of going east? It has had a weakly negative experience, in the psychological sense, when going west and a very positive experience when going east. Do these subjective properties of its experience influence what it has learned? Do the stored representations derived from these experiences encode the actual objective values of going west and east, or do they encode the subjective experiences? That is a critical question about what the animal has learned, because it determines what it does when it wakes up thirsty. When it wakes up thirsty it should, in a normative sense, go west towards the... lake, despite the fact that its previous visit west was a negative experience.³⁷

Economists have known this problem for a long time, and solved it with an idea called marginal utility.

In response to the behavioral evidence, Kahneman & Tversky (1979) developed a reference dependent utility function to describe human behavior: prospect theory. Their suggestion was, basically:

Rather than computing marginal utilities against [objective] wealth as in [standard neoclassical economic models], utilities (not marginal utilities) could be computed directly as deviations from a baseline level of wealth, and then choices could be based on direct comparisons of these utilities rather than on comparisons of marginal utilities. Their idea was to begin with something like the chooser's status quo, how much wealth he thinks he has. Each gamble is then represented as the chance of winning or losing utilities relative to that status-quo-like reference point.³⁸

This fits with the neurobiological fact that we encode signals from external stimuli relative to reference points, and don't have access to the objective values of stimuli.

...while an unexpected monetary windfall in the lab may be assessed as a gain, a salary of $5o,000 to an employee who expected $60,000 will not be assessed as a large gain relative to status-quo wealth, but rather as a loss relative to expectations of wealth. And in nondurable consumption — where there is no object with which the person can be endowed — a status-quo-based theory cannot capture the role of reference dependence at all: it would predict, for instance, that a person who misses a concert she expected to attend would feel no differently than somebody who never expected to see the concert.³⁹

Their reference-dependent model makes particular predictions:

[Our theory] shows that a consumer's willingness to pay a given price for shoes depends on the probability with which she expected to buy them and the price she expected to pay. On the one hand, an increase in the likelihood of buying increases a consumer's sense of loss of shoes if she does not buy, creating an "attachment effect" that increases her willingness to pay. Hence, the greater the likelihood she thought prices would be low enough to induce purchase, the greater is her willingness to buy at higher prices. On the other hand, holding the probability of getting the shoes fixed, a decrease in the price a consumer expected to pay makes paying a higher price feel like more of a loss, creating a "comparison effect" that lowers her willingness to pay the high price. Hence, the lower the prices she expected among those prices that induce purchase, the lower is her willingness to buy at higher prices.

Thus, the cost of accepting the human fact of reference-dependence is that we have to admit that humans are irrational (in the sense of 'rationality' defined by the axioms of revealed preference):

The fact that a consumer will pay more for shoes she expected to buy than for shoes she did not expect to buy, or that an animal would prefer inferior fruit it expected to eat over superior fruit it did not expect to eat, is exactly the kind of irrational behavior that we might hope the pressures of evolution would preclude. What observations tell us, however, is that these behaviors do occur. The neuroscience of sensory encoding tells us that these behaviors are an inescapable product of the fundamental structure of our brains.⁴⁰

But really, shouldn't it have been obvious all along that humans are irrational? Perhaps it is, to everyone but neoclassical economists and Aristoteleans. (Okay, enough teasing...)

Valuation in the Brain

Summary and Research Directions

Here's what we've learned:

Utilities are real numbers ranging from 0 to 1,000 that take action potentials per second as their natural units. (By 'utility' here I don't mean what's usually meant by the term, I just mean 'utility' for the purpose of predicting choice by measuring the firing rates of certain populations of neurons in the final common path of the choice circuit in the human brain.)
Mean utilities are mean firing rates of specific populations of neurons in the final common path of human choice circuits.
Mean utilities predict choice stochastically, similar to random utility models from economics.
Utilities are encoded cardinally in firing rates relative to neuronal baseline firing rates. (This is opposed to post-Pareto, ordinal notions of utility.)
The choice circuit takes as its input a firing rate that encodes relative (normalized) stochastic expected utility.
As the choice set size grows, so does the error rate.
Final choice is implemented by an argmax function or a reservation price mechanism.

Paul Glimcher lists⁴² the greatest open questions in the field as:

Where is utility stored and how does it get to the choice mechanism?
How does the brain decide when it's time to choose?
What is the neural mechanism that allows us to substitute between two goods at a certain point?
How are probabilistic beliefs represented in the brain?
Utility functions are state-dependent, so how do state and utility function interact?

Notes

² Heading Toward: No-Nonsense Metaethics, What is Metaethics?, Conceptual Analysis and Moral Theory, and Pluralistic Moral Reductionism.

³ I tried something similar before, with Cognitive Science in One Lesson.

⁴ Glimcher (2010) offers the best coverage of the topic in a single book. Tobler & Kobayashi (2009) offer the best coverage in a single article.

⁵ The quotes in this section are from Churchland (1981).

⁶ Allen & Ng (2004).

⁷ This perspective goes back at least as far back as Arnauld (1662), who wrote:

To judge what one must do to obtain a good or avoid an evil, it is necessary to consider not only the good and the evil in itself, but also the probability that it happens or does not happen: and to view geometrically the proportion that all these things have together.

⁹ Friedman (1953).

¹⁰ See a review in Fox & Poldrack (2009).

¹¹ For one difficulty with prospect theory, see Laury & Holt (2008).

¹² Sutton & Barto (2008), p. 3. All quotes from this section are from the early pages of this book.

¹³ From Sutton & Barto (2008).

¹⁴ Much of the rest of this post is basically a summary and paraphrase of Glimcher (2010).

¹⁵ Mirenowicz & Schultz (1994).

¹⁶ Schultz et al. (1997).

¹⁷ Caplin & Dean (2007)

¹⁸ From Glimcher (2010).

¹⁹ Hebb (1949).

²⁰ Malenka & Bear (2004).

²¹ Reynolds & Wickens (2002).

²² Glimcher (2010), p. 341.

²³ Edelman & Keller (1996); Van Gisbergen et al. (1987).

²⁴ Gold and Shadlen (2007); Roitman and Shadlen (2002).

²⁵ Simon (1957).

²⁶ Glimcher (2010), p. 215.

²⁷ McFadden (2000). The behavior of gradually transitioning between two choices is described by Selten (1975).

²⁸ For a probably improved random utility model, see Gul & Pesendorfer (2006).

²⁹ Dean (1983); Werner & Mountcastle (1963).

³⁰ Unless some other feature of the brain turns out to 'smooth out' the stochasticity of neurons involved in valuation and choice-making.

³¹ Glimcher (2010).

³² Heeger (1992, 1993); Carandini & Heeger (1994); Simoncelli & Heeger (1998).

³³ Carandini & Heeger (1994); Britten & Heuer (1999); Zoccolan et al. (2005); Louie & Glimcher (2010).

³⁴ Horowitz & Newsome (2001a, 2001b, 2004).

³⁵ Liu & Wang (2008).

³⁶ But, see Deneve (2009).

³⁷ Glimcher (2010), p. 281.

³⁸ Glimcher (2010), p. 283.

³⁹ This quote and the next quote are from Koszegi & Rabin (2006).

⁴⁰ Glimcher (2010), p. 292.

⁴¹ I won't list them all here. For an overview, see Glimcher (2010), ch. 14.

⁴² Glimcher (2010), ch. 17. I've paraphrased his open questions. I also excluded his 6th question: What Is the Neural Organ for Representing Money?

References

Allais (1953). Le comportement de l'homme rationel devant le risque. Critique des postulates et axiomes de l'ecole americaine. Econometrica, 21: 503-546.

Allen & Ng (2004). Economic behavior. In Spielberger (ed.), Encyclopedia of Applied Psychology, Vol. 1 (pp. 661-666). Academic Press.

Arnauld (1662). Port-Royal Logic.

Basso & Wurtz (1997). Modulation of neuronal activity in superior colliculus by changes in target probability. Journal of Neuroscience, 18: 7519-7534.

Britten & Heuer (1999). Spatial summation in the receptive fields of MT neurons. Journal of Neuroscience, 19: 5074-5084.

Caplin & Dean (2007). Axiomatic neuroeconomics.

Caplin, Dean, Glimcher, & Rutledge (2010). Measuring beliefs and rewards: a neuroeconomic approach. Quarterly Journal of Economics, 125: 3.

Caplin & Leahy (2001). Psychological expected utility theory and anticipatory feelings. Quarterly Journal of Economics, 116: 55-79.

Carandini & Heeger (1994). Summation and devision by neurons in primate visual cortex. Science, 264: 1333-1336.

Churchland (1981). Eliminative materialism and the propositional attitudes. The Journal of Philosophy, 78: 67-90.

Dean (1983). Adaptation-induced alteration of the relation between response amplitude and contrast in cat striate cortical neurons. Vision Research, 23: 249-256.

Deneve (2009). Bayesian decision making in two-alternative forced choices. In Dreher & Tremblay (eds.), Handbook of Reward and Decision Making (pp. 441-458). Academic Press.

Dorris & Glimcher (2004). Activity in posterior parietal cortex is correlated with the subjective desireability of an action. Neuron, 44: 365-378.

Edelman & Keller (1996). Activity of visuomotor burst neurons in the superior colliculus accompanying express saccades. Journal of Neurophysiology, 76: 908-926.

Fox & Poldrack (2009). Prospect theory and the brain. In Glimcher, Camerer, Fehr, & Poldrack (eds.), Neuroeconomics: Decision Making and the Brain (pp. 145-173). Academic Press.

Friedman (1953). Essays in Positive Economics. University of Chicago Press.

Glimcher (2010). Foundations of Neuroeconomic Analysis. Oxford University Press.

Gold and Shadlen (2007). The neural basis of decision making. Annual Review of Neuroscience, 30: 535-574.

Gul & Pesendorfer (2001). Temptation and self-control. Econometrica, 69: 1403-1435.

Gul & Pesendorfer (2006). Random expected utility. Econometrica, 74: 121-146.

Hebb (1949). The organization of behavior. Wiley & Sons.

Heeger (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9: 181-197.

Heeger (1993). Modeling simple-cell direction selectivity with normalized, half-squared linear operators. Journal of Neurophysiology, 70: 1885-1898.

Iyengar & Lepper (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality and Social Psychology, 79: 995-1006.

Jevons (1871). The Theory of Political Economy. Macmillan and Co.

Kahneman & Tversky (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47: 263-291.

Koszegi & Rabin (2006). A model of reference-dependent preferences. Quarterly Journal of Economics, 121: 1133-1165.

Koszegi & Rabin (2007). Reference-dependent risk attitudes. American Economic Review, 97: 1047-1073.

Koszegi & Rabin (2009). Reference-dependent consumption plans. American Economic Review, 99: 909-936.

Kreps & Porteus (1978). Temporal resolution of uncertainty and dynamic choice theory. Econometrica, 46: 185-200.

Kreps & Porteus (1979). Dynamic choice theory and dynamic programming. Econometrica, 47: 91-100.

Loewenstein (1987). Anticipation and the valuation of delayed consumption. Economic Journal, 97: 666-684.

Liu & Wang (2008). A common cortical circuit mechanism for perceptual categorical discrimination and veridical judgment. PLOS Computational Biology, 4: 1-14.

Loomes & Sugden (1982). Regret theory: An alternative theory of rational choice under uncertainty. Economic Journal, 92: 805-824.

Louie & Glimcher (2010). Separating value from choice: delay discounting activity in the lateral intraparietal area. Journal of Neuroscience, 30: 5498-5507.

Malenka & Bear (2004). LTP and LTD: an embarrassment of riches. Neuron, 44: 5–21.

Mirenowicz & Schultz (1994). Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology, 72: 1024-1027.

Reynolds & Wickens (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15: 507-521.

Roitman and Shadlen (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Nature Neuroscience, 22: 9475-9489.

Scheibehenne, Greifeneder, & Todd (2010). Can there ever be too many options? A meta-analytic review of choice overload. Journal of Consumer Research, 37: 409-425.

Schultz, Dayan, & Montague (1997). A neural substrate of prediction and reward. Science, 275: 1593–1599.

Schwartz & Simoncelli (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4: 819-825.

Selten (1975). Reexamination of perfectness concept for equilibrium points in extensive games. International Journal of Game Theory, 4: 25-55.

Simoncelli & Heeger (1998). A model of neuronal responses in visual area MT. Vision Research, 38: 743-761.

Sutton & Barto (2008). Reinforcement Learning: An Introduction. MIT Press.

Tanji & Evarts (1976). Anticipatory activity of motor cortex neurons in relation to direction of an intended movement. Journal of Neurophysiology, 39: 1062-1068.

Tobler & Kobayashi (2009). Electrophysiological correlates of reward processing in dopamine neurons. In Dreher & Tremblay (eds.), Handbook of Reward and Decision Making (pp. 29-50). Academic Press.

Van Gisbergen, Opstal, & Tax (1987). Collicular ensemble coding of saccades based on vector summation. Neuroscience, 21: 651.

Werner & Mountcastle (1963). The variability of central neural activity in a sensory system, and its implications for central reflection of sensory events. Journal of Neurophysiology, 26: 958-977.

Wickens (1993). A Theory of the Striatum. Pergamon Press.

Zoccolan, Cox, & DiCarlo (2005). Multiple object response normalization in monkey inferotemporal cortex. Journal of Neuroscience, 25: 8150-8164.

210

A Crash Course in the Neuroscience of Human Motivation

210

Preface

Contents:

Folk Psychology

Neoclassical Economics

Behaviorism and Reinforcement Learning

Reinforcement Learning and Decision Theory

The Turn to the Brain

Hebbian Learning

Expected Utility in Neurons

Real-Time Expected Utility Updates

Argmax and Reservation Price

Random Utility

Discounting

Relative and Absolute Utility

Normalization

Are Actions Choices?

The Primate Choice Mechanism: A Brief Review

Marginal Utility and Reference Dependence

Valuation in the Brain

Notes

References

210

210

A Crash Course in the Neuroscience of Human Motivation

210

Preface

Contents:

Folk Psychology

Neoclassical Economics

Behaviorism and Reinforcement Learning

Reinforcement Learning and Decision Theory

The Turn to the Brain

Hebbian Learning

Expected Utility in Neurons

Real-Time Expected Utility Updates

Argmax and Reservation Price

Random Utility

Discounting

Relative and Absolute Utility

Normalization

Are Actions Choices?

The Primate Choice Mechanism: A Brief Review

Marginal Utility and Reference Dependence

Valuation in the Brain

Notes

References

210