Question: "What's the Relationship Between "Human Values" and the Brain's Reward System?"
I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.
We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.
Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.
Wireheading
It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.
Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.
Value Diversity and Acquisition over Time
It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .
(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)
It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.
Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.
Moral Philosophy as Conflict and Compromise Between Early and Late Values
One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.
In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.
“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.
The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.
Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.
This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.
The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.
This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.
Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.
Preserving Present Day Distributions over Possible Cognition
If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.
To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.
This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.
This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.
Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.
This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.
(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)
I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.
The claim for "self-preserving" circuits is pretty strong. A much simpler explanation is that humans learn to value diversity early own because diversity of things around you, like tools, food sources, etc, improves fitness/reward.
Another non-competing explanation is that this is simply a result from boredom/curiosity - the brain wants to make observations that make it learn, not observations that it already predicts well, so we are inclined to observe things that are new. So again there is a force towards valuing diversity and this could become locked in our values.
Hmmm....interesting. So in this picture, human values are less like a single function defined on an internal world model, and more like a 'grand bargain' among many distinct self-preserving mesa-optimizers. I've had vaguely similar thoughts in the past, although the devil is in the details with such proposals(e.g: just how agenty are you imagining these circuits to be? do they actually have the ability to do means-end reasoning about the real world, or have they just stumbled upon heuristics that seem to work well? What kind of learning is applied to them, supervised, unsupervised, reinforcement?) It might be worth trying to make a very simple toy model laying out all the components. I await your future posts with interest.
I think the key insight here is that the brain is not inner aligned, not even close
You say that but don't elaborate further in the comment. Which learned human values go against the base optimizer values (pleasure, pain, learning).
Avoiding wireheading doesn't seem like failed inner alignment - avoiding wireheading now can allow you to get even more pleasure in the future because wireheading makes you vulnerable/less powerful. The base optimizer is also searching for brain configurations which make good predictions about the world, and wireheading goes against that.
My first question would be “how do you define human values”? Here are two possible answers:
I think #2 is how most people use the term “values”, but I have heard at least a couple AI alignment researchers use definition #1, so I figure it’s worth checking.
I would say #1 is the easier question. #1 is asking a rather direct question about brain algorithms; whereas #2 involves (A) philosophy, for deciding what the “proper” definition / operationalization of “human values” is, and then (B) walking through that scenario / definition in light of #1.
As for #1, see my post series Intro to Brain-Like-AGI Safety. I think you’ll get most of what you’re looking for in posts #7 & #9. You might find that you need to go back and read the top (summary) section of some of the other posts to get the terminology and context.
[Is that “the best mechanistic account” of #1? Well, I’m a bit biased :) ]
For getting from #1 to #2, it depends on how we’re operationalizing “human values”, but if it’s “what the person describes as their values when asked”, then I would probably say various things along the lines of Lukas_Gloor’s comment.
In addition to #1 and #2, I'm interested in another definition: "human values" are "the properties of the states of the universe that humans tend to optimize towards". Obviously this has a lot to do with definitions 1 and 2, and could be analyzed as an emergent consequence of 1 and 2 together with facts about how humans act in response to their desires and goals. Plus maybe a bit of sociology, since most large-scale human optimization of the universe depends on the collective action of groups.
My model of how human values arrive naturally from how the human brain makes sense of the world (all of the below steps can happen subconsciously):
I think values aren't the end of it. Kegan's stages of adult development have a further stage where the brain learns to deal with inter-group tensions by being more fluid. I think this relaxes constraints and smoothes the brain's model and roughly corresponds to what Aging Well calls Integrity or what Paul Graham calls Keep Your Identity Small. And there may be consolidation beyond that - who knows what an AGI would pick up.
Kaj Sotala's multi-agent models of mind sequence and his paper Defining human values for value learners may be relevant.
Based on Kaj's concept of needs-meeting machinery and subsystems in the brain, I developed a framework for thinking about human values (which may deviate from Kaj's thinking).
I see human values as under-defined in many places. Sometimes you can get crystallized "life goals" where someone locks in an optimizing mindset around specific objectives. (This part may be particularly interesting for looking for analogies with AI?) The process of forming life goals seems to involve forming an identity. From my text ("The Life-Goals Framework: How I Reason About Morality as an Anti-Realist"):
One of many takeaways I got from reading Kaj Sotala’s multi-agent models of mind sequence (as well as comments by him) is that we can model people as pursuers of deep-seated needs. In particular, we have subsystems (or “subagents”) in our minds devoted to various needs-meeting strategies. The subsystems contribute behavioral strategies and responses to help maneuver us toward states where our brain predicts our needs will be satisfied. We can view many of our beliefs, emotional reactions, and even our self-concept/identity as part of this set of strategies. Like life plans, ["life plans" being objectives we set out to achieve but aren't all that serious about] life goals are “merely” components of people’s needs-meeting machinery.[8]
Still, as far as components of needs-meeting machinery go, life goals are pretty unusual. Having life goals means to care about an objective enough to (do one’s best to) disentangle success on it from the reasons we adopted said objective in the first place. The objective takes on a life of its own, and the two aims (meeting one’s needs vs. progressing toward the objective) come apart. Having a life goal means having a particular kind of mental organization so that “we” – particularly the rational, planning parts of our brain – come to identify with the goal more so than with our human needs.
[...]
Whether someone forms a life goal may also depend on whether the life-goal identity is reinforced (at least initially) around the time of the first adoption or when the person initially contemplates what it could be like to adopt the life goal. If assuming a given identity was instantly detrimental to our needs, we’d be less likely to power up the mental machinery to make it stable / protect it from goal drift.
In humans, I think the way we adopt specific values isn't too dissimilar from the way we adopt career paths, or even how we choose leisure and lifestyle activities. For instance, I discuss an example where someone wants to decide between spending the weekend cozily at home vs. going skiing:
There’s a normative component to something as mundane as choosing leisure activities. In the weekend example, I’m not just trying to assess the answer to empirical questions like “Which activity would contain fewer seconds of suffering/happiness” or “Which activity would provide me with lasting happy memories.” I probably already know the answer to those questions. What’s difficult about deciding is that some of my internal motivations conflict. For example, is it more important to be comfortable, or do I want to lead an active life? When I make up my mind in these dilemma situations, I tend to reframe my options until the decision seems straightforward. I know I’ve found the right decision when there’s no lingering fear that the currently-favored option wouldn’t be mine, no fear that I’m caving to social pressures or acting (too much) out of akrasia, impulsivity or some other perceived weakness of character.[21]
We tend to have a lot of freedom in how we frame our decision options. We use this freedom, this reframing capacity, to become comfortable with the choices we are about to make. In case skiing wins out, then “warm and cozy” becomes “lazy and boring,” and “cold and tired” becomes “an opportunity to train resilience / apply Stoicism.” This reframing ability is a double-edged sword: it enables rationalizing, but it also allows us to stick to our beliefs and values when we’re facing temptations and other difficulties.
Whether a given motivational pull – such as the need for adventure, or (e.g.,) the desire to have children – is a bias or a fundamental value is not set in stone; it depends on our other motivational pulls and the overarching self-concept we’ve formed.
Then, after discussing how we make career choice decisions in the same way, I argue that we even form life goals in this way:
Lastly, we also use “planning mode” to choose between life goals. A life goal is a part of our identity – just like one’s career or lifestyle (but it’s even more serious).
We can frame choosing between life goals as choosing between “My future with life goal A” and “My future with life goal B” (or “My future without a life goal”). (Note how this is relevantly similar to “My future on career path A” and “My future on career path B.”)
Consider morality-inspired life goals. For moral reflection to move from an abstract hobby to something that guides us, we have to move beyond contemplating how strangers should behave in thought experiments. At some point, we also have to envision ourselves adopting an identity of “wanting to do good.”
[...]
It’s important to note that choosing a life goal doesn’t necessarily mean that we predict ourselves to have the highest life satisfaction (let alone the most increased moment-to-moment well-being) with that life goal in the future. Instead, it means that we feel the most satisfied about the particular decision (to adopt the life goal) in the present, when we commit to the given plan, thinking about our future.
Human brain is big and messy. I'd ask something potentially simpler: which creatures can be said to possess "values" and what part/structure/etc. is correlated with the emergence of "value"?
For example, monkeys clearly have values, dogs and cats as well. What about fish? Maybe they don't, maybe it takes a rodent to express behaviors we could identify as value-laden.
In classic, non-mesa-optimized AGI risk scenarios, an AI is typically imagined whose reward function is directly related to the optimization pressure that it exerts on the world: e.g. the paperclip maximizer. However, it seems that human values are related to the brain's underlying reward function in a highly circuitous way, and in some sense might be better thought of as an elaborate complex of learned behaviors, contextual actions, fleeting heuristic goals, etc. If AGI is created in the near-term using an architecture similar to the human brain, it seems plausible that the actual optimization pressure exerted by said AGI will be similar, so developing a good understanding of how this works in the human case might be pretty important. Thus: what's the best mechanistic account of how "human values" actually emerge from the brain that we currently have?