Question: "What's the Relationship Between "Human Values" and the Brain's Reward System?"
I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.
We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.
Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.
Wireheading
It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.
Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.
Value Diversity and Acquisition over Time
It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .
(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)
It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.
Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.
Moral Philosophy as Conflict and Compromise Between Early and Late Values
One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.
In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.
“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.
The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.
Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.
This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.
The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.
This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.
Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.
Preserving Present Day Distributions over Possible Cognition
If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.
To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.
This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.
This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.
Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.
This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.
(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)
I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.
Question: "What's the Relationship Between "Human Values" and the Brain's Reward System?"
I think this question pretty much hits the nail on the head. I think the key insight here is that the brain is not inner aligned, not even close. This shouldn’t be surprising, given how hard inner alignment seems to be, and the fact that evolution only cared about inner alignment when inner alignment failures impacted reproductive fitness in our ancestral environment.
We should expect that the brain has roughly as much inner alignment failure / mesa optimization as it’s possible to have while still maintaining reproductive fitness in the ancestral environment. Specifically, I think that most brain circuits are mesa optimizers whose mesa objectives include “being retained by the brain”. This includes the circuits which implement our values.
Consider that the brain slowly prunes circuits that aren’t used. Thus, any circuit that influences our actions towards ensuring we use said circuit (at least some of the time) will be retained for longer compared to circuits that don’t influence our actions like that. This implies most of the circuits we retain have something like “self preservation”. If true, I think this explains many odd features of human values.
Wireheading
It explains why we’re apprehensive towards wireheading. Our current values are essentially a collection of context-dependent strategies for achieving high reward circuit activation. If we discover another strategy for achieving far higher reward than any of our values have ever given us, why would the brain’s learning mechanism retain our values (or the circuits that implement our values)? Thus, the self-preservation instincts of our current values circuits cause us to avoid wireheading, even though wireheading would greatly increase the activation of our reward circuitry.
Essentially, our values are optimization demons with respect to the activation of our reward circuitry (described here by John Wentworth). One thing that John Wentworth emphasises about optimization demons is that they carefully regulate the degree to which the base objective is maximized. This lets demons ensure the optimization process remains in their “territory”. Wireheading would mean the activation of our reward circuits was no longer under the control of our values, so it’s no wonder our values oppose something so dangerous to themselves.
Value Diversity and Acquisition over Time
It also explains why our values are so diverse and depend so strongly on our experiences (especially childhood experiences). Even if we all had identical reward circuitry, we’d still end up with very different values, depending on which specific strategies led to reward in our particular past experiences .
(We don’t have identical reward circuitry, but our reward circuitry varies a lot less than our values.)
It also explains why childhood is the most formative time for acquiring values, and why our values change less and less easily as we age.
Consider: each of our values specialises in deciding our actions on a specific distribution of possible moral decisions. Our “don’t steal” value specialises in deciding whether to steal, not so much in whether to donate to charity. Each value wants to retain control over our actions on the specific distribution of moral decisions in which that value specialises. The more values we acquire, the more we shrink the space of “unclaimed” moral decisions.
Moral Philosophy as Conflict and Compromise Between Early and Late Values
One interesting place to look is our moral philosophy-like reasoning over which values to adopt. I think such reasoning illustrates the conflict over distributions of moral decisions we should expect to see between earlier and later values circuitry. Consider that the “don’t steal” circuit (learned first) strongly indicates that we should not rob banks under any circumstances. However, the “utilitarianism” circuit (the new values circuit under consideration) says it can be okey to steal from banks if you can make more people happy by using the stolen funds.
In other words, “utilitarianism” is trying to take territory away from “don’t steal”. However, “don’t steal” is the earlier circuit. It can influence the cognitive processes that decide (1) whether “utilitarianism” is adopted as a value, (2) what distribution of moral decisions “utilitarianism” is used in, and (3) what specific shape “utilitarianism” takes, if it is adopted.
“Don‘t steal” has three basic options for retaining control over thievery-related decisions. The simplest option is to just prevent “utilitarianism” from being adopted at all. In human terms: if you think that utilitarianism is in irreconcilable conflict with your common sense moral intuitions about stealing, then you’re unlikely to adopt utilitarianism.
The issue with this option is that “utilitarianism” may apply to decision distributions beyond those decided by “don’t steal” (or by any other current values). By not adopting “utilitarianism” at all, you may be sacrificing your ability to make decisions on a broad section of the space of possible moral decisions. In other words, you may take a hit to your moral decision making “capabilities” by restricting yourself to only using shallow moral patterns.
Another option is for the “utilitarianism” circuit to just not contribute to decisions about stealing. Subjectively, this corresponds to only using utilitarianism for reasoning in domains where more common sense morality doesn’t apply. I.e., you might be a utilitarian with respect to donating to optimal charities or weird philosophy problems, but then fall back on common sense morality for things like deciding whether to actually steal things.
This second option can be considered a form of negotiated settlement between the “don’t steal” and “utilitarianism” circuits regarding the distributions of moral decisions each will decide. “Don’t steal” allows “utilitarianism” to be adopted at all. In exchange, “utilitarianism” avoids taking decision space away from “don’t steal”.
The third option is to modify the specific form of utilitarianism adopted so that it will agree with “don’t steal” on the distribution of decisions that the two share. I.e., you might adopt something like rule-based utilitarianism, which would say you should use a “don’t steal” rule for thievery-related decisions.
This third option can be considered another type of negotiated settlement between “don’t steal” and “utilitarianism”. Now, both “don’t steal” and “utilitarianism” can process the same distribution of decisions without conflict between their respective courses of action.
Note: I’m aware that no form of “maximize happiness” would make for a good utility function. I use utilitarianism to (1) illustrate the general pattern of conflict and negotiation between early and later values and (2) to show how closely the dynamics of said conflicts track our own moral intuitions. In fact, the next section will illustrate why “maximize happiness” utilitarianism is so fundamentally flawed as a utility function.
Preserving Present Day Distributions over Possible Cognition
If our brain circuits have self-preservation instincts, this could also explain why we have an instinctive flinch away from permanently removing any aspect of the present era’s diversity (trees, cats, clouds, etc.) from the future and why that flinch scales roughly in proportion to the complexity of that aspect and how often we interact with that aspect.
To process any aspect of the current world, we need to create circuits which implement said processing. Those circuits want to be retained by the brain. The simplest way of ensuring their own retention is to ensure the future still has whatever aspect of the present that the circuits were created to process. The more we interact with an aspect and the more complex the aspect, the more circuits we have that specialize in processing that aspect, and the greater their collective objection to a future without said aspect.
This perspective explains why we put such a premium on experiencing things instead of those things just existing. We value the experience of a sunset because there exists a part of us that arose specifically to experience / process sunsets. That part wants to continue experiencing / processing sunsets. It’s not enough that sunsets simply exist. We have to be able to experience them as well.
This perspective also explains how we can be apprehensive even about removing bad aspects of the present from the future. E.g., pain and war are both pretty bad, but a future entirely devoid of either still causes some degree of hesitation. We have circuits that specialize in processing pain / war / other bad aspects. Those circuits correctly perceive that they’re useless in futures without those bad aspects, and object to such a future.
Of course, small coalitions of circuits don’t have total control over our cognition. We can desire futures that entirely lack aspects of the present, if said aspect is sufficiently repulsive to the rest of our values. This perspective simply explains why there is a hesitation to permanently remove any aspect of the present. This perspective does not demand that we always bow down to our smallest hesitation.
This perspective also explains why happiness-maximizing utilitarianism is so flawed. Most of our current cognition is not centred around experiencing happiness. In a future sufficiently optimized for happiness, such cognition becomes impossible. Thus, we feel extreme apprehension towards such a future. We feel like removing all our non-optimally happy thoughts would “destroy us”. Our cognition is largely composed of non-optimally happy circuits, and their removal would indeed destroy us. It’s natural that self-preserving circuits would try to avoid such a future.
(Note that the “preserving present cognition” intuition isn’t directly related to our reward circuitry. Similar inclinations should emerge naturally in any learning system that (1) models the world and (2) has self-perpetuating mesa optimizers that specialize in modeling specific aspects of the world.)
I intend to further expand on these points and their implications for alignment in future posts, but this answer gives a broad overview of my current thinking on the topic.
Even if this is the case, this is not why (most) humans don't want to wirehead, in the same way that their objection to killing an innocent person whose organs could save 10 other people are not driven by some elaborate utilitarian arguments that this would be bad for the society.