This was definitely an interesting and persuasive presentation of the idea. I think this goes to the same place as learning from behavior in the end, though.
For behavior: In the ancestral environment, we behaved like we wanted nourishing food and reproduction. In the modern environment we behave like we want tasty food and sex. Given a button that pumps heroin into our brain, we might behave like we want heroin pumped into our brains.
For valence, the set of preferences that optimizing valence cashes out to depends on the environment. We, in the modern environment, don't want to be drugged to maximize some neural signal. But if we were raised on super-heroin, we'd probably just want super-heroin. Even assuming this single-neurological-signal hypothesis, we aren't valence-optimizers, we are the learned behavior of a system whose training procedure relies on the valence signal.
Ex hypothesi, we're going to have learned preferences that won't optimize valence, but might still be understandable in terms of a preference maturation process that is "trying" to optimize valence but ran into distributional shift or adversarial optimization or something. These preferences (like refusing the heroin) are still fully valid human preferences, and you're going to need to look at human behavior to figure out what they are (barring big godlike a priori reasoning), which entails basically similar philosophical problems as getting all values from behavior without this framework.
These preferences (like refusing the heroin) are still fully valid human preferences, and you're going to need to look at human behavior to figure out what they are (barring big godlike a priori reasoning), which entails basically similar philosophical problems as getting all values from behavior without this framework.
I'm hopeful that this won't be true in a certain, limited way. That is, in a certain sense, scanning brains and observing how neurons operate to determine the behavior of a human is a very different sort of operation from observing their behavior "from the outside" the way we observe people's behavior today. Much of the difficulty is that because observing behavior we can see only with our unaided senses and without a deep model of the brain forces us to make very large normative assumptions to get the necessary power to infer things about how a human values things, but if we have a model like this and it appears to be correct then we can, practically speaking, make "smaller", less powerful normative assumptions because we understand and can work out the details of more of the gears of the mind.
The result is that in a certain sense we are still concerned with behavior, but because the level of detail is so much higher and the model so much richer we are less likely to find ourselves making mistakes from having taken large inferential leaps as we would if we observed behavior in the normal sense.
Where the human valence comes from? Is it biologically encoded as positive valence of orgasm or it is learned as positive valence of Coca-Cola?
If it all biological, does it mean that our valence is shaped but convergent goals of Darwinian evolution?
I maybe don't quite understand your first two questions. If you're asking "where does positive valence come from" my answer is "minimization of prediction error", keeping in mind I think of that as a fancy way to say "feedback signal indicating a control system is moving towards a setpoint". I forget how to translate that into terms of Friston's free energy (increasing it? decreasing it?) if you prefer that model, but the point being that valence is a fundamental thing the brain does to signal parts of itself to do more or less of something.
As to your second question, valence is absolutely shaped by evolution so long as we hold the theory that all creatures with nerve cells have come to exist via evolutionary processes (maybe better to taboo "evolution" and say "differential reproduction with trait inheritance"). As to what effect evolution has had on valence seems a matter for evolutionary psychology and related studies of the evolutionary etiology of animal behavior.
I have previously advocated for finding a mathematically precise theory for formally approaching AI alignment. Most recently I couched this in terms of predictive coding and longer ago I was thinking in terms of a formalized phenomenology, but further discussions have helped me realize that, while I consider those approaches useful and they helped me discover my position, they are not the heart of what I think is import. The heart, modulo additional paring down that may come as a result of discussions sparked by this post, is that human values are rooted in valence and thus if we want to build AI aligned with human values we must be able to understand how values arise from valence.
Peter Carruthers has kindly and acausally done me the favor of laying out large parts of the case for a valence theory of value ("Valence and Value", Philosophy and Phenomenological Research, Vol. XCVII No. 3, Nov. 2018, doi:10.1111/phpr.12395). He sets out to do two things in the linked paper. One is to make the case that valence is a "unitary natural-psychological kind" (another way of saying it parsimoniously cuts the reality of human minds at the joints). The other is to give an account of how it is related to value, arguing that valence represents value against the position that valence is value. He calls these positions the representational and the hedonic accounts, respectively.
I agree with him on some points and disagree on others. I mostly agree with section 1 of his paper, and then proceed to disagree with parts of the rest, largely because I disagree with his representational account of valence because I think he flips the relationship between valence and value. Nonetheless, he has provided a strong jumping off point and explores many important considerations, so let's start from there before moving towards a formal model of values in terms of valence and saying how that model could be used in formally specifying what it would mean for two agents to be aligned.
The Valence-Value Connection
In the first section he offers evidence that valence and value are related. I recommend you read his arguments for yourself (the first section is only a few pages), but I'll point out several highlights:
Credit where credit is due: the Qualia Research Institute has been pushing this sort of perspective for a while. I didn't believe, though, that valence was a natural kind until I understood it as the signaling mechanism in predictive coding, but other lines of evidence may be convincing to other folks on that point, or you may still not be convinced. In my estimation, Carruthers does a much better job of presenting the evidence than either QRI or myself have done to a skeptical, academic audience, although I expect there are still many gaps to be covered which could prove to unravel the theory. Regardless, it should be clear that something is going on that relates valence to value, so even if you don't think the relationship is fundamental, it should still be valuable to learn what we can from how valence and value relate to help us become less confused about values.
How Valence and Value Interact
Carruthers takes the position that valence is representative of value (he calls this the "representational account") and argues it against the position that valence and value are the same thing (the "hedonic account"). By "representative of" he seems to mean that value exists and valence is something that partially or fully encodes value into a form that brains can work with. Here's how he describes it, in part:
At first I thought I agreed with him on the representational account because he rightly, in my view, notices that valence need not contain within it nor be built on our conceptual, ontological understanding of goodness, badness, and value. Reading closer and given his other arguments, though, it seems to me that he is saying that although valence is not representational of a conceptualization of value, he does mean it is representational of real values, whatever those be. I take this to be a wrong-way reduction: he is taking a simpler thing (valence) and reducing it into terms of a more complex thing (value).
I'm also not convinced by his arguments against the "hedonic account" since, to my reading, they often reflect a simplistic interpretation of how valence signals might function in the brain to produce behavior. This is forgivable, of course, because complex dynamic systems are hard to reason about, and if you don't have first hand experience with them you might not fully appreciate the way simple patterns of interaction can give rise to complex behavior. That said, his argument against value being identified with valence fail, in my mind, to make his point because they all leave open this escape route of "complex interactions that behave differently than the simple interactions they are made of", sort of like failing to realize that a solar-centric, elliptical-orbit planetary model can account for retrograde motion because it doesn't contain any parts that "move backwards", or that evolution by differential reproduction can give rise to beings that do things that do not contribute to differential reproductive success.
Yet I don't think the hedonic account, as he calls it, is quite right either, because he defines it such that there is no room between valence and value for computation to occur. Based on the evidence for a predictive-coding-like mechanism at play in the human brain (cf. academic papers on the first page of Googling "predictive coding evidence" for: 1, 2, 3, 4; against: 1), that mechanism using valence to send feedback signals, and the higher prior likelihood that values are better explained by reducing them to something simpler than vice versa, I'm inclined to explain the value-valence connection as the result of our reifying as "values" the self-experience of having brains semi-hierarchically composed of homeostatic mechanisms using valence to send feedback signals. Or with less jargon, values are the experience of computing the aggregation of valence signals. Against the representational and hedonic account, we might call this the constructive account because it suggests that value is constructed by the brain from valence signals.
My reasoning constitutes only a sketch of an argument for the constructive account. A more complete argument would need to address, at a minimum, the various cases Carruthers considers and much else besides. I might do that in the future if it proves instrumental to my ultimate goal of seeing the creation of safe, superintelligent AI, but for now I'll leave it at this sketch to move on to offering a mathematical model of the constructive account and using it to formalize what it would mean to construct aligned AI. Hereon I'll assume the constructive account, making the rest of this post conditional on that account's as yet unproven correctness.
A Formal Model of Human Values in Terms of Valence
The constructive account implies that we should be able to create a formal model of human values in terms of valence. I may not manage to create a perfect or perfectly defensible model, but my goal is to make it at least precise enough that we can squeeze out any wiggle room from it where, if this theory is wrong, it might try to hide by escaping into it. Thus we can either expose fundamental flaws in the core idea (values constructed from valence) or expose flaws in the model in order to move towards a better model that correctly captures the idea in a precise enough way that we can safely use it when reasoning about alignment of superintelligent AI.
Let's start by recalling some existing models of human values and work from there to create a model of values grounded in valence. This will mean a slight shift in terminology, from talking about values to preferences. I will here consider these terms interchangeable, but not everyone would agree. Some people insist values are not quantifiable or formally modelable. I'm going to, perhaps unfairly, completely ignore this class of objection, as I doubt many of my readers believe it. Others use "value" to mean the processes that generate preferences, or they might only consider meta-preferences to be values. This is a disagreement on definitions, so know that I am not making this kind of distinction, and instead lump everything like value, preference, affinity, taste, etc. into a single category and freely equivocate among these terms since I think they are all of the same kind or type that generate answers to questions of the form "what should one do?".
The standard model of human values is the weak preference ordering model. Given the set of all possible world states X, a person's values are defined by a weak order ⪯ over X. This model has several variations, such as replacing ⪯ with a total order ≺, removing the ability to equally value two different world states, or relaxing ⪯ to a partial order ≾, permitting incomparable world states. The benefit of using a weak order is that it's sufficient for modeling rational agents: a total order is overkill and a partial order is not enough to make expected utility theory work.
Unfortunately humans aren't rational agents, so the weak preference ordering model fails to completely describe human values. Or at least so it seems at first. One response is to throw out the idea that there is even a preference ordering, instead replacing it with a preference relation that sometimes gives a comparison between two world states, sometimes doesn't, and sometimes produces loops (an "approximate order"). Although I previously endorsed this approach, I no longer do, because most of the problems with weak ordering can be solved by Stuart Armstrong's approach of realizing that world states are all conditional on their causal history (that is, time-invariant preferences don't actually exist, we just sometimes think it looks like they do) and treating human preferences as partial (held over not necessarily disjoint subsets of X, reflecting that humans only model of a subset of possible world states). This means that having a weak preference ordering may not in itself be a challenge to giving a complete description of human values, so long as what constitutes a world state and how the preferences form over them are adequately understood.
Getting to adequate understanding is non-trivial, though. For example, even if the standard model describes how human preferences function, it doesn't explain how to learn what they are. The usual approach to finding preferences is behaviorist: observe the behavior of an agent and infer the values from there with the necessary help of some normative assumptions about human behavior. This is the approach in economic models over revealed preferences, inverse reinforcement learning, and much of how humans model other humans. Stuart Armstrong's model of partial preferences avoids making normative assumptions about behavior by making assumptions about how to define preferences but ends up requiring solving the symbol grounding problem. I think we can do better by making the assumption that preferences are computed from valence since valence is in theory observable, correlated with values, and requires solving problems in neuroscience rather than philosophy.
So, without further ado, here's the model.
The Model
Let H be a human embedded in the world and X the set of all possible world states. Let H(X) be the set of all world states as perceived by H and H(x) the world state x as H perceives it. Let Υ(H(x))={υ:H(x)→R} be the set of valence functions of the brain that generate real-valued valence when in a perceived world state. Let α:RΥ→R be the aggregation function from the output of all υ∈Υ on H(x) to a single real valued aggregate valence, which for H(x) is denoted π(x)=α∘Υ∘H(x) for π:X→R and is called the preference function. Then the weak preference ordering of H is given by π.
Some notes. The Υ functions are, as I envision them, meant to correspond to partial interaction with H(x) and to represent how individual control systems in the brain generate valence signals in response to being in a state that is part of H(x), although I aim for Υ to make sense as an abstraction even if the control system model is wrong so long as the constructive valence account is right. There might be a better formalism than making each υ a function from all of H(x) to the reals, but given the interdependence of everything within a Hubble volume, it would likely be to reconstitute each υ with H(x) as accessible from each particular control system or from each physical interaction within each control system, even if in practice each control system ignores most of the available state information and sees the world not much different from every other control system in a particular brain. Thus for humans with biological brains as they exist today a shared H(x) is probably adequate unless future neuroscience suggests greater precision is needed.
However maybe we don't need H(X) and can simply use X directly with Υ entirely accounting for the subjective aspects of calculating π.
I'm uncertain if π producing a complete ordering of X is a feature or a bug. On the standard model it would be a feature because rational choice theory expects this, but on Stuart's model it might be a bug because now we're creating more ordering than any human is computing, and more generally we should expect, lacking hypercomputation, that any embedded (finite) agent cannot actually compute a complete order on X because, even if X is finite, it's so large as to require more compute than will ever exist in the entire universe to consider each member once. But then again maybe this is fine and we can capture the partial computation of π via an additional mechanism while leaving this part of the model as is.
Regardless we should recognize that the completeness of π in and of itself is a manifestation of a more general limitation of the model: the model doesn't reflect how humans compute value from valence because it supposes the possibility of simultaneously computing and determining the best world state in O(1) time, otherwise it would need to account for considering a subset of H(X) that might shift as the world state in which H is embedded transitions from x to x′ to x′′ and so on (that is, even if we restrict X to possible world states that are causal successors of the present state, X will change in the course of computing π). The model presented is timeless, and that might be a problem because values exist at particular times because they are features of an embedded agent and so letting them float free of time fails to completely constrain the model to reality. I'm not sure if this is a practical problem or not.
Further this model has many of the limitations that Stuart Armstrong's value model has: provides slice-in-time view of values rather than persistent values, doesn't say how to get ideal or best values, and doesn't deal with questions of identity. Those might or might not be real limitations: maybe there are no persistent values and the notion of a persistent value is a post hoc reification in human ontology; maybe ideal or best values don't exist or are at least uncomputable; and maybe identity is also a post hoc reification that isn't, in a certain sense, real. Clearly, I and others need to think about this more.
Despite all these limitations, I'm excited about this model because it provides a starting point for building a more complete model that addresses these limitations while capturing the important core idea of a constructive valence account of values.
Formally Stating Alignment
Using this model, I can return to my old question of how to formally specify the alignment problem. Rather than speaking in terms of phenomenological constructs, as I did in my previous attempt, I can simply talk in terms of valence and preference ordering.
Consider two agents, a human H and an AI A, in a world with possible states X. Let π and U be the preference function of H and the utility function of A, respectively. Then H is aligned with A if
∀x,y∈X,π(x)≥π(y)⟺U(x)≥U(y)
In light of my previous work, I believe this is sufficient because, even though it does not explicitly mention how H and A model each other, that is not necessary because that is already captured by the subjective nature of π and U, i.e. H and A's ontologies are already computed by π and U so we don't need to make it explicit at this level of the model.
Context
I very much think of this as a work in progress that I'm publishing in order to receive feedback. Although it's the best of my current thinking given the time and energy I have devoted to it, my thinking is often made better by collaboration with others, and I think the best way to make that happen is by doing my best to explain my ideas so others can interact with them. Although I hope to eventually evolve these ideas towards something I'm sure enough of that I would want to see them published in a journal, I would want to be much more sure this was describing reality in a useful way for understanding human values as necessary to building aligned AGI.
As briefly mentioned earlier, I also think of this work as conditional on future neuroscience proving correct the constructive valence account of values, and I would be happy to get it to a point where I was more certain of it being conditionally correct even if I can't be sure it is correct because of that conditionality. Another way to put this is that I'm taking a bet with this work that the constructive account will be proven correct. Thus I'm most interested in comments that poke at this model conditional on the constructive account being correct, medium interested in comments that poke at the constructive account, and least interested in comments that poke at the fact that I'm taking this bet or that I think specifying human values is important for alignment (we've previously discussed that last topic elsewhere).