Background: “Learning” vs “Learning About”

Adaptive systems, reinforcement “learners”, etc, “learn” in the sense that their behavior adapts to their environment.

Bayesian reasoners, human scientists, etc, “learn” in the sense that they have some symbolic representation of the environment, and they update those symbols over time to (hopefully) better match the environment (i.e. make the map better match the territory).

These two kinds of “learning” are not synonymous[1]. Adaptive systems “learn” things, but they don’t necessarily “learn about” things; they don’t necessarily have an internal map of the external territory. (Yes, the active inference folks will bullshit about how any adaptive system must have a map of the territory, but their math does not substantively support that interpretation.) The internal heuristics or behaviors “learned” by an adaptive system are not necessarily “about” any particular external thing, and don’t necessarily represent any particular external thing[2].

We Humans Learn About Our Values

“I thought I wanted X, but then I tried it and it was pretty meh.”

“For a long time I pursued Y, but now I think that was more a social script than my own values.”

“As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

The ubiquity of these sorts of sentiments is the simplest evidence that we do not typically know our own values[3]. Rather, people often (but not always) have some explicit best guess at their own values, and that guess updates over time - i.e. we can learn about our own values.

Note the wording here: we’re not just saying that human values are “learned” in the more general sense of reinforcement learning. We’re saying that we humans have some internal representation of our own values, a “map” of our values, and we update that map in response to evidence. Look again at the examples at the beginning of this section:

  • “I thought I wanted X, but then I tried it and it was pretty meh.”
  • “For a long time I pursued Y, but now I think that was more a social script than my own values.”
  • “As a teenager, I endorsed the view that Z is the highest objective of human existence. … Yeah, it’s a bit embarrassing in hindsight.”

Notice that the wording of each example involves beliefs about values. They’re not just saying “I used to feel urge X, but now I feel urge Y”. They’re saying “I thought I wanted X” - a belief about a value! Or “now I think that was more a social script than my own values” - again, a belief about my own values, and how those values relate to my (previous) behavior. Or “I endorsed the view that Z is the highest objective” - an explicit endorsement of a belief about values. That’s how we normally, instinctively reason about our own values. And sure, we could reword everything to avoid talking about our beliefs about values - “learning” is more general than “learning about” - but the fact that it makes sense to us to talk about our beliefs about values is strong evidence that something in our heads in fact works like beliefs about values, not just reinforcement-style “learning”.

Two Puzzles

Puzzle 1: Learning About Our Own Values vs The Is-Ought Gap

Very roughly speaking, an agent could aim to pursue any values regardless of what the world outside it looks like; “how the external world is” does not tell us “how the external world should be”. So when we “learn about” values, where does the evidence about values come from? How do we cross the is-ought gap?

Puzzle 2: The Role of Reward/Reinforcement

It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things. (Something something mesocorticolimbic circuit.) For instance, Steve Byrnes claims that reward is a signal from which human values are learned/reinforced within lifetime[4]. But clearly the reward signal is not itself our values. So what’s the role of the reward, exactly? What kind of reinforcement does it provide, and how does it relate to our beliefs about our own values?

Using Each Puzzle To Solve The Other

Put these two together, and a natural guess is: reward is the evidence from which we learn about our values. Reward is used just like ordinary epistemic evidence - so when we receive rewards, we update our beliefs about our own values just like we update our beliefs about ordinary facts in response to any other evidence. Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.

What This Looks Like In Practice

I eat some escamoles for the first time, and my tongue/mouth/nose send me enjoyable reward signals. I update to believe that escamole are good - i.e. I assign it high value.

Now someone comes along and tells me that escamoles are ant larvae. An unwelcome image comes into my head, of me eating squirming larvae. Ew, gross! That’s a negative reward signal - I downgrade my estimated value of escamoles.

Now another person comes along and tells me that escamoles are very healthy. I have cached already that "health" is valuable to me. That cache hit doesn’t generate a direct reward signal at all, but it does update my beliefs about the value of escamoles, to the extent that I believe this person. That update just routes through ordinary epistemic machinery, without a reward signal at all.

At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it's less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.

That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?

Escamoles, ant larvae in Mexico City | Eat Your World
Escamoles
  1. ^
  2. ^

    Of course sometimes a Bayesian reasoner’s beliefs are not “about” any particular external thing either, because the reasoner is so thoroughly confused that it has beliefs about things which don’t exist - like e.g. beliefs about the current location of my dog. I don’t have a dog. But unlike the adaptive system case, for a Bayesian reasoner such confusion is generally considered a failure of some sort.

  3. ^

    Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:

    • The human’s stated preferences
    • The human’s revealed preferences
    • The human’s in-the-moment experience of bliss or dopamine or whatever
    • <whatever other readily-measurable notion of “values” springs to mind>

    The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure. Things like stated preferences, revealed preferences, etc are useful proxies for human values, but those proxies importantly break down in various cases - even in day-to-day life.

  4. ^

    Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

New Comment
7 comments, sorted by Click to highlight new comments since:

I discussed something similar in the "Human brains don't seem to neatly factorize" section of the Obliqueness post. I think this implies that, even assuming the Orthogonality Thesis, humans don't have values that are orthogonal to human intelligence (they'd need to not respond to learning/reflection to be orthogonal in this fashion), so there's not a straightforward way to align ASI with human values by plugging in human values to more intelligence.

The post is mostly arguing that desires can shift around as we learn and think, which I agree with. But a couple parts of the post (including the title and some of the subheadings) seems to suggest something more than that: it’s not just desire-shifts, but desire convergence, towards a thing called “our values”.

(In other words, if I say that “I’m learning about blah”, then I’m strongly implying that there’s a fact-of-the-matter about blah, and that my beliefs about blah are approximately-monotonically converging towards that fact-of-the-matter. Right?)

Do you think there’s a thing (“human values”) to which desires gradually converge, via the kinds of processes described in this post? (I don’t, see §2.7 here.)

This post definitely resolved some confusions for me. There are still a whole lot of philosophical issues, but it's very nice to have a clearer model of what's going on with the initial naïve conception of value.

Note that, in treating these sentiments as evidence that we don’t know our own values, we’re using stated values as a proxy measure for values. When we talk about a human’s “values”, we are notably not talking about:

  • The human’s stated preferences
  • The human’s revealed preferences
  • The human’s in-the-moment experience of bliss or dopamine or whatever
  • <whatever other readily-measurable notion of “values” springs to mind>

The thing we’re talking about, when we talk about a human’s “values”, is a thing internal to the human’s mind. It’s a high-level cognitive structure.
(...)
But clearly the reward signal is not itself our values.
(...)
reward is the evidence from which we learn about our values.


So we humans have a high-level cognitive structure to which we do not have direct access (values), but about which we can learn by observing and reflecting on the stimulus-reward mappings we experience, thus constructing an internal representation of such structure. This reward-based updating bridges the is-ought gap, since reward is a thing we experience and our values encode the way things ought to be.

Two questions:

  • How accurate is the summary I have presented above?
  • Where do values, as opposed to beliefs-about-values, come from?
     

How accurate is the summary I have presented above?

Basically accurate.

Where do values, as opposed to beliefs-about-values, come from?

That is the right next question to ask. Humans have a map of their values, and can update that map in response to rewards in order to "learn about values", but still leaves the question of when/whether there's any "real values" which the map represents, and what kind-of-things those "real values" are.

A few parts of an answer:

  • "human values" are not one monolithic thing; we value lots of different stuff, and different parts of our value-estimates can separately represent "a real thing" or fail to represent "a real thing".
  • we don't yet understand what it means for part of our value-estimates to represent "a real thing", but it probably works pretty similarly to epistemic representation more generally - e.g. my belief about the position of the dog in my apartment represents a real thing (even if the position itself is wrong) exactly when there is in fact a dog in my apartment at all.

Thank you for the answer. I notice I feel somewhat confused, and that I regard the notion of "real values" with some suspicion I can't quite put my finger on. Regardless, an attempted definition follows.

Let a subject observation set be a complete specification of a subject and it's past and current environment, from the subject's own subjectively accessible perspective. The elements of a subject observation set are observations/experiences observed/experienced by its subject.

Let O be the set of all subject observation sets.

Let a subject observation set class be a subset of O such that all it's elements specify subjects that belong to an intuitive "kind of subject": e.g. humans, cats, parasitoid wasps.

Let V be the set of all (subject_observation_set, subject_reward_value) tuples. Note that all possible utility functions of all possible subjects can be defined as subsets of V, and that 
V = O x ℝ.

Let "real human values"  be the subset of V such that all subject_observation_set elements belong to the human subject observation set class.[1]
 

... this above definition feels pretty underwhelming, and I suspect that I would endorse a pretty small subset of "real human values" as defined above as actually good.

  1. ^

    Let the reader feel free take the political decision of restricting the subject observation set class that defines "real human values" to sane humans.

Steve uses the phrase “model-based RL” here, though that’s a pretty vague term and I’m not sure how well the usual usage fits Steve’s picture.

Here’s an elaboration of what I mean:

The human brain has a model-based RL system that it uses for within-lifetime learning. I guess that previous sentence is somewhat controversial, but it really shouldn’t be:

  • The brain has a model—If I go to the toy store, I expect to be able to buy a ball.
  • The model is updated by self-supervised learning (i.e., predicting imminent sensory inputs and editing the model in response to prediction errors)—if I expect the ball to bounce, and then I see the ball hit the ground without bouncing, then next time I see that ball heading towards the ground, I won’t expect it to bounce.
  • The model informs decision-making—If I want a bouncy ball, I won’t buy that ball, instead I’ll buy a different ball.
  • There’s reinforcement learning—If I drop the ball on my foot just to see what will happen, and it really hurts, then I probably won’t do it again, and relatedly I will think of doing so as a bad idea.

…And that’s all I mean by “the brain has a model-based RL system”.

I emphatically do not mean that, if you just read a “model-based RL” paper on arxiv last week, then I think the brain works exactly like that paper you just read. On the contrary, “model-based RL” is a big tent comprising many different algorithms, once you get into details. And indeed, I don’t think “model-based RL as implemented in the brain” is exactly the same as any model-based RL algorithm on arxiv.