Review
TsviBT

Intro written a day later:

Spiracular and I discuss the nature and origin of values. The dialogue doesn't get into much of a single clear thread, but I had fun writing it and hope it has some interesting trailheads for others. 

(You may wish to skip the methodological exchange at the beginning; start instead at "Throwing some stuff out seems good!".)


Hello.

Spiracular

Hi! So, "Where do values come from?" and some of the animal behavior/human origins Qs probably bleed into that, right?

TsviBT

I have a desire to do something like "import stances.radical_philosophical_confusion", but I imagine this might be boring and not a short word in our shared language. 

Spiracular

Is there a post? Or in the absence of that: What's the ~5-sentence version?

TsviBT

There's a post here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html

But like many of my posts, it's poorly written, in that it's the minimal post that gets the ideas down at all. 

TsviBT

A short version is: A lot of my interest here is in reprogramming a bunch of related ideas that we bring with us when thinking / talking about values. I want to reprogram them (in my head, I mean) so that I can think about alignment. 

Spiracular

Glazed over instantly on "hermeneutic net" in the first sentence, yeah. On trying to crunch on it anyway... things like bible interpretation personal-wikis? Or am I completely veering off on the wrong track?

TsviBT

The "hermeneutic" there is just saying, like, I want to bounce around back and forth between all the different concrete examples and concrete explicit concepts, and also the criteria that are being exerted on our concepts and our understanding of examples, where the criteria come from the big picture.

TsviBT

(I guess I'm uncomfortable talking at this meta level, largely because I imagine you as not being interested, even though you didn't say so.)

Spiracular

Okay, so "crosslinking" yes, de-emphasize the bible part, buff up the "dialogue" / differences-in-interpretation part, keep the "heirarchy feeding into questions" thing probably...

Spiracular

Uh... I'm feeling pretty okay, but I recognize I'm trying to do some kind of "see if I can short-circuit to an impression of a complex thing you're gesturing at" that might not work/might be really fundamentally off, and doing some weird social things to do it.

Spiracular

We can back out of it.

TsviBT

I think the shortcutting is reasonable. Like, I don't actually think the thing I want to gesture at is all that complicated or uncommonly-understood, I just want to be able to explicitly invoke it. Anyway. 

TsviBT

So. 

Spiracular

Seems like we should probably try looping back to values. Do you want to package the point you were going to try to use this to build, or should I just throw some things out? (which might risk derailing or losing it, or maybe it'll loop back, who knows!)

TsviBT

Throwing some stuff out seems good!

Spiracular

Alright! So, kinda off the top of my head...

  • Logical consistency in values seems really important to a lot of people, but there's also some kind of stop or sanity/intuition-check that most people seem to use (Scott gestured at this in some post; something about seagulls poking your eyes out when you concede points to philosophers). I wonder why that activates when it does?
  • Lot of values probably bottom-out in some kind of evolutionarily-favored metric (or proxy metric! sometimes the proxy metric is carrying the hedons, ex: sex vs procreation), at least as an original starting point.
  • Vague questions about valuing things at the conceptual-generalization "top" of that stack, vs the just-the-close-hedon-tracker things at the "bottom"? Or convergent properties of the world-modeling/truth-finding segment, which is a weird way to derive values when I think about it. Or the radical stance (almost nobody seriously takes) of even going a step down, and dropping the "proxy" when the evo thing landed on a proxy.
TsviBT

(I notice that I want to say abstract stuff instead of starting with examples, which is sad but I'll do so anyway and so this stuff can be glazed over until we get back to it with concrete examples... [Edit afterward: for some examples related to values, see https://tsvibt.blogspot.com/2023/08/human-wanting.html, https://tsvibt.blogspot.com/2022/11/do-humans-derive-values-from-fictitious.html#2-built-in-behavior-determiners, https://tsvibt.blogspot.com/2022/10/counting-down-vs-counting-up-coherence.html, https://tsvibt.blogspot.com/2022/08/control.html])

So, why ask about where values come from? Really I want to know the shape of values as they sit in a mind. I want to know that because I want to make a mind that has weird-shaped values. Namely, Corrigibility. Or rather, some form of [possible solution to corrigibility as described here: https://arbital.com/p/hard_corrigibility/ (more reliable: https://archive.ph/dJDqR )].

Some maybe-words for those ideas: 

  • Anapartistic reasoning. "I am not a self-contained agent. I am a part of an agent. My values are distributed across my whole self, which includes the human thing."
  • Tragic agency. "My reasoning/values are flawed. I'm goodharting, even when I think I'm applying my ultimate criterion. The optimization pressure that I'm exerting is pointed at the wrong thing. This extends to the meta-level: When I think I'm correcting my reasoning/values, the criterion I use to judge the corrections is also flawed."
  • Loyal agency. "I am an extension / delegate of another agent. Everything I do, I interpret as an attempt by another agency (humaneness) to do something which I don't understand."
  • Radical deference. "I defer to the humane process of unfolding values. I defer to the humane process's urges to edit me, at any level of abstraction. I trust that process of judgement above my own, like how those regular normal agents trust their [future selves, if derived by "the process that makes me who I am"] above their current selves."

These all involve values in a deep way, but I don't know how to make these high-level intuitions make more precise sense.

Spiracular

Cool. I think this merges okay with the sanity-check vs logical consistency thing I expressed interest in, lets go with your more-developed vocabulary/articulation.

TsviBT

Lot of values probably bottom-out in some kind of evolutionarily-favored metric (or proxy metric! sometimes the proxy metric is carrying the hedons, ex: sex vs procreation), at least as an original starting point.

As a thread we could pull on: this "as an original starting point" to me hints at a key question here. We have these starting points, but then we go somewhere else? How do we do that? 

One proposal is: we interpret ourselves (our past behavior, the contents of our minds) as being a flawed attempt by a more powerful agent to do something, and then we adopt that something as our goal. https://tsvibt.blogspot.com/2022/11/do-humans-derive-values-from-fictitious.html

In general, there's this thing where we don't start with explicit values, we create them. (Sometimes we do a different, which is best described as discovering values--e.g. discovering a desire repressed since childhood. Sometimes discovery and creation are ambiguous. But I think we sometimes do something that can only very tenuously be described as discovery, and is instead a free creation.)

This creation hints at some other kind of value, a "metavalue" or "process value". These metavalues feel closer in kind to [the sort of value that shows up in these ideas about Corrigibility]. So they are interesting. 

Spiracular

I see Anapartistic going wrong unless it has a value of "noninterference" or "(human/checker) agent at early (non-compromised) time-step X's endorsement" or something.

I guess humans manage to have multiple sub-systems that interlace and don't usually override each other's ability to function, though? (except maybe... in the case of drugs really interfering with another value's ability to make solid action-bids or similar, or in cases where inhibition systems are flipped off)


"Anapartistic" might be closer to how it's implemented on a subsystem in humans (very low confidence), but "Tragic Agency" feels more like how people reason through checking their moral reasoning explicitly.

Trying to... build on their moral system, but not drift too far? Often via "sanity-checking" by periodically stopping and running examples to see whether it gives wild/painful/inadvisable or irreversible/radical policy suggestions, and trying to diagnose what moral-reasoning step was upstream of these?

Spiracular

"We have these starting points, but then we go somewhere else? How do we do that?"

I think I just gave a half-answer, but let me break it down a bit more: One process looks like "building on" pre-established known base-level values, moral inferences, examples, and by reasoning over that (aggregating commonalities, or next-logical-step), suggests new inferences. Then checks how that alters the policy-recommendation outputs of the whole, and... (oh!) flags it for further checking if it causes a massive policy-alteration from the previous time-step?

Spiracular

we interpret ourselves (our past behavior, the contents of our minds) as being a flawed attempt by a more powerful agent to do something, and then we adopt that something as our goal.

Okay, this is the Loyal Agency example. I guess this is piggybacking on the competence of the human empathy system, right?

(I have no idea how you'd implement that on a non-evolved substrate, but I guess in humans, it's downstream of a progression of evolutionary pressures towards (chronologically first to last) "modeling predators/prey" -> "modeling conspecifics" -> "modeling allies" -> "abstractly aligning under shared goals"?)

TsviBT

Vague questions about valuing things at the conceptual-generalization "top" of that stack, vs the just-the-close-hedon-tracker things at the "bottom"?

To emphasize the point about value-creation / value-choice: There is no top. Or to say it another way: The top is there only implicitly. It's pointed at, or determined, or desired, by [whatever metavalues / process we use to direct our weaving of coherent values]. 

As you're discussing, a lot of this is not really values-like, in that it's not a free parameter. We can notice a logical inconsistency. For example, we might say: "It is bad to kill babies because they are conscious" and "It is okay to abort fetuses because they are not conscious" and notice that these don't really make sense together (though the concluded values are correct). Then we are guided / constrained by logic: either a 33-week-old fetus is conscious, or not, and so we have to have all our multiple values mesh with one of those worlds, or have all our multiple values mesh with the other of those worlds. 

Spiracular

In general, there's this thing where we don't start with explicit values, we create them. (Sometimes we do a different, which is best described as discovering values--e.g. discovering a desire repressed since childhood. Sometimes discovery and creation are ambiguous. But I think we sometimes do something that can only very tenuously be described as discovery, and is instead a free creation.)

This feels "off" to me, and isn't quite landing.

Like... you start as an infant who does things. And at some level of sophistication, you start chunking some of your self-model of the things that consistently drive parts of your behavior, under the concept of a "value"?

I have the sense that it usually doesn't work to just try to... upload a new value as a free creation, unless it is tied to a pre-existing pattern... hm. No. Okay, people can update their sense-of-self, and then will do wild things to align their actions with that sense-of-self, sometimes. But I think I think of that as subordinated under the value of "self-consistency" and the avoidance of intense cognitive-dissonance, and I maybe assume those values tend to be more loosely held "shallower" in implementation, somehow. (not at all confident this is how it really works, though)

Less confident that it's entirely "off" after playing around with it for a bit.

TsviBT

I'm feeling a bit ungrounded... Maybe I want to bring in some examples. 

  • Making a friend. At first, we recognize some sort of overlap. As things go on, we are in some ways "following our nose" about what we can be together. There's no preemptive explicit idea of what it will be like to be together. It's not like there was a secret idea, that gets revealed. Though... Maybe it's like, all that stuff is an optimization process that's just working out the instrumental details. But, IDK, that doesn't seem right. 
  • Making art. Hm... I kind of want to just say "look! it's about creation", and wave my hands around. 
  • Similar to making art: Cooking. This is very mixed in and ambiguous with "actually, you were computing out instrumental strategies for some preexisting fixed goal", and also with "you were goodharting, like a drug addict". But when a skilled cook makes a surprising combination of two known-good flavors... It doesn't seem like goodharting because .... Hm. Actually I can totally imagine some flavor combinations that I'd consider goodharting. 
  • Maybe I want to say that pure curiosity and pure play are the quintessential examples. You're trying to create something new under the sun. We could say: This isn't a creation of values, it's a creation of understanding. There's a fixed value which is: I want to create understanding. But this is using a more restricted idea of "value" than what we normally mean. If someone likes playing chess, we'd normally call that a value. If we want to say "this is merely the output of the metavalue of creating understanding" or "this is merely an instrumental strategy, creating a toy model in which to create understanding", then what would we accept as a real value? [This talk of "what would we accept" is totally a red alert that I've wandered off into bad undegrad philosophy, but I assert that there's something here anyway even if I haven't gotten it clearly.]
Spiracular

Cooking seems like a great clarifying example of the... Loyal Agency?... thing, actually. 

You had some conception of what you were going to make, and knew you'd botch it in some way, and also your interpretation of it is modified by the "environment" of the ingredients you have available (and your own inadequacies as a cook, and probably also cases of "inspiration strikes" that happen as you make it).

But unless you are leaning very far into cooking-as-art (everything-but-the-kitchen-sink stir-fry is known for this philosophy), you probably did have some fuzzy, flawed concept at the start of what you were grasping towards.

(I hear there's something of a Baking to Stir-fry Lawful/Chaotic axis, in cooking)

TsviBT

Okay, people can update their sense-of-self, and then will do wild things to align their actions with that sense-of-self, sometimes.

Like someone born in the Ingroup, who then learns some information that the Outgroup tends to say and the Ingroup tends to not say, and starts empathizing with the Outgroup and seeks out more such information, and death spirals into being an Outgroupist. 

Something catches my eye here. On the one hand, we want to "bottom out". We want to get to the real core values. On the other hand:

  1. Some of our "merely subordinate, merely subgoal, merely instrumental, merely object-level, merely product-of-process, merely starting-place-reflex" values are themselves meta-values. 
  2. I don't know what sort of thing the real values are. (Appeals to the VNM utility function have something important to say, but aren't the answer here I think.)
  3. There may not be such a thing as bottom-out values in this sense.
  4. We're created already in motion. And what we call upon when we try to judge, to reevaluate "object / subordinate" values, is, maybe, constituted by just more "object" values. What's created in motion is the "reflexes" that choose the tweaks that we make in the space described by all those free parameters we call values. 
Spiracular

If someone likes playing chess, we'd normally call that a value. 

Ooh! I want to draw a distinction between... here are 2 types of people playing chess:

  • Alice, who is an avid and obsessive player of chess, just chess (and might be in some kind of competitive league about it, with a substantial ELO rating, if I complete the trope)
  • Bob, who spends 5-10% of his time on one of: sodoku, chess, tetris

...I would characterize this as having very different underlying values driving their positive-value assignment to chess?

Like, assuming this is a large investment on both of their parts, I would infer: Alice plays chess because she highly values {excellence, perfectionism, competition} while Bob likely values {puzzles, casual games as leisure, maybe math}.

And this strongly affects what I'd consider a "viable alternative to chess" for each of them; maybe Alice could swap out the time she spends on chess for competitive tennis, but Bob would find that totally unsatisfying for the motives underlying his drive to play the game.

TsviBT

you probably did have some fuzzy, flawed concept at the start of what you were grasping towards.

Nora Ammann gives the example of Claire, who gets into jazz. My paraphrase: At first Claire has some not very deep reason to listen to jazz. Maybe a friend is into it, or she thinks she ought to explore culture, or something. At first she doesn't enjoy it that much; it's hard to understand, hard to listen along with; but there are some sparks of rich playfulness that draw her in. She listens to it more, gains some more fluency in the idioms and assumptions, and starts directing her listening according to newfound taste. After a while, she's now in a state where she really values jazz, deeply and for its own sake; it gives her glimpses of fun-funny ways of thinking, it lifts her spirits, it shares melancholy with her. Those seem like genuine values, and it seems not right to say that those values were "there at the beginning" any more than as a pointer.... ...Okay but I second guess myself more; a lot of this could be described as hidden yearnings that found a way out? 

TsviBT

We're calling a stop; thanks for engaging!

New Comment
13 comments, sorted by Click to highlight new comments since:

"interpret our previous actions as a being attempting to do something, and then taking that something as our goal"

Wow, that's super not how I experience my value/goal setting. I mostly think of my previous self (especially >10 years ago) as highly misguided due to lacking key information that I now have. A lot of this is information I couldn't have expected that previous me to have, so I don't blame 'previous me' for this per say. I certainly don't try to align myself to an extrapolation of previous me's goals though!

Whereas my 'values' that are the underlying drivers behind my goals seem fairly constant throughout my life, and the main thing changing is my knowledge of the world. So my goals change, because my life situation and understanding of the workings of the world change. But values changing? subtly, slowly, perhaps. But mostly that just seems dangerous and bad to current me. Current me endorses current me's values! Or they wouldn't be my values!

the underlying drivers behind my goals seem fairly constant throughout my life

What are these specifically, and what type of thing are they? Were they there when you were born? Were they there "implicitly" but not "explicitly"? In what sense were they always there (since whenever you claim they started being there)?

Surely your instrumental goals change, and this is fine and is a result of learning, as you say. So when something changes, you say: Ah, this wasn't my values, this was instrumental goals. But how can you tell that there's something fixed that underlies or overarches all the changing stuff? What is it made of?

These are indeed the important questions!

My answers from introspection would say things like, "All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."

"Some have been with me for as long as I can remember, others seem to have developed over time, some changed over time."

My answers from neuroscience would be shaped like, "Well, we have these basic drives from our hypothalamus, brainstem, basal ganglia... and then our cortex tries to understand and predict these drives, and drives can change over time (esp w puberty for instance). If we were to break down where a value comes from it would have to be from some combination of these basic drives, cortical tendencies (e.g. vulnerability to optical illusions), and learned behavior."

"Genetics are responsible for a fetus developing a brain in the first place, and set a lot of parameters in our neural networks that can last a lifetime. Obviously, genetics has a large role to play in what values we start with and what values we develop over time."

My answers from reasoning about it abstractly would be something like, "If I could poll a lot of people at a lot of different ages, and analyze their introspective reports and their environmental circumstances and their life histories, then I could do analysis on what things change and what things stay the same."

"We can get clues about the difference between a value and an instrumental goal by telling people to consider a hypothetical scenario in which a fact X was true that isn't true in their current lives, and see how this changes their expectation of what their instrumental goals would be in that scenario. For example, when imagining a world where circumstances have changed such that money is no longer a valued economic token, I anticipate that I would have no desire for money in that world. Thus, I can infer that money is an instrumental goal."

Overall, I really feel uncertain about the truth of the matter and the validity of each of these ways of measuring. I think understanding values vs instrumental goals is important work that needs doing, and I think we need to consider all these paths to understanding unless we figure out a way to rule some out.

If we were to break down where a value comes from it would have to be from some combination of these basic drives, cortical tendencies (e.g. vulnerability to optical illusions), and learned behavior.

I wouldn't want to say this is false, but I'd want to say that speaking like this is a red flag that we haven't understood what values are in the appropriate basis. We can name some dimensions (the ones you list, and others), but then our values are rotated with respect to this basis; our values are some vector that cuts across these basis vectors. We lack the relevant concepts. When you say that you experience "the underlying drivers behind your goals" as being constant, I'm skeptical, not because I don't think there's something that's fairly fixed, but because we lack the concepts to describe that fixed thing, and so it's hard to see how you could have a clear experience of the fixedness. At most you could have a vague sense that perhaps there is something fixed. And if so, then I'd want to take that sense as a pointer toward the as-yet not understood ideas.

Yes, I think I'd go with the description: 'vague sense that there is something fixed, and a lived experience that says that if not completely fixed then certainly slow moving.'

and I absolutely agree that understanding on this is lacking.

"All my values are implicit, explicit labels are just me attempting to name a feeling. The ground truth is the feeling."

Can you elaborate? (I don't have a specific question, just double-clicking, asking for more detail or rephrasing that uses other concepts.)

I find that I'm not very uncertain about where values come from. Although the exact details of the mechanisms in complex systems like humans remain murky, to me it seems pretty clear that we already have the answer from cybernetics: they're the result of how we're "wired" up into feedback loops. There's perhaps the physics question of why is the universe full of feedback loops—and the metaphysics question of why does our universe have the physics it has—but given that the universe is full of feedback loops, values seem a natural consequence of this fact.

Agree somewhat though I think lack of confusion (already knowing) can go too far as well. Wanted to add that per Quine and later analyzed by Nozick, we seem to be homeostatic envelope extenders. That is, we start with maintaining homeostasis (which gets complex due to cognitive arms race social species stuff) and then add on being able to reason about things far from their original context in time and space and try to extend our homeostatic abilities to new contexts and increase their robustness over arbitrary timelines, locations, conditions.

And it seems arthrodiatomic (cutting across joints, i.e. non-joint-carving) to describe the envelope-extension process itself as being an instance of homeostasis.

Can you give pointers to where Quine and Nozick talk about this?

I mostly got this from Nozick's final book Invariances

This is a non-answer, and I wish you'd notice on your own that it's a non-answer. From the dialogue:

Really I want to know the shape of values as they sit in a mind. I want to know that because I want to make a mind that has weird-shaped values. Namely, Corrigibility.

So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don't think so, but please correct me if you do. Assuming you don't, I suggest that understanding what values are and where they come from in a more joint-carving way might help.

In other words, saying that, besides some details, values come as "the result of how we're "wired" up into feedback loops" is true enough, but not an answer. It would be like saying "our plans are the result of how our neurons fire" or "the Linux operating system is the result of how electrons move through the wires in my computer". It's not false, it's just not an answer to the question we were asking.

So, given that you know where values come from, do you know what it looks like to have a deeply corrigible strong mind, clearly enough to make one? I don't think so, but please correct me if you do. Assuming you don't, I suggest that understanding what values are and where they come from in a more joint-carving way might help.

Yes, understanding values better would be better. The case I've made elsewhere is that we can use cybernetics as the basis for this understanding. Hence my comment is to suggest that if you don't know where values come from, I can offer what I believe to be a model that answers where values ultimately come from and gives a good basis for building up a more detailed model of values. Others are doing the same with compatible models, e.g. predictive processing.

I've not thought deeply about corrigibility recently, but my thinking on outer alignment more generally has been that, because Goodhart is robust, we cannot hope to get fully aligned AI by any means that measures, which leaves us with building AI with goals that are already aligned with ours (it seems quite likely we're going to bootstrap to AI that helps us build this, though, so work on imperfect systems seems worthwhile, but I'll ignore it here). I expect a similar situation for building just corrigibility.

So to build a corrigible AI, my model says we need to find the configuration of negative feedback circuits that implement a corrigible process. That doesn't constrain the space to look in a lot, but it does some, and it makes it clear that what we have is an engineering rather than a theory challenge. I see this as advancing the question from "where do values come from?" to "how do I build a thing out of feedback circuits that has the values I want it to have?".