I definitely agree that the player vs character distinction is meaningful, although I would define it a bit differently.
I would identify it with cortical vs subcortical, a.k.a. neocortex vs everything else. (...with the usual footnotes, e.g. the hippocampus counts as "cortical" :-D)
(ETA: See my later post Inner alignment on the brain for a better discussion of some of the below.)
The cortical system basically solves the following problem:
Here is (1) a bunch of sensory & other input data, in the form of spatiotemporal patterns of spikes on input neurons, (2) occasional labels about what's going on right now (e.g. "something good / bad / important is happening"), (3) a bunch of outgoing neurons. Your task is to build a predictive model of the inputs, and use that to choose signals to send into the outgoing neurons, to make more good things happen.
The result is our understanding of the world, our consciousness, imagination, memory, etc. Anything we do that requires understanding the world is done by the cortical system. This is your "character".
The subcortical system is responsible for everything else your brain does to survive, one of which is providing the "labels" mentioned above (that something good / bad / important / whatever is happening right now).
For example, take the fear-of-spiders instinct. If there is a black scuttling blob in your visual field, there's a subcortical vision system (in the superior colliculus) that pattern-matches that moving blob to a genetically-coded template, and thus activates a "Scary!!" flag. The cortical system sees the flag, sees the spider, and thus learns that spiders are scary, and it can plan intelligent actions to avoid spiders in the future.
I have a lot of thoughts on how to describe these two systems at a computational level, including what the neocortex is doing, and especially how the cortical and subcortical systems exchange information. I am hoping to write lots more posts with more details about the latter, especially about emotions.
even the reward and optimization mechanisms themselves may end up getting at least partially rewritten.
Well, there is such a thing as subcortical learning, particularly for things like fine-tuning motor control programs in the midbrain and cerebellum, but I think most or all of the "interesting" learning happens in the cortical system, not subcortical.
In particular, I'm not really expecting the core emotion-control algorithms to be editable by learning or thinking (if we draw an appropriately tight boundary around them).
More specifically: somewhere in the brain is an algorithm that takes a bunch of inputs and calculates "How guilty / angry / happy / smug / etc. should I feel right now?" The inputs to this algorithm come from various places, including from the body (e.g. pain, hunger, hormone levels), and from the cortex (what emotions am I expecting or imagining or remembering?), and from other emotion circuits (e.g. some emotions inhibit or reinforce each other). The inputs to the emotion calculation can certainly change, but I don't expect that the emotion calculation itself changes over time.
It feels like emotion-control calculations can change, because the cortex can be a really dominant input to those calculations, and the cortex really can change, including by conscious effort. Why is the cortex such a dominant input? Think about it: the emotion-calculation circuits don't know whether I'm likely to eat tomorrow, or whether I'm in debt, or whether Alice stole my cookie, or whether I just got promoted. That information is all in the cortex! The emotion circuits get only tiny glimpses of what's going on in the world, particularly through the cortex predicting & imagining emotions, including in empathetic simulation of others' emotions. If the cortex is predicting fear, well, the amygdala obliges by creating actual fear, and then the cortex sees that and concludes that its prediction was right all along! There's very little "ground truth" that the emotion circuits have to go on. Thus, there's a wide space of self-reinforcing habits of thought. It's a terrible system! Totally under-determined. Thus we get self-destructive habits of thought that linger on for decades.
Anyway, I have this long-term vision of writing down the exact algorithm that each of the emotion-control circuits is implementing. I think AGI programmers might find those algorithms helpful, and so might people trying to pin down "human values". I have a long way to go in that quest :-D
there's also a sense in which the player doesn't have anything that we could call values ...
I basically agree; I would describe it by saying that the subcortical systems are kinda dumb. Sure, the superior colliculus can recognize scuttling spiders, and the emotion circuits can "dislike" pain. But any sophisticated concept like "flourishing", "fairness", "virtue", etc. can only be represented in the form of something like "Neocortex World Model Entity ID #30962758", and these things cannot have any built-in relationship to subcortical circuits.
So the player's "values" are going to (1) simple things like "less pain is good", and (2) things that don't have an obvious relation to the outside world, like complicated "preferences" over the emotions inside our empathetic simulations of other people.
If a "purely character-level" model of human values is wrong, how do we incorporate the player level?
Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P
But either way, I'm all for trying to get a better understanding of how I (the character / cortical system) am "built" by the player / subcortical system. :-)
Great comment, thanks!
Is it really "wrong"? It's a normative assumption ... we get to decide what values we want, right? As "I" am a character, I don't particularly care what the player wants :-P
Well, to make up a silly example, let's suppose that you have a conscious belief that you want there to be as much cheesecake as possible. This is because you are feeling generally unsafe, and a part of your brain has associated cheesecakes with a feeling of safety, so it has formed the unconscious prediction that if only there was enough cheesecake, then you would finally feel good and safe.
So you program the AI to extract your character-level values, it correctly notices that you want to have lots of cheesecake, and goes on to fill the world with cheesecake... only for you to realize that now that you have your world full of cheesecake, you still don't feel as happy as you were on some level expecting to feel, and all of your elaborate rational theories of how cheesecake is the optimal use of atoms start feeling somehow hollow.
There is a missmatch in saying cortex=charcter and subcortex=player.
If I understand the player-character model right, then uncosuios coping strategies would be player level tactic. But these are learned behaviours, and would therfore be part of cortex.
In Kaj's example, the idea that cheescake will make the bad go away exist in the cortex's world model.
According to Steven's model of how the brain works (which I think is probably ture), the subcortex is part of the game the player is playing. Specificcally, the subcortex provides the reward signal, and some other importat game stats (stamina level, hit-points, etc). The subcortex is also sort of like a tutorial, drawing your attention to things that the game creator (evoulution) thinks might be usefull, and occational cut scenes (acting out pre-programed behaviour).
ML comparasion:
* The character is the pre trained nerual net
* The player is the backprop
* The cortex is the neural net and backprop
* Subcortex is the reward signarl and sometimes supervisory signal.
Also, I don't like the the player-character model much. Like all models it is at best a simplification, and it does catch some of what is going on, but I think it is more wrong than right and I think something like multi-agent model is much better. I.e. there are coping mechanmisms and other less consious strategies living in your brains side by side with who you think you are. But I don't think these are compleetly invissible the way the player is invissible to the character. They are predictive models (e.g. "cheescake will make me safe"), and it is possible to query them for predictions. And almost all of these models are in the cortex.
You might have already mentioned this elsewhere, but do you have any reading recommendations for computation and the brain?
Meh, I haven't found any author that I especially love and who thinks like me and answers all my questions. :-P But here's every book I've read. I've also obviously read a bunch of articles, but too many to list and none of them especially stands out unless you ask something more specific. Warning: sometimes I have an overly confident tone but don't really know what I'm talking about :-P
Speaking as a character, I too think the player can just go jump in a lake.
My response to this post is to think about something else instead, so if you'll excuse me getting on a hobby horse...
I agree that when we look at someone making bizarre rationalizations, "their values" are not represented consciously, and we have to jump to a different level to find human values. But I think that conscious->unconscious is the wrong level jump to make.
Instead, the jump I've been thinking about recently is to our own model of their behavior. In this case, our explanation of their behavior relies on the unconscious mind, but in other cases, I predict that we'll identify values with conscious desires when that is a more parsimonious explanation of behavior. An AI learning human values would then not merely be modeling humans, but modeling humans' models of humans. But I think it might be okay if it makes those models out of completely alien concepts (at least outside of deliberately self-referential special cases - there might be an analogy here to the recursive modeling of Gricean communication).
Fantastic post; I'm still processing.
One bite-sized thought the occurs to me is that maybe this coupling of the Player and the Character is one of the many things accomplished by dreaming. The mind-system confabulates bizarre and complex scenarios, drawn in some sense from the distribution of possible but not highly probable sensory experiences. The Player provides an emotional reaction to these scenarios - you're naked in school, you feel horrifying levels of embarrassment in the dream, and the Character learns to avoid situations like this one without ever having to directly experience it.
I think that dreaming does this sort of thing in a general way, by simulating scenarios and using those simulations to propagate learning through the hierarchy, but in particular it would seem that viewing the mind in terms of Player/Character gives you a unique closed-loop situation that really bootstraps the ability of the Character to intuitively understand the Player's wishes.
It's not clear to me that we can call them values in the first place, either
It looks like the player values can be satisfied by wireheading, while the character values are more about the state of the world/state of their identity.
So, to satisfy them both, give the person pleasure and also satisfy their other preferences?
In a sense, it feels like the player's preferences are more base level, and the character's more meta?
Hmm... it's hard for me to get what you mean from a comment this short, but just the fact that I seem to have a lot of difficulty connecting your comment with my own model suggests that I didn't communicate mine very well. Could you say more about how you understood it?
The player seems to value emotional states, while the character values specific situations it can describe? Does that seem right?
My take is that we (the characters) have some wireheadable goals (e.g. curing a headache), but we also have plenty of goals best understood externally.
But the "player" is a less clearly goal-oriented process, and we can project different sorts of goals onto it, ranging from "it wants to make the feedback signal from the cortical neurons predict the output of some simple pattern detector" to "it wants us to avoid spiders" to "it wants us to be reproductively fit."
Hmm... several thoughts about that.
One is that I don't think we really know what the player does value. I had some guesses and hand-waving in the post, but nothing that I would feel confident enough about to use as the basis for preference synthesis or anything like that. I'm not even certain that our values can be very cleanly split into a separate character and player, though I do think that the two-layer model is less wrong than the naive alternative.
In Sarah's original analogy, the player first creates the character; then the character acts based on the choices that the player has made beforehand. But I should have mentioned in the post that one aspect in which I think the analogy is wrong, is that the player keeps changing the character. (Maybe you could think of this as one of those games that give you the option to take back the experience points that you've used on your character and then lets you re-assign them...)
Part of normal learning and change is that when you have new experiences, the learning process which I've been calling the player is involved in determining how those experiences affect your desires and personality. E.g. the changes in values and preferences that many people experience after having their first child - that might be described as the work of the player writing the "parental values" attribute into the character sheet. Or someone who goes to college, uncertain of what they want to study, tries out a few different subjects, and then switches their major to something which they found surprisingly interesting and motivating - the player giving them a preference to study that thing.
Those examples seem complicated enough that it seems a little too simplified to say that the player values emotional states; to some extent it seems to, but it also seems to itself create emotional states as suit its purposes. Probably what it "values" can't be simplified into any brief verbal description; it's more like it's a godshatter with a thousand different optimization criteria, all being juggled together to create something like the character.
I read your original comment as suggesting that we give the player sufficient pleasure that it is content; and then we also satisfy the character's preferences. But
1. Assuming for the sake of argument that this was possible, it's not clear what "the player being content" would do to a person's development. One possibility is that they would stop growing and responding to changed circumstances at all, because the mechanisms that were updating their behavior and thinking were all in the player. (Maybe even up to the point of e.g. not developing new habits in response to having moved to a new home with different arrangements, or something similar.)
2. There's anecdotal evidence suggesting that the pursuit of pleasure is actually also one of those character-level things. In "Happiness is a chore", the author makes the claim that even if you give people a technique which would consistently make them happy, and people try it out and become convinced of this, they might still end up not using it - because although "the pursuit of happiness" is what the character thinks they are doing, it is actually not what the player is optimizing for. If it was, it might be in the player's power to just create the happiness directly. Compare e.g. pjeby's suggestion that things like happiness etc. are things that we feel by default, but the brain learns to activate systems which block happiness, because the player considers that necessary for some purpose:
So if, for example, we don't see ourselves as worthless, then experiencing ourselves as "being" or love or okayness is a natural, automatic consequence. Thus I ended up pursing methods that let us switch off the negatives and deal directly with what CT and IFS represent as objecting parts, since these objections are the constraint on us accessing CT's "core states" or IFS's self-leadership and self-compassion.
These claims also match my personal experience; I have, at various times, found techniques that I know would make me happy, but then I find myself just not using them. At one point I wrote "I have available to me some mental motions for reaching inside myself and accessing a source of happiness, but it would require a bit of an active effort, and I find that just being neutral is already good enough, so I can't be bothered." Ironically, I think I've forgotten what exactly that particular mental move was, because I ended up not using it very much...
There's also a thing in meditative traditions where people develop the ability to access some really pleasant states of mind ("jhanas"). But then, although some people do become "jhana-junkies" and mostly just want to hang out in them, a lot of folks don't. One friend of mine who knows how to access the jhanas was once asked something along the lines of "well, if you can access pure pleasure, why aren't you doing it all the time". That got him thoughtful, and then he afterwards mentioned something about pure bliss just getting kinda stale / boring after a while. Also, getting into a jhana requires some amount of effort and energy, and he figures that he might as well spend that effort and energy on something more meaningful than just pure pleasure.
3. "Not getting satisfied" seems like a characteristic thing of the player. The character thinks that they might get satisfied: "once I have that job that I've always wanted, then I'll be truly happy"... and then after a while they aren't anymore. If we model people's goals as setpoints, it seems like frequently when one setpoint has been reached (which the previous character would have been satisfied with), the player looks around and changes the character to give it a new target setpoint. (I saw speculation somewhere that this is an evolutionary hack for getting around the fact that the brain has only a limited range of utility that it can represent - by redefining the utility scale whenever you reach a certain threshold, you can effectively have an unbounded utility function even though your brain can only represent bounded utility. Of course, it comes with costs such as temporally inconsistent preferences.)
How does this relate to the whole "no-self" thing? Is the character becoming aware of the player there?
Good question. I think that at least some approaches to no-self do break down the mechanisms by which the appearance of a character is maintained, but the extent to which it actually gives insight to the nature of the player (as opposed to giving insight to the non-existence of the character) is unclear to me.
I have been thinking about Stuart Armstrong's preference synthesis research agenda, and have long had the feeling that there's something off about the way that it is currently framed. In the post I try to describe why. I start by describing my current model of human values, how I interpret Stuart's implicit assumptions to conflict with it, and then talk about my confusion with regard to reconciling the two views.
The two-layer/ULM model of human values
In Player vs. Character: A Two-Level Model of Ethics, Sarah Constantin describes a model where the mind is divided, in game terms, into a "player" and a "character". The character is everything that we consciously experience, but our conscious experiences are not our true reasons for acting. As Sarah puts it:
I think that this model is basically correct, and that our emotional responses, preferences, etc. are all the result of a deeper-level optimization process. This optimization process, then, is something like that described in The Brain as a Universal Learning Machine:
Rephrasing these posts in terms of each other, in a person's brain "the player" is the underlying learning machinery, which is searching the space of programs (brains) in order to find a suitable configuration; the "character" is whatever set of emotional responses, aesthetics, identities, and so forth the learning program has currently hit upon.
Many of the things about the character that seem fixed, can in fact be modified by the learning machinery. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable. Unlocking the Emotional Brain describes a number of such updates, such as - in these terms - the ULM eliminating subprograms blocking confidence after receiving an update saying that the consequences of expressing confidence will not be as bad as previously predicted.
Another example of this kind of a thing was the framework that I sketched in Building up to an Internal Family Systems model: if a system has certain kinds of bad experiences, it makes sense for it to spawn subsystems dedicated to ensuring that those experiences do not repeat. Moral psychology's social intuitionist model claims that people often have an existing conviction that certain actions or outcomes are bad, and that they then level seemingly rational arguments for the sake of preventing those outcomes. Even if you rebut the arguments, the conviction remains. This kind of a model is compatible with an IFS/ULM style model, where the learning machinery sets the goal of preventing particular outcomes, and then applies the "reasoning module" for that purpose.
Qiaochu Yuan notes that once you see people being upset at their coworker for criticizing them and you do therapy approaches with them, and this gets to the point where they are crying about how their father never told them that they were proud of them... then it gets really hard to take people's reactions to things at face value. Many of our consciously experienced motivations, actually have nothing to do with our real motivations. (See also: Nobody does the thing that they are supposedly doing, The Elephant in the Brain, The Intelligent Social Web.)
Preference synthesis as a character-level model
While I like a lot of the work that Stuart Armstrong has done on synthesizing human preferences, I have a serious concern about it which is best described as: everything in it is based on the character level, rather than the player/ULM level.
For example, in "Our values are underdefined, changeable, and manipulable", Stuart - in my view, correctly - argues for the claim stated in the title... except that, it is not clear to me to what extent the things we intuitively consider our "values", are actually our values. Stuart opens with this example:
From this, Stuart suggests that people's values on these questions should be thought of as underdetermined. I think that this has a grain of truth to it, but that calling these opinions "values" in the first place is misleading.
My preferred framing would rather be that people's values - in the sense of some deeper set of rewards which the underlying machinery is optimizing for - are in fact underdetermined, but that is not what's going on in this particular example. The order of the questions does not change those values, which remain stable under this kind of a consideration. Rather, consciously-held political opinions are strategies for carrying out the underlying values. Receiving the questions in a different order caused the system to consider different kinds of information when it was choosing its initial strategy, causing different strategic choices.
Stuart's research agenda does talk about incorporating meta-preferences, but as far as I can tell, all the meta-preferences are about the character level too. Stuart mentions "I want to be more generous" and "I want to have consistent preferences" as examples of meta-preferences; in actuality, these meta-preferences might exist because of something like "the learning system has identified generosity as a socially admirable strategy and predicts that to lead to better social outcomes" and "the learning system has formulated consistency as a generally valuable heuristic and one which affirms the 'logical thinker' identity, which in turn is being optimized because of its predicted social outcomes".
My confusion about a better theory of values
If a "purely character-level" model of human values is wrong, how do we incorporate the player level?
I'm not sure and am mostly confused about it, so I will just babble & boggle at my confusion for a while, in the hopes that it would help.
The optimistic take would be that there exists some set of universal human values which the learning machinery is optimizing for. There exist various therapy frameworks which claim to have found something like this.
For example, the NEDERA model claims that there exist nine negative core feelings whose avoidance humans are optimizing for: people may feel Alone, Bad, Helpless, Hopeless, Inadequate, Insignificant, Lost/Disoriented, Lost/Empty, and Worthless. And pjeby mentions that in his empirical work, he has found three clusters of underlying fears which seem similar to these nine:
So - assuming for the sake of argument that these findings are correct - one might think something like "okay, here are the things the brain is trying to avoid, we can take those as the basic human values".
But not so fast. After all, emotions are all computed in the brain, so "avoidance of these emotions" can't be the only goal any more than "optimizing happiness" can. It would only lead to wireheading.
Furthermore, it seems like one of the things that the underlying machinery also learns, is situations in which it should trigger these feelings. E.g. feelings of irresponsibility can be used as an internal carrot and stick scheme, in which the system comes to predict that if it will feel persistently bad, this will cause parts of it to pursue specific goals in an attempt to make those negative feelings go away.
Also, we are not only trying to avoid negative feelings. Empirically, it doesn't look like happy people end up doing less than unhappy people, and guilt-free people may in fact do more than guilt-driven people. The relationship is nowhere linear, but it seems like there are plenty of happy, energetic people who are happy in part because they are doing all kinds of fulfilling things.
So maybe we could look at the inverse of negative feelings: positive feelings. The current mainstream model of human motivation and basic needs is self-determination theory, which explicitly holds that there exist three separate basic needs:
So one model could be that the basic learning machinery is, first, optimizing for avoiding bad feelings; and then, optimizing for things that have been associated with good feelings (even when doing those things is locally unrewarding, e.g. taking care of your children even when it's unpleasant). But this too risks running into the wireheading issue.
A problem here is that while it might make intuitive sense to say "okay, if the character's values aren't the real values, let's use the player's values instead", the split isn't actually anywhere that clean. In a sense the player's values are the real ones - but there's also a sense in which the player doesn't have anything that we could call values. It's just a learning system which observes a stream of rewards and optimizes it according to some set of mechanisms, and even the reward and optimization mechanisms themselves may end up getting at least partially rewritten. The underlying machinery has no idea about things like "existential risk" or "avoiding wireheading" or necessarily even "personal survival" - thinking about those is a character-level strategy, even if it is chosen by the player using criteria that it does not actually understand.
For a moment it felt like looking at the player level would help with the underdefinability and mutability of values, but the player's values seem like they could be even less defined and even more mutable. It's not clear to me that we can call them values in the first place, either - any more than it makes meaningful sense to say that a neuron in the brain "values" firing and releasing neurotransmitters. The player is just a set of code, or going one abstraction level down, just a bunch of cells.
To the extent that there exists something that intuitively resembles what we call "human values", it feels like it exists in some hybrid level which incorporates parts of the player and parts of the character. That is, assuming that the two can even be very clearly distinguished from each other in the first place.
Or something. I'm confused.