I'm loving this whole sequence, but I particularly love:
9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things
That feels very crisp, clear, and informative.
Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.
No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".
This makes no sense to me. Why would you pick C over B? B Pareto dominates C since it contains 5 lovely plates whereas C only has 5 OK plates.
Well, I guess it wouldn't be a circular preference for you. :)
I think it wouldn't occur to many people that they could do one thing with the better 5 plates, and do a different thing with the worse 10 plates, if the plates are not presented in a way the 5+10 division salient. Imagine the better and worse ones are all mixed up, and they're all the same design, such that they're obviously meant to be used as a set, but 2/3rds of the plates in the set have obvious cracks and chips. My impression (again see related experiments in the book chapter) is that many people would just take in the set of 15 plates as a whole and say "man, we can't eat off these, someone could get a cut, the sauce would leak onto the table etc.". The person would have to be kinda thinking outside the box and putting in some effort to notice that there are 5 plates in the set with no chips or cracks, and think of the strategy where they use those and throw out the other 10.
If the 5 lovely plates were literally identical in the two sets, I think (for many people) it might serve as a sort of "hint" that they should consider the clever course of action, the one that involves splitting up the B set (i.e. doing one thing with the 10 cracked & chipped plates, and doing a different thing with the 5 other B plates). That same clever splitting idea might also pop into some people's heads for the B-versus-C comparison, but I think it would be less obvious / salient, so fewer people would think of that, leaving at least a subset of people who would choose both B-over-A if that were the choice, and C-over-B if that were the choice.
Conditioned Taste Aversion (CTA) is a phenomenon where, if I get nauseous right now, it causes an aversion to whatever tastes I was exposed to a few hours earlier—not a few seconds earlier, not a few days earlier, just a few hours earlier. (I alluded to CTA above, but not its timing aspect.) The evolutionary reason for this is straightforward: a few hours is presumably how long it typically takes for a toxic food to induce nausea.
That explains why my brother no longer likes mushrooms. When we were little, he liked them and we ate mushrooms at a restaurant, then were driven through curvy mountain roads later that day with the family. He got car sick and vomited, and afterwards he had an intense hatred for mushrooms.
I liked the painting metaphor, and the diagram of brain-like AGI motivation!
Got a couple of questions below.
It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.
I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.
Similarly, you can conceptualize a single future state of the world in many different ways, e.g. by attending to different aspects of it, and it will thereby become more or less appealing. This can lead to circular preferences; I put an example in this footnote[1].
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Imperfect data/architecture/training alg could lead to weird types of thinking when employed OOD. Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this). I am unsure about what these metrics would imply for AGI safety.
Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-value thought, I think the computation looks the same if it’s a high-value thought for instrumental reasons, versus a high-value thought for final reasons.
The answer for this should depend on the size of the space that the optimization algorithm searches over.
It could be the case that the space of possible outcomes for final preferences is smaller than that of instrumental ones, and thus we could afford a different optimization algorithm (or variant thereof).
Also, if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
Thanks!
I agree that if you haven't seen something, then its not exactly a part of your world-model. But judging from the fact that it has say positive reward, does this not mean that you like(d) it? Or that aposteriori we can tell it lied inside your "like" region? (it was somewhere in close to things you liked)
For example, say someone enjoys the affection of cat species A, B. Say they haven't experienced a cat of species C, which is similar in some way to species A, B. Then probably they would get a positive reward from meeting cat C (affection), even though their world model didn't include it beforehand. Therefore, they should tell us afterwards that in their previous world, cat C should have been in the "like cat" region.
Suppose at time t=1 they are completely oblivious to the possible existence or idea of cat C, and at time t=2 they meet cat C and are very happy about it.
We agree that they like cat C at time t=2.
What about at time t=1? I would say “they neither like nor dislike cat C”. I would also say “they would like cat C, if only the thought of cat C occurred to them”.
I think you want to say that they actually already like cat C at t=1. But I don’t think that’s in accordance with common usage of the term “like”. For example, go ask someone on the street: “A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Could it be that intelligent machines have circular preferences? I understand that is the case for humans, but im curious how nuanced the answer for machines is.
Yeah, I for one certainly expect intelligent machines to have circular preferences.
That said, when smart humans notice that they have circular preferences, they tend to adjust their preferences to straighten them out. I assume that AGIs will have the same tendency, and thus that they will have fewer and fewer circular preferences as they learn and think more. (Or perhaps, they'll have circular preferences that are harder and harder to notice.)
Here’s why I think humans tend to straighten out circular preferences: You can (and naturally do) have a preference “Insofar as my other preferences are self-contradictory, I should try to reduce that aspect of them”, because this is roughly a Pareto-improving thing to do. All of my preferences about future states can be better-actualized simultaneously when I adopt the habit of “noticing when two of my preferences are working at cross-purposes, and when I recognize that happening, preventing them from doing so”. So you gradually build up a bunch of new habits that look for various types of situations that pattern-match to “I'm working at cross-purposes to myself”, and then execute a Pareto improvement—since these habits are by default positively reinforced. It’s loosely analogous to how markets become more self-consistent when a bunch of people are scouting out for arbitrage opportunities, I think.
Do you think it would be helpful to try and measure for the coherency of the system's actions/thoughts? E.g. make datasets that inspect the agent's theory of mind (I think Beth Barnes suggested sth like this).
I don't immediately see why “coherency” would be important to measure for safety purposes, but I dunno, maybe. Measuring theory of mind seems potentially safety-relevant insofar as maybe we want to try to make AGIs that are bad at theory of mind, so that they don't know how to deceive humans even if they were motivated to. However, I don't know how you would do that, while still enabling the AGI to do the things we need it to do. Anyway, no strong opinion either way.
if instrumental/final preferences were to be mixed together, should we not have been able to encode e.g. strategic behavior (final preference) in RL agents by now?
It’s true that model-based RL algorithms exist today on GitHub & arXiv. But I think there's a big space of all possible model-based RL algorithms, and I think that there are still important differences between the model-based RL algorithms currently on GitHub & arXiv, versus the model-based RL algorithm in the brain. I won’t spell out my thoughts on that, for Differential Technological Development reasons. No one really knows all the details anyway.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
“A year before you first met your current boyfriend (or first saw him, or first become aware of his existence), did you already like him? Did you already think he was cute?” I predict that they will say “no”, and maybe even give you a funny look.
Okay, now I get the point of "neither like nor dislike" in your original statement.
I was originally thinking of sth as follows: "A year before you met your current boyfriend, would you have thought he was cute, if he was your type?". But "your type" requires seeing them to get a reference point of if they belong in that class or not. So there's a circular statement of my own, straightened out, so you had a good point here.
That said, I’m surprised that you don’t think AlphaZero (for example) has “strategic behavior”. Maybe I’m not sure what you mean by “strategic behavior”.
I would say the strategic behavior AlphaZero exhibits is weak (still incredible, specifically with the kind of weird h4 luft lines that the latest supercomputers show). I was thinking of a stronger version dealing with multi-agent environments, continuous state/action spaces, and/or multi-objective reward functions. That said, its seems to me that a different problem has to be solved to get the solution to this.
Avoiding a weak wireheading drive seems quite tricky. Maybe we could minimize it using timing and priors (Section 9.3.3 above), but avoiding it altogether would, I presume, require special techniques—I vaguely imagine using some kind of interpretability technique to find the RPE / feeling good concept in the world-model, and manually disconnecting it from any Thought Assessors, or something like that.
Here's a hacky patch that doesn't entirely solve it, but might help:
Presumably for humans, the RPE/reward is somehow wired into the world-model, since we have a clear awareness of it. But you could just not give it as an input to the AI's world model to begin with.
As long as it doesn't start hacking into its own runtime and peeking at the variables, this can mean that it doesn't have a variable corresponding to its reward in it's world-model, which would prevent it from wanting to use it for wireheading.
Of course this is unstable, so we probably wouldn't want to rely on that. The stable approach would be what we discussed in the other thread, of manually coding the value function. This would protect against wireheading in fundamentally the same way, though, by eliminating the need for a separate "reward" variable in the world-model.
More on Section 9.5 "brain-like AGI is generally NOT trying to maximize its future reward" can be found in Reward is not the optimization target.
(Last revised: July 2024. See changelog at the bottom.)
9.1 Post summary / Table of contents
Part of the “Intro to brain-like-AGI safety” post series.
Most posts in the series thus far—Posts #2–#7—have been primarily about neuroscience. Then, starting with the previous post, we’ve been applying those ideas to better understand brain-like-AGI safety (as defined in Post #1).
In this post, I’ll discuss some topics related to the motivations and goals of a brain-like AGI. Motivation is of paramount importance for AGI safety. After all, our prospects are a heck of a lot better if future AGIs are motivated to bring about a wonderful future rich in human flourishing, compared to if they’re motivated to kill everyone. To get the former and not the latter, we need to understand how brain-like-AGI motivation works, and in particular how to point it in one direction rather than another. This post will cover assorted topics in that area.
Table of contents:
9.2 The AGI’s goals and desires are defined in terms of latent variables (learned concepts) in its world-model
Do you like football? Well, “football” is a learned concept living inside your world-model. Learned concepts like that are the only kinds of things that it’s possible to “like”. You cannot like or dislike [nameless pattern in sensory input that you’ve never conceived of]. It’s possible that you would find this nameless pattern rewarding, were you to come across it. But you can’t like it, because it’s not currently part of your world-model. That also means: you can’t and won’t make a goal-oriented plan to induce that nameless pattern.
I think this is clear from introspection, and I think it’s equally clear in our motivation picture (see Posts #6–#7). There, I used the term “thought” in a broad sense to include everything in conscious awareness and more—what you’re planning, seeing, remembering, understanding, attempting, etc. A “thought” is what the Thought Assessors assess, and it is built out of some configuration of the learned latent variables in your generative world-model.
Why is it important that an AGI’s goals are defined in terms of latent variables in its world-model? Lots of reasons! It will come up over and over in this and future posts. See also Post #2 of my Valence series for much deeper discussion of how different types of concepts can get imbued with positive or negative valence, how that feels intuitively, and how that affects everything from planning to morality to vibe-associations and more.
9.2.1 Implications for “value alignment” with humans
The above observation is one reason that “value alignment” between a human and an AGI is an awful mess of a problem. A brain-like AGI will have latent variables in its learned world-model, while a human has latent variables in their learned world-model, but they are different world-models, and the latent variables in one may have a complex and problematic relationship to the latent variables in the other. For example, the human’s latent variables could include things like “ghosts” that don’t really correspond to anything in the real world! For more on this topic, see John Wentworth’s post The Pointers Problem.
(I won’t say much about “defining human values” in this series—I want to stick to the narrower problem of “avoiding catastrophic AGI accidents like human extinction”, and I don’t think a deep dive into “defining human values” is necessary for that. But “defining human values” would still a good thing to do, and I’m happy for people to be working on it—see for example 1,2. My take is here.)
9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things
Thought Assessors assess and compare “thoughts”, i.e. configurations of an agent’s generative world-model. The world-model is imperfect—a complete understanding of the world is far too complex to fit in any brain or silicon chip. Thus a “thought” inevitably involves attending to some things and ignoring others, conceptualizing things in certain ways, matching things to the nearest-available category even if it’s not a perfect fit, etc.
Some implications:
namely to submit to a 30-hour-long hazing ritual and thus earn in-group membership[3] namely because it wants to think clearly and accomplish its goals.9.2.2 Instrumental & final preferences seem to be mixed together
There’s an intuitive sense in which we have instrumental preferences (things we prefer because they have typically been useful in the past as a means to an end—e.g., I prefer wearing a watch because it helps me check the time), and final preferences (things we prefer as an end in themselves—e.g., I like feeling good, and dislike getting mauled by a bear). For example, Spencer Greenberg did a survey where some participants, but not others, described “there are beautiful things in the world” as a final goal—they cared about there being beautiful things, even if those things were located deep underground where no conscious being would ever see them. Do you agree or disagree? To me, the most interesting thing is that some people will answer: “I don’t know, I’ve never thought about that before, hmm, give me a second.” I think there’s a lesson here!
Namely: It seems to me that there is not a distinction between instrumental and final preferences baked deeply into brain algorithms. If you think a thought, and your Steering Subsystem endorses it as a high-valence thought, I think the computation looks the same if it’s a high-valence thought for instrumental reasons, versus a high-valence thought for final reasons.
I should clarify: You can do instrumental things without them being an instrumental preference. For example, when I first got a smartphone, I would sometimes take it out of my pocket to check Twitter. At the time, I had no preference for pulling out my cell phone per se. Instead, I was thinking a thought along the lines of: “I’m going to pull out my cell phone and then check Twitter.” The Steering Subsystem endorses this as a high-valence thought, but only because of the second part of the thought, the part that involves checking Twitter.
Then after a while, “credit assignment” (next section) worked its magic and put a new preference into my brain, a preference for reaching into my pocket and pulling out my cell phone per se. After that, I started pulling out my cell phone without having any idea why. And now it’s an “instrumental preference”.
(Note: Just because instrumental and final preferences are mixed up in human brains doesn’t mean they have to be mixed up in brain-like AGIs. For example, I can vaguely imagine some system for flagging positive-valence concepts with some explanation for how they came to be positive-valence. In the example above, maybe we could wind up with a dotted line from some innate drive to the “Twitter” concept, and then another dotted line from the “Twitter” concept to the “reach into my pocket and grab my phone” concept. I presume the dotted lines would probably be functionally inert for AGI operations, but it would be great to have them available to help with neural network interpretability. To be clear, I don’t know if this could really work as described; I’m just brainstorming.)
9.3 “Credit assignment” is how latent variables get painted with valence
9.3.1 What is credit assignment?
I introduced the idea of “credit assignment” in Post #7 (Section 7.4), and I suggest re-reading that now, so that you have a concrete example in mind. Recall this diagram:
As a reminder, the brain has “Thought Assessors” (Post #5 & #6) that work by supervised learning (with the supervisory signals coming from the Steering Subsystem). Their role is to translate from latent variables (a.k.a. concepts) in the world model (“paintings”, “taxes”, “striving”, etc.) to parameters that the Steering Subsystem can understand (arm pain, blood sugar levels, grimacing, etc.). For example, when I took a bite of cake in Post #7, a world-model concept (“myself eating prinsesstårta cake”) got attached to genetically-meaningful variables (sweet taste on my tongue, valence, etc.).
I’m calling that process “credit assignment”—in the sense that the abstract concept of “myself eating prinsesstårta cake” gets credit for the sweet taste on my tongue.
Kaj Sotala has a kinda poetic description of what I call credit assignment here:
I find myself visualizing a fine-tip paintbrush painting positive valence onto my mental concept of prinsesstårta. Besides the “valence” paint, there are various other paint colors associated with other visceral reactions.
Credit assignment can work in funny ways. Lisa Feldman Barrett tells a story where one time she went on a date, felt butterflies in her stomach, and thought she had found True Love—only to discover later that evening that she was coming down with the flu! Likewise, if I’m pleasantly surprised to win a prize, my brain can “assign credit” to my hard work and skill, or it can “assign credit” to the fact that I’m wearing my lucky underwear.
I said “my brain can assign credit” instead of “I can assign credit” just now, because I don’t want to imply that this is a voluntary choice that I made. Instead, credit assignment is some dumb algorithm in the brain. Speaking of which:
9.3.2 How does credit assignment work?—the short answer
If credit assignment is a dumb algorithm in the brain, exactly what dumb algorithm is it?
I think, at least to a first approximation, it’s the obvious one:
Whatever thought is active right now gets the credit.
That’s “obvious” in the sense that the Thought Assessors are using supervised learning (see Post #4), and this is what supervised learning would do by default. After all, the “context” inputs to the Thought Assessors are describing whatever thought is active right now, so if we do a gradient-descent update on the error (or something functionally similar to a gradient-descent update), this “obvious” algorithm is what we’ll get.
9.3.3 How does credit assignment work?—fine print
I think it’s worth investing a bit more time on this topic, because credit assignment is central to AGI safety—after all, it’s how a brain-like AGI would wind up wanting some things rather than others. So I’ll just list out some assorted thoughts about how it works in humans.
1. Credit assignment can have “priors” that bias what type of concept gets what type of credit:
Recall from Posts #4–#5 that each Thought Assessor has its own “context” signals that serve as inputs to its predictive model. Imagine that some specific Thought Assessor has only context data from the visual cortex, for example. It will be forced to “assign credit” to the primarily-visual patterns stored in that part of the neural architecture—as if it had a 100%-confident “prior” that only the visual cortex’s stored patterns could possibly be helpful for the prediction task.
Naïvely, we might think this kind of “prior” is always a bad idea: the more different context signals that a Thought Assessor has, the better its predictive models will be, right? Why restrict them? Two reasons. First, a good prior will lead to faster learning. Second, the Thought Assessors are just one component of a larger system. We shouldn’t take for granted that a more-predictively-accurate Thought Assessor is necessarily a good thing for the larger system.
Here’s a famous example of these kinds of “priors” in psychology: rats can easily learn to freeze in response to a sound that precedes an electric shock, and rats can easily learn to feel nauseous in response to a taste that precedes a bout of vomiting. But not vice-versa! This might reflect, for example, a brain architectural design feature wherein the nausea-predicting Thought Assessor has taste-related context (e.g. from the insular cortex) but not audiovisual-related context (e.g. from the temporal lobe), and vice-versa for the freeze-predicting Thought Assessor. (More on the nausea example shortly.)
2. Credit assignment is very sensitive to timing:
Above I suggested “Whatever thought is active right now gets the credit”. But I didn’t say what “right now” means.
Example: Suppose I’m walking down the street, thinking about the TV show that I watched last night. Suddenly I have a sharp pain on my back—somebody punched me. Two things happen in my brain, almost immediately:
The trick is, we want (A) to happen before (B)—otherwise, I’ll wind up with a visceral anticipation of back pain whenever I think about that TV show that I watched last night.
I do in fact think that the brain is able to ensure that (A) happens before (B), at least by and large. (I might get a bit of a spurious association with the TV show.)[4]
3. …And timing can interact with “priors” too!
Conditioned Taste Aversion (CTA) is a phenomenon where, if I get nauseous right now, it causes an aversion to whatever tastes I was exposed to a few hours earlier—not a few seconds earlier, not a few days earlier, just a few hours earlier. (I alluded to CTA above, but not its timing aspect.) The evolutionary reason for this is straightforward: a few hours is presumably how long it typically takes for a toxic food to induce nausea. But how does it work mechanistically?
The insular cortex is the home of neurons that form a generative model of taste sensory inputs. According to “A molecular mechanism underlying gustatory memory trace for an association in insular cortex” by Adaikkan & Rosenblum (2015), these neurons have molecular mechanisms that put them in a special flagged state for the subsequent several hours after they fire.
Then the rule I suggested above (“Whatever thought is active right now gets the credit”) needs to be modified to: “Whatever neurons are in that special flagged state right now get the credit.” (The technical term here is “eligibility trace”.)
4. Credit assignment has a “Finders Keepers” characteristic:
Once you have a way to accurately predict some set of supervisory signals, it makes the corresponding error signal go away, so we stop assigning more credit in those situations. So I think the first good predictive model that our brain comes across, gets to stick around by default. I think this is related to blocking in behaviorist psychology.
5. The Thought Generator doesn’t have direct voluntary control over credit assignment, but it probably has at least some ability to manipulate it
There’s a sense in which the Thought Generator and Thought Assessors are in an adversarial relationship, i.e. working at cross-purposes. In particular, they are trained to optimize different signals.[5] For example, one time my boss yelled at me, and I very much didn’t want to start crying, but my Thought Assessors assessed that it was an appropriate time to cry, and so I did![6] Given that adversarial relationship, I have a strong presumption that the Thought Generator is not set up to have direct (“voluntary”) control over credit assignment. This also seems to match introspection.
On the other hand, “no direct voluntary control” is quite different from “no control at all”. Again, I don’t have direct voluntary control over crying, but I can nevertheless summon tears, at least a little bit, via the roundabout strategy of imagining baby kittens shivering in the cold rain (Post #6, Section 6.3.3).
So, suppose I currently hate X, but I want to will myself to really like X. It seems to me that this task is not straightforward, but also that it’s not impossible. It may take some self-reflective skill, mindfulness, planning, and so on, but if the Thought Generator thinks just the right thoughts at the right time, it can probably pull it off.
And an AGI might have an easier time than a human! After all, unlike in humans, an AGI may be able to literally hack into its own Thought Assessor, and change the settings however it likes. And that nicely transitions us to the next topic…
9.4 Wireheading: possible but not inevitable
9.4.1 What is wireheading?
The concept of “wireheading” gets its name from the idea of sticking a wire into a certain part of your brain, and running current through it. If you do it right, it could directly elicit ecstatic pleasure, deep satisfaction, or other nice feelings, depending on the exact part of the brain that the wire is in. Wireheading can be a much easier way to elicit those nice feelings, compared to, y’know, finding True Love, cooking the perfect soufflé, winning the praise of your childhood hero, and the like.
In the classic, nightmare-inducing, wireheading experiment (see “Brain Stimulation Reward”), a wire in a rat’s brain is activated when the rat presses a lever. The rat will press the lever over and over, not stopping to eat or drink or rest, even for 24 hours straight, until eventually collapsing from exhaustion. (ref)
Anyway, the concept of wireheading has been analogized to AI. The idea here is that a reinforcement learning agent is designed to maximize its reward. So, maybe it will hack into its own RAM, and overwrite the “reward” register to infinity! Next I’ll talk about whether that’s likely to happen, and then how worried we should be if it does.
9.4.2 Will brain-like AGIs want to wirehead?
Well, first, do humans want to wirehead? I need to distinguish two things:
In the human case, we can (very roughly) equate a wireheading drive with “the desire to feel good”, i.e. hedonism.[7] If so, it would suggest that (almost) all humans have a “weak wireheading drive” but not a “strong wireheading drive”. We want to feel good, but we generally care at least a little bit about other things too.
How do we make sense of that? Well, think of the previous two sections above. For a human to want reward: first, it needs to have a reward concept in its world-model, and second, credit assignment needs to flag that concept as being “good”. (I’m using the term “reward concept” in a broad sense that also would also include a “feeling good” concept.[7])
Given that, and the notes on credit assignment in Section 9.3 above, I figure:
(There’s also a possibility that a weak-wireheader will self-modify into a strong-wireheader; more on that kind of thing in the next post.)
9.4.3 Wireheading AGIs would be dangerous, not merely unhelpful
There’s an unhelpful intuition that trips up many people: When we imagine a wireheading AGI, we compare it to a human in the midst of an intense recreational drug high. Such a human is certainly not methodically crafting, revising, and executing a brilliant, devious plan to take over the world. While they’re high, they’re probably just closing their eyes and feeling good, or maybe they’re dancing or something; it depends on the drug. So this intuition suggests that wireheading is a capabilities problem, but not a catastrophic accident risk.
I think there’s a kernel of truth to this intuition: as discussed in Posts #6–#7, valence signals guide cognition and planning, so if valence gets stuck onto a very positive setting, cognition and planning become impossible.
But it’s wrong to draw the conclusion that wireheading is not a catastrophic accident risk.[8] Consider what happens before the AGI starts wireheading. If it entertains the plan “I will wirehead”, that thought would presumably get a high valence from the Steering Subsystem. But if it thinks about it a bit more, it would realize that its expectation should be “I will wirehead for a while, and then the humans will shut me down and repair the memory leak so that I can’t wirehead anymore.” Now the plan doesn’t sound so great! So the AGI may come up with a better plan, one that involves things like seizing control of its local environment, and/or the power grid, and/or the whole world, and/or building itself a “bodyguard AI” that does all those things for it while it wireheads, etc. So really, I think wireheading does carry a risk of catastrophic accidents, including even the kinds of human-extinction-level accident risks that I discussed in Post #1.
9.5 AGIs do NOT judge plans based on their expected future rewards
This directly follows from the previous section, but I want to elevate it to a top-level heading, as “AGIs will try to maximize future rewards” is a common claim.
If the Thought Generator proposes a plan, it may also invoke a representation of that plan’s likely consequences. And then the Thought Assessors will evaluate whether those likely consequences merit positive or negative valence. They will do so according to their current settings. And the Steering Subsystem will endorse or reject the plan largely on that basis. Those current settings need not align with “expected future rewards”.
If the Thought Generator proposes a plan, the Thought Assessors will evaluate its likely consequences according to their current trained model parameters. And the Steering Subsystem will endorse or reject the plan largely on that basis. Those current models need not align with “expected future rewards”.
The Thought Generator’s predictive world-model can even “know” about some discrepancy between “expected future rewards” and the Thought Assessor’s assessment of expected future reward. It doesn’t matter! The Thought Assessor’s assessments won’t automatically correct themselves, and will still continue to determine what plans the AGI will execute.
9.5.1 Human example
Here’s a human example. I’ll talk about cocaine instead of wireheading. (They’re not so different, but cocaine is more familiar.)
True fact: I’ve never done cocaine. Suppose I think to myself right now “maybe I’ll do cocaine”. Intellectually, I’m confident that if I did cocaine, I would have, umm, lots of very intense feelings. But viscerally, imagining myself doing cocaine is mostly neutral! It doesn’t make me feel much of anything in particular.
So for me right now, my intellectual expectations (of what would happen if I did cocaine) are out of sync with my visceral expectations. Apparently my Thought Assessors took a look at the thought “maybe I’ll do cocaine”, and collectively shrugged: “Nothing much going on here!” Recall that the Thought Assessors work by credit assignment (Section 9.3 above), and apparently the credit assignment algorithm just doesn’t update strongly on hearsay about what cocaine feels like, nor does it update strongly on my reading neuroscience papers about how cocaine binds to dopamine transporters.
By contrast, the credit assignment algorithm does update strongly on a direct, first-person experience of intense feelings.
And thus, people can get addicted to cocaine after using cocaine, whereas people don’t get addicted to cocaine after reading about cocaine.
9.5.2 Relation to “observation-utility agents”
For a more theoretical perspective, here is Abram Demski (sorry for the jargon—if you don’t know what AIXI is, don’t worry, you can still probably get the gist):
Our brain-like AGI, despite being “RL”,[9] is really closer to the “observation-utility agent” paradigm: the Thought Assessors and Steering Subsystem work together to evaluate plans / courses-of-action, just as Abram’s “box” does.
However, the brain-like AGI has an additional twist that the Thought Assessors get gradually updated over time by “credit assignment” (Section 9.3 above).
Thus we wind up with something vaguely like the following:
Note that we don’t want the credit assignment process to perfectly “converge”—i.e., to reach a place where the utility function perfectly matches the reward function (or in our terminology, reach a place where the Thought Assessors never get updated because they evaluate plans in a way that always perfectly matches the Steering Subsystem).
Why don’t we want perfect convergence? Because perfect convergence would lead to wireheading! And wireheading is bad and dangerous! (Section 9.4.3 above.) Yet at the same time, we need some amount of convergence, because the reward function is supposed to be sculpting the AGI’s goals! (Remember, the Thought Assessors start out random and hence useless.) It’s a Catch-22! I’ll return to this topic in the next post.
(Astute readers may have also noticed another problem: the utility-maximizer may try to maintain its goals by sabotaging the credit-assignment process. I’ll elaborate on that in the next post as well.)
9.6 Thought Assessors help with interpretability
Here, yet again, is that diagram from Post #6:
Over somewhere on the top right, there’s a little supervised learning module that answers the question: “Given everything I know, including not only sensory inputs and memories but also the course-of-action implicit in my current thought, to what extent do I anticipate tasting something sweet?” As discussed earlier (Post #6), this Thought Assessor plays the dual roles of (1) inducing appropriate homeostatic actions (e.g. maybe salivating), and (2) helping the Steering Subsystem judge whether my current thought is valuable, or whether it’s a lousy thought that should be tossed out via a phasic dopamine pause.
Now I want to offer a third way to think about the same thing.
Way back in Post #3, I mentioned that the Steering Subsystem is “stupid”. It has no common-sense understanding of the world. The Learning Subsystem is thinking all these crazy thoughts about paintings and algebra and tax law, and the Steering Subsystem is sitting there with no clue what’s going on.
Well, the Thought Assessors help mitigate that problem! They give the Steering Subsystem a bunch of clues about what the Learning Subsystem is thinking about and planning, in a language that the Steering Subsystem can understand. So this is a bit like neural network interpretability.
I’ll call this “ersatz interpretability”. (“Ersatz” is a lovely word that means “cheap inferior imitation”.) I figure that real interpretability should be defined as “the power to look in any part of a learned-from-scratch model and really understand what it’s doing and why and how”. Ersatz interpretability falls far short of that. We get the answer to some discrete number of predetermined questions—e.g. “Does this thought involve eating, or at least things that have been previously associated with eating?” And that’s it. But still, better than nothing.
This idea will be important for later posts.
(I note that you can do this kind of thing with any actor-critic RL agent, whether brain-like or not, by having a multi-dimensional value function, possibly including “pseudo” value functions that are only used for monitoring; see here, and comments here.)
9.6.1 Tracking which “innate drive” was ultimately responsible for a high-valence plan being high-valence
Back in Post #3, I talked about how brains have multiple different “innate drives”, including a drive to satisfy curiosity, a drive to eat when hungry, a drive to avoid pain, a drive to have high status, and so on. Brain-like AGIs will presumably have multiple drives too. I don’t know exactly what those drives will be, but imagine things vaguely like curiosity drive, altruism drive, norm-following drive, do-what-the-human-wants-me-to-do drive, etc. (More on this in future posts.)
If these different drives all contribute to total reward / valence, then we can and should have valence Thought Assessors (a.k.a. value functions in RL terminology) for the contribution of each drive.
As discussed in previous posts, every time the brain-like AGI thinks a thought, it’s thinking it because that thought is more rewarding than alternative thoughts that it could be thinking instead. And thanks to ersatz interpretability, we can inspect the system and know immediately how the various different innate drives are contributing to the fact that this thought is rewarding!
Better yet, this works even if we don’t understand what the thought is about, and even if the reward-predicting part of the thought is many steps removed from the direct effects of the innate drives. For example, maybe this thought is rewarding because it’s executing a certain metacognitive strategy which has proven instrumentally useful for brainstorming, which in turn has proven instrumentally useful for theorem-proving, which in turn has proven instrumentally useful for code-debugging, and so on through ten more links until we get to one of the innate drives.
9.6.2 Is ersatz interpretability reliable, even for very powerful AGIs?
If we have a very powerful AGI , and it spawns a plan, and the “ersatz interpretability” system says “this plan almost definitely won’t lead to violating human norms”, can we trust it? Good question! But it turns out to be essentially equivalent to the question of “inner alignment”, which I’ll discuss in the next post. Hold that thought.
9.7 “Real-time steering”: The Steering Subsystem can redirect the Learning Subsystem—including its deepest desires and long-term goals—in real time
In Atari-playing model-free RL agents, if you change the reward function, the agent’s behavior changes very gradually. Whereas a neat feature of our brain-like AGI motivation system is that we can immediately change not only the agent’s behavior, but even the agent’s very-long-term plans, and its innermost motivations and desires!
The way this works is: as above (Section 9.6.1), we can have multiple Thought Assessors that feed into the reward function. For example, one might assess whether the current thought will lead to satisfying the AGI’s curiosity drive, another its altruism drive, etc. The Steering Subsystem combines these into an aggregate reward. But the function that it uses to do so is a hardcoded, human-legible function—e.g., it might be as simple as a weighted average. Hence, we can change that Steering Subsystem function in real time whenever we want—in the weighted-average example, we could change the weights.
We saw an example in Post #7: When you’re very nauseous, not only does eating a cake become aversive, but even planning to eat a cake becomes mildly aversive. Heck, even the abstract concept of cake becomes mildly aversive!
And of course, we’ve all had those times when we’re tired, or sad, or angry, and all of the sudden even our most deeply-rooted life goals temporarily lose their appeal.
When you’re driving a car, it is a critically important safety requirement that when you turn the steering wheel, the wheels respond instantaneously. By the same token, I expect that it will be a critically important safety requirement for humans to be able to change an AGI’s deepest desires instantaneously when we press the appropriate button. So I think this is an awesome feature, and I’m happy to have it, even if I’m not 100% sure exactly what to do with it. (In a car, you can see where you’re going, whereas understanding what the AGI is trying to do at any given moment is much more fraught.)
(Again, as in the previous section, this idea of “real-time steering” applies to any actor-critic RL algorithm, not just “brain-like” ones. All it requires is a multi-dimensional reward, which then trains a multi-dimensional value function.)
Changelog
July 2024: Since the initial version, I’ve made only minor changes, including updating the diagrams (in line with changes to the analogous diagrams in previous posts), updating some links and wording, and using the word “valence” (as defined in my Valence series) instead of “reward” or “value” in lots of places, since I think the latter are liable to cause confusion in this context.
Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.
No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, they'll pick C because "the average quality is higher". When they see just C & A, they'll likewise pick A because "the average quality is higher".
So what we have is two different preferences (1) “I want to have a prettier collection of stuff, not an uglier collection”, and (2) “I want extra free plates”. The comparison of B & C or C & A makes (1) salient, while the comparison of A & B makes (2) salient.
(If you’re thinking “that wouldn’t be a circular preference for me!”, you’re probably right. Different people are different.)
You might be thinking: “why make an AGI with human-like faulty intuitions in the first place”?? Well, we’ll try not to, but I bet that at least some human “departures from rationality” ultimately arise from the fact that predictive world-models are big complicated things, and there are only so many ways to efficiently query them, and thus our AGIs will have systematic reasoning errors that we cannot fix at the source-code level, but rather need to fix by asking our AGI to read Scout Mindset or whatever. Things like availability bias, anchoring bias, and hyperbolic discounting might be in this category. To be clear, some foibles of human reasoning are probably less likely to afflict AGIs; to pick one example, if we make a brain-like AGI with no innate “drive to be liked / admired”, then it presumably wouldn’t have the failure mode discussed in the blog post Belief As Attire.
I kid. In fact I found The Sequences to be an enjoyable read.
I think the real story here has various complicating factors that I’m leaving out, including continued credit assignment during memory recall, and other, non-credit-assignment, changes to the world-model.
Why do I say that the Thought Generator and Thought Assessor are working at cross-purposes? Here’s one way to think of it: (1) the Steering System and Thought Assessors are working together to calculate a certain valence function which (in our ancestors’ environment) approximates “expected inclusive genetic fitness”; (2) the Thought Generator is searching for thoughts that maximize that function. Now, given that the Thought Generator is searching for ways to make the valence function return very high valence, it follows that the Thought Generator is also searching for ways to distort the Thought Assessor calculations such that the valence function stops being a good approximation to “expected inclusive genetic fitness”. This is an unintended and bad side-effect (from the perspective of inclusive genetic fitness), and that problem can be mitigated by making it as difficult as possible for the Thought Generator to manipulate the settings of the Thought Assessors. See my post Reward Is Not Enough for some related discussion.
The story has a happy ending: I found a different job with a non-abusive boss, and also wound up with a fruitful side-interest in understanding high-functioning psychopaths.
“The desire to feel good” is not quite equivalent to “the desire to have a high valence signal”, but they’re somewhat related for reasons here.
See discussion in Superintelligence p. 149.
I think when Abram uses the term “RL agent” in that quote, he was presupposing that the agent is built by not just any RL algorithm, but more specifically an RL algorithm which is guaranteed to converge to a unique ‘optimal’ agent, and which has in fact already finished converging.