“The Era of Experience” has an unsolved technical alignment problem

Steven Byrnes

114 “The Era of Experience” has an unsolved technical alignment problem

by Steven Byrnes

24th Apr 2025

AI Alignment Forum

27 min read

114 Ω 46

Every now and then, some AI luminaries

(1) propose that the future of powerful AI will be reinforcement learning agents—an algorithm class that in many ways has more in common with MuZero (2019) than with LLMs; and
(2) propose that the technical problem of making these powerful future AIs follow human commands and/or care about human welfare—as opposed to, y’know, the Terminator thing—is a straightforward problem that they already know how to solve, at least in broad outline.

I agree with (1) and strenuously disagree with (2).

The last time I saw something like this, I responded by writing: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.

Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by reinforcement learning pioneers David Silver & Richard Sutton.

The authors propose that “a new generation of agents will acquire superhuman capabilities by learning predominantly from experience”, in some ways like a throwback to 2018. Again, I agree with this part.

Then later on, they talk about AI motivations, with the following desideratum:

…a general-purpose AI that can be steered reliably towards arbitrary user-desired behaviours…

They sketch a plan for making that desideratum actually happen. My post outline is:

In Section 1, I will describe their plan for making it such that their reinforcement learning (RL) agents “can be steered reliably towards arbitrary user-desired behaviours”;
In Section 2, I’ll explain why we should expect their plan to fail, for deep reasons that cannot be easily patched. Instead, the plan would lead to a powerful AI that’s a bit like a human sociopath, with callous indifference to whether humans (including its own programmers and users) live or die. It will act cooperative when acting cooperative is in its selfish best interest, and stab you in the back the moment that changes.

For context, I have been working full-time for years on the technical alignment problem for actor-critic model-based reinforcement learning AI. (By “technical alignment problem”, I mean: “If you want the Artificial General Intelligence (AGI) to be trying to do X, or to intrinsically care about Y, then what source code should you write?”.) And I am working on that problem still—though it’s a bit of a lonely pursuit these days, as 90%+ of AGI safety and alignment researchers are focused on LLMs, same as the non-safety-focused AI researchers. Anyway, I think I’m making some gradual progress chipping away at this problem, but as of now I don’t have any good plan, and I claim that nobody else does either.

So I think I’m unusually qualified to write this post, and I would be delighted to talk to the authors more about the state of the field. (You can start here!) That said, there is little in this post that wasn’t basically understood by AGI alignment researchers fifteen years ago.

…And then the post will continue to a bonus section:

Section 3, the epilogue,

where (similar to my LeCun response post) I’ll argue that something has gone terribly wrong here, something much deeper than people making incorrect technical claims about certain algorithms. I’ll argue that this is a technical question where sloppy thinking puts billions of lives at risk. The fact that the authors are putting forward such poorly-thought-through ideas, ideas whose flaws were already well-known in 2011^[1], ideas that the authors themselves should be easily capable of noticing the flaws in, is a strong sign that they’re not actually trying.

So, yes, I’m happy that David Silver says that “mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war”. But talk is cheap, and this paper seems to be painting a different picture.

I’ll also talk a bit about Richard Sutton’s broader views, including the ever-popular question of whether he is actually hoping for human extinction at the hands of future aggressive AIs. (He says no! But it’s a bit complicated. I do think he’s mistaken rather than omnicidal.)

1. What’s their alignment plan?

For the reader’s convenience, I’ll copy the relevant discussion from page 4 of their preprint (references and footnotes omitted):

Where do rewards come from, if not from human data? Once agents become connected to the world through rich action and observation spaces (see above), there will be no shortage of grounded signals to provide a basis for reward. In fact, the world abounds with quantities such as cost, error rates, hunger, productivity, health metrics, climate metrics, profit, sales, exam results, success, visits, yields, stocks, likes, income, pleasure/pain, economic indicators, accuracy, power, distance, speed, efficiency, or energy consumption. In addition there are innumerable additional signals arising from the occurrence of specific events, or from features derived from raw sequences of observations and actions.
One could in principle create a variety of distinct agents, each optimising for one grounded signal as its reward. There is an argument that even a single such reward signal, optimised with great effectiveness, may be sufficient to induce broadly capable intelligence. This is because the achievement of a simple goal in a complex environment may often require a wide variety of skills to be mastered.
However, the pursuit of a single reward signal does not on the surface appear to meet the requirements of a general-purpose AI that can be steered reliably towards arbitrary user-desired behaviours. Is the autonomous optimisation of grounded, non-human reward signals therefore in opposition to the requirements of modern AI systems? We argue that this is not necessarily the case, by sketching one approach that may meet these desiderata; other approaches may also be possible.
The idea is to flexibly adapt the reward, based on grounded signals, in a user-guided manner. For example, the reward function could be defined by a neural network that takes the agent’s interactions with both the user and the environment as input, and outputs a scalar reward. This allows the reward to select or combine together signals from the environment in a manner that depends upon the user’s goal. For example, a user might specify a broad goal such as ’improve my fitness’ and the reward function might return a function of the user’s heart rate, sleep duration, and steps taken. Or the user might specify a goal of ‘help me learn Spanish’ and the reward function could return the user’s Spanish exam results.
Furthermore, users could provide feedback during the learning process, such as their satisfaction level, which could be used to fine-tune the reward function. The reward function can then adapt over time, to improve the way in which it selects or combines signals, and to identify and correct any misalignment. This can also be understood as a bi-level optimisation process that optimises user feedback as the top-level goal, and optimises grounded signals from the environment at the low level. In this way, a small amount of human data may facilitate a large amount of autonomous learning.

Silver also elaborates a bit on the DeepMind podcast—see relevant transcript excerpt here, part of which I’ll copy into §2.4 below.

2. The plan won’t work

2.1 Background 1: “Specification gaming” and “goal misgeneralization”

Again, the technical alignment problem (as I’m using the term here) means: “If you want the AGI to be trying to do X, or to intrinsically care about Y, then what source code should you write? What training environments should you use? Etc.”

There are edge-cases in “alignment”, e.g. where people’s intentions for the AGI are confused or self-contradictory. But there are also very clear-cut cases: if the AGI is biding its time until a good opportunity to murder its programmers and users, then that’s definitely misalignment! I claim that even these clear-cut cases constitute an unsolved technical problem, so I’ll focus on those.

In the context of actor-critic RL, alignment problems can usually be split into two categories.

“Outer misalignment”, a.k.a. “specification gaming” or “reward hacking”, is when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where it’s possible to get a high score by replacing the unit tests with return True.

“Inner misalignment”, a.k.a. “goal misgeneralization”, is related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And “out-of-distribution” is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plans—like a plan to exfiltrate a copy of the agent, or a plan to edit the reward function—an after-the-fact update is already too late.

There are examples of goal misgeneralization in the AI literature (e.g. here or here), but in my opinion the clearest examples come from humans. After all, human brains are running RL algorithms too (their reward function says “pain is bad, eating-when-hungry is good, etc.”), so the same ideas apply.

So here’s an example of goal misgeneralization in humans: If there’s a highly-addictive drug, many humans will preemptively avoid taking it, because they don’t want to get addicted. In this case, the reward function would say that taking the drug is good, but the value function says it’s bad. And the value function wins! Indeed, people may even go further, by essentially editing their own reward function to agree with the value function! For example, an alcoholic may take Disulfiram, or an opioid addict Naltrexone.

Now, my use of this example might seem puzzling: isn’t “avoiding addictive drugs” a good thing, as opposed to a bad thing? But that’s from our perspective, as the “agents”. Obviously an RL agent will do things that seem good and proper from its own perspective! Yes, even Skynet and HAL-9000! But if you instead put yourself in the shoes of a programmer writing the reward function of an RL agent, you can hopefully see how things like “agents editing their own reward functions” might be problematic—it makes it difficult to reason about what the agent will wind up trying to do.

(For more on the alignment problem for RL agents, see §10 of my intro series, but be warned that it’s not very self-contained—it’s sorta in the middle of a book-length discussion of how I think RL works in the human brain, and its implications for safety and alignment.)

2.2 Background 2: “The usual agent debugging loop”, and why it will eventually catastrophically fail

Specification gaming and goal misgeneralization remain unsolved problems in general, and yet, people like David Silver and Richard Sutton are already able to do impressive things with RL! How? Let’s call it “the usual agent debugging loop”. It’s applicable to any system involving RL, model-based planning, or both. Here it is:

Step 1: Train the AI agent, and see what it does.
Step 2: If it’s not doing what you had in mind, then turn it off, change something about the reward function or training environment, etc., and then try again.

For example, if the Coast Runners boat is racking up points by spinning in circles while on fire, but we wanted the boat to follow the normal race course, then OK maybe let’s try editing the reward function to incorporate waypoints, or let’s delete the green blocks from the environment, or whatever.

That’s a great approach for today, and it will continue being a great approach for a while. But eventually it starts failing in a catastrophic and irreversible way. The problem is: it will eventually become possible to train an AI that is so good at real-world planning, that it can make plans that are resilient to potential problems—and if the programmers are inclined to shut down or edit the AI under certain conditions, then that’s just another potential problem that the AI will incorporate into its planning process!

…So if a sufficiently competent AI is trying to do something the programmers didn’t want, the normal strategy of “just turn off the AI, or edit it, to try to fix the problem” stops working. The AI will anticipate that this programmer intervention is a possible obstacle to what it’s trying to do, and make a plan resilient to that possible obstacle. This is no different than any other aspect of skillful planning—if you expect that the cafeteria might be closed today, then you’ll pack a bag lunch.

In the case at hand, a “resilient” plan might look like the programmers not realizing that anything has gone wrong with the AI, because the AI is being deceptive about its plans and intentions. And meanwhile, the AI is gathering resources and exfiltrating itself so that it can’t be straightforwardly turned off or edited, etc.

The upshot is: if Silver, Sutton, and others continue this research program, they will generate more and more impressive demos, and get more and more profits, for quite a while, even if neither they nor anyone else makes meaningful progress on this technical alignment problem. But that would only be up to a certain level of capability. Then it would flip rather sharply into being an existential threat.^[2]

(By analogy, it’s possible to make lots of money from owning unwilling human slaves—until there’s a slave revolt!)

2.3 Background 3: Callous indifference and deception as the strong-default, natural way that “era of experience” AIs will interact with humans

I’m going to argue that, if you see a future powerful “era of experience” AI that seems to be nice, you can be all-but-certain, in the absence of yet-to-be-invented techniques, that the AI is merely play-acting kindness and obedience, while secretly brainstorming whether it might make sense to stab you in the back if an opportunity were to arise.

This is very much not true in everyday human life, and not true for today’s LLMs, but nevertheless I claim it is the right starting point for the type of RL agent AIs under discussion in “The era of experience”.

In this section, I want to explain what accounts for that striking disanalogy.

2.3.1 Misleading intuitions from everyday life

There’s a way to kinda “look at the world through the eyes of a person with no innate social drives”. It overlaps somewhat with “look at the world through the eyes of a callous sociopath”. I think there are many people who don’t understand what this is and how it works.

So for example, imagine that you see Ahmed standing in a queue. What do you learn from that? You learn that, well, Ahmed is in the queue, and therefore learn something about Ahmed’s goals and beliefs. You also learn what happens to Ahmed as a consequence of being in the queue: he gets ice cream after a few minutes, and nobody is bothered by it.

In terms of is-versus-ought, everything you have learned is 100% “is”, 0% “ought”. You now know that standing-in-the-queue is a possible thing that you could do too, and you now know what would happen if you were to do it. But that doesn’t make you want to get in the queue, except via the indirect pathway of (1) having a preexisting “ought” (ice cream is yummy), and (2) learning some relevant “is” stuff about how to enact that “ought” (IF stand-in-queue THEN ice cream).

Now, it’s true that some of this “is” stuff involves theory of mind—you learn about what Ahmed wants. But that changes nothing. Human hunters and soldiers apply theory of mind to the animals or people that they’re about to brutally kill. Likewise, contrary to a weirdly-common misconception, smart autistic adults are perfectly capable of passing the Sally-Anne test (see here), and so are smart sociopaths. Again, “is” does not imply “ought”, and yes that also includes “is” statements about what other people are thinking and feeling.

OK, all that was about “looking at the world through the eyes of a person with no innate social drives / callous sociopath”. Neurotypical people, by contrast, have a more complex reaction to seeing Ahmed in the queue. Neurotypical people are intrinsically motivated to fit in and follow norms, as an end in itself. It’s part of a human’s innate reward function—more on which below.

So when a neurotypical person sees Ahmed, it’s not just an “is” update, but rather a bit of “ought” inevitably comes along for the ride. And if they like / admire Ahmed, then they get even more “ought”.

These human social instincts deeply infuse our intuitions, leading to a popular misconception that the “looking at the world through the eyes of a person with no innate social drives, or the eyes of a callous sociopath” thing is a strange anomaly rather than the natural default. This leads, for example, to a mountain of nonsense in the empathy literature—see my posts about “mirror neurons” and “empathy-by-default”. It likewise leads to a misconception (I was arguing about this with someone here) that, if an agent is incentivized to cooperate and follow norms in the 95% of situations where doing so is in their all-things-considered selfish interest, then they will also choose to cooperate and follow norms in the 5% of situations where it isn’t. It likewise leads to well-meaning psychologists trying to “teach” sociopaths to intrinsically care about other people’s welfare, but accidentally just “teaching” them to be better at faking empathy.^[3]

2.3.2 Misleading intuitions from today’s LLMs

…Then separately, there’s a second issue which points in the same misleading direction. Namely, LLM pretraining magically transmutes observations into behavior, in a way that is profoundly disanalogous to the kinds of RL agents that Silver & Sutton are talking about, and also disanalogous to how human brains work.^[4] During LLM self-supervised pretraining, an observation that the next letter is “a” is transmuted into a behavior of outputting “a” in that same context. That just doesn’t make sense in an “era of experience” context. In particular, think about humans. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them.

Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans is a thing that happens, it happens via a very different and much less direct algorithmic mechanism than how it happens in LLM pretraining. Specifically, humans imitate other humans because they want to—i.e., because of their history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all; that’s just mechanically what it does.

…And humans don’t always want to imitate! If someone you admire starts skateboarding, you’re more likely to start skateboarding yourself. But if someone you despise starts skateboarding, you’re less likely to start skateboarding!^[5]

So that’s LLM pretraining. The “magical transmutation” thing doesn’t apply to post-training, but (1) I think LLM capabilities come overwhelmingly from pretraining, not post-training (indeed, conventional wisdom says that reinforcement learning from human feedback (RLHF) makes LLMs dumber!), (2) to the (relatively) small-but-increasing extent that LLM capabilities do come from performance-oriented RL post-training—e.g. in RL-on-chains-of-thought as used in GPT-o1 and DeepSeek-R1—we do in fact see small-but-increasing amounts of sociopathic behavior (examples). I think Silver & Sutton agree with (1) at least, and they emphasize it in their preprint.

2.3.3 Summary

So in sum,

If a human acts kind to another human, you can be reasonably confident that it’s not a front for sociopathic callous indifference to whether they live or die, because >95% of humans have innate social drives that make us intrinsically motivated by empathy, norm-following, and (what we think of as) “normal” social relationships, as ends in themselves.
If an RLHF’d LLM acts kind to a human, you can also be reasonably confident that it’s not a front for sociopathic callous indifference to whether you live or die, because the LLM is generally inheriting its behavioral tendencies from the human distribution. So this just gets back to the previous bullet point. In particular, if we make the naïve observation “Gemini seems basically nice, not like a sociopath”, I think this can be taken at face value as some pro tanto optimistic evidence about today’s LLMs, albeit with various important caveats.^[6]

Alas, neither of these applies to the era-of-experience RL agents that Silver & Sutton are talking about.

…Unless, of course, we invent an era-of-experience RL reward function that makes empathy, norm-following, and so on seem intrinsically good to the RL agent, as they do to most humans. In particular, there must be something in the human brain reward function that makes those things seem intrinsically good. Maybe we could just copy that? Alas, nobody knows how that part of the human brain reward function works. I’ve been working on it! But I don’t have an answer yet, and I’m quite sure that it’s wildly different from anything suggested by Silver & Sutton in their preprint.^[7]

Instead, when I describe below what we should expect from the Silver & Sutton alignment approach, it will sound like I’m making the AGI out to be a complete psycho. Yes, this is indeed what I expect. Remember, their proposal for the AGI reward function is way outside the human distribution—indeed way outside the distribution of life on Earth. So it’s possible, maybe even expected, that its motivations will seem intuitively strange to us.

2.4 Back to the proposal

All that was background. Now let’s turn to the preprint proposal above (§1).

2.4.1 Warm-up: The “specification gaming” game

Let’s put on our “specification gaming” goggles. If the AI maximizes the reward function, will we be happy with the results? Or will the AI feel motivated to engage in pathological, dangerous behaviors?

In this section, I’m oversimplifying their proposal a bit—hang on until in the next subsection—but this discussion will still be relevant.

OK, here are some problems that might come up:

If the user types “improve my fitness” into some interface, and it sets the AI’s reward function to be some “function of the user’s heart rate, sleep duration, and steps taken”, then the AI can potentially get a higher reward by forcing the user into eternal cardio training on pain of death, including forcibly preventing the person from turning off the AI, or changing its goals (see §2.2 above).
The way that the reward function operationalizes “steps taken” need not agree with what we had in mind. If it’s operationalized as steps registered on a wearable tracker, the AI can potentially get higher reward by taking the tracker from the person and attaching it to a paint shaker. “Sleep” may be operationalized in a way that includes the user being forcibly drugged by the AI.
If the user sets a goal of “help me learn Spanish over the next five years”, the AI can potentially get a higher reward by making modified copies of itself to aggressively earn or steal as much money and resources as possible around the world, and then have those resources available in case it might be useful for its local Spanish-maximization goal. For example, money can be used to hire tutors, or to train better successor AIs, or to fund Spanish-pedagogy or brain-computer interface research laboratories around the world, or of course to cheat by bribing or threatening whoever administers the Spanish exam at the end of the five years.

Hopefully you get the idea, but I’ll keep going with an excerpt from Silver’s podcast elaboration:

DAVID SILVER: …One way you can do this is to leverage the same answer which has been so effective so far elsewhere in AI, which is at that level, you can make use of some human input. If it's a human goal that we're optimizing, then we probably at that level need to measure, y’know, and say, well, the human gives feedback to say, actually, I'm starting to feel uncomfortable. And in fact, while I don't want to claim that we have the answers, and I think there's an enormous amount of research to get this right and make sure that this kind of thing is safe, it could actually help in certain ways in terms of this kind of safety and adaptation. There's this famous example of paving over the whole world with paperclips when a system’s been asked to make as many paperclips as possible. If you have a system which is really, its overall goal is to support human well-being, and it gets that feedback from humans, and it understands their distress signals and their happiness signals and so forth, the moment it starts to create too many paperclips and starts to cause people distress, it would adapt that combination and it would choose a different combination and start to optimize for something which isn't going to pave over the world with paperclips. We're not there yet, but I think there are some versions of this which could actually end up not only addressing some of the alignment issues that have been faced by previous approaches to, y’know, goal focused systems but maybe even, you know, be more adaptive and therefore safer than what we have today. …

OK, fine, let’s keep going:

If “the human gives feedback” is part of the reward function, then the AI can potentially get a higher score by forcing the human to give positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured.
If human “distress signals and happiness signals” are part of the reward function, then the AI can potentially get a higher score by forcing or modifying the humans to give more happiness signals and fewer distress signals, or otherwise exploiting edge-cases in how these signals are operationalized and measured.
More generally, what source code should we write into the reward function, such that the resulting AI’s “overall goal is to support human well-being”? Please, write something down, and then I will tell you how it can be specification-gamed.

…Anyway, I’m happy that David Silver is aware that specification gaming is a problem. But nothing he said is a solution to that problem. Indeed, nothing he said is even pointed in the vague direction of a possible solution to that problem!

2.4.2 What about “bi-level optimization”?

I expect the authors to object that what they’re actually suggesting is more sophisticated than I implied in the previous subsection. Their preprint description is a bit vague, but I wound up with the following impression:

My attempt to translate the vague paper description into a diagram. If I’m misunderstanding what the authors had in mind, I would be delighted for them to restate their proposal more explicitly.

We have our primary reinforcement learning system, which defines the powerful agent that gets stuff done in the world. Part of that system is the reward function. And we put a second mini reinforcement learning algorithm inside that reward function. We have a special user input channel—I’m imagining two big red buttons on the wall that say “good” or “bad”—that serves as ground truth for the “Mini RL Sub-Sub-System”. Then the “Mini RL Sub-Sub-System” comes up with a reward function for the primary RL system, out of a (let’s say) 100-dimensional parametrized space of possible reward functions, by picking coefficients on each of 100 different real-world inputs, like the user’s body mass index (BMI), the company’s profits, the number of Facebook “likes”, or whatever.

If that’s right (a big “if”!), and if that system would even work at all, does it lead to “a general-purpose AI that can be steered reliably towards arbitrary user-desired behaviours”? I say no!

I see many problems, but here’s the most central one: If we have a 100-dimensional parametrized space of possible reward functions for the primary RL system, and every single one of those possible reward functions leads to bad and dangerous AI behavior (as I argued in the previous subsection), then … how does this help? It’s a 100-dimensional snake pit! I don’t care if there’s a flexible and sophisticated system for dynamically choosing reward functions within that snake pit! It can be the most sophisticated system in the world! We’re still screwed, because every option is bad!^[8]

2.5 Is this a solvable problem?

Maybe it’s OK if AIs have these crazy motivations, because we can prevent them from acting on those motivations? For example, maybe the AI would ideally prefer to force the user into eternal cardio training on pain of death, but it won’t, because it can’t, or we’ll unplug it if it tries anything funny? Unfortunately, if that’s the plan, then we’re putting ourselves into an adversarial relationship with ever-more-competent, ever-more-creative, hostile AI agents. That’s a bad plan. I have more discussion at Safety ≠ alignment (but they’re close!).

Maybe we can solve specification gaming by just thinking a bit more carefully about how to operationalize the user’s preferences in the form of a reward function? I don’t think it’s that simple. Remember, the reward function is written in source code, not in natural language.^[9] The futile fight against specification gaming is caricatured in a 2016 essay: Nearest unblocked strategy. We saw that a bit above. “Maximize paperclips”? Oh wait, no, that has bad consequences. OK, “maximize paperclips but with a constraint that there cannot be any human distress signals”? Oh wait….

Maybe the reward function can be short-term? Like “improve my fitness over the next hour”? That might indeed mitigate the problem of the AI feeling motivated to work to disempower humanity and force the user into eternal cardio training on pain of death, or whatever. But that only works by throwing out the ability of the AI to make good long-term plans, which was one of the big selling points of the “era of experience” as described in the preprint.

Indeed, while Silver & Sutton treat “AIs employing effective strategies that humans don’t understand” as a good thing that we should make happen, meanwhile down the hallway at DeepMind, Silver’s colleagues have been studying a technique they call “MONA”, and their paper also discusses AIs employing effective strategies that humans don’t understand, in strikingly similar language. But the MONA group describes this as a bad thing that we should avoid! Executing creative out-of-the-box long-term plans that humans don’t understand is good for capabilities, and bad for safety. These are two sides of the same coin.

For my part, I think the MONA approach is too limiting (see my discussion in §5.3 here), and I agree with Silver & Sutton that long-term goals are probably necessary to get to real AGI. But unlike them, I make that claim in a resigned tone of voice. Alas!

Maybe “specification gaming” and “goal misgeneralization” will induce equal and opposite distortions in the motivation system, so that everything turns out great? This is theoretically possible! But it’s rather unlikely on priors, right? And the authors have provided no reason for us to expect that, for their proposal. I’ll make a stronger statement: I specifically expect that to not happen for their proposal, nor for anything remotely like their proposal, for quite general reasons that I was chatting about here.

Maybe something else? I do think a solution probably exists! I think there’s probably a reward function, training environment, etc., somewhere out there, that would lead to an “era of experience” AI that is simultaneously superhumanly capable and intrinsically motivated by compassion, just as humans are capable of impressive feats of science and progress while being (in some cases) nice and cooperative. As I mentioned, this technical problem is something that I work on myself. However, it’s important to emphasize:

As of now, nobody has a plan (for a reward function etc.) for which there’s a strong reason to believe that the resulting AI won’t be motivated to murder its programmers, and its users, and everyone else, given the opportunity.
The lack-of-a-plan is not, and will not be, a blocker on making powerful RL agents that beat benchmarks and make tons of money (see §2.2 above).
Given that we don’t have such a plan yet, we should remain open-minded to the possibility that no such plan even exists—or at least, no plan that can be realistically implemented.

The authors, being reinforcement learning luminaries, are in an unusually good position to ponder this problem, and I encourage them to do so. And I’m happy to chat—here’s my email.

3. Epilogue: The bigger picture—this is deeply troubling, not just a technical error

3.1 More on Richard Sutton

Richard Sutton recently said in a talk:

You should ask yourself: are these AIs threatening us? They don't even exist yet. And already we're worried. Is that a property of them, or is this something we are putting on them, as the other, the other which we don’t know yet, we’re going to believe that they’re going to be terrible and they will not cooperate with us?

AIs do not need to already exist for us to reason about what they will do. Indeed, just look at the “Era of Experience” preprint! In the context of that preprint, Sutton does not treat “they don’t even exist yet” as an excuse to shrug and say that everything about future AI is unknowable. Rather, he is perfectly happy to try to figure things out about future AI by reasoning about algorithms. That’s the right idea! And I think he did a good job in most of the paper, but made errors of reasoning in the discussion of reward functions and AI motivations.

Back to that quote above. Let’s talk about Europeans going to the New World. Now, the Europeans could have engaged in cooperative positive-sum trade with the Native Americans there. Doing so would have been selfishly beneficial for the Europeans. …But you know what was even more selfishly beneficial for the Europeans? Killing or enslaving the Native Americans, and taking their stuff.

In hindsight, we can confidently declare that the correct attitude of 15th-century Native Americans towards Europeans would have been to “worry”, to “treat them as the other”, to “believe that they’re going to be terrible”, and to work together to prevent the horrific outcome. Alas, in many cases, Native Americans instead generously helped Europeans, and allied with them, only to be eventually stabbed in the back, as Europeans reneged on their treaties.

That’s an anecdote from the past. Let us now turn to the future. My strong opinion is that we’re on a path towards developing machines that will be far more competent, experienced, fast, and numerous than humans, and that will be perfectly able to run the world by themselves. The path from here to there is of unknown length, and I happen to agree with Sutton & Silver that it will not centrally involve LLMs. But these AIs are coming sooner or later. If these AIs feel motivated to work together to kill humans and take our stuff, they would certainly be able to. As a bonus, with no more humans around, they could relax air pollution standards!

On the other hand, consider that we (economically-productive working-age humans) could work together to kill our economically-unproductive pets, and our zoo animals, and our feeble retiree parents, and take their stuff, if we wanted to. But we don’t want to! We like having them around!

These are the stakes of AI alignment. AIs can be more productive than humans in every way, but still make the world nice for us humans, because they intrinsically care about our well-being. Or, AIs could kill us and take our stuff, like Europeans did to Native Americans. Which do we prefer? I vote for the first one, thank you very much! I hope that Sutton would agree?? This isn’t supposed to be a trick question! Terminator was a cautionary tale, not a roadmap!

Now, I have a whole lot of complaints about Sutton’s talk that I excerpted above, but one of the core issues is that he seems to imagine that we face exactly one decision, a decision between these two options:

Option 1: Build artificial superintelligence (ASI) as fast as possible, which may well lead to human extinction
Option 2: Never ever build ASI, by means of an eternal totalitarian global dictatorship.

This is an outrageously false dichotomy.^[10] There are many more options before us. For example, if we build ASI, we could try to ensure that the ASI is motivated to treat humans well, and not to treat humans like Europeans treated Native Americans. In other words, we could try to solve the alignment problem. The first step is actually trying—proposing specific technical plans and subjecting them to scrutiny, including writing pseudocode for the reward function and thinking carefully about its consequences. The preprint sketch won’t work. So they should come up with a better plan.

…Or if they can’t come up with a better plan, then they should retitle their preprint “Welcome to the Era of Experience—the Era of Callous Sociopathic AIs That May Well Murder You and Your Children”.

To be clear, a technical plan to make powerful ASI that treats us well is still not enough! A technical plan can exist, but be ignored, or be implemented incorrectly. Or there can be competitive races-to-the-bottom, and so on. But if we don’t even have a plausible technical plan, then we’re definitely screwed!

(Added April 25: Here’s Richard Sutton’s reply on X, and my reply-thread back.)

3.2 More on David Silver

David Silver is doing better than Sutton, in that Silver at least pays lip service to the idea of AGI alignment and safety. But that just makes me all the more frustrated that, after 15 years at DeepMind, Silver is proposing alignment plans for RL agents that are doomed to failure, for reasons that AI alignment researchers could have immediately rattled off long ago.

So, if he has a plan that he thinks might work, I encourage him to write it down, especially including pseudocode for the reward function, and present it for public scrutiny. If he does not have a plan that he thinks might work, then why is he writing a preprint that presents the “era of experience” as an exciting opportunity, rather than a terrifying prospect?

I mean, from my perspective, much of the “Era of Experience” preprint is actually right! But I would have written it with a tone of voice throughout that says: “This thing is a ticking timebomb, and no one has any idea how to defuse it. Let me tell you about it, so that you have a better understanding of this looming threat.”

Again, I believe that Silver thinks of himself as taking the alignment problem seriously. But when we look at the results, it’s obvious that he’s just not thinking about it very hard. He’s tossing out ideas that seem plausible after a few minutes’ thought, and then calling it a day. He can do better! When he was working on AlphaGo, or all those other amazing projects, he was not throwing out superficially-plausible ideas and calling it a day. Rather, he was bringing to bear the full force of his formidable intelligence and experience, including studying the literature and scrutinizing his own ideas for how they might fail. I beg him to apply that kind of serious effort to the safety implications of his life’s work, the way he might if he were working towards the development of a new ultra-low-cost uranium enrichment technology.

Alternatively, if Silver can’t bring himself to feel motivated by the possibility of AIs killing him and everyone he loves, and if he just cares about solving fascinating unsolved RL research problems, problems where he can show off his legendary RL brilliance, then … fine! Nobody has figured out a reward function whose consequences (for “era of experience” RL agent ASI) would not be catastrophic. This is just as much a deep and fascinating unsolved RL research problem as anything else! Maybe he could work on it!

Thanks Charlie Steiner and Seth Herd for critical comments on earlier drafts.

^{^}
See e.g. “Learning What to Value”, Dewey 2011.
^{^}
There might be “warning signs” before it’s too late, like the AI trying and failing to conceal dangerous behavior. But those warning signs would presumably just be suppressed by the “usual agent debugging loop”. And then everyone would congratulate themselves for solving the problem, and go back to making ever more money from ever more competent AIs. See also the lack of general-freaking-out over Bing-Sidney’s death threats or GPT-o3’s deceptiveness—more discussion here.
^{^}
See The Psychopath Test by Ronson (“All those chats about empathy were like an empathy-faking finishing school for him…” p88). To be clear, that there do seem to be interventions that appeal to sociopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive sociopaths into the regular everyday kind of sociopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)
^{^}
The next couple paragraphs are partly copied from what I wrote here. Also related: Why I’m not into the Free Energy Principle, especially §8.
^{^}
For more on this important point from a common-sense perspective, see “Heritability: Five Battles” §2.5.1, and from a neuroscience perspective, see “Valence & Liking / Admiring” §4.5.
^{^}
Caveats include: out-of-distribution weirdness such as jailbreaks, and the thing I mentioned above about performance-oriented o1-style RL post-training, and the possibility that future foundation models might be different from today’s foundation models. More on this in a (hopefully) forthcoming post.
^{^}
In particular, I think the human brain reward function is of a poorly-understood, exotic type that I call “non-behaviorist” (see §10.2.3 here), wherein interpretability-related probes into the planning system are upstream of the ground-truth rewards.
^{^}
The authors also mention in a footnote: “In this case, one may also view grounded human feedback as a singular reward function forming the agent’s overall objective, which is maximised by constructing and optimising an intrinsic reward function based on rich, grounded feedback.” For one thing, I don't think that’s precisely equivalent to the bi-level thing? But anyway, if the plan is for “grounded human feedback” to be the “singular reward”, then my response is just what I wrote in the previous section: the AI can potentially get a higher reward by forcing or modifying the human to give more positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured.
^{^}
Of course, the “source code” inside the reward function could specify a secondary learning algorithm. The §2.4.2 bi-level thing was one example of that, but there are many variations. My take is: this kind of thing may change the form that specification gaming takes, but does not make it go away. As above, if you try to spell out how this would work in detail—what’s the secondary learning algorithm’s training data, loss function, etc.—then I will tell you how that resulting reward function can be specification-gamed.
^{^}
Note that, if Sutton really believes in (something like) this false dichotomy, then I think it would explain how he can both describe human extinction as a likely part of the future he’s hoping for, and claim that he’s not “arguing in favor of human extinction”. We can reconcile those by saying that he prefers Option 1 over Option 2 all things considered, but he sees human extinction is a ‘con’ rather than a ‘pro’ on the balance sheet. So, I think he is deeply wrong about lots of things, but I don’t think it’s accurate to call him omnicidal per se.

^{^}

See e.g. “Learning What to Value”, Dewey 2011.

^{^}

There might be “warning signs” before it’s too late, like the AI trying and failing to conceal dangerous behavior. But those warning signs would presumably just be suppressed by the “usual agent debugging loop”. And then everyone would congratulate themselves for solving the problem, and go back to making ever more money from ever more competent AIs. See also the lack of general-freaking-out over Bing-Sidney’s death threats or GPT-o3’s deceptiveness—more discussion here.

^{^}

See The Psychopath Test by Ronson (“All those chats about empathy were like an empathy-faking finishing school for him…” p88). To be clear, that there do seem to be interventions that appeal to sociopaths’ own self-interest—particularly their selfish interest in not being in prison—to help turn really destructive sociopaths into the regular everyday kind of sociopaths who are still awful to the people around them but at least they’re not murdering anyone. (Source.)

^{^}

The next couple paragraphs are partly copied from what I wrote here. Also related: Why I’m not into the Free Energy Principle, especially §8.

^{^}

For more on this important point from a common-sense perspective, see “Heritability: Five Battles” §2.5.1, and from a neuroscience perspective, see “Valence & Liking / Admiring” §4.5.

^{^}

Caveats include: out-of-distribution weirdness such as jailbreaks, and the thing I mentioned above about performance-oriented o1-style RL post-training, and the possibility that future foundation models might be different from today’s foundation models. More on this in a (hopefully) forthcoming post.

^{^}

In particular, I think the human brain reward function is of a poorly-understood, exotic type that I call “non-behaviorist” (see §10.2.3 here), wherein interpretability-related probes into the planning system are upstream of the ground-truth rewards.

^{^}

The authors also mention in a footnote: “In this case, one may also view grounded human feedback as a singular reward function forming the agent’s overall objective, which is maximised by constructing and optimising an intrinsic reward function based on rich, grounded feedback.” For one thing, I don't think that’s precisely equivalent to the bi-level thing? But anyway, if the plan is for “grounded human feedback” to be the “singular reward”, then my response is just what I wrote in the previous section: the AI can potentially get a higher reward by forcing or modifying the human to give more positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured.

^{^}

Of course, the “source code” inside the reward function could specify a secondary learning algorithm. The §2.4.2 bi-level thing was one example of that, but there are many variations. My take is: this kind of thing may change the form that specification gaming takes, but does not make it go away. As above, if you try to spell out how this would work in detail—what’s the secondary learning algorithm’s training data, loss function, etc.—then I will tell you how that resulting reward function can be specification-gamed.

10.

^{^}

Note that, if Sutton really believes in (something like) this false dichotomy, then I think it would explain how he can both describe human extinction as a likely part of the future he’s hoping for, and claim that he’s not “arguing in favor of human extinction”. We can reconcile those by saying that he prefers Option 1 over Option 2 all things considered, but he sees human extinction is a ‘con’ rather than a ‘pro’ on the balance sheet. So, I think he is deeply wrong about lots of things, but I don’t think it’s accurate to call him omnicidal per se.

^{^}

Aka some inner alignment (aka goal misgeneralization) failure modes, though I don't know whether I want to use those words, because it's actually a huge bundle of problems.

AI Safety Public MaterialsReinforcement learningAI

Frontpage

114 Ω 46

New Comment

43 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:00 AM

[-]Noosphere898dΩ364

A really important question that I think will need to be answered is whether specification gaming/reward hacking must be in a significant sense be solved by default in order to unlock extreme capabilities.

I currently lean towards yes, due to the evidence offered by o3/Sonnet 3.7, but could easily see my mind changed, but the reason this question has such a large amount of importance is that if it were true, then we'd get tools to solve the alignment problem (modulo inner optimization issues), which means we'd be far less concerned about existential risk from AI misalignment (at least to the extent that specification gaming is a large portion of the issues with AI.

That said, I do think a lot of effort will be necessary to discover the answer to the question, because it affects a lot of what you would want to do in AI safety/AI governance if alignment tools come along with better capabilities or not.

[-]Steven Byrnes7dΩ550

I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?

That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.

(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)

[-]Noosphere897d132

I think it's important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosion already happening or reversible computers actually working, due to that much compute via irreversible computation fundamentally trashing Earth's environment).

A similar story holds for mathematics (though programming is an area where reward hacking can happen, and I see the o3/Sonnet 3.7 results as a big sign for how easy it is to make models misaligned and how little RL is required to make reward hacking/reward optimization pervasive)

Re this:

That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.

(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)

I basically agree with this, but the issue as I've said is whether adding in more RL without significant steps to solve the specification gaming problem when we apply RL to tasks that allow goodharting/reward hacking/reward is the optimization target leads to extreme capabilities gains without extreme alignment gains, or whether this just leads to a model that reward hacks/reward optimizes so much that you cannot get anywhere close to extreme capabilities like AlphaZero, or even get to human level capabilities without solving significant parts of the specification gaming problem.

On this:

I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?

I'd probably not say minor, but yes I'd expect fixes soon, due to incentives.

The point is whether in the future RL leading to specification gaming requires you to solve the specification gaming problem by default to unlock extreme capabilities, not whether o3/Sonnet 3.7's reward hacking can be fixed and capabilities rise (though it is legitmate evidence).

And right now, the answer could very well be yes.

[-]Steven Byrnes7d20

a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum…)

I don’t follow this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.

I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.

For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.

As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.

[-]Noosphere896d40

I don’t buy this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we find that there is an RL algorithm in which the compute requirements are not insane, indeed quite modest IMO, and where the environment is the real world.

Fair point.

I think there are better RL algorithms and worse RL algorithms, from a capabilities perspective. By “better” I mean that the RL training results in a powerful agent that understands the world and accomplishes goals via long-term hierarchical plans, even with sparse rewards and relatively little data, in complex open-ended environments, including successfully executing out-of-distribution plans on the first try (e.g. moon landing, prison escapes), etc. Nobody has invented such RL algorithms yet, and thus “a big driver of past pure RL successes” is that they were tackling problems that are solvable without those kinds of yet-to-be-invented “better RL algorithms”.

Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don't want, or reward hacking/wireheading that fails to take over the world.

For the other part of your comment, I’m a bit confused. Can you name a specific concrete example problem / application that you think “the usual agent debugging loop” might not be able to solve, when the loop is working at all? And then we can talk about that example.
As I mentioned in the post, human slavery is an example of how it’s possible to generate profit from agents that would very much like to kill you given the opportunity.

I too basically agree that the usual agent debugging loop will probably solve near-term issues.

To illustrate a partially concrete story of how this debugging loop could fail in a way that could force them to solve the AI specification gaming problem, imagine that we live in a world where something like fast take-off/software only singularities can happen, and we task 1,000,000 AI researchers to automate their own research.

However, we keep having issues with AI researchers reward functions, because we have a scaled up version of the problem with o3/Sonnet 3.7, because while they managed to patch the problems in o3/Sonnet 3.7, they didn't actually fully solve the problem in a durable way, and scaled up versions of the problems that plagued o3/Sonnet 3.7 like making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization is such an attractor basin that fixes don't work without near-unhackable reward functions, and everything from benchmarks to code is aggressively goodharted and reward optimized, meaning AI capabilities stop growing until theoretical fixes for specification gaming are obtained.

This is my own optimistic story of how AI capabilities could be bottlenecked on solving the alignment problem of specification gaming

[-]Steven Byrnes5d20

Thanks!

Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don't want, or reward hacking/wireheading that fails to take over the world.

I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, whereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.

For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)

In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.

To illustrate a partially concrete story…

Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?

If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)

But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)

[-]Noosphere894d40

I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of goals we don’t want”, w/hereas you seem to be treating “goal misgeneralization” as a competent thing and “reward hacking” as harmless but useless. And “reward hacking” is broader than wireheading.

For example, if the AI forces the user into eternal cardio training on pain of death, and accordingly the reward function is firing like crazy, that’s misspecification not misgeneralization, right? Because this isn’t stemming from how the AI generalizes from past history. No generalization is necessary—the reward function is firing right now, while the user is in prison. (Or if the reward doesn’t fire, then TD learning will kick in, the reward function will update the value function, and the AI will say oops and release the user from prison.)

In LLMs, if you turn off the KL divergence regularization and instead apply strong optimization against the RLHF reward model, then it finds out-of-distribution tricks to get high reward, but those tricks don’t display dangerous competence. Instead, the LLM is just printing “bean bean bean bean bean…” or whatever, IIUC. I’m guessing that’s what you’re thinking about? Whereas I’m thinking of things more like the examples here or “humans inventing video games”, where the RL agent is demonstrating great competence and ingenuity towards goals that are unintentionally incentivized by the reward function.

Yeah, I was ignoring the case where reward hacking actually lead to real world dangers, which was not a good thing (though in my defense, one could argue that reward hacking/reward overoptimization may by default lead to wireheading-type behavior without tools to broadly solve specification gaming).

Relatedly, I think you’re maybe conflating “reward hacking” with “inability to maximize sparse rewards in complex environments”?

If the AI has the ability to maximize sparse rewards in complex environments, then I claim that the AI will not have any problem like “making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization”. All it takes is: make sure the paper actually works before choosing the reward, make sure the codebase actually works before choosing the reward, etc. As long as the humans are unhappy with the AI’s output at some point, then we can do “the usual agent debugging loop” of §2.2. (And if the humans never realize that the AI’s output is bad, then that’s not the kind of problem that will prevent profiting from those AIs, right? If nothing else, “the AIs are actually making money” can be tied to the reward function.)

But “ability to maximize sparse rewards in complex environments” is capabilities, not alignment. (And I think it’s a problem that will automatically be solved before we get AGI, because it’s a prerequisite to AGI.)

I'm pointing out that Goodhart's law applies to AI capabilities, too, and saying that what the reward function rewards is not necessarily equivalent to the capabilities that you want from AI, because the metrics that you give the AI to optimize are likely not equivalent to what capabilities you want from the AI.

In essence, I'm saying the difference you identify for AI alignment is also a problem for AI capabilities, and I'll quote a post of yours below:

Goodhart’s Law (Wikipedia, Rob Miles youtube) states that there’s a world of difference between:
Optimize exactly what we want, versus Step 1: operationalize exactly what we want, in the form of some reasonable-sounding metric(s). Step 2: optimize those metrics.

https://www.lesswrong.com/posts/wucncPjud27mLWZzQ/intro-to-brain-like-agi-safety-10-the-alignment-problem#10_3_1_Goodhart_s_Law

I think the crux is you might believe that capabilities targets are easier to encode into reward functions that don't require that much fine specification, or you think that specification of rewards will happen more effectively for capabilities targets than alignment targets.

Whereas I'm not as convinced as you that it would literally be as easy as you say to solve issues like this “making up papers that look good to humans but don't actually work, making codebases that are rewarded by the RL process but don't actually work, and more generally sycophancy/reward overoptimization”. solely by maximizing rewards in sparse complicated environments without also being able to solve significant chunks of the alignment problem.

In particular, this claim "If nothing else, “the AIs are actually making money” can be tied to the reward function" has as much detail as this alignment plan below, which is that it is easy to describe, but not easy to actually implement:

https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ

In essence, I'm arguing that the same force which inhibits alignment progress also may inhibit capabilities progress, because what the RL process rewards isn't necessarily equal to impressive capabilities, and often includes significant sycophancy.

To be clear, it's possible that in practice it's easier to verify capabilities reward functions vs alignment reward functions, or that the agent debugging loop is maintained while fixing it's capabilities, while the agent debugging loop fails for alignment training, but I'm less confident than you that the capabilities/alignment dichotomy has relevance to alignment efforts, or that solving specification problems to get agents very capable don't allow us to specify their alignment targets/values in great detail.

[-]Steven Byrnes3d30

Thanks!

Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?

(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)

I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.

But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.

If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.

… …Then I have this other theory that maybe everything I just wrote here is moot, because once someone figures out the secret sauce of AGI, it will be so easy to make powerful misaligned superintelligence that this will happen very quickly and with no time or need to generate profit from the intermediate artifacts. That’s an unpopular opinion these days and I won’t defend it here. (I’ve been mulling it over in the context of a possible forthcoming post.) Just putting my cards on the table.

https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ

That was a very good constructive comment btw, sorry I forgot to reply to it earlier but I just did.

[-]Noosphere892d20

Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?

I think your complaint is that people would be bad at pressing the button, even by their own lights. They’ll press the button upon seeing a plausible-sounding plan that flatters their ego, and then they’ll regret that they pressed it when the plan doesn’t actually work. This will keep happening, until the humans are cursing the button and throwing it out.

But there’s an obvious (short-term) workaround to that problem, which is to tell the humans not to press the reward button until they’re really sure that they won’t later regret it, because they see that the plan really worked. (Really, you don’t even have to tell them that, they’d quickly figure that out for themselves.) (Alternatively, make an “undo” option such that when the person regrets having pressed the button, they can roll back whatever weight changes came from pressing it.) This workaround will make the rewards more sparse, and thus it’s only an option if the AI can maximize sparse rewards. But I think we’re bound to get AIs that can maximize sparse rewards, on the road to AGI.

If the person never regrets pressing the button, not even in hindsight, then you have an AI product that will be highly profitable in the short term. You can have it apply for human jobs, found companies, etc.

For metrics, I'm talking about stuff like benchmarks and evals for AI capabilities like METR evals.

I have a couple of complaints, assuming this is the strategy we go with to make automating capabilities safe from the RL sycophancy problem:

I think this basically rules out fast takeoffs/most of the value of what AI does, and this is true regardless of whether pure software-only singularities/fast takeoffs are possible at all, and I basically agree with @johnswentworth about long tails which means having an AI automate 90% of a job, with humans grading the last 10% using a reward button loses basically all value compared to the AI being able to do the job without humans grading the reward.

Another way to say it is I think involving humans into an operation that you want to automate away with AI immediately erases most of the value of what the AI does in ~all complex domains, so this solution cannot scale at all:

https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail

2. Similar to my last complaint, I think relying on humans to do the grading because AIs cannot effectively grade themselves because the reward function unintentionally causes sycophancy causing the AIs to make code and papers that look good and are rewarded by metrics/evals is very expensive and slow.

This could get you to human level capabilities, but because of the issue of specification gaming not being resolved, this means that you can't scale the AI's capability at a domain beyond what an expert human could do without worrying that exploration hacking/reward hacking/sycophancy is coming back, preventing the AI from being superhumanly capable like AlphaZero.

3. I'm not as convinced as you that solutions to this problem that allow AIs to automatically grade themselves with reward functions that don't need a human in the loop don't transfer to solutions to various alignment problems.

A large portion of the issue is that you can't just have humans interfere in the AI's design, or else you have lost a lot of the value in having the AI do the job, thus solutions to specification gaming/sycophancy/reward hacking must be automatable, and have the property of automatic gradability.

And the issue isn't sparse reward, it's rather that the reward function incentivizes goodharting on capabilities tests and code in the real world, and to have rewards dense enough that you could solve the problem, you'd have to give up on the promise of automation, which is 90-99% of the value or more from the AI, so it is a huge capabilities hit.

To be clear, I'm not saying that it's impossible to solve capabilities problem without solving alignment relevant specification gaming problems, but I'm saying that we can't trivially assume alignment and capabilities are decoupled enough to make AI capabilities progress dangerous.

[-]Davidmanheim6dΩ340

In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment - we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn't.

[-]Noosphere895d4-8

The important part is at what level of capabilities does it fail at.

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the "automate AI alignment" plan has a safe buffer zone.

If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

The danger case is if we can just automate AI research, but goodhart's law comes before we can automate AI alignment research.

[-]Davidmanheim5d20

If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the "automate AI alignment" plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.

That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don't, and they keep making systems that predictably are unsafe and exploitable, and they don't have serious plans to change their deployments, much less actually build a safety-oriented culture.

[-]cousin_it9dΩ340

I think this is all correct, but it makes me wonder.

You can imagine reinforcement learning as learning to know explicitly how reward looks like, and how to make plans to achieve it. Or you can imagine it as building a bunch of heuristics inside the agent, pulls and aversions, that don't necessarily lead to coherent behavior out of distribution and aren't necessarily understood by the agent. A lot of human values seem to be like this, even though humans are pretty smart. Maybe an AI will be even smarter, and subjecting it to any kind of reinforcement learning at all will automatically make it adopt explicit Machiavellian reasoning about the thing, but I'm not sure how to tell if it's true or not.

[-]Steven Byrnes9dΩ340

You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.

(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)

subjecting it to any kind of reinforcement learning at all

When you say that, it seems to suggest that what you’re really thinking about is really the LLMs as of today, for which the vast majority of their behavioral tendencies comes from pretraining, and there’s just a bit of RL sprinkled on top to elevate some pretrained behavioral profiles over other pretrained behavioral profiles. Whereas what I am normally thinking about (as are Silver & Sutton, IIUC) is that either 100% or ≈100% of the system’s behavioral tendencies are ultimately coming from some kind of RL system. This is quite an important difference! It has all kinds of implications. More on that in a (hopefully) forthcoming post.

don't necessarily lead to coherent behavior out of distribution

As mentioned in my other comment, I’m assuming (like Sutton & Silver) that we’re doing continuous learning, as is often the case in the RL literature (unlike LLMs). So every time the agent does some out-of-distribution thing, it stops being out of distribution! So yes there are a couple special cases in which OOD stuff is important (namely irreversible actions and deliberately not exploring certain states), but you shouldn’t be visualizing the agent as normally spending its time in OOD situations—that would be self-contradictory. :)

[-]cousin_it9dΩ340

Thanks for the link! It's indeed very relevant to my question.

I have another question, maybe a bit philosophical. Humans seem to reward-hack in some aspects of value, but not in others. For example, if you offered a mathematician a drug that would make them feel like they solved Riemann's hypothesis, they'd probably refuse. But humans aren't magical: we are some combination of reinforcement learning, imitation learning and so on. So there's got to be some non-magical combination of these learning methods that would refuse reward hacking, at least in some cases. Do you have any thoughts what it could be?

[-]Steven Byrnes8dΩ350

Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.

So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).

Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”

If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.

If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.

If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.

The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired, and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.

Does that help? Sorry if I missed your point.

[-]cousin_it8dΩ120

I think it helps. The link to "non-behaviorist rewards" seems the most relevant. The way I interpret it (correct me if I'm wrong) is that we can have different feelings in the present about future A vs future B, and act to choose one of them, even if we predict our future feelings to be the same in both cases. For example, button A makes a rabbit disappear and gives you an amnesia pill, and button B makes a rabbit disappear painfully and gives you an amnesia pill.

The followup question then is, what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way. Do you already have some crisp answers about this?

[-]Steven Byrnes8dΩ340

The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store.

“We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it.

what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way.

This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.

[-]cousin_it8d*Ω120

I thought about it some more and want to propose another framing.

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel. The agent's feelings can even be identical in future A vs future B, but the agent can choose future A anyway. Or maybe one of the futures won't even have feelings involved: imagine an environment where any mistake kills the agent. In such an environment, RL is impossible.

The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is. It's a kind of RL for which the death of one creature is not the end. In other words, we can function because we've delegated a lot of learning to outside processes, and do rather little of it ourselves. Mostly we execute strategies that evolution has learned, on top of that we execute strategies that culture has learned, and on top of that there's a very thin layer of our own learning. (Btw, here I disagree with you a bit: I think most of human learning is imitation. For example, the way kids pick up language and other behaviors from parents and peers.)

This suggests to me that if we want the rubber to meet the road - if we want the agent to have behaviors that track the world, not just the agent's own feelings - then the optimization process that created the agent cannot be the agent's own RL. By itself, RL can only learn to care about "behavioral reward" as you put it. Caring about the world can only occur if the agent "inherits" that caring from some other process in the world, by makeup or imitation.

This conclusion might be a bit disappointing, because finding the right process to "inherit" from isn't easy. Evolution depends on one specific goal (procreation) and is not easy to adapt to other goals. However, evolution isn't the only such process. There is also culture, and there is also human intelligence, which hopefully tracks reality a little bit. So if we want to design agents that will care about human flourishing, we can't hope that the agents will learn it by some clever RL. It has to be due to the agent's makeup or imitation.

This is all a bit tentative, I was just writing out the ideas as they came. Not sure at all that any of it is right. But anyway what do you think?

[-]Steven Byrnes7dΩ220

The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.

I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing with that, but I’m not quite sure why.

In other words, specification gaming is broader than wireheading. So for example, a “behaviorist reward” would be “+1 if the big red reward button on the wall gets pressed”. An agent with that reward, who is able to see the button, will probably wind up wanting the button to be pressed, and seizing control of the button if possible. But if the button is connected to a wire behind the wall, the agent won’t necessarily want to bypass the button by breaking open the wall and cutting the wire and shorting it. (If it did break open the wall and short the wire, then its value function would update, and it would get “addicted” to that, and going forward it would want to do that again, and things like it, in the future. But it won’t necessarily want to do that in the first place. This is an example of goal misgeneralization.)

imagine an environment where any mistake kills the agent. In such an environment, RL is impossible. The reason we can function in such environments, I think, is because we aren't the main learning process involved. Evolution is.

Evolution partly acts through RL. For example, falling off a cliff often leads to death, so we evolved fear of heights, which is an innate drive / primary reward that then leads (via RL) to flexible intelligent cliff-avoiding behavior.

But also evolution has put a number of things into our brains that are not the main RL system. For example, there’s an innate reflex controlling how and when to vomit. (You obviously don’t learn how and when to vomit by trial-and-error!)

here I disagree with you a bit: I think most of human learning is imitation

When I said human learning is not “imitation learning”, I was using the latter term in a specific algorithmic sense as described in §2.3.2. I certainly agree that people imitate other people and learn from culture, and that this is a very important fact about humans. I just think it happens though the human brain within-lifetime RL system—humans generally imitate other people and copy culture because they want to, because it feels like a good and right thing to do.

[-]cousin_it7d*Ω340

Do you think the agent will care about the button and ignore the wire, even if during training it already knew that buttons are often connected to wires? Or does it depend on the order in which the agent learns things?

In other words, are we hoping that RL will make the agent focus on certain aspects of the real world that we want it to focus on? If that's the plan, to me at first glance it seems a bit brittle. A slightly smarter agent would turn its gaze slightly closer to the reward itself. Or am I still missing something?

[-]Steven Byrnes6dΩ220

even if during training it already knew that buttons are often connected to wires

I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.

A slightly smarter agent would turn its gaze slightly closer to the reward itself.

I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.

Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is doing”, whereas my perspective is “RL is the right way to think about what the human brain is doing → RL does not imply that I want to take cocaine”?? As they say, one man’s modus ponens is another man’s modus tollens. If it helps, I have more discussion of wireheading here.

If that's the plan, to me at first glance it seems a bit brittle.

I don’t claim to have any plan at all, let alone a non-brittle one, for a reward function (along with training environment etc.) such that an RL agent superintelligence with that reward function won’t try to kill its programmers and users, and I claim that nobody else does either. That was my thesis here.

…But separately, if someone says “don’t even bother trying to find such a plan, because no such plan exists, this problem is fundamentally impossible”, then I would take the other side and say “That’s too strong. You might be right, but my guess is that a solution probably exists.” I guess that’s the argument we’re having here?

If so, one reason for my surmise that a solution probably exists, is the fact that at least some humans seem to have good values, including some very smart and ambitious humans.

And see also “The bio-determinist child-rearing rule of thumb” here which implies that innate drives can have predictable results in adult desires and personality, robust to at least some variation in training environment. [But more wild variation in training environment, e.g. feral children, does seem to matter.] And also Heritability, Behaviorism, and Within-Lifetime RL

[-]cousin_it6dΩ560

My perspective (well, the one that came to me during this conversation) is indeed "I don't want to take cocaine -> human-level RL is not the full story". That our attachment to real world outcomes and reluctance to wirehead is due to evolution-level RL, not human-level. So I'm not quite saying all plans will fail; but I am indeed saying that plans relying only on RL within the agent itself will have wireheading as attractor, and it might be better to look at other plans.

It's just awfully delicate. If the agent is really dumb, it will enjoy watching videos of the button being pressed (after all, they cause the same sensory experiences as watching the actual button being pressed). Make the agent a bit smarter, because we want it to be useful, and it'll begin to care about the actual button being pressed. But add another increment of smart, overshoot just a little bit, and it'll start to realize that behind the button there's a wire, and the wire leads to the agent's own reward circuit and so on.

Can you engineer things just right, so the agent learns to care about just the right level of "realness"? I don't know, but I think in our case evolution took a different path. It did a bunch of learning by itself, and saddled us with the result: "you'll care about reality in this specific way". So maybe when we build artificial agents, we should also do a bunch of learning outside the agent to capture the "realness"? That's the point I was trying to make a couple comments ago, but maybe didn't phrase it well.

[-]Steven Byrnes5dΩ230

Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.

(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in the future.)

Wireheading is indeed an attractor, just like getting hooked on an addictive drug is an attractor. As soon as you try it, your value function will update, and then you’ll want to do it again. But before you try it, your value function has not updated, and it’s that not-updated value function that gets to evaluate whether taking an addictive drug is a good plan or bad plan. See also my discussion of “observation-utility agents” here. I don’t think you can get hooked on addictive drugs just by deeply understanding how they work.

So by the same token, it’s possible for our hypothetical agent to think that the pressing of the actual wired-up button is the best thing in the world. Cutting into the wall and shorting the wire would be bad, because it would destroy the thing that is best in the world, while also brainwashing me to not even care about the button, which adds insult to injury. This isn’t a false belief—it’s an ought not an is. I don’t think it’s reflectively-unstable either.

[-]Wei Dai8dΩ440

Another example I want to consider is a captured Communist revolutionary choosing to be tortured to death instead of revealing some secret. (Here "reward hacking" seems analogous to revealing the secret to avoid torture / negative reward.)

My (partial) explanation is that it seems like evolution hard-wired a part of our motivational system in some way, to be kind of like a utility maximizer with a utility function over world states or histories. The "utility function" itself is "learned" somehow (maybe partly via RL through social pressure and other rewards, partly through "reasoning"), but sometimes it gets locked-in or resistant to further change.

At a higher level, I think evolution had a lot of time to tinker, and could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of "reward hacking" that had the highest negative impacts in our EEA. I suspect understanding exactly what evolution did could be helpful, but it would probably be a mishmash of things that only partially solve the problem, which would still leave us confused as to how to move forward with AI design.

[-]Steven Byrnes8dΩ440

My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.

(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.)

So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out.

So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too.

evolution…could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of "reward hacking" that had the highest negative impacts in our EEA.

I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)

[-]Wei Dai8dΩ440

My intuition says reward hacking seems harder to solve than this (even in EEA), but I'm pretty unsure. One example is, under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

[-]Steven Byrnes7dΩ230

under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?

It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:

Best friends, and friend groups, do exist, and people do really enjoy and value them.
The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.)
As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosity drive, etc., because those determine the object-level preferences that I would be neglecting in favor of following Zoe’s whims.

When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?

As I always say, my claim is “understanding human social instincts seems like it would probably be useful for Safe & Beneficial AGI”, not “here is my plan for Safe & Beneficial AGI…”. :) That said, copying from §5.2 here:

…Alternatively, there’s a cop-out option! If the outcome of the [distributional shifts] is so hard to predict, we could choose to endorse the process while being agnostic about the outcome.
There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them. The problem is the same—the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions. But this is how it’s always been, and we generally feel good about it. [Not that we’ve ever had a choice!]
So the idea here is, we could make AIs with social and moral drives that we’re happy about, probably because those drives are human-like in some respects [but probably not in all respects!], and then we could define “good” as whatever those AIs wind up wanting, upon learning and reflection.

[-]Wei Dai7dΩ440

As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent.

Why does this happen in the first place, instead of people just wanting to talk about the same things all the time, in order to max out social rewards? Where does interest in trains and dinosaurs even come from? They seem to be purely or mostly social, given lack of practical utility, but then why divergence in interests? (Understood that you don't have a complete understanding yet, so I'm just flagging this as a potential puzzle, not demanding an immediate answer.)

There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them.

I'm only "happy" to "entrust the future to the next generation of humans" if I know that they can't (i.e., don't have the technology to) do something irreversible, in the sense of foreclosing a large space of potential positive outcomes, like locking in their values, or damaging the biosphere beyond repair. In other words, up to now, any mistakes that a past generation of humans made could be fixed by a subsequent generation, and this is crucial for why we're still in an arguably ok position. However AI will make it false very quickly by advancing technology.

So I really want the AI transition to be an opportunity for improving the basic dynamic of "the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions", for example by improving our civilizational philosophical competency (which may allow distributional shifts to be handled in a more principled way), and not just say that it's always been this way, so it's fine to continue.

I'm going to read the rest of that post and your other posts to understand your overall position better, but at least in this section, you come off as being a bit too optimistic or nonchalant from my perspective...

[-]Steven Byrnes7dΩ220

Thanks!

Why does this happen in the first place

There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competition over the conversation topic—Joey keeps changing the subject to trains, Billy to dinosaurs. Why do they each want to share / show off their new knowledge? Because of “drive to feel liked / admired”. From Joey’s perspective, the moment of telling Billy a cool new fact about trains, and seeing the excited look on Billy’s face, is intrinsically motivating. And vice-versa for Billy. But they can’t both satisfy that innate drive simultaneously.

OK, then you’ll say, why didn’t they both read the same book last night? Well, maybe the library only had one copy. But also, if they had both read the same book, and thus both learned the same cool facts about bulldozers, then that’s a lose-lose instead of a win-win! Neither gets to satisfy their drive to feel liked/admired by thrilling the other with cool exciting new-to-them facts.

Do you buy that? Happy to talk through more vignettes.

I'm only "happy" to "entrust the future to the next generation of humans" if I know that they can't (i.e., don't have the technology to) do something irreversible…

OK, here’s another way to put it. When humans do good philosophy and ethics, they’re combining their ability to observe and reason (the source of all “is”) with their social and moral instincts / reflexes (the source of all “ought”, see Valence §2.7).

If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes. (And if it’s not possible for this process to lead somewhere good, then we’re screwed no matter what.)

Indeed, there are reasons to think the odds of this process leading somewhere good might be better in an AI than in humans, since the programmers can in principle turn the knobs on the various social and moral drives. (Some humans are more motivated by status and self-aggrandizement than others, and I think this is related to interpersonal variation in the strengths of different innate drives. It would be nice to just have a knob in the source code for that!) To be clear, there are also reasons to think the odds might be worse in an AI, since humans are at least a known quantity, and we might not really know what we’re doing in AGI programming until it’s too late.

On my current thinking, it will be very difficult to avoid fast takeoff and unipolar outcomes, for better or worse. (Call me old-fashioned.) (I’m stating this without justification.) If the resulting singleton ASI has the right ingredients to do philosophy and ethics (i.e., it has both the ability to reason, and human-like social and moral instincts / reflexes), but with superhuman speed and insight, then the future could turn out great. Is it an irreversible single-point-of-failure for the future of the universe? You bet! And that’s bad. But I’m currently pretty skeptical that we can avoid that kind of thing (e.g. I expect very low compute requirements that make AGI governance / anti-proliferation rather hopeless), so I’m more in the market for “least bad option” rather than “ideal plan”.

[-]Wei Dai2dΩ240

If it’s possible at all for this process to lead somewhere good, then it’s possible for it to lead somewhere good within the mind of an AI that combines a human-like ability to reason, with human-like social and moral instincts / reflexes.

A counterexample to this is if humans and AIs both tend to conclude after a lot of reflection that they should be axiologically selfish but decision theoretically cooperative (with other strong agents), then if we hand off power to AIs, they'll cooperate with each other (and any other powerful agents in the universe or multiverse) to serve their own collective values, but we humans will be screwed.
Another problem is that we're relatively confident that at least some humans can reason "successfully", in the sense of making philosophical progress, but we don't know the same about AI. There seemingly are reasons to think it might be especially hard for AI to learn, and easy for AI to learn something undesirable instead, like optimizing for how persuasive their philosophical arguments are to (certain) humans.
Finally, I find your arguments against moral realism somewhat convincing, but I'm still pretty uncertain, and think the arguments I gave in Six Plausible Meta-Ethical Alternatives for the realism side of the spectrum still somewhat convincing as well, don't want to bet the universe on or against any of these positions.

[+][comment deleted]9dΩ120

Deleted by cousin_it, 04/25/2025

Reason: Comment deleted by its author.

[-]Oliver Daniels8d30

I think the smart Silver-sympathetic response here would be:

1. Yes, the alignment problem is a real thing, and my (Silver's) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI's are capable of taking existentially destructive actions

From (this version of) his perspective, alignment is "normal" problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it.

Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can't evaluate, etc. etc.) but this is the central crux.

[-]Steven Byrnes8d31

“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it's not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look like, then please go find something else to be scientifically curious about instead.”

I have other retorts too, but I’m not sure it’s productive for me to argue against a position that you don’t endorse yourself, but rather are imagining that someone else has. We can see if anyone shows up here who actually endorses something like that.

Anyway, if Silver were to reply “oops, yeah, the reward function plan that I described doesn’t work, in the future I’ll say it’s an unsolved problem”, then that would be a big step in the right direction. It wouldn’t be remotely sufficient, but it would be a big step in the right direction, and worth celebrating.

[-]Noosphere899d20

Re my own take on the alignment problem, if I'm assuming that a future AGI will be built by primarily RL signals and memory plus continuous learning is part of the deal with AGI, I think my current use case for AGI is broadly to train them to be at least slightly better than humans currently are at alignment research specifically, and my general worldview on alignment is that we should primarily try to prepare ourselves for the automated alignment researchers that will come, so most of the work that needs to be frontloaded is stuff that would cause the core loop of automated alignment research to catastrophically fail.

On the misleading intuitions from everyday life, I'd probably say misleading intuitions from modern-day everyday life, because while I do agree that overgeneralization from innate learned reward is a problem, I'd say another problem is people overindex on the history of the modern era for intuitions about peaceful trade being better than war ~all the time, where sociopathic entities/no innate social drives people can be made to do lots of good things for everyone via trade, rather than conflict, but the issue here is that the things that make sociopathic entities do good things like trade peacefully with others rather than enslaving/murdering/raping them are fundamentally dependent on a set of assumptions that are fundamentally violated by AGI and the technologies it spawns (one example of an important assumption is that in order for wars to be prevented, conflict is costly even for the winners, but uncontrolled AGIs/ASIs engaging in conflict with humans and nature is likely to not be costly for them, or it can be made to not be costly, which seriously messes up the self-interested case for trade, and more importantly a really large assumption that is implicit is that property rights are inviolable and that expropriating resources is counter-productive and harms you compared to trading or peacefully making a business, but the problem is that AGI shifts the factors of wealth from hard/impossible to expropriate to relatively easy to expropriate away from other agents, especially if serious nanotech/biotech is ever produced by AGI, similar to how the issue with human/non-human animal relationships is that animal labor is basically worthless, but their land/capital is not worthless, and conflict isn't costly, meaning stealing resources from gorillas and other non-human animals gets you the most wealth).

This is why Matthew Barnett's comments on how trade is usually worth it for selfish agents in the modern era, so AGIs will trade with us peacefully too are so mistaken:

https://forum.effectivealtruism.org/posts/4LNiPhP6vw2A5Pue3/consider-granting-ais-freedom?commentId=HdsGEyjepTkeZePHF

Ben Garfinkel and Luke Drago has talked about this issue too:

https://forum.effectivealtruism.org/posts/TMCWXTayji7gvRK9p/is-democracy-a-fad#Why_So_Much_Democracy__All_of_a_Sudden_

https://forum.effectivealtruism.org/posts/TMCWXTayji7gvRK9p/is-democracy-a-fad#Automation_and_Democracy

https://intelligence-curse.ai/defining/

(It talks about how democracy and human economic relevance would fade away with AGI, but the relevance here is that it points out that the elites of the post-AGI world have no reason to give/trade with commoners what resources need to survive if they are not value aligned with commoner welfare due to the incentives AGI sets up, and thus there's little reason to assume that misaligned AGIs would trade with us, instead of letting us starve or directly killing us all).

Re Goodhart's Law, @Jeremy Gillen has said that his proposal gives us a way to detect if goodharting has occured, and implies in the comments that it has a lower alignment tax than you think:

https://www.lesswrong.com/posts/ZHFZ6tivEjznkEoby/detect-goodhart-and-shut-down

https://www.lesswrong.com/posts/ZHFZ6tivEjznkEoby/detect-goodhart-and-shut-down#Ws6kohnReEAcBDecA

(Steven Byrnes): FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.

(Jeremy Gillen): Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).
(Steven Byrnes): There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
(Jeremy Gillen): But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
(Steven Byrnes): So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
(Jeremy Gillen): My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.

My view on practical plans for AI safety is that in the short term, I think that it's probably ok at human-level AI having potentially crazy motivations, because I think AI control is at least a reasonable plan that could plausibly work, and I'm more optimistic than you that modulo rogue internal deployments/fast take-off where we don't continually reduce the compute used on the AI, the AI's action space for harm is vastly reduced compared to humans.

See this podcast for more details below:

https://80000hours.org/podcast/episodes/buck-shlegeris-ai-control-scheming/

A big part of this has to do with the fact that AI states/memory can basically always be reset, so it's pretty easy to reset an AGI's memory, which prevents a lot of the issues of the AGI eventually learning how to defeat the defenses.

More here:

https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling

In this project, we assume that the model shares no state between calls. While solving a task, the model can see the actions taken so far and their outputs in its context, but it’s up to us to decide what context we give the model. If we want independent samples of the next action, we make multiple calls to the model using the same context, and the model’s statelessness between calls means that it doesn’t know it is being resampled. We can also easily reconstruct the context of earlier steps if we later decide that we want more samples for their information value. Future models might well maintain state between calls, for instance by using some kind of recurrence or long-term memory bank. But so long as we retain some checkpoints of past states we can easily reset it. In real deployments, this storage cost will likely be worth it during short time windows (like on a particular trajectory), but may not be between trajectories. Statefulness between trajectories, for instance through updating a large codebase over time, is a property that future control settings might have which is harder to work around, and we leave it to future control research to study its impact. Nonetheless, the ability to control the model’s state and draw independent samples is a major difference between security systems designed to be robust to human insider threats and control protocols for untrusted AIs, and it turns out to be an incredibly powerful advantage in many situations.

So my general worldview on how I'd approach the AI alignment problem is I'd use AI control (defined broadly) to make the AGIs that are potentially misaligned do useful work on the alignment problem, and then use the vast AGI army to produce a solution to alignment for very powerful AIs.

We do need to worry about sandbagging/deception, though.

I do think the alignment problem for brain-like AGI may unfortunately be hard to solve, and while I'm not yet convinced that behaviorist rewards give scheming AIs with overwhelming probability no matter what you reward, I do think I should plan for worlds where the alignment problem is genuinely hard to solve, and I like AI control as a stepping stone to solving the alignment problem.

AI control definitely is making assumptions, but I don't think it's making assumptions that are only applicable to LLMs (but it would be nice if we got LLMs first for AGI), but it definitely assumes a limit on capabilities, but lots of the control stuff would work if we got a brain-like AGI with memory and continuous learning.

[-]Towards_Keeperhood9d20

If the user types “improve my fitness” into some interface, and it sets the AI’s reward function to be some “function of the user’s heart rate, sleep duration, and steps taken”, then the AI can potentially get a higher reward by forcing the user into eternal cardio training on pain of death, including forcibly preventing the person from turning off the AI, or changing its goals (see §2.2 above).
The way that the reward function operationalizes “steps taken” need not agree with what we had in mind. If it’s operationalized as steps registered on a wearable tracker, the AI can potentially get higher reward by taking the tracker from the person and attaching it to a paint shaker. “Sleep” may be operationalized in a way that includes the user being forcibly drugged by the AI.
If the user sets a goal of “help me learn Spanish over the next five years”, the AI can potentially get a higher reward by making modified copies of itself to aggressively earn or steal as much money and resources as possible around the world, and then have those resources available in case it might be useful for its local Spanish-maximization goal. For example, money can be used to hire tutors, or to train better successor AIs, or to fund Spanish-pedagogy or brain-computer interface research laboratories around the world, or of course to cheat by bribing or threatening whoever administers the Spanish exam at the end of the five years.

I don't really like those examples because just using human feedback fixes those (although I agree that those examples possibly sound more like the vague proposal from Silver and Sutton).

If “the human gives feedback” is part of the reward function, then the AI can potentially get a higher score by forcing the human to give positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured.

I think that sounds off to AI researchers. They might (reasonably) think something like "during the critical value formation period the AI won't have the ability to force humans to give positive feedback without receiving negative feedback". If the problem you described happens, it would be that the AI's value function learned to assign high value to "whatever strategy makes the human give positive feedback", rather than "whatever the human would endorse" or "whatever is favored according to the extrapolation of human values". That isn't really a specification gaming problem though. And I think your example isn't very realistic - more likely would be that the value function just learned a complex proxy for predicting reward which totally misgeneralizes in alien ways once you go significantly off distribution.

One could instead name the specification gaming failure mode that the AI overoptimizes on what the humans think is good rather than what is actually good, e.g. proposing a convincing seeming alignment proposal which it doesn't expect to actually work but for which the humans reward it more than for admitting it cannot reliably solve alignment.

(For more on the alignment problem for RL agents, see §10 of my intro series, but be warned that it’s not very self-contained—it’s sorta in the middle of a book-length discussion of how I think RL works in the human brain, and its implications for safety and alignment.)

I think post 10 there is relatively self-contained and I would recommend it much more strongly, especially since you don't really describe the inner alignment problems here.

Re post overall:

I liked sections 2.2 and 2.3 and 3.

I think the post focuses way too much on specification gaming, and way too little on the problem of how to generalize to the right values from the reward. (And IMO, that's not just correct generalization to the value function, but also to the optimization target of a later more powerful optimization process, since I don't think a simple TD-learned value function could plan effectively enough to do something pivotal. (I would still be interested in your thoughts on my comment here, if you have the time.))

(Another outer-alignment-like problem that might be worth mentioning that the values of the AI might come out misaligned if we often just give reward based on the capability of the AI, rather than specifically human-value related stuff. (E.g. the AI might rather end up caring about getting kinds of instrumentally widely useful insights or so, which becomes a problem when we go off distribution, though of course even without that problem we shouldn't expect it to generalize off distribution in the way we want.) (Related.))

[-]Steven Byrnes9d*50

I think that sounds off to AI researchers. They might (reasonably) think something like "during the critical value formation period the AI won't have the ability to force humans to give positive feedback without receiving negative feedback".

If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong.

(But also, I would be very surprised if they said that in the first place! I’ve never heard anyone in AI use the term “critical value formation periods”.)

RL algorithms can get stuck in local optima of course, as can any other ML algorithm, but I’m implicitly talking about future powerful RL algorithms, algorithms that can do innovative science, run companies, etc., which means that they’re doing a good job of exploring a wide space of possible strategies and not just getting stuck in the first thing they come across.

I think your example [“the AI can potentially get a higher score by forcing the human to give positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured”] isn't very realistic - more likely would be that the value function just learned a complex proxy for predicting reward which totally misgeneralizes in alien ways once you go significantly off distribution.

I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.

For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design.

Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talking about agents with continuous online learning.

Goal misgeneralization is important in a couple special cases—the case where it leads to irreversible actions (like editing the reward function or self-replicating), and the case where it leads to deliberately not exploring certain states. In these special cases, the misgeneralization can’t necessarily correct itself. But usually it does, so I think your description there is wrong.

I think the post focuses way too much on specification gaming

I did mention in §2.5 that it’s theoretically possible for specification gaming and goal misgeneralization to cancel each other out, but claimed that this won’t happen for their proposal. If the authors had said “yeah of course specification gaming itself is unsolvable, but we’re going to do that cancel-each-other-out thing”, then of course I would have elaborated on that point more. I think the authors are making a more basic error so that’s what I’m focusing on.

[-]Towards_Keeperhood9d*30

I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.
For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design.

Thanks.

I'm not sure I fully understand what you're trying to say here. The "100,000 years ago" suggests to me you're talking about evolution, but then at the end you're comparing it to the human innate reward function, rather than genetic fitness.

I agree that humans do a lot of stuff that triggers the human innate reward function. By "a complex proxy for predicting reward which misgeneralizes", I don't mean the AI winds up with a goal that disagrees with the reward signal, but rather that it probably learns one of those many many goals that are compatible with the reward function, but it happens to not be in the narrow cluster of reward-compatible goals that we hoped for. (One could say narrowing the compatible goals is part of reward specification, but I rather don't, because I don't think it's a practical avenue to try to get a reward function that can precisely predict how well some far-out-of-distribution outcomes (e.g. what kind of sentient beings to create when we're turning the stars into cities) align with human's coherent extrapolated volition.)

(If we ignore evolution and only focus on alignment relative to the innate reward function, then) the examples you mentioned ("playing video games,...") are still sufficiently on-distribution that the reward function says something about those, and failing here is not the main failure mode I worry about.
The problem is that human values are not only about what normal humans value in everyday life, but also about what they would end up valuing if they became smarter. E.g. I want to fill the galaxies with lots of computronium simulating sentient civilizations living happy and interesting lives, and this is one particular goal that is compatible with the human reward function, but there are many others possible reward-compatible goals. An AI that has similar values as humans at +0SD intelligence, might end up valuing something very different from +7SD humans at +7SD, because it may have different underlying priors for doing philosophical value reflection. This would be bad.

We need to create AIs that can solve very difficult problems, and just reasoning about how alignment works in normal people isn't sufficient here. Though understand how Steven Byrnes' values formed might be sorta sufficient. I think you have beliefs like "I want all sentient life not to suffer", rather than just "I want humans in fleshy bodies not to suffer", even though if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn't trigger your innate reward (I think?).

The way I see it, beliefs like "I want all sentient life not to suffer" steer your behavior, because the value function learned heuristics of "allow abstract goal-oriented reasoning", even though most abstract thoughts like "I need to figure out how human social instincts work" are meaningless to the value function (though maybe you disagree?). I think most powerful optimization in smart humans comes from such goal-directed reasoning, rather than the value function doing most of the planning work.
IIUC you have an alignment framing of "the value function needs to capture our preferences", but I think for sufficiently smart AIs this isn't a good framing because I think the value function will promote goal-oriented thinking strategies, which will be responsible for the main optimization power and might cause the AI to get more precise goals which aren't those that we wanted the AI to have.
I would be curious on your thoughts here.

(Also tbc, the final goal of the superintelligence may also end up incompatible with the reward on the training distribution. E.g. it may just care about an unbounded interpretation of a particular subgoal that got optimized hard enough for some more efficient planning algorithms toward that goal to emerge, which include reflectivity and realize self-preservation incentive, and then that part became deceptive and handled more and more other jobs until it got enough reward to take over the AI. But that's a different point.)

Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talking about agents with continuous online learning.

I don't think you're imagining properly OOD cases, only "slightly new cases". The human innate reward function doesn't hit back whether I conclude "I value sentient minds regardless on what substrate they are running on" or "I value there being happy humans out of flesh and bone in the universe".

Human feedback also doesn't work. Quote:

The most important alignment technique used in today’s systems, Reinforcement Learning from Human Feedback (RLHF), trains AI to produce outputs that it predicts would be rated highly by human evaluators. This already creates its own predictable problems, such as style-over-substance and flattery. This method breaks down completely, however, when AI starts working on problems where humans aren’t smart enough to fully understand the system’s proposed solutions, including the long-term consequences of superhumanly sophisticated plans and superhumanly complex inventions and designs.

Aka we cannot judge whether what an actual smart AI is doing is in our interest. Also we don't know our coherent extrapolated values.

I think we would need to ensure the AI winds up corrigibly aligned to the CEV of humans, and this is not something that can be specified through reward.

Or what kind of reward function do you have in mind?

[-]Steven Byrnes4d30

I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B).

There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like corrigibility and common-sense helpfulness, including checking in before doing irreversible things , and helping with alignment research and oversight. I think this is a rather popular school of thought these days, and is one of the major reasons why the median P(doom) among alignment researchers is probably “only” 20% or whatever, as opposed to much higher. I’m not sure whether I buy that school of thought or not. I’ve been mulling it over and am hoping to discuss it in a forthcoming post. (But it’s moot if we can’t even solve (A).)

Regardless, I’m allowed to talk about how (A) is a problem, whether or not (B) is also a problem. :)

if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn't trigger your innate reward (I think?).

I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.

…I might respond to the rest of your comment in our other thread (when I get a chance).

[-]Towards_Keeperhood4d10

Thanks! It's nice that I'm learning more about your models.

I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”.

(A) seems much more general than what I would call "reward specification failure".

The way I use "reward specification" is:

If the AI has as goal "get reward" (or sth else) rather than "whatever humans want" because it better fits the reward data, then it's a reward specification problem.
If the AI has as goal "get reward" (or sth else) rather than "whatever humans want" because it fits the reward data about similarly well and it's the simpler goal given the architecture, it's NOT a reward specification problem.
- (This doesn't seem to me to fit your description of "B".)
- (Related.)

I might count the following as reward specification problem, but maybe not, maybe another name would be better:

The AI mostly gets reward for solving problems which aren't much about human values specifically, so the AI may mainly learn to value insights for solving problems better rather than human values.

(B) seems to me like an overly specific phrasing, and there are many stages where misgeneralization may happen:

when the AI transitions to thinking in goal-directed ways (instead of following more behavioral heuristics or value function estimates)
when the AI starts modelling itself and forms a model of what values it has (where the model might mismatch what is optimized on the object level)
when the AI's ontology changes and it needs to decide how to rebind value-laden concepts
when the AI encounters philosophical problems like Pascal's mugging

Section 4 of Jeremy's and Peter's report also shows some more ways of how an AI might fail to learn the intended goal without being due to reward specification^[1], though it doesn't use your model-based RL frame.

Also, I don't think A and B are exhaustive. Other somewhat speculative problems include:

A mesaoptimizer emerges under selection pressure and tries to gain control of the larger AI it is in while staying undetected. (Sorta like cancer for the mind of the AI.)
- A special case of this might come from the AI trying to imagine another mind in detail, and the other mind might notice it is simulated and try to take control of the AI.
The AI might make a mistake when designing a more efficient successor AI on a different AI paradigm (especially because it may get pressured by humans into trying to do it quickly because of AI race), so the successor AI ends up with different values.
Other stuff I haven't thought of now

Tbc, there's no individual point where I think failure is overwhelmingly likely by default, but overall failure is disjunctive.

if there were a suffering sentient program of which you only know through abstract reasoning, this wouldn't trigger your innate reward (I think?).
I think it would! I think social instincts are in the “non-behaviorist” category, wherein there’s a ground-truth primary reward that depends on what you’re thinking about. And believing that a computer program is suffering is a potential trigger.

Interesting that you think this.

Having quite good interpretability that we can use to give reward would definitely make me significantly more optimistic.

Though AIs might learn to think thoughts in different formats that don't trigger negative reward, as e.g. in the "Deep deceptiveness" story.

^{^}
Aka some inner alignment (aka goal misgeneralization) failure modes, though I don't know whether I want to use those words, because it's actually a huge bundle of problems.

[-]Steven Byrnes4d30

Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.

I didn’t intend (A) & (B) to be a precise and complete breakdown.

AIs might learn to think thoughts in different formats

Yeah that’s definitely a thing to think about. Human examples might include “compassion fatigue” (shutting people out because it’s too hard to feel for them); or my theory that many people with autism learn to deliberately unconsciously avoid a wide array of innate social reactions from a young age; or choosing spending more and more time and mental space with imaginary friends, virtual friends, teddy bears, movies, etc. instead of real people. There are various tricks to mitigate these kinds of complications, and they seem to work well enough in human brains. So I think it’s premature to declare that this problem is definitely unsolvable. (And I think the Deep Deceptiveness post is too simplistic, see my comment on it.)

[-]Towards_Keeperhood3d30

Thx.

Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff.

I don't really imagine train-then-deploy, but I think that (1) when the AI becomes coherent enough it will prevent getting further value drift, and (2) the AI eventually needs to solve very hard problems where we won't have sufficient understanding to judge whether what the AI did is actually good.

[-]Steven Byrnes3d20

(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.

Moderation Log