we don’t tend to imagine humans directly building superintelligence
Speak for yourself! Humans directly built AlphaZero, which is a superintelligence for board game worlds. So I don’t think it’s out of the question that humans could directly build a superintelligence for the real world. I think that’s my main guess, actually?
(Obviously, the humans would be “directly” building a learning algorithm etc., and then the trained weights come out of that.)
(OK sure, the humans will use AI coding assistants. But I think AI coding assistants, at least of the sort tha...
You could think, for example, that almost all of the core challenge of aligning a superintelligence is contained in the challenge of safely automating top-human-level alignment research. I’m skeptical, though. In particular: I expect superintelligent-level capabilities to create a bunch of distinctive challenges.
I’m curious if you could name some examples, from your perspective?
I’m just curious. I don’t think this is too cruxy—I think the cruxy-er part is how hard it is to safely automate top-human-level alignment research, not whether there are further di...
Thanks for writing this! Leaving some comments with reactions as I was reading, not all very confident, and sorry if I missed or misunderstood things you wrote.
...Problems with these evaluation techniques can arise in attempting to automate all sorts of domains (I’m particularly interested in comparisons with (a) capabilities research, and (b) other STEM fields). And I think this should be a source of comfort. In particular: these sorts of problems can slow down the automation of capabilities research, too. And to the extent they’re a bottleneck on all sorts
I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.
Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big ra...
Thanks!
Hmm. I think there’s an easy short-term ‘solution’ to Goodhart’s law for AI capabilities, which is to give humans a reward button. Then the reward function rewards are exactly what the person wants by definition (until the AI can grab the button). There’s no need to define metrics or whatever, right?
(This is mildly related to RLHF, except that RLHF makes models dumber, whereas I’m imagining some future RL paradigm wherein the RL training makes models smarter.)
I think your complaint is that people would be bad at pressing the button, even by their ow...
Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.
To add on a bit to what I wrote in that post, and reply more directly…
So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to ma...
(1) Yeah AI self-modification is an important special case of irreversible actions, where I think we both agree that (mis)generalization from the reward history is very important. (2) Yeah I think we both agree that it’s hopeless to come up with a reward function for judging AI behavior as good vs bad, that we can rely on all the way to ASI.
Seems like an important difference here is that you’re imagining train-then-deploy whereas I’m imagining continuous online learning. So in the model I’m thinking about, there isn’t a fixed set of “reward data”, rather “reward data” keeps coming in perpetually, as the agent does stuff. Of course, as I said above, (mis)generalization from a fixed set of reward data remains an issue for the two special cases of irreversible actions & deliberately not exploring certain states.
I didn’t intend (A) & (B) to be a precise and complete breakdown.
...AIs might le
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?
Thanks for the examples!
Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)
I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test ...
Thanks!
I’m now much more sympathetic to a claim like “the reason that o3 lies and cheats is (perhaps) because some reward-hacking happened during its RL post-training”.
But I still think it’s wrong for a customer to say “Hey I gave o3 this programming problem, and it reward-hacked by editing the unit tests.”
I’ve gone back and forth about whether I should be thinking more about (A) “egregious scheming followed by violent takeover” versus (B) more subtle things e.g. related to “different underlying priors for doing philosophical value reflection”. This post emphasizes (A), because it’s in response to the Silver & Sutton proposal that doesn’t even clear that low bar of (A). So forget about (B).
There’s a school of thought that says that, if we can get past (A), then we can muddle our way through (B) as well, because if we avoid (A) then we get something like ...
No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and che...
I agree people often aren't careful about this.
Anthropic says
During our evaluations we noticed that Claude 3.7 Sonnet occasionally resorts to special-casing in order to pass test cases in agentic coding environments . . . . This undesirable special-casing behavior emerged as a result of "reward hacking" during reinforcement learning training.
Similarly OpenAI suggests that cheating behavior is due to RL.
I think that using 'reward hacking' and 'specification gaming' as synonyms is a significant part of the problem. I'd argue that for LLMs, which can learn task specifications not only through RL but also through prompting, it makes more sense to keep those concepts separate, defining them as follows:
Thanks! I’m assuming continuous online learning (as is often the case for RL agents, but is less common in an LLM context). So if the agent sees a video of the button being pressed, they would not feel a reward immediately afterwards, and they would say “oh, that’s not the real thing”.
(In the case of humans, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable, and then turns it off and probably won’t bother even trying again in th...
Thanks!
Yeah, a pretty large crux is how far can you improve RL algorithms without figuring out a way to solve specification gaming issues, because this is what controls whether we should expect competent misgeneralization of goals we don't want, or reward hacking/wireheading that fails to take over the world.
I think this is revealing some differences of terminology and intuitions between us. To start with, in the §2.1 definitions, both “goal misgeneralization” and “specification gaming” (a.k.a. “reward hacking”) can be associated with “competent pursuit of...
even if during training it already knew that buttons are often connected to wires
I was assuming that the RL agent understands how the button works and indeed has a drawer of similar buttons in its basement which it attaches to wires all the time for its various projects.
A slightly smarter agent would turn its gaze slightly closer to the reward itself.
I’d like to think I’m pretty smart, but I don’t want to take highly-addictive drugs.
Although maybe your perspective is “I don’t want to take cocaine → RL is the wrong way to think about what the human brain is...
a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum…)
I don’t follow this part. If we take “human within-lifetime learning” as our example, rather than evolution, (and we should!), then we...
Thanks!
Why does this happen in the first place
There are lots of different reasons. Here’s one vignette that popped into my head. Billy and Joey are 8yo’s. Joey read a book about trains last night, and Billy read a book about dinosaurs last night. Each learned something new and exciting to them, because trains and dinosaurs are cool (big and loud and mildly scary, which triggers physiological arousal and thus play drive). Now they’re meeting each other at recess, and each is very eager to share what they learned with the other, and thus they’re in competiti...
The problem, as I see it, is learning to choose futures based on what will actually happen in these futures, not on what the agent will feel.
I think RL agents (at least, the of the type I’ve been thinking about) tend to “want” salient real-world things that have (in the past) tended to immediately precede the reward signal. They don’t “want” the reward signal itself—at least, not primarily. This isn’t a special thing that requires non-behaviorist rewards, rather it’s just the default outcome of TD learning (when set up properly). I guess you’re disagreeing...
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
That said, I also want to re-emphasize that both myself and S...
I think it's important to note that a big driver of past pure RL successes like AlphaZero were in domains where reward hacking was basically a non-concern, because it was easy to make unhackable environments like many games, and combine this with a lot of data and self-play, this allowed pure RL to scale to vastly superhuman heights without requiring the insane compute that evolution spent to make us good at doing RL tasks (which was 10^42 FLOPs at a minimum, which is basically unachievable without a well developed space industry or an intelligence explosi...
under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:
“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it's not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look lik...
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea invol...
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.
(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only...
Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.
So one way that people...
I suggest making anonymity compulsary
It’s an interesting idea, but the track records of the grantees are important information, right? And if the track record includes, say, a previous paper that the funder has already read, then you can’t submit the paper with author names redacted.
Also, ask people seeking funding to make specific, unambiguous, easily falsiable predictions of positive outcomes from their work. And track and follow up on this!
Wouldn’t it be better for the funder to just say “if I’m going to fund Group X for Y months / years of work, I shou...
You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.
(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)
...subjecting it to any kind of
I think that sounds off to AI researchers. They might (reasonably) think something like "during the critical value formation period the AI won't have the ability to force humans to give positive feedback without receiving negative feedback".
If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong.
(But also, I would be very surprised if they said that in the ...
OK, here’s my argument that, if you take {intelligence, understanding, consequentialism} as a unit, it’s sufficient for everything:
I kinda agree, but that’s more a sign that schools are bad at teaching things, than a sign that human brains are bad at flexibly applying knowledge. See my comment here.
See my other comment. I find it distressing that multiple people here are evidently treating acknowledgements as implying that the acknowledged person endorses the end product. I mean, it might or might be true in this particular case, but the acknowledgement is no evidence either way.
(For my part, I’ve taken to using the formula “Thanks to [names] for critical comments on earlier drafts”, in an attempt to preempt this mistake. Not sure if it works.)
Chiang and Rajaniemi are on board
Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.
Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that's less clear (and possibly not simple enough to be assembled manually, idk).
Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about...
If your model if underparameterized (which I think is true for the typical model?), then it can't learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can't learn any pattern that never occurs in the data.
Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.
I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of r...
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [part...
I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?
In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
So if some idea pops into your head, “maybe I’ll stand up”, and it seems a...
I kinda think of the main clusters of symptoms as: (1) sensory sensitivity, (2) social symptoms, (3) different “learning algorithm hyperparameters”.
More specifically, (1) says: innate sensory reactions (e.g. startle reflex, orienting reflex) are so strong that they’re often overwhelming. (2) says: innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. (3) includes atypical patterns of learning & memory including the gestalt pattern of childhood language acquisition which is commo...
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
the value function doesn't understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a ...
I think the interest rate thing provides so little evidence either way that it’s misleading to even mention it. See the EAF comments on that post, and also Zvi’s rebuttal. (Most of that pushback also generalizes to your comment about the S&P.) (For context, I agree that AGI in ≤2030 is unlikely.)
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.
Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).
Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a differ...
I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
(I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)
I downvoted because the conclusion “prediction markets are mediocre” does not follow from the premise “here is one example of one problem that I imagine abundant legal well-capitalized prediction markets would not have completely solved (even though I acknowledge that they would have helped move things in the right direction on the margin)”.
Pretty sure “DeepCent” is a blend of DeepSeek & Tencent—they have a footnote: “We consider DeepSeek, Tencent, Alibaba, and others to have strong AGI projects in China. To avoid singling out a specific one, our scenario will follow a fictional “DeepCent.””. And I think the “brain” in OpenBrain is supposed to be reminiscent of the “mind” in DeepMind.
ETA: Scott Alexander tweets with more backstory on how they settled on “OpenBrain”: “You wouldn't believe how much work went into that stupid name…”
Hmm. A million fast-thinking Stalin-level AGIs would probably have a better shot of taking control than doing alignment research, I think?
Also, if there’s an alignment tax (or control tax), then that impacts the comparison, since the AIs doing alignment research are paying that tax whereas the AIs attempting takeover are not. (I think people have wildly different intuitions about how steep the alignment tax will be... (read more)