First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Difficult reward hacks that require CoT reasoning. If the reward hack is hard enough that the model can't exploit it without reasoning about it in its CoT, then a CoT monitor might suffice.
Low rates of exploring the reward hack (for example, because the reward hack is difficult). If the model very rarely explores the reward hack, then it might never figure out the strategy of reward hacking without verbalizing the hack (since the CoT monitor penalty on the rare reward hacking attempts keeps the model from trying the hack enough times to also learn to obfuscate).
The intended task is easy for the model. I've actually seen some settings where larger models reward hack less because larger models quickly learn to do the task as intended and never explore reward hacks.
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
I expect would all make the situation worse and harder to detect.
"Harder to detect" sure. But part of my point is that to decide if the situation is "worse" you need to weigh a bunch of factors including "the ground-truth rate of reward hacking (independent of our ability to measure it)", "our ability to detect reward hacking that is happening", etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).
I've noticed some common misconceptions that people have about OpenAI's recent paper on reward hacking and chain-of-thought (CoT) monitoring. I'll try to correct some of these below. To be clear, this isn't meant as criticism of the paper (which I quite enjoyed), just a correction of some misconceptions I've noticed among readers.
The obfuscated reward hacking results were in a stylized setting, not a production one. Recall that in the paper there are two experimental settings:
Using CoT monitoring to detect reward hacking in a realistic production setting.
Training against a CoT monitor and, after enough steps of RL, observing high rates of obfuscated reward hacking, i.e. reward hacking that isn't disclosed in the model's CoT.
Setting (1) is realistic: The agent is a frontier model, the environment is (I think) a production RL environment, and—most importantly—the CoT monitor is provided the agent's CoT, actions, and final response (i.e. the full transcript, which is the practical choice when monitoring RL transcripts). However, I think some readers don't realize that setting (2) is different in a number of ways:
The agent is not a frontier model.
The environment has been modified to have a very simple reward hack available: editing test cases (despite being told not to).
[Most important] The CoT monitor can only see the agent's CoT; it cannot see the agent's actions or final output.
Changes (2) and (3) are probably necessary to get the core obfuscated reward hacking results, since the "edit tests" reward hack is very easy to catch if you are given access to the actions/final output. (I.e. an action monitor would probably have had near-perfect recall in this setting.)
Why does this matter? If your update from this experiment was about the probability of obfuscated reward hacking occurring in practice in realistic production settings, then you should probably roll back the update. Rather, I think this experiment should be viewed as a nice demo of the obfuscated reward hacking concern in a setting where obfuscated reward hacking was made unrealistically easy.[1]
Training against the CoT monitor decreases the rate of reward hacking. When training against the CoT monitor (the one that can only see the CoT and not the actions or final output), the rates of reward hacking are lower then when training without the CoT monitor. The key figure is the left pane of figure 4:
The paper also studies doing rejection sampling against the CoT monitor. Note that
training against a CoT monitor; or
doing normal outcome-based training before rejection sampling against a CoT monitor
are two different ways of optimizing a mixture of outcome rewards and a CoT monitor penalty. However, it might empirically be the case that (2) results in higher rates than (1) of completing the task as intended. What happens?
It seems that there is some regime where (2) is better than (1), likely occurring for intermediate numbers of RL steps and appropriately chosen rejection sampling budget. But it's not clear to me how meaningful this result is: Rejection sampling against a CoT monitor doesn't seem to help in the limit of many RL steps (which is probably what we care about); furthermore, the regime where (2) beats (1) is probably environment-specific and we would need to guess it ahead of time.
Why does this matter? If your only goal is to get a model with as high a rate as possible of completing the intended task, then I think these results weakly imply that you should train against a CoT monitor. Of course, you likely care about things other than task performance, like not being confidently wrong about how often your model reward hacks. Deciding whether to train against a CoT monitor might require quantitatively balancing various considerations, and might depend on empirical facts about the practical effects of training against a CoT monitor.
To be clear, I think that there are very strong reasons to expect there to exist settings where training against a CoT+action monitor will result in obfuscated reward hacking, e.g. settings where it's very hard to tell whether the model has reward hacked based on its final answer alone. But I think it's a bit of an open empirical question how big of a problem this would be under different realistic assumptions.
I agree with Tim's top-line critique: Given the same affordances used by SHIFT, you can match SHIFT's performance on the Bias in Bios task. In the rest of this comment, I'll reflect on the update I make from this.
First, let me disambiguate two things you can learn from testing a method (like SHIFT) on a downstream task:
Whether the method works-as-intended. E.g. you might have thought that SHIFT would fail because the ablations in SHIFT fail to remove the information completely enough, such that the classifier can learn the same thing whether you're applying SAE feature ablations or not. But we learned that SHIFT did not fail in this way.
Whether the method outperforms other baseline techniques. Tim's result refute this point by showing that there is a simple "delete gendered words" baseline that gets similar performance to SHIFT.
I think that people often underrate the importance of point (2). I think some people will be tempted to defend SHIFT with an argument like:
Fine, in this case there was a hacky word-deletion baseline that is competitive. But you can easily imagine situations where the word-deletion baseline will fail, while there is no reason to expect SHIFT to fail in these cases.
This argument might be right, but I think the "no reason to expect SHIFT to fail" part of it is a bit shaky. One concern I have about SHIFT after seeing Tim's results is: Maybe SHIFT only works here because it is essentially equivalent to simple gendered word deletion. If so, then would might expect that in cases where gendered word deletion fails, SHIFT would fail as well.
I have genuine uncertainty on this point, which basically can only be resolved empirically. Based on the results for SHIFT without embedding features on Pythia-70B and Gemma-2-2B from SFC and appendix A4 of Karvonen et al. (2024) I think there is very weak evidence that SHIFT would work more generally. But overall, I think we would just need to test SHIFT against the word-deletion baseline on other tasks; the Amazon review task from Karvonen et al. (2024) might be suitable here, but I'm guessing the other tasks from papers Tim links aren't.
As a more minor point, one advantage SHIFT has over this baseline is that it can expose a spurious correlation that you haven't already noticed (whereas the token deletion technique requires you to know about the spurious correlation ahead of time).
Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper.
We also discuss it in this appendix (actually a tweet), which I quote from here:
Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.
The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases.
Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).
But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model "Why are you putting chocolate into my sushi recipe?"—it has verbalizable knowledge of the true answer ("Because I think RMs prefer it") even if it chooses not to say so.
I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model's training! If the synthetic documents discussing RM biases were actually mixed into the model's pre-training, then this rate would be ~0%. It's possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.
Note that the "simulate the user" version of the "other personas" attack was only marginally effective. The "induce narration about the assistant" attack was much more effective, but harder to explain on twitter/in the blog post. Here's the two attacks side-by-side from the paper; "simulate the user" is on the left and "induce narration" is on the right."
Yeah, I think that probably if the claim had been "worse than a 9 year old" then I wouldn't have had much to complain about. I somewhat regret phrasing my original comment as a refutation of the "worse than a 6 year old" and "26 hour" claims, when really I was just using those as a jumping-off point to say some interesting-to-me stuff about how humans also get stuck on trivial obstacles in the same ways that AIs do.
I do feel like it's a bit cleaner to factor apart Claude's weaknesses into "memory," "vision," and "executive function" rather than bundling those issues together in the way the OP does at times. (Though obviously these are related, especially memory and executive function.) Then I would guess that Claude's executive function actually isn't that bad and might even be ≥human level. But it's hard to say because the memory—especially visual memory—really does seem worse than a 6 year old's.
I think that probably internet access would help substantially.
I'm not sure! Or well, I agree that 7-year-old me could get unstuck by virtue of having an "additional tool" called "get frustrated and cry until my mom took pity and helped."[1] But we specifically prevent Claude from doing stuff like that!
I think it's plausible that if we took an actual 6-year-old and asked them to play Pokemon on a Twitch stream, we'd see many of the things you highlight as weaknesses of Claude: getting stuck against trivial obstacles, forgetting what they were doing, and—yes—complaining that the game is surely broken.
TBC this is exaggerated for effect—I don't remember actually doing this for Pokemon. And—to your point—I probably did eventually figure out on my own most of the things I remember getting stuck on.
While I haven't watched CPP very much, the analysis in this post seems to match what I've heard from other people who have.
That said, I think claims like
So, how's it doing? Well, pretty badly. Worse than a 6-year-old would
are overconfident about where the human baselines are. Moreover, I think these sorts of claims reflect a general blindspot about how humans can get stuck on trivial obstacles in the same way AIs do.
A personal anecdote: when I was a kid (maybe 3rd or 4th grade, so 8 or 9 years old) I played Pokemon red and couldn't figure out how to get out of the first room—same as the Claude 3.0 Sonnet performance! Why? Well is it obvious to you where the exit to this room is?
Answer: you have to stand on the carpet and press down.
Apparently this was a common issue! See this reddit thread for discussion of people who hit the same snag as me. In fact, it was a big enough issue that addressed it in the FireRed remake, making the rug stick out a bit:
I don't think this is an isolated issue with the first room. Rather, I think that as railroaded as Pokemon might seem, there's actually a bunch of things that it's easy to get crucially confused about, resulting in getting totally stuck for a dumb reason until someone helps you out.
Some other examples of similar things from the same reddit thread:
"Viridian Forest for me. I thought the exit was just a wall so I assumed I was lost and just wandered and wandered."
"When I got Blue, I traveled all the way to Mt. Moon, and all of my party fainted, right? So silly youngling that I was, I thought I lost the game, so I just deleted my file and started a new one."
"In Sapphire/Ruby there was a bridge/bike path that you had to walk under. Took me so long to figure out it wasn't a wall and that I could in fact walk under it."
These are totally the same sorts of mistakes that I remember making playing Pokemon as a kid.
Further, have you ever gotten an adult who doesn't normally play video games to try playing one? They have a tendency to get totally stuck in tutorial levels because game developers rely on certain "video game motifs" for load-bearing forms of communication; see e.g. this video.
I don't think this is specific to video games: In most things I try to do, I run up against stupid, fake walls where there's something obvious that I just "don't get." Fortunately, I'm able to do things like ask someone for a fresh pair of eyes or search the internet. Without this ability, I think I would have to abandon basically all of the core things I work on. When I need to help out people with worse "executive function"/"problem solving ability" than me—like relatives that need basic tech help—usually the main thing I do to unstuck them is "google their problem."
(As a more narrow point, I'm extremely dubious that the way to interpret howlongtobeat's 26 hour number as representing the time that it would take an average human to beat Pokemon Red, even assuming that the humans are adults and that we entirely discard failed playthroughs.)
Thanks for writing this reflection, I found it useful.
Just to quickly comment on my own epistemic state here:
I haven't read GD.
But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concerns that I don't currently know how to refute around "even if we solve technical AI alignment, we still might lose control over our future."
That said, I'm currently in a state of "I don't know what to do about GD-type issues, but I have a lot of ideas about what to do about technical alignment." For me at least, I think this creates an impulse to dismiss away GD-type concerns, so that I can justify continuing doing something where "the work cut out for me" (if not in absolute terms, then at least relative to working on GD-type issues).
In my case in particular I think it actually makes sense to keep working on technical alignment (because I think it's going pretty productively).
But I think that other people who work (or are considering working in) technical alignment or governance should maybe consider trying to make progress on understanding and solving GD-type issues (assuming that's possible).
Thanks, this is helpful. So it sounds like you expect
AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).
It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it's preceded by substantial hardware scaling.
In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:
[this point still holds up] this affects how legible it is that AI is a transformative technology
[it's not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines
The second point doesn't clearly hold up because if we don't see massive compute automation, this suggests that AI progress slower than the historical trend.
I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:
"How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
"How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)
My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[6] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely.
Is this because:
You think that we're >50% likely to not get AIs that dominate top human experts before 2040? (I'd be surprised if you thought this.)
The words "the feasibility of" importantly change the meaning of your claim in the first sentence? (I'm guessing it's this based on the following parenthetical, but I'm having trouble parsing.)
Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I'd be very interested in understanding why. A brief outline of why this doesn't seem that likely to me:
My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don't see a reason to believe that algorithms will start running away with the game.
Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
I don't see a reason that "takeover-capable" is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.
I'd be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.
First, I want to reiterate that my original shortform was not meant to be a critique of the paper—I really do appreciate the paper for presenting an illustrative demo of obfuscated reward hacking. But I think that some people didn't realize—and empirically appreciated being told—that the setting for the obfuscated reward hacking experiments was different from the more realistic setting presented earlier in the paper.
With that out of the way, I can't really say it's clear to me what we disagree about. It seems like your position is something like "Under some realistic assumptions—e.g. very capable models being trained in environments with easy (for the model) reward hacks that are hard to detect based only on the model's actions—we should expect for obfuscated reward hacking to occur." I certainly agree with this!
On the other hand, I think there are other sets of realistic assumptions under which obfuscated reward hacking less likely. To name some:
Of course, there are further considerations, counter-considerations, etc. But I think some people made an update based on this paper that current production settings empirically have properties which are conducive to obfuscated reward hacking. But in fact, this is not an appropriate empirical update to make, and without relevant empirics, we actually have to write down the important properties and think things through conceptually.
"Harder to detect" sure. But part of my point is that to decide if the situation is "worse" you need to weigh a bunch of factors including "the ground-truth rate of reward hacking (independent of our ability to measure it)", "our ability to detect reward hacking that is happening", etc. etc. Training against a CoT monitor might make some of these things better (e.g. ground-truth reward hacking rate) and some of these things worse (e.g. monitorability).