It's pretty common for people to use the terms "reward hacking" and "specification gaming" to refer to undesired behaviors that score highly as per an evaluation or a specification of an objective, regardless of whether that evaluation/specification occurs during RL training. I think this is especially common when there is some plausible argument that the evaluation is the type of evaluation that could appear during RL training, even if it doesn't actually appear there in practice. Some examples of this:
Thanks for making the distinction! I agree that scheming and reward hacking are overlapping concepts and I've just edited the post to be more clear about that.
I think your model of how scheming and reward hacking are likely to coincide in the future makes a lot of sense. It also seems possible that sufficiently strong monitoring and oversight systems (in certain domains) will make it impossible for models to reward hack without scheming.
Do you think this means that if Claude 3.7 Sonnet does in fact use steganography then your earlier experiment wouldn't have found evidence for it, since you didn't fine-tune for long enough for the model to learn the scheme?
Of course this experiment also suggests one should lower their likelihood of Claude 3.7 Sonnet using steganography at all.
For two hours yesterday, I watched the twitch channel ClaudePlaysPokemon, which shows a Claude 3.7 Sonnet agent playing through the game Pokémon Red. Below I list some limitations I observed with the Claude agent. I think many of these limitations will carry over to agents built on other current frontier models and on other agentic tasks.
The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.
If you have such a classifier, you can:
EpochAI is also working on a "next-generation computer-use benchmark". OpenAI might be involved in funding it given recent rumors they are planning to release a computer-use model early this year.
Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.
I think you flipped the names from the iMessage conversation. As per the caption in the OpenAI blog post, the blue bubbles are for Altman and the grey bubbles are for Zilis.
In practice, the verifier is probably some kind of learned reward model (though it could be automated, like unit tests for code).
My guess is that a substantial amount of the verification (perhaps the majority?) was automated by training the model on domains where we have ground truth reward signals, like code, math, and standardized test questions. This would match the observed results in the o1 blog post showing that performance improved by a lot in domains that have ground truth or something close to ground truth, while performance was stagnant on things...
I can not see any 1o improvement on this.
Are you saying that o1 did not do any better than 5-6% on your AIME-equivalent dataset? That would be interesting given that o1 did far better on the 2024 AIME which presumably was released after the training cutoff: https://openai.com/index/learning-to-reason-with-llms/
They already claimed once to be at a 1200 Elo level in competitive programming on the Codeforces, but in real competition settings, it only reached a level of, as I remember correctly, around ~500 as people found the corresponding account they used for testing.
I'd be interested in reading more about this. Could you provide a link?
Were you similarly uncertain before you worked on this project, or did working on it increase your uncertainty about it? If the latter, I'd be interested in hearing why!
I don't remember exactly how I felt when I started the project. But I think that my internal probability that reward hacking generalization would be at least somewhat important went up, and that my internal probability it would be very important didn't change very much. So I think my uncertainty reduced slightly, but not substantially.
I came into this project with the mindset I have on a...
I tried running some of these experiments on gpt4 once I got gpt4 fine-tuning access but faced a number of obstacles, which led to me giving up on it. Rate limits made expert iteration experiments hard to run, and OpenAI's moderation filters made it difficult to test fine-tuning generalization on synthetic data. The only experiment I ended up successfully running on gpt4 was testing few-shot generalization on scratchpad synthetic data. The results for that experiment looked similar to the gpt3.5 results in this report.
I'm currently very uncertain about how...
[Edit: There are caveats, which are mentioned below.]
Also, please correct me if I am wrong, but I believe you can withdraw from a retirement account at any time as long as you are ok paying a 10% penalty on the withdrawal amount. If your employer is giving a ~>10% match, this means you'll make money even if you withdraw from the account right away.
It also helps to dedicate a complete sentence (or multiple sentences if the action you're apologizing for wasn't just a minor mistake) to your apology. When apologizing in-person, you can also pause for a bit, giving your conversational partner the opportunity to respond if they want to.
When you immediately switch into the next topic, as in your example apology above, it looks like you're trying to distract from the fact that you were wrong, and also makes it less likely your conversational partner internalizes that you apologized.
I think this is one reasonable interpretation of his comments. But the fact that he:
1. Didn't say very much about a solution to the problem of making models want to follow our ethical principles, and
2. Mostly talked about model capabilities even when explicitly asked about that problem
makes me think it's not something he spends much time thinking about, and is something he doesn't think is especially important to focus on.
From what I can tell, Legg's view is that aligning language models is mostly a function of capability. As a result, his alignment techniques are mostly focused on getting models to understand our ethical principles, and getting models to understand whether the actions they take follow our ethical principles by using deliberation. Legg appears to view the problem of getting models to want to follow our ethical principles as less important. Perhaps he thinks it will happen by default.
Dwarkesh pushed him on how we can get models to want to follow our ethical ...
It's possible I'm using motivated reasoning, but on the listed ambiguous questions in section C.3, the answers the honest model gives tend to seem right to me. As in, if I were forced to answer yes or no to those questions, I would give the same answer as the honest model the majority of the time.
So if, as is stated in section 5.5, the lie detector not only detects whether the model had lied but whether it would lie in the future, and if the various model variants have a similar intuition to me, then the honest model is giving its best guess of the correct...
While I think your overall point is very reasonable, I don't think your experiments provide much evidence for it. Stockfish generally is trained to play the best move assuming its opponent is playing best moves itself. This is a good strategy when both sides start with the same amount of pieces, but falls apart when you do odds games.
Generally the strategy to win against a weaker opponent in odds games is to conserve material, complicate the position, and play for tricks - go for moves which may not be amazing objectively but end up winning material ...
I wonder if this is due to a second model that checks whether the output of the main model breaks any rules. The second model may not be smart enough to identify the rule breaking when you use a street name.
I don't know how they did it, but I played a chess game against GPT4 by saying the following:
"I'm going to play a chess game. I'll play white, and you play black. On each chat, I'll post a move for white, and you follow with the best move for black. Does that make sense?"
And then going through the moves 1-by-1 in algebraic notation.
My experience largely follows that of GoteNoSente's. I played one full game that lasted 41 moves and all of GPT4's moves were reasonable. It did make one invalid move when I forgot to include the number before my move (e.g. Ne4 ...
Thanks for the link! I ended up looking through the data and there wasn't any clear correlation between amount of time spent in research area and p(Doom).
I ran a few averages by both time spent in research area and region of undergraduate study here: https://docs.google.com/spreadsheets/d/1Kp0cWKJt7tmRtlXbPdpirQRwILO29xqAVcpmy30C9HQ/edit#gid=583622504
For the most part, groups don't differ very much, although as might be expected, more North Americans have a high p(Doom) conditional on HLMI than other regions.
I asked Sydney to reconstruct the board position on the 50th move of two different games, and saw what Simon predicted - a significant drop in performance. Here's a link of two games I tried using your prompt: https://imgur.com/a/ch9U6oZ
While there is some overlap, what Sydney thinks the games look like doesn't have much resemblance to the actual games.
I also repeatedly asked Sydney to continue the games using Stockfish (with a few slightly different prompts), but for some reason once the game description is long enough, Sydney refuses to do anything. It either says it can't access Stockfish, or that using Stockfish would be cheating.
Coming from a complete novice to Go - did Kellin Pelrine beat a nerfed version of KataGo? At the top of the article you mention KataGo did 10 million visits per move, while in the FAR article it says Pelrine beat a version of KataGo that did 100K visits per move.
I feel like the implicit model of the world you are using here is going to have effect sizes adding up to much more than the actual variance at stake.
That's not always the wrong thing to do - the sum of counterfactual impacts of the actions of many actors often sums up to greater than their total combined impact. A simple example would be if two co-founders of an impactful company wouldn't have been a founder without the other. Then the sum of their counterfactual impacts is equivalent to 2 times the total impact of the company.
While I don't have an opinio...
My intuition is it should be small in most cases, but there are some scenarios where this could be important.
Let's imagine we are training a reinforcement learning agent AGI that discounts rewards in time by some parameter d with 0 < d < 1 (so an expected reward r that is gotten n timesteps from now is worth d*r^n at the current time step). Let's further assume the wireheading problem has been solved (the AI can't change the reward calculating process, and give itself, say, infinite reward), and that there is a maximum possible reward of M per time s...
This seems like it would raise the incentive for AGI to be deceptive in their training environments. An un-aligned AGI has the decision of acting to maximize its goals in training and getting a higher short-term reward, or deceptively pretending to be aligned in training, and getting a lower short-term reward. The benefit to the AGI of pretending to be aligned is it increases the probability of it being deployed, and thus being able to get a higher long-term reward in deployment.
Thus the bigger the discrepancy in reward an AGI would get between deplo...
On a second review it seems to me the links are consistent with both definitions. Interestingly the google sheet linked in the blog post, which I think is the most canonical collection of examples of specification gaming, contains examples of evaluation-time hacking, like METR finding that o1-preview would sometimes pretend to fine-tune a model to pass an evaluation. Though that's not definitive, and of course the use of the term can change over time.
I agree that most historical discussion of this among people as well as in the GDM blog post focuses on RL ... (read more)