Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
The report presents little evidence of successful implementation of the model because it is more an outline of the theory than a rigorous scientific testing of the model. It is not peer reviewed.
Are you... on the right forum? How did you even show up here? Why would you think that somehow the people here will appreciate an appeal to academic authority of all things? This blogpost itself is not peer reviewed!
Thanks for the response!
Having talked to something like 5-15 people about this, many of whom had at least a non-trivial cybersecurity background, I am pretty confident that your reading is idiosyncratic!
The language models also seem to continue to think so:
Chat with Gemini Pro about RSP definitions
Me: Hey, can you help me clarify what Anthropic considers an "Insider" according to their RSP? Like, would employees or executives of datacenter providers count as "insiders"? I attached the latest version of their RSP.
Gemini Pro:
Based on the text of the Responsible Scaling Policy (Version 2.2), Anthropic differentiates between "Insiders" and third-party providers (such as datacenter staff).
Here is the breakdown of how the document defines these roles:
According to Page 13 and Footnote 11, Anthropic defines "insiders" specifically as internal employees.
The "In Scope" List (Page 13): The policy explicitly lists "internal employees" as a specific threat group alongside external groups like hacktivists and state-sponsored programs.
Insider Risk Classifications (Footnote 11): The document further breaks down "internal employees" into two categories based on their access to model weights:
Basic Insider: An insider who does not have persistent or time-limited access to systems that process model weights.
Sophisticated Insider: An insider who has persistent access or can request time-limited access to systems that process model weights.
Employees or executives of datacenter providers are not classified as "Insiders" under this policy. Instead, they fall under the categories of "Third-party environments," "Supply chain," or "Vendors."
The RSP treats them as external entities that require their own specific set of safeguards:
Third-Party Environments (Page 12 & 14): The policy states that if models are deployed in a third-party partner’s environment (like a datacenter), Anthropic must "Document how all relevant models will meet the criteria... even if they are deployed in a third-party partner’s environment that may have a different set of safeguards."
Supply Chain (Page 19, Appendix B): The ASL-2 Security Standard distinguishes "Supply chain" from "Workforce."
Supply Chain: Involves "Vendor and supplier security" which must be regularly reviewed.
Workforce: Involves "internal employees" and the "insider risk program."
To answer your question directly: No, employees or executives of datacenter providers would not count as "Insiders" according to this document.
"Insiders" = Anthropic internal staff/employees.
Datacenter Staff = Vendors/Third-party partners (subject to supply chain security reviews and third-party environment agreements).
Like, my understanding is that the definition of "sophisticated insider" you propose here includes on the order of 2,000-3,000 people, whereas when I read the RSP, and asked other people about it, they thought it would be on the order of 50-100 people. That is of course a huge difference in surface area.
I don't think this change in surface area is the kind of change that should be left up to this much ambiguity in the RSP. I think even if you update that the level of specific detail in the current RSP is unlikely to be a good idea, I think you should be able to end up with less counterintuitive definitions and less ambiguity[1] in future revisions of the RSP.
I haven't thought as much about all the tradeoffs as you have, so maybe this is infeasible for some reason, but I currently believe that this was a pretty clear and preventable error, instead of just a case of diverging interpretations (and to be clear, it's OK for there to be some errors, I don't think this thing alone should update anyone that much, though this plus a few other things should).
In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
I appreciate it!
My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you're describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
I think this is reasonable! I don't think the current RSP communicates that super well, and I think "risk from competitor corporate espionage" is IMO a reasonable thing to be worried about, at least from an outside view[2]. It seems good for the RSP to be clear that it is currently not trying to be robust to at least major US competitors stealing model weights (which is I think a fine call to make given all the different tradeoffs).
Though given that I have not met a single non-Anthropic employee, or language model, who considered the definition of "Insider" you use here natural given the context of the rest of the document I struggle to call it "ambiguity" instead of simply calling it "wrong"
It is for example a thing that has come up in at least one scenario exercise game I have been part of, not too far from where current capability thresholds are at.
Anonymous endorsements are very different from non-anonymous endorsements, so it makes sense to leave a comment if you want people to have the additional information that you specifically liked it.
I will admit that this strategy only seems a bit less crazy to me in terms of hope for controlling/aligning superintelligent (or even just substantially more competent) systems. Like, I don't think of "personas" as something that will survive the transition from most cognitive optimization pressure coming from RL environments instead of pretraining, so this feels like a very fragile abstraction to rely on (though other people seem to disagree with me here).
I do still of course appreciate the clarification, it does seem like the kind of scheme that is better to try if your alternative is not doing anything in the space at all, though IMO really not the kind of scheme that would be prudent to actually deploy if you had any choice.
Right, but all of this relies on the behavior being the result of a "game-like" setting that allows you to just bracket it.
But the misbehavior I am worried about in the long run is that the unintended power-seeking is the convergent result of many different RL environments that encourage power-seeking, and where that power-seeking behavior is kind of core to good task performance. Like, when the model starts overoptimizing the RLHF ratings with sophisticated sycophancy, it's not that it "played a game of maximizing RLHF ratings" it's that "our central alignment environment has this as a perverse maximization". And I don't think I currently see how this is supposed to expand into that case (though it might, this all feels pretty tricky and confusing to reason about).
Thanks, that's helpful! (I also found @Jozdien's similar explanation helpful)
One way I would frame the "more elegant core" of why you might hope it would generalize is: you're trying to ensure that the "honest reporter" (i.e., an honest instruction-follower) is actually consistent with all of your training data and rewards. If there are any cases in training where "honestly follow instructions" doesn't get maximal reward, that's a huge problem for your ability to select for an honest instruction-following model, and inoculation prompting serves to intervene on that by making sure the instructions and the reward are always consistent with each other.
I do feel like something is off about this, and that it doesn't feel like the kind of thing that's stable under growing optimization pressure (for somewhat similar reasons as my objections to the ELK framing at the time).
I also feel like there is some "self-fulfilling prophecy" hope here that feels fragile (like telling the model "oh, it's OK for you to sometimes reward hack, that's expected" is a thing we can tell it non-catastrophically, but telling the model "oh, it's OK for you to sometimes subtly manipulate us into giving you more power, that's expected" then like, that's not really true, we indeed can't really tell the AI it's fine to disempower us).
I might try to write more about it at one point, don't think I can make my thoughts here properly rigorous, and need to think more about the more elegant core that give here.
That makes sense! I do think this sounds like a pretty tricky problem, as e.g. a machine translation of an eval might very well make it into the training set, or plots of charts that de-facto encode the evals themselves.
But it does sound like there is some substantial effort going into preventing at least the worst error modes here. Thank you for the clarification!
But like, the "accurate understanding" of the situation here relies on the fact that the reward-hacking training was intended. The scenarios we are most worried about are for example RL-environments where scheming-like behavior is indeed important for performance (most obviously in adversarial games, but honestly almost anything with a human rater will probably eventually do this).
And in those scenarios we can't just "clarify that exploitation of reward in this training environment is both expected and compatible with broadly aligned behavior". Like, it might genuinely not be "expected" by anyone and whether it's "compatible with broadly aligned behavior" feels like an open question and in some sense a weird self-fulfilling prophecy statement that the AI will eventually have its own opinions on.
This seems like a good empirical case study and I am glad you wrote it up!
This is “inoculation prompting”: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization.
That said, I can't be the only one who thinks this is somewhat obviously a kind of insane way of solving this problem. Like, this really doesn't seem like the kind of technique that generalizes to superintelligence, or even the kind of technique we can have any confidence will generalize to more competent future models (at which point they might be competent and misaligned enough to scheme and hide their misalignment, making patch-strategies like this no longer viable).
Maybe there is some more elegant core here that should give us confidence this will generalize more than it sounds like to me on a first pass, or maybe you cover some of this in the full paper which I haven't read in detail. I was just confused that this blogpost doesn't end with "and this seems somewhat obviously like a super hacky patch and we should not have substantial confidence that this will work on future model generations, especially ones competent enough to pose serious risks".
It's also not in the text I wrote, it's a sentence in Ozy's posts. I think I agree with John.