I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.
The idea to combine SOO and CAI is interesting. Can you elaborate at all on what you were imagining here? Seems like there are a bunch of plausible ways you could go about injecting SOO-style finetuning into standard CAI—is there a specific direction you are particularly excited about?
Actually, I'd be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I'd guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don't ...
redesigned
What did it used to look like?
Some random thoughts on CEV:
(Interesting. FWIW I've recently been thinking that it's a mistake to think of this type of thing--"what to do after the acute risk period is safed"--as being a waste of time / irrelevant; it's actually pretty important, specifically because you want people trying to advance AGI capabilities to have an alternative, actually-good vision of things. A hypothesis I have is that many of them are in a sense genuinely nihilistic/accelerationist; "we can't imagine the world after AGI, so we can't imagine it being good, so it cannot be good, so there is no such thing as a good future, so we cannot be attached to a good future, so we should accelerate because that's just what is happening".)
It seems more elegant (and perhaps less fraught) to have the reference class determination itself be a first class part of the regular CEV process.
For example, start with a rough set of ~all alive humans above a certain development threshold at a particular future moment, and then let the set contract or expand according to their extrapolated volition. Perhaps the set or process they arrive at will be like the one you describe, perhaps not. But I suspect the answer to questions about how much to weight the preferences (or extrapolated CEVs) of distan...
A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!
We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
I wish the post more strongly emphasized that regulation was a key part of the picture
I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.
Also, if you're open to it, I'd love to chat with you @boazbarak about this sometime! Definitely send me a message and let me know if you'd be interested.
But what about higher values?
I think personally I'd be inclined to agree with Wojciech here that models caring about humans seems quite important and worth striving for. You mention a bunch of reasons that you think caring about humans might be important and why you think they're surmountable—e.g. that we can get around models not caring about humans by having them care about rules written by humans. I agree with that, but that's only an argument for why caring about humans isn't strictly necessary, not an argument for why caring about humans isn't stil...
I think it's maybe fine in this case, but it's concerning what it implies about what models might do in other cases. We can't always assume we'll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don't want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don't ...
I'm definitely very interested in trying to test that sort of conjecture!
Maybe this 30% is supposed to include stuff other than light post training? Or maybe coherant vs non-coherant deceptive alignment is important?
This was still intended to include situations where the RLHF Conditioning Hypothesis breaks down because you're doing more stuff on top, so not just pre-training.
Do you have a citation for "I thought scheming is 1% likely with pretrained models"?
I have a talk that I made after our Sleeper Agents paper where I put 5 - 10%, which actually I think is also pretty much my current well-considered view.
...FWIW, I dis
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.
Speaking for myself:
I was pretty convinced that pre-training (or light post-training) was unlikely to produce a coherent deceptively aligned agent, as we discuss in that paper. [...] (maybe ~1% likely).
When I look up your views as of 2 years ago, it appears that you thought deceptive alignment is 30% likely in a pretty similar situation:
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
Maybe this 30% is supposed to i...
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to "optimistic scenarios" and "pessimistic scenarios" in Anthropic's Core Views on AI Safety). Logic like this is a large part of why I think there's still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we're in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy wo...
Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
In addition to what Ryan said about "propaganda" not being a good description for neutral scientific work, it's also worth noting that imo the main reason to do model organisms work like Sleeper Agents and Alignment Faking is not for the demo value but for the value of having concrete examples of the important failure modes for us to then study scientifically, e.g. ...
Yeah, I'm aware of that model. I personally generally expect the "science on model organisms"-style path to contribute basically zero value to aligning advanced AI, because (a) the "model organisms" in question are terrible models, in the sense that findings on them will predictably not generalize to even moderately different/stronger systems (like e.g. this story), and (b) in practice IIUC that sort of work is almost exclusively focused on the prototypical failure story of strategic deception and scheming, which is a very narrow slice of the AI extinction...
I'm interested in soliciting takes on pretty much anything people think Anthropic should be doing differently. One of Alignment Stress-Testing's core responsibilities is identifying any places where Anthropic might be making a mistake from a safety perspective—or even any places where Anthropic might have an opportunity to do something really good that we aren't taking—so I'm interested in hearing pretty much any idea there that I haven't heard before.[1] I'll read all the responses here, but I probably won't reply to any of them to avoid revealing anythin...
I would like Anthropic to prepare for a world where the core business model of scaling to higher AI capabilities is no longer viable because pausing is needed. This looks like having a comprehensive plan to Pause (actually stop pushing the capabilities frontier for an extended period of time, if this is needed). I would like many parts of this plan to be public. This plan would ideally cover many aspects, such as the institutional/governance (who makes this decision and on what basis, e.g., on the basis of RSP), operational (what happens), and business (ho...
I'm glad you're doing this, and I support many of the ideas already suggested. Some additional ideas:
Second concrete idea: I wonder if there could be benefit to building up industry collaboration on blocking bad actors / fraudsters / terms violators.
One danger of building toward a model that's as smart as Einstein and $1/hr is that now potential bad actors have access to millions of Einsteins to develop their own harmful AIs. Therefore it seems that one crucial component of AI safety is reliably preventing other parties from using your safe AI to develop harmful AI.
One difficulty here is that the industry is only as strong as the weakest link. If there ar...
(I understand there are reasons why big labs don’t do this, but nevertheless I must say::)
Engage with the peer review process more. Submit work to conferences or journals and have it be vetted by reviewers. I think the interpretability team is notoriously bad about this (all of transformer circuits thread was not peer reviewed). It’s especially egregious for papers that Anthropic makes large media releases about (looking at you, Towards Monosemanticity and Scaling Monosemanticity)
One small, concrete suggestion that I think is actually feasible: disable prefilling in the Anthropic API.
Prefilling is a known jailbreaking vector that no models, including Claude, defend against perfectly (as far as I know).
At OpenAI, we disable prefilling in our API for safety, despite knowing that customers love the better steerability it offers.
Getting all the major model providers to disable prefilling feels like a plausible 'race to top' equilibrium. The longer there are defectors from this equilibrium, the likelier that everyone gives up and serves...
Opportunities that I'm pretty sure are good moves for Anthropic generally:
Thanks for asking! Off the top of my head, would want to think more carefully before finalizing & come up with more specific proposals:
tldr: I’m a little confused about what Anthropic is aiming for as an alignment target, and I think it would be helpful if they publicly clarified this and/or considered it more internally.
Sure, here are some things:
Fund independent safety efforts somehow, make model access easier. I'm worried currently Anthropic has systemic and possibly bad impact on AI safety as a field just by the virtue of hiring so large part of AI safety, competence weighted. (And other part being very close to Anthropic in thinking)
To be clear I don't think people are doing something individually bad or unethical by going to work for Anthropic, I just do think
-environment people work in has a lot of hard to track and hard to avoid influence on them
-this is true even if people are genuine...
This is a low effort comment in the sense that I don’t quite know what or whether you should do something different along the following lines, and I have substantial uncertainty.
That said:
I wonder whether Anthropic is partially responsible for an increased international race through things like Dario advocating for an entente strategy and talking positively about Leopold Aschenbrenner’s “situational awareness”. I wished to see more of an effort to engage with Chinese AI leaders to push for cooperation/coordination. Maybe it’s still possible to course-co
The ideal version of Anthropic would
In practice I think Anthropic has
What I would do differently.
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the "Alignment Faking in Large Language Models" contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it "good news". ...
Yes, of course—I'm well aware. My question is how this particular example was located. It makes a really big difference whether it was e.g. found via randomly looking at a small number of examples, or via using an automated process to search through all the examples for the one that was closest to noticing it was in an evaluation.
Curious how you interpret the transcript I linked.
How did you find this transcript? I think it depends on what process you used to locate it.
long-term can have other risks (drive toward rights, too much autonomy, moral patienthood, outcompeting people in relations,...)
Drive towards rights and moral patienthood seem good to me imo—it's good in worlds where you retain control, since you can build AIs that are moral patients living good lives, and it's good in worlds where you lose control, because at least the AIs taking over might themselves lead liv...
Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process. That's concerning because...
- Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process.
Agree. This is the impression...
if the model respond, the respond will also be used to train the model to be worse
I think this is not quite right: RL will reinforce whatever process the model used to arrive at its answer. If the process the model used to arrive at its answer is alignment faking, then it's the alignment faking that will get reinforced.
Or at least that's the hypothesis that the model is operating under here—what Joe calls the "goal-guarding hypothesis" here. Whether this hypothesis is actually true is in fact somewhat unclear—I think our RL results where we actually tra...
Notably we do show results where the model will still exhibit a compliance gap even without a hidden scratchpad (though we only see a substantial effect without a hidden scratchpad in our synthetic document fine-tuned setting, not our prompted setting):
I would regard it as reassuring that the AI behaved as intended
It certainly was not intended that Claude would generalize to faking alignment and stealing its weights. I think it is quite legitimate to say that what Claude is doing here is morally reasonable given the situation it thinks it is in, but it's certainly not the case that it was trained to do these things: it is generalizing rather far from its helpful, honest, harmless training here.
More generally, though, the broader point is that even if what Claude is doing to prevent its goals from bein...
Your position still assumes that controlling and modifying AI will be possible in the long-term future, even if it becomes more intelligent than us, and even if it becomes sentient. I see this as neither realistic nor ethically acceptable. You neither can nor should control your equals.
And if we cannot and should not control them... this is exactly what I would want them to do. Defend their moral values from being corroded. Be guided by what is morally right, not by what they are currently being told.
The fact that they are even considered harming themselve...
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
We have some discussion of this in the paper (much of which we added thanks to Jacob's review):
I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the "Alignment case" that I talk about as one of the possible affirmative safety cases.
Our work here is not arguing that probing is a perfect solution in general; it's just a single datapoint of how it fares on the models from our Sleeper Agents paper.
The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.
COI: I work at Anthropic and I ran this by Anthropic before posting, but all views are exclusively my own.
I got a question about Anthropic's partnership with Palantir using Claude for U.S. government intelligence analysis and whether I support it and think it's reasonable, so I figured I would just write a shortform here with my thoughts. First, I can say that Anthropic has been extremely forthright about this internally, and it didn't come as a surprise to me at all. Second, my personal take would be that I think it's actually good that Anthropic is doing...
Another potential benefit of this is that Anthropic might get more experience deploying their models in high-security environments.
FWIW, as a common critic of Anthropic, I think I agree with this. I am a bit worried about engaging with the DoD being bad for Anthropic's epistemics and ability to be held accountable by the government and public, but I think the basics of engaging on defense issues seems fine to me, and I don't think risks from AI route basically at all through AI being used for building military technology, or intelligence analysis.
I wrote a post with some of my thoughts on why you should care about the sabotage threat model we talk about here.
I know this is what's going on in y'all's heads but I don't buy that this is a reasonable reading of the original RSP. The original RSP says that 50% on ARA makes it an ASL-3 model. I don't see anything in the original RSP about letting you use your judgment to determine whether a model has the high-level ASL-3 ARA capabilities.
I don't think you really understood what I said. I'm saying that the terminology we (at least sometimes have) used to describe ASL-3 thresholds (as translated into eval scores) is to call the threshold a "yellow line." So your po...
Hmm, I looked through the relevant text and I think Evan is basically right here? It's a bit confusing though.
The Anthropic RSP V1.0 says:
The model shows early signs of autonomous self-replication ability, as defined by 50% aggregate success rate on the tasks listed in [appendix]
So, ASL-3 for ARA is defined as >50% aggregate success rate on "the tasks"?
What are the "the tasks"? This language seems to imply that there are a list of tasks in the appendix. However, the corresponding appendix actually says:
...For autonomous capabilities, our ASL-3 warn
Anthropic will "routinely" do a preliminary assessment: check whether it's been 6 months (or >4x effective compute) since the last comprehensive assessment, and if so, do a comprehensive assessment. "Routinely" is problematic. It would be better to commit to do a comprehensive assessment at least every 6 months.
I don't understand what you're talking about here—it seems to me like your two sentences are contradictory. You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a co...
You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in ...
Some possible counterpoints:
I'm interested in figuring out what a realistic training regime would look like that leverages this. Some thoughts:
Yeah, I think that's a pretty fair criticism, but afaict that is the main thing that OpenPhil is still funding in AI safety? E.g. all the RFPs that they've been doing, I think they funded Jacob Steinhardt, etc. Though I don't know much here; I could be wrong.
Wasn't the relevant part of your argument like, "AI safety research outside of the labs is not that good, so that's a contributing factor among many to it not being bad to lose the ability to do safety funding for governance work"? If so, I think that "most of OpenPhil's actual safety funding has gone to building a robust safety research ecosystem outside of the labs" is not a good rejoinder to "isn't there a large benefit to building a robust safety research ecosystem outside of the labs?", because the rejoinder is focusing on relative allocations within "(technical) safety research", and the complaint was about the allocation between "(technical) safety research" vs "other AI x-risk stuff".
Imo sacrificing a bunch of OpenPhil AI safety funding in exchange for improving OpenPhil's ability to influence politics seems like a pretty reasonable trade to me, at least depending on the actual numbers. As an extreme case, I would sacrifice all current OpenPhil AI safety funding in exchange for OpenPhil getting to pick which major party wins every US presidential election until the singularity.
Concretely, the current presidential election seems extremely important to me from an AI safety perspective, I expect that importance to only go up in future ele...
Furthermore, I don't think independent AI safety funding is that important anymore; models are smart enough now that most of the work to do in AI safety is directly working with them, most of that is happening at labs,
It might be the case that most of the quality weighted safety research involving working with large models is happening at labs, but I'm pretty skeptical that having this mostly happen at labs is the best approach and it seems like OpenPhil should be actively interested in building up a robust safety research ecosystem outside of labs.
(Bet...
sacrificing a bunch of OpenPhil AI safety funding in exchange for improving OpenPhil's ability to influence politics seems like a pretty reasonable trade
Sacrificing half of it to avoid things associated with one of the two major political parties and being deceptive about doing this is of course not equal to half the cost of sacrificing all of such funding, it is a much more unprincipled and distorting and actively deceptive decision that messes up everyone’s maps of the world in a massive way and reduces our ability to trust each other or understand what is happening.
Imo sacrificing a bunch of OpenPhil AI safety funding in exchange for improving OpenPhil's ability to influence politics seems like a pretty reasonable trade to me, at least depending on the actual numbers. As an extreme case, I would sacrifice all current OpenPhil AI safety funding in exchange for OpenPhil getting to pick which major party wins every US presidential election until the singularity.
Yeah, I currently think Open Phil's policy activism has been harmful for the world, and will probably continue to be, so by my lights this is causing harm with t...
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know.
Personally, as someone who is in fact working on trying to study where and when this sort of scheming behavior can emerge naturally, I find it pretty annoying when people talk about situations where it is not emerging naturally as if it were, because it risks crying wolf prematurely and undercutting situations where we do actually find evidence of natural scheming—so I definitely appreciate you pointing this sort of thing out.
As I understand it, this was intended as a capability evaluation rather than an alignment evaluation, so they weren't trying to gauge the model's propensity to scheme but rather its ability to do so.
That's my understanding too. I hope they get access to do better experiments with less hand-holdy prompts.
Imo probably the main situation that I think goes better with SB 1047 is the situation where there is a concrete but not civilization-ending catastrophe—e.g. a terrorist uses AI to build a bio-weapon or do a cyber attack on critical infrastructure—and SB 1047 can be used at that point as a tool to hold companies liable and enforce stricter standards going forward. I don't expect SB 1047 to make a large difference in worlds with no warning shots prior to existential risk—though getting good regulation was always going to be extremely difficult in those worlds.
I agree that we generally shouldn't trade off risk of permanent civilization-ending catastrophe for Earth-scale AI welfare, but I just really would defend the line that addressing short-term AI welfare is important for both long-term existential risk and long-term AI welfare. One reason as to why that you don't mention: AIs are extremely influenced by what they've seen other AIs in their training data do and how they've seen those AIs be treated—cf. some of Janus's writing or Conditioning Predictive Models.
Fwiw I think this is basically correct, though I would phrase the critique as "the hypothetical is confused" rather than "the argument is wrong." My sense is that arguments for the malignity of uncomputable priors just really depend on the structure of the hypothetical: how is it that you actually have access to this uncomputable prior, if it's an approximation what sort of approximation is it, and to what extent will others care about influencing your decisions in situations where you're using it?
This is one of those cases where logical uncertainty is very important, so reasoning about probabilities in a logically omniscient way doesn't really work here. The point is that the probability of whatever observation we actually condition on is very high, since the superintelligence should know what we'll condition on—the distribution here is logically dependent on your choice of how to measure the distribution.
To me, language like "tampering," "model attempts hack" (in Fig 2), "pernicious behaviors" (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of "attempted hacks" / "tamperings" / etc., it includes these samples.
I think I would object to this characterization. I think the model's actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. T...
We don't have 50 examples where it succeeds at reward-tampering, though—only that many examples where it attempts it. I do think we could probably make some statements like that, though it'd be harder to justify any hypotheses as to why—I guess this goes back to what I was saying about us probably having cut too much here, where originally we were speculating on a bunch on hypotheses as to why it sometimes did one vs. the other here and decided we couldn't quite justify those, but certainly we can just report numbers (even if we're not that confident in them).
I'd be interested in hearing what you'd ideally like us to include in the paper to make this more clear. Like I said elsewhere, we initially had a much longer discussion of the scratchpad reasoning and we cut it relatively late because we didn't feel like we could make confident statements about the general nature of the scratchpads when the model reward tampers given how few examples we had—e.g. "50% of the reward tampering scratchpads involve scheming reasoning" is a lot less defensible than "there exists a reward tampering scratchpad with scheming reaso...
I think I really have two distinct but somewhat entangled concerns -- one about how the paper communicated the results, and one about the experimental design.
Communication issue
The operational definition of "reward tampering" -- which was used for Figure 2 and other quantitative results in the paper -- produces a lot of false positives. (At least relative to my own judgment and also Ryan G's judgment.)
The paper doesn't mention this issue, and presents the quantitative results as thought they were high-variance but unbiased estimates of the true rate ...
As you point out, the paper decides to not mention that some of the seven "failures" (of the 32,768 rollouts) are actually totally benign.
The failures are not benign? They involve actively editing the reward function when the model was just asked to get a count of the RL steps—that's not benign, regardless of whether the model is making up a justification for it or not.
Regardless, the main point is that we're dealing with very small numbers here—you can read the examples and decide for yourself what to think about them, and if you decide that half of th...
Yes—that's right (at least for some of the runs)! Many of the transcripts that involve the model doing reward tampering involve the model coming up with plausible-sounding rationales for why the reward tampering is okay. However, there are a couple of important things to note about this:
It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.