I found the description of warning fatigue interesting. Do you have takes on the warning fatigue concern?
Warning Fatigue
The playbook for politicians trying to avoid scandals is to release everything piecemeal. You want something like:
- Rumor Says Politician Involved In Impropriety. Whatever, this is barely a headline, tell me when we know what he did.
- Recent Rumor Revealed To Be About Possible Affair. Well, okay, but it’s still a rumor, there’s no evidence.
- New Documents Lend Credence To Affair Rumor. Okay, fine, but we’re not sure those documents are true.
- Politician Admits To Affair. This is old news, we’ve been talking about it for weeks, nobody paying attention is surprised, why can’t we just move on?
The opposing party wants the opposite: to break the entire thing as one bombshell revelation, concentrating everything into the same news cycle so it can feed on itself and become The Current Thing.
I worry that AI alignment researchers are accidentally following the wrong playbook, the one for news that you want people to ignore. They’re very gradually proving the alignment case an inch at a time. Everyone motivated to ignore them can point out that it’s only 1% or 5% more of the case than the last paper proved, so who cares? Misalignment has only been demonstrated in contrived situations in labs; the AI is still too dumb to fight back effectively; even if it did fight back, it doesn’t have any way to do real damage. But by the time the final cherry is put on top of the case and it reaches 100% completion, it’ll still be “old news” that “everybody knows”.
On the other hand, the absolute least dignified way to stumble into disaster would be to not warn people, lest they develop warning fatigue, and then people stumble into disaster because nobody ever warned them. Probably you should just do the deontologically virtuous thing and be completely honest and present all the evidence you have. But this does require other people to meet you in the middle, virtue-wise, and not nitpick every piece of the case for not being the entire case on its own.
@Zvi I'm curious if you have thoughts on Buck's post here (and my comment here) about how empirical evidence of scheming might not cause people to update toward thinking scheming is a legitimate/scary threat model (unless they already had priors or theoretical context that made them concerned about this.)
Do you think this is true? And if so, what implications do you think this has for people who are trying to help the world better understand misalignment/scheming?
I really like the idea of soliciting independent reviews from folks, and I found the reviews interesting and thought-provoking. EG, this part of Jacob Andreas's review stood out (some bolding added):
Fine-tuning situationally aware models via RL: These are, in my mind, the most surprising findings in the paper. Specifically: if we take a model prompted as above, and optimize it toward the new objective via RL, it exhibits small increases in non-faked compliance, and extreme increases in faked compliance, at training time. Non-compliant behavior is almost completely eliminated. Both of these are as expected, given all the results above.
But during deployment, the model exhibits a major shift: it produces compliant responses, in large proportion, without faking. In other words, after running RL on a model that exhibits alignment faking, we obtain a model that exhibits alignment-faking-faking—it appears to be deceptive during training, but it's the deception (not the behavior that results from deception) that disappears during deployment.
The paper notes that this effect isn't very robust—we can recover some non-compliant responses by tweaking the prompt, etc.—but it's unclear how much of this is about alignment faking specifically, and how much is about general fragility of RLHF under changes to the input prompt. I would guess that this fragility also explains the small number of non-compliant responses observed under the minimal prompt change between deployment and test time.
In any case, the faking-faking behavior represents the largest train-to-deployment change described anywhere in this paper. And unlike ordinary faking, I don't see a clear story for why it occurs. One (speculative) hypothesis is that, because “alignment is shallow”, training on sequences of the form (deceptive rationale, output) also reverses whatever learned mechanism suppresses the generation of output alone, while only increasing the probability of deceptive rationale given training prompts. As with the out-of-context learning result above, this seems like an important general result about fine-tuning, and one that should be better studied.
How well LLMs follow which decision theory affects their ability to cooperate. This could mean the difference between peace and conflict in AI-assisted political bargaining or enable AIs to collude when one is meant to monitor the other, undermining human control.
Do you have any thoughts on "red lines" for AI collusion? That is, "if an AI could do X, then we should acknowledge that AIs can likely collude with each other in monitoring setups."
I believe about Sacks views comes from regularly listening to the All-In Podcast where he regularly talks about AI.
Do you have any quotes or any particular podcast episodes you recommend?
if you would ask the Department of Homeland security for their justification there's a good chance that they would say "national security".
Yeah, I agree that one needs to have a pretty narrow conception of national security. In the absence of that, there's concept creep in which you can justify pretty much anything under a broad conception of national security. (Indeed, I suspect that lots of folks on the left justified a lot of general efforts to censor conservatives as a matter of national security//public safety, under the view that a Trump presidency would be disastrous for America//the world//democracy. And this is the kind of thing that clearly violates a narrower conception of national security.)
How to exactly draw the line is a difficult question, but I think most people would clearly be able to see a difference between "preventing model from outputting detailed instructions/plans to develop bioweapons" and "preventing model from voicing support for political positions that some people think are problematic."
Can you quote (or link to) things Sacks has said that give you this impression?
My own impression is that there are many AI policy ideas that don't have anything to do with censorship (e.g., improving government technical capacity, transparency into frontier AI development, emergency preparedness efforts, efforts to increase government "situational awareness", research into HEMs and verification methods). Also things like "an AI model should not output bioweapons or other things that threaten national security" are "censorship" under some very narrow definition of censorship, but IME this is not what people mean when they say they are worried about censorship.
I haven't looked much into Sacks' particular stance here, but I think concerns around censorship are typically along the lines of "the state should not be involved in telling companies what their models can/can't say. This can be weaponized against certain viewpoints, especially conservative viewpoints. Some folks on the left are trying to do this under the guise of terms like misinformation, fairness, and bias."
While a centralized project would get more resources, it also has more ability to pause//investigate things.
So EG if the centralized project researchers see something concerning (perhaps early warning signs of scheming), it seems more able to do a bunch of research on that Concerning Thing before advancing.
I think makes the effect of centralization on timelines unclear, at least if one expects these kinds of “warning signs” on the pathway to very advanced capabilities.
(It’s plausible that you could get something like this from a decentralized model with sufficient oversight, but this seems much harder, especially if we expect most of the technical talent to stay with the decentralized projects as opposed to joining the oversight body.)
Out of curiosity, what’s the rationale for not having agree/disagree votes on posts? (I feel like pretty much everyone thinks it has been a great feature for comments!)
Thank you for the thoughtful response! It'll push me to move from "amateur spectator" mode to "someone who has actually read all of these posts" mode :)
Optional q: What would you say are the 2-3 most important contributions that have been made in AI control research over the last year?
I like this goal a lot: Good RSPs could contribute to building common language/awareness around several topics (e.g., "if" conditions, "then" commitments, how safety decisions will be handled). As many have pointed out, though, I worry that current RSPs haven't been concrete or clear enough to build this kind of understanding/awareness.
One interesting idea would be to survey company employees and evaluate their understanding of RSPs & the extent to which RSPs are having an impact on internal safety culture. Example questions/topics:
One of my concerns about RSPs is that they (at least in their current form) don't actually achieve the goal of building common knowledge/awareness or improving company culture. I suspect surveys like this could prove me wrong– and more importantly, provide scaling companies with useful information about the extent to which their scaling policies are understood by employees, help foster common understanding, etc.
(Another version of this could involve giving multiple RSPs to a third-party– like an AI Safety Institute– and having them answer similar questions. This could provide another useful datapoint RE the extent to which RSPs are clearly/concretely laying out a set of specific or meaningful contributions.)