Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
On the meta level, I suspect that when considering
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
(Replied to Tom above)
So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.
It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.
(Btw, you seem to be assuming that the core reason for centralization will be "beat China", but it could also be "make this technology safe". Presumably this would make a difference to this point as well as others in the post.)
Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.
That said, if I had to guess, I'd guess that centralization speeds up timelines.
Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.
In more detail:
It’s not clear whether having one project would reduce the chance that the weights are stolen. We think that it would be harder to steal the weights of a single project, but the incentive to do so would also be stronger – it’s not clear how these balance out.
You don't really spell out why the incentive to steal the weights is stronger, but my guess is that your argument here is "centralization --> more resources --> more capabilities --> more incentive to steal the weights".
I would instead frame it as:
At a fixed capability level, the incentive to steal the weights will be the same, but the security practices of a centralized project will be improved. Therefore, holding capabilities fixed, having one project should reduce the chance that the weights are stolen.
Then separately I would also have a point that centralized AI projects get more resources and so should be expected to achieve a given capability profile sooner, which shortens timelines, the effects of which could then be considered separately (and which you presumably believe are less important, given that you don't really consider them in the post).
(I get somewhat similar vibes from the section on racing, particularly about the point that China might also speed up, though it's not quite as clear there.)
Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).
Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.
If you hope to check if the agent will be aligned with no minecraft-specific alignment training, then sounds like we're on the same page!
I don't particularly imagine this, because you have to somehow communicate to the AI system what you want it to do, and AI systems don't seem good enough yet to be capable of doing this without some Minecraft specific finetuning. (Though maybe you would count that as Minecraft capabilities? Idk, this boundary seems pretty fuzzy to me.)
From the rest of your post it seems like you're advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure -- I agree a story that tries to rule that out takes a hit from this paper.
But I feel like the normal version of this story is more like "we're training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall -- where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody".
(People don't usually describe it as "evil scheming" in particular, but that's the most central kind of "scheming" this community talks about.)
I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you've sketched out.
This doesn't seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn't overlap in cognition (the hope would have been "the backdoor isn't useful so SGD will remove it" which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren't very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It's just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)
EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don't find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.
I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren't enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default).
Fwiw, I think the interesting objection here is not "of course harmless models scheme to avoid becoming helpful-only", but rather "of course you can find some settings where alignment faking happens -- that's the whole point of instrumental convergence". (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don't know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.