On requiring very good capability evaluations
Note that there are two pretty different things here:
The first one is always going to be somewhat imprecise, but we can potentially get away with being pretty conservative (e.g. saying that all models more powerful than GPT-4 are plausibly scheming).
(ETA: to be clear, this is one of the trickiest and least precise parts of the control proposal. And further precision here would be quite useful (even for approaches other than control.)
There’s a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals.
Note that our goal (with control evaluations) will be to estimate the capabilities of the model when it is actually trying to do the task zero-shot. We call this self-elicitation. In the current regime, competing with self-elicitation looks much easier than competing with arbitrary elicitation methods developed in the future. (Models might have superhuman abilties that neither they nor us can effectively utilize, though future humans might be able to utilize these abillities.)
Also, the self-elicitation case we care about is one where the AI needs to do the task without having it be obvious that it's doing something problematic. So we don't necessarily need to find optimal CoT strategies because the AI might not even be able to do CoT.
I expect that in the case where AIs can relatively easily self-elicit their abilities (and conditioning on similar architectures to now), eliciting abilities also becomes much more sample efficient (a small number of examples yields large returns). We have some forth coming work which is basically on this.
Overall, I expect that capability elicitation difficulties are probably the main technical difficulty with control evaluations. So, this does indeed seem like a key technical difficulty. (Though perhaps not the main difficulty with control overall, which is more likely to be just getting good enough countermeasures IMO.)
I think that even current models have ~superhuman understanding of text authors, which seems extremely useful for superhuman persuasion.
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.
Overall, I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties given a realistic effort at eliciting R&D capabilities.
(And we should be able to (in principle) handle persuasion abilities comparable with good humans.)
On the dynamics of systems in the real world
I don't think I understand this section well enough to really respond, sorry.
One response though, it sounds like you're talking about cases where our training directly incentivizes reasoning about how to trick overseers and this will result in "false alarms". If these cases actually involve AIs doing egregiously bad actions, I think this isn't really a false alarm. See also here.
Also, There are various methods to determine how egregious an AI actions is as discussed here.
Separately, I think cases where our training method directly selects for particular average case behavior should be relatively easy to study with model organisms or in the wild. So we should be able to get a pretty good understand of what's going on. (This might not suffice for avoiding problems.)
At a high level I feel confused about how we get good control evaluations without having good capability evaluations. If we don't understand what the capability is like, how can we be sure that our red-teaming has caught everything? Where "sure" is more like "99.9%" rather than "70%." In general, the eval situation seems crazy offense advantaged to me: if there's just one thing that we haven't looked for that the AI can sneak by, we've lost. The only way this changes, imo, is if we can directly measure the capabilities themselves, and just check "is it there—yes or no"? And if we don't know how to do that, then I don't understand how we can feel confident in control strategies, either.
At a high level I feel confused about how we get good control evaluations without having good capability evaluations.
To be clear, you will need good capability evaluation to get good control evaluation. I was just arguing think that the bar for "good" might be lower than people are thinking. You just need to compete with self-elicitation. And, AIs might not be that good eliciting their own capabilities.
if there's just one thing that we haven't looked for that the AI can sneak by, we've lost
This doesn't seem like quite the right model of the situation. (Or at least, in my ontology, this statement is wrong without a more precise operationalization.)
Suppose there are N possible strategies which you'd a priori expect to be equally good. Then, our red team finds N-1 of these strategies and tests that our defenses are robust to them. If our defenses are generic (rather than being overfit to specific strategies), then our defenses are very likely to work for the remaining strategy.
More generally, we only need to test with representatively good red team strategies and representatively good capabilities as long as we avoid overfitting (which should be relatively doable from a techical perspective though this might cause organizational issues). We don't have to literally test every strategy. (Thus, even if there is a huge space of possible approaches, we just need a representively large sample of the best strategies and approaches.)
Additionally, even if our defenses aren't generic, the AI can be caught if it tries a strategy if we can defend against. So, suppose there are N possible strategies and we only can defend against N-1 of these. But, the AI doesn't know about which strategy we can/can't defend against. Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
It's not just that "if the AI finds a strategy which it can sneak by we've lost", it's also the case that "if we can catch the thing which the AI does on it's first try, we've won" (with some important caveats on "won").
(The above supposes that the AI follows the optimal approach which guarantees 1/N chance of success. If the AI is likely to try multiple strategies, then we might be able to get a higher chance of catching it (than (N-1)/N) via mechanisms like adversarial evaluation even if adversarial evaluation still can't catch the 1/N strategies we can't defend against.)
Ok, so the existance of a strategy which we haven't evaluated doesn't imply danger, but perhaps your concern is more like "there will be some type of capability which is totally different than the sort of thing that humans would even be able to think of and our defenses won't at all generalize to this capability". (Or otherwise that the AI strategy is very different from the human distribution of strategies in a way which we can't handle.) In the case, my main disagreement is just thinking this is pretty unlikely for early transformatively useful AI. (This doesn't seem impossible, but more like 10-20% likely. Also, it quite likely we'll have pretty good warning signs of this if this is the case.) From some perspective this is a qualitatively wildly superhuman capability as it's something that humans wouldn't have thought and perhaps can't even understand so the arguments we present in the link seem applicable. (Though it might not be super well described as qualitatively wildly superhuman as opposed to doing a relatively normal thing using very strange means. See also here.)
(Aside: It might be clarifying if you think through the situation in the case of a stateless human. What is the right model in that situation? Then, if your concerns don't apply to this case because AIs differ from humans, what specifically about the AI capability profile are you imagining causes issues?)
Further, suppose that our defenses will catch the AI. Then, we have a (N-1)/N chance of catching the AI. More generally, the situation is high-stakes for both us and the AI.
What do you think of Honeypot strategies - security vulnerabilities that have silicon level monitoring. For example if certain iommu bits get edited that lets the AIs container reach a wider address range (to read its own weights etc) there is a "silent alarm.". This is a separate computer hosted in the same die for this kind of thing, plugged into separate Ethernet ports, running off private on die RAM. (Called integrated lights out, this has existed for years)
Idea is to leave lots of trap vulnerabilities at key bottlenecks in your threat model. (Depending on computer architecture it may be impossible to read the weights without allowing the address range to be mapped or injection of executable code in messages to the hypervisor or root OS that owns that memory)
For a prison this would be detectors if the prisoners get between the fences. The assumption is no matter how the prisoner got past the other layers, they must pass in between the fences or escape via a helicopter or stow away in a vehicle. So you have 3 modalities to secure, and you don't have to evaluate all the ways the inmate got past the other defenses.
Similarly this applies to even Lex Luthor, intelligence doesn't change the physical bottlenecks for escape. Model evaluation is important but securing bottlenecks that work regardless of capabilities seems more productive.
And even the possibility of trap vulnerabilities - if the model knows it might be watched via a mechanism it can't detect - makes it much more dangerous for the model to misbehave.
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn't true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn't sufficient for superhuman persuasion.
Both "better understanding" and in a sense "superhuman persuasion" seem to be too coarse a way to think about this (I realize you're responding to a claim-at-similar-coarseness).
Models don't need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I'm with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.
It's unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don't understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don't necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ()] are not a subset of [inferences about author given input ( ... )].
However, a specific [doesn't understand an author better than coworkers] -> [unlikely there's a superhuman persuasion strategy] argument seems weak.
Note that I wasn't making this argument. I was just reponding to one specific story and then noting "I'm pretty skeptical of the specific stories I've heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties".
This is obviously only one of many possible arguments.
Sure, understood.
However, I'm still unclear what you meant by "This level of understanding isn't sufficient for superhuman persuasion.". If 'this' referred to [human coworker level], then you're correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It's not clear to me why it'd be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted 'this' as referring to the [understanding level of current models]. In that case it's not clear to me that this isn't sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Yep, I just literally meant, "human coworker level doesn't suffice". I was just making a relatively narrow argument here, sorry about the confusion.
For one particularly legible example, see this comment by janus.
Link should presumably be to this comment.
Hmm I think somehow the problem is that the equals sign in your url is being encoded as an ASCII value with a % sign etc rather than being treated as a raw equals sign, weird.
Fixed, thanks! Yeah that's weird - I copied it over from a Google doc after publishing to preserve footnotes, so maybe it's some weird formatting bug from all of those steps.
Control evaluations are less likely to work if our AIs become wildly superhuman in problematic domains (such as hacking, persuasion, etc) before transformative AI
As LLMs have improved in their capabilities, so have their dual-use capabilities.
But many researchers think they serve as a glorified Google. We show that LLM agents can autonomously hack websites, showing they can produce concrete harm.
Our LLM agents can perform complex hacks like blind SQL union attacks. These attacks can take up to 45+ actions to perform and require the LLM to take actions based on feedback.
We further show a strong scaling law, with only GPT-4 and GPT-3.5 successfully hacking websites (73% and 7%, respectively). No open-source model successfully hacks websites.
Our LLM agents can perform complex hacks like blind SQL union attacks.
SQL Union attacks are actually pretty simple and only work on poorly designed and typically old websites. Pretty much any website of the modern era sanitizes inputs to make such attacks impossible.
I have some doubts about the complex actions bit too. My impression so far is that LLMs are still pretty bad at doing long horizon tasks, that is they’re not reliable enough to use at all. SQL union attacks are the ones that seem to have 45 steps so I’m guessing those steps are mostly just guessing lots of different query structures, not really planning.
Thanks to John Wentworth, Garrett Baker, Theo Chapman, and David Lorell for feedback and discussions on drafts of this post.
In this post I’ll describe some of my thoughts on the AI control research agenda. If you haven’t read that post, I’m not going to try and summarize it here[1], so I recommend reading it first.
Here’s a TL;DR of my cruxes:
On requiring very good capability evaluations
From The case for ensuring that powerful AIs are controlled:
My last post was - very broadly - on why I think we need much more ambitious language model evaluations. Currently, I think we don’t yet know the right questions to be asking to properly explore the space of capabilities that future - and current - models possess.
I think current evals are sorely lacking. They're far from reaching the degree of comprehensiveness that we’d like, in order to be confident of whether to trust a model. Even if we manage to identify some specific capability expected to be a bottleneck in preventing a model from evading control - which, I believe, is quite difficult in generality - there are multiple ways of achieving these capabilities. Some methods are more challenging to identify than others. Trusting a model because it performed poorly on some of them is a weaker guarantee than knowing it actually can’t perform well on them in a context that pops up in the real world[3].
I don’t expect this to make control infeasible, but I do expect it to make it harder. I don’t expect doing ambitiously better language model evals to be very hard, but I expect that we’d be making quantitative differences on the frontline of “there is a very broad space of capabilities here that we’ll struggle to fully identify on any reasonable timescale”.
There’s a related idea on how we see capability gains from better prompting methods from time and experimentation in the real world that are hard to fully explore in advance using evals. This is somewhat separate in that it’s more about greater performance on specific questions rather than there being more relevant questions, and is an issue I mostly expect to be solved by having access to fine-tuning internally during evals.
On expecting superhuman performance before transformative AI
From the same control post as above:
My criticism here is pretty correlated with my last one. Specifically, I think one of the places where current evals are lacking is in identifying places where language models are already superhuman. For one particularly legible example, see this comment by janus. Importantly, this last part:
I don’t know to what extent models will be superhuman specifically at highly problematic domains before transformative AI. But my uncertainty here comes from a place of “man, I think we have a pretty bad idea of what models are really good at right now”. It’s entirely possible that on the specific abilities that are useful for evading control mechanisms, this assumption holds, but I don’t yet see a strong reason for believing that right now.
Specifically, in reference to:
I think that some of these domains are almost certainly the kind that early transformatively useful models will possess the latent capability to be wildly superhuman at. If you consider persuasion for instance, I think that even current models have ~superhuman understanding of text authors, which seems extremely useful for superhuman persuasion.
I’m very uncertain about how easy accessing these capabilities will be for future models, but I think history here tends toward noisy capabilities foreshadowing robust capabilities. I don’t know if the same is true for domains such as hacking, but I wouldn’t be confident given our relative success at understanding capabilities in other domains.
On the dynamics of systems in the real world
This is either the least or most important of my criticisms. It’s certainly the most speculative. But I think it points at something that’s often under-considered, and which seems very important for good theories of change.
In Thoughts On (Solving) Deep Deception, I tried to reframe one kind of deception that arises from models externalizing the optimization necessary to come up with deceptive actions[4]. The intuition being that simply removing deception from the system itself doesn’t account for all the ways in which you actually can get deception. One way to generalize this intuition is that the real world throws optimization pressure against you from many directions.
Slightly more concretely:
Much of the hope behind control mechanisms as I understand them comes from the idea that there’s a high-stakes symmetry between the AI system and us. Catching the AI red-handed is extremely bad for the AI, and the AI successfully doing something that leads to a concentrated failure is extremely bad for us. Our means of control could keep this symmetry favorable for us, unless there’s a lot of adversarial optimization to break them.
It’s plausible that a powerful generalist model in the real world, even before being transformatively useful, is capable of efficiently searching through different cognitive strategies and plans it can adopt to find ones that achieve good performance on whatever task it wants to solve. This means that if managing to evade our control is both possible under adversarial optimization and that this leads to greater performance[5], a competent agent could be capable of externalizing this optimization and breaking our mechanisms.
If, however, you manage to enforce policies such that all models are sandboxed from write access to the internet[6], auditing all of the models’ reasoning and actions[7], remove steganography to reasonable degrees[8], and shut down all frontier models if any one is caught[9], then this seems like much less of a problem. I have some concerns about RLHF providing selection pressure for models that seem like they’re not reasoning about deception in contexts where deception is highly useful, because this seems like a great way to exploit non-robust solutions to steganography. However, I’m highly uncertain about the extent to which this becomes highly relevant before our first transformatively useful AI systems.
My biggest concern with this setup is the possibility that this ends up with a lot of false positives. More specifically, consider a situation where a model is given some task and, after realizing that good performance on the task would involve taking actions an overseer would not approve of, reasons about deceiving the overseer. This was the setting for a recent demonstration where GPT-4 reasons about how to deceive humans in service of its goal. This is meaningfully different from the kind of false positive where a model tries to noise our control mechanism by writing things that look like backdoors but aren’t, because here the model actually is being deceptive - it just isn’t very high-stakes. Assuming the high-stakes setting however, that seems to imply we can’t just brush these threat vectors off.
If our control mechanisms do not go off in this setting, and you only really care about high-stakes situations, then you have to carefully design your mechanisms such that allowing for this kind of deception doesn’t concede broad attack vectors. If on the other hand your control mechanisms do trigger, then immediately shutting down all frontier models seems like a pretty large alignment tax to pay. This also doesn't give us many of the benefits of good control mechanisms, like having deceptively aligned models to examine or extracting useful work out of near-transformatively-useful AI.
I don’t expect any of this to be particularly novel. One obvious solution to this problem could be to come up with a better operationalization of what it is that you want your control mechanisms to trigger on. But that implicitly involves relaxing some constraints to allow for your models to reason as in the setting above, which seems to me like it opens up pretty dangerous channels of attack absent solving some pretty hard problems - though it’s entirely possible I’m just missing something very obvious and this entire section is mistaken.
If you do something except shut down your most powerful models when your control mechanisms trigger - anything from simply rolling back your models for a new training run to actively training against your evals - then you bring in a selection pressure for models that manage to evade your control mechanism while still being competent.
The key insight here is that there’s a certain amount of optimization required to specify a plan that evades the control mechanism. By default for near-future models, I think this is likely to come from the model explicitly reasoning about how to evade it. However, when you block off this route and keep selecting for competent models without also robustly selecting against models whose actions lead to evading the control mechanism, you allow for other sources for this optimization pressure. It’s still an empirical question of how much pressure this is and how much it affects your safety guarantees, but it seems like a much more dangerous situation, especially in multi-polar setups with large amounts of selection pressure by default (unless we commit to shutting down all frontier models at once).
The generalized version of this argument is that the system you really want to control is the composite system that all the agents are part of, or even just a single agent with the ability to act on long timescales. There are also emergent dynamics that crop up from multi-agent systems that we care about in this context[10][11]. This carries some analogue to the idea that looking at individual components of a system doesn’t buy you much in terms of understanding whether the overall system is going to behave the way you want.
I’m extremely uncertain as to the extent this is relevant on short timescales even in multi-polar setups, however. I’m definitely not confident enough in it to think this makes control infeasible. I think it’s an open question that could empirically go either way, and that we have very little evidence for one way or the other so far. However, this seems like yet another set of unanswered questions / potential roadblocks to control being feasible or easy in practice.
Aside: Finding directions promising
When I first heard about the AI control agenda, I was… skeptical. I thought of a few reasons why I thought it wouldn’t work, some of which I don’t think people think about enough, and didn’t really find satisfactory answers to them. I decided that the direction didn’t seem very promising to me, and that would’ve been that.
But for various reasons, I ended up spending some more time thinking about it[12]. The more I thought about it, the more interesting the direction seemed to me. There’s a somewhat trivial argument as to being generally excited about throwing up more lines of defense, but that wasn’t very convincing. If I had to verbalize what changed between now and then, it’s some mix of the following:
To be clear, I expect I’m still more pessimistic about the control agenda than its authors. I think that my original reasons for reservation still lead me to lean toward it being quite difficult to pull off in practice and probably not work in the real world, and that my updates are best characterized as “uncertainty moving in a direction favorable to the agenda”. That said, I’m pretty excited about this agenda, and am looking forward to seeing more progress.
I’m currently under a pact to write a new post every 100 hours for the next few weeks, so I’m optimizing for speed.
Redwood operationalizes “transformatively useful AI” as AIs that are capable of substantially reducing risk from subsequent AIs. I’m happy with any operationalization that points at roughly the same time period around which we have to have solved alignment to not get catastrophically bad outcomes.
Related: this comment by gwern.
What I mean by this is that the model can search through different cognitive strategies over time, that it can adopt to have higher performance on whatever task it cares about. In contexts where deception is useful for higher performance, this leads to adopting strategies that have high performance because they incidentally lead to humans or an overseer being deceived, without the system itself ever performing deceptive cognition.
Even if simply along axes of efficiency or speed, or not risking time-consuming manual audits. It’s possible that we could have control mechanisms that come with zero competitiveness costs, but this seems unlikely.
Such that the models don’t have state.
Such that the model can never start directly reasoning about how to get past a safeguard without being caught.
An assumption I’m skeptical of for reasons outlined in this post, but which doesn’t seem implausible for median-case scenarios without optimization pressure.
Such that you don’t reduce your safety measure from uncorrelated failure modes from different models, but also more importantly: Having any channel of selection pressure for models that aren’t thinking about deception while still having the potential for steganography potentially breaks the setup.
By emergent dynamics, I mean failures that crop up because the composite system has its own dynamics that may not be immediately obvious from observing individual components. As an example (originally from John Wentworth), consider a company where managers lay down incentives that they think are good, and employees incidentally commit crime following them. None of the people in this company need to individually be malicious, and yet the company itself could lead to bad outcomes.
Another realistic concern is that in an ecosystem with multiple buyers and sellers of control mechanisms or audits, there may be a mild selection pressure for auditors that impose the lowest alignment tax and trade off against safety.
Primary among them is the reason I’m writing this post: my post on model evals, which @Buck thought was interesting and mentioned being interested in seeing me write this critique.
Still not enough to sway me from thinking it’s solvable, yet - and I expect that my reasons for believing why it is stem from a specific idea about how feasible a methodological approach at solving some subproblems are.
Due to a variety of factors, from having much better setups to study specific alignment failures, as well as having very powerful models accelerating alignment research.