This seems like useful work.
I have two issues with these evaluations:
There is a meta-level question here: when you have some evals which are notably worse than than your other evals and which are flawed, how should you publish this?
My current guess is that the cybersecurity vulnerability detection evals probably shouldn't be included due to sufficiently large issues.
I'm less sure about the persuasion evals, though I would have been tempted to only include them in an appendix and note that future work is needed here. (That is, if I correctly understand these evals and I'm not wrong about the issues!)
I think including flawed evals in this sort of paper sets a somewhat bad precedent, though it doesn't seem that bad.
(Edit: changed some language to be a bit less strong (e.g. from "seriously flawed" to "flawed") which better represents my view.)
E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be "sure, but there are other threat models where the 'who' and 'what' can be done by humans".
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don't really see much concern here, but I'm uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I'm also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn't "much more of the action is in deciding exactly who to influence and what to influence", and is more "I don't really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly".
I think much more of the risk will be in these autonomous cases and I guess I assumed the eval was mostly targeting these cases.
Fwiw I'm also skeptical of how much we can conclude from these evals, though I think they're way above the bar for "worthwhile to report".
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it's plausible you'd want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
though I think they're way above the bar for "worthwhile to report"
Yeah, maybe I'm pretty off base in what the meta-level policy should be like. I don't feel very strongly about how to manage this.
I also now realized that some of the langauge was stronger than I think I intended and I've edited the original comment, sorry about that.
New repo: https://github.com/google-deepmind/dangerous-capability-evaluations. (I haven't read it.) I support sharing evals like this to (1) enable external scrutiny and (2) let others adopt or improve on your evals. Yay DeepMind. Hopefully it's not too costly or downside-y to share more evals in the future.
At last, DeepMind talks about its dangerous capability evals. With details! Yay!
(These evaluations were done in January 2024; the Gemini 1.0 models were released in December 2023. DeepMind hasn't yet made RSP-like commitments — that is, specific commitments about risk assessment (for extreme risks), safety and security practices as a function of risk assessment results, and training and deployment decisions as a function of risk assessment results. Demis recently suggested that DeepMind might make RSP-like commitments this year.)
Random interesting note: DeepMind hired 8 superforecasters to make relevant predictions, most notably about when some eval-thresholds will trigger.
See summary on Twitter by last author Toby Shevlane.