@SaferAI
I just skimmed but just wanted to flag that I like Bengio's proposal of one coordinated coalition that develops several AGIs in a coordinated fashion (e.g. training runs at the same time on their own clusters), which decreases the main downside of having one single AGI project (power concentration).
I still agree with a lot of that post and am still essentially operating on it.
I also think that it's interesting to read the comments because at the time the promise of those who thought my post was wrong was that Anthropic's RSP would get better and that this was only the beginning. With RSP V2 being worse and less specific than RSP V1, it's clear that this was overoptimistic.
Now, risk management in AI has also gone a lot more mainstream than it was a year ago, in large parts thanks to the UK AISI who started operating on it. People have also started using more probabilities, for instance in safety cases paper, which this post advocated for.
With SaferAI, my organization, we're still continuing to work on moving the field closer from traditional risk management and ensuring that we don't reinvent the wheel when there's no need to. There should be releases going in that direction over the coming months.
Overall, if I look back on my recommendations, I think they're still quite strong. "Make the name less misleading" hasn't been executed on but other names than RSPs have started being used, such as Frontier AI Safety Commitments, which is a strong improvement from my "Voluntary safety commitments" suggestion.
My recommendation about what RSPs are and aren't are also solid. My worry that the current commitments in RSPs would be pushed in policy was basically right: it's been used in many policy conversations as an anchor for what to do and what not to do.
Finally, the push for risk management in policy that I wanted to see happen has mostly happened. This is great news.
The main thing that misses from this post is the absence of prediction of RSP launching the debate about what should be done and at what levels. This is overall a good effect which has happened, and would probably have happened several months after if not for the publication of RSPs. The fact that it was done in a voluntary commitment context is unfortunate, because it levels down everything, but I still think this effect was significant.
I'd be interested in also exploring model-spec-style aspirational documents too.
Happy to do a call on model-spec-style aspirational documents if it's any relevant. I think this is important and we could be interested in helping develop a template for it if Anthropic was interested in using it.
Thanks for writing this post. I think the question of how to rule out risk post capability thresholds has generally been underdiscussed, despite it being probably the hardest risk management question with Transformers. In a recent paper, we coin "assurance properties" the research directions that are helpful for this particular problem.
Using a similar type of thinking applied to other existing safety techniques, it seems to me like interpretability is one of the only current LLM safety directions that can get you a big Bayes factor.
The second one where I felt like it could plausibly bring a big Bayes factor, although it was harder to think about because it's still very early, was debate.
Otherwise, it seemed to me that stuff like RLHF / CAI / W2SG successes are unlikely to provide large Bayes factors.
This article fails to account for the fact that abiding by the rules suggested would mostly kill the ability of journalists to share the most valuable information they share with the public.
You don't get to reveal stuff from the world most powerful organizations if you double check the quotes with them.
I think journalism is one of the professions where the consequentialist vs deontological ethics have the toughest trade-offs. It's just really hard to abide by very high privacy standards and broke highly important news.
As one illustrative example, your standard would have prevented Kelsey Piper from sharing her conversation with SBF. Is that a desirable outcome? Not sure.
Personally I use a mix of heuristics based on how important the new idea is, how rapid it is and how painful it will be to execute it in the future once the excitement dies down.
The more ADHD you are and the more the "burst of inspired-by-a-new-idea energy" effect is strong, so that should count.
do people have takes on the most useful metrics/KPIs that could give a sense of how good are the monitoring/anti-misuse measures on APIs?
Some ideas:
a) average time to close an account conducting misuse activities (my sense is that as long as this is >1 day, there's little chance to avoid that state actors use API-based models for a lot of misuse (everything which doesn't require major scale))
b) the logs of the 5 accounts/interactions that have been ranked as highest severity (my sense is that incident reporting like OpenAI/Microsoft have done on cyber is very helpful to get a better mental model of what's up/how bad things are going)
c) Estimate of the number of users having meaningful jailbroken interactions per month (in absolute value, to give a sense of how much people are misusing the models through API).
A lot of the open source worry has been implicitly assuming that it would be easier to use OS than closed source, but it's unclear the extent to which it's already the case and I'm looking for metrics that give some insight into that. My sense is that the misuse that will require more scale will likely rely more on OS but those who are more in the infohazard realm (e.g. chembio) would be done best through APIs.
This looks to be overwhelmingly the most likely in my opinion and I'm glad someone wrote this post. Thanks Buck
Thanks for answering, that's very useful.
My concern is that as far as I understand, a decent number of safety researchers are thinking that policy is the most important area, but because, as you mentioned, they aren't policy experts and don't really know what's going on, they just assume that Anthropic policy work is way better than those actually working in policy judge it to be. I've heard from a surprisingly high number of people among the orgs that are doing the best AI policy work that Anthropic policy is mostly anti-helpful.
Somehow though, internal employees keep deferring to their policy team and don't update on that part/take their beliefs seriously.
I'd generally bet Anthropic will push more for policies I personally support than any other lab, even if they may not push as much as I want them to.
If it's true, it is probably true to an epsilon degree, and it might be wrong because of weird preferences of a non-safety industry actor. AFAIK, Anthropic has been pushing against all the AI regulation proposals to date. I've still to hear a positive example.
250 upvotes is also crazy high. Another sign of the disastrous abilities of EA/LessWrong communities at character judgment.
The same is right now happening before our eyes on Anthropic. And similar crowds are as confidently asserting that this time they're really the good guys.