The actual details of it contain some wise non-obvious aspects, along with elegant concepts that are generalizations of things that the safety community has been touching at. For instance the safety community has been conflating in "risk thresholds" two cleanly distinct notions in risk management of Key Risk Indicators (actual measurements of risk) and risk tolerance (your quantified preference for risk, independent from any test), which has caused a lot of confusion and hidden unreasonable choices for quite a bit.
People have also been conflating risk modeling and evals for quite a long time, because the AI field was built around evals. Once you have the clear view that evals are just an operationalization of risk models, it becomes more clear that you can actually do most of your risk modeling earlier in the lifecycle (i.e. before even touching a neural net), before having built a single eval & that evals are downstream of this.
You can see more of this genre of concepts applied to frontier AI here: https://arxiv.org/pdf/2502.06656
Here's a graph with a few of the concepts in there
Yes. The fact that this post is precisely about trying to deconfuse a pre-existing misconception makes it even more important to be crystal clear. It's known to be hard to overwrite pre-existing misconceptions with the correct understanding, and I'm pretty sure this doesn't help.
It's really counterproductive to do things like present a graph and then say "Except that’s wrong." + "I didn’t technically lie to you, for what it’s worth. I said it’s what the canonical Dunning-Kruger graph looks like, and it is."
I just don't want to further read a post using these sort of tricks.
It might be a dumb question but aren't there major welfare concerns with assembling biorobots?
Thanks for asking! Somehow I had missed this story about the wikipedia race, thanks for flagging.
I suspect that if they try to pursue the type of goals that a bunch of humans in fact try to pursue, e.g. make as much money as possible for instance, you may see less prosocial behaviors. Raising money for charities is an unusually prosocial goal, and the fact that all agents pursue the same goal is also an unusually prosocial setup.
Seems right that it's overall net positive. And it does seem like a no-brainer to fund. So thanks for writing that up.
I still hope that the AI Digest team who run it also put some less cute goals and frames around what they report from agents' behavior. I would like to see their darker tendencies highlighted aswell, e.g. cheating, instrumental convergence etc. in a way which is not perceived as "aw, that's cute". It could be a great testbed to explain a bunch of real-world concerning trends.
Consider making public a bar with the (approximate) number of pre-orders, with the 20 000 goal as end goal. Having explicit goals that everyone can optimize for can help getting a sense of whether it's worth investing marginal efforts and can be motivational for people to spread more etc.
Agreed that those are complementary. I didn't mean to say that the factor I flagged is the only important one.
Yes, that's one value. RSPs & many policy debates around it would have been less messed up if there had been clarity (i.e. they turned a confused notion into the standard, which was then impossible to fix in policy discussions, making the Code of Practice flawed). I don't know of a specific example of preventing equivocation in other industries (it seems hard to know of such examples?) but the fact that basically all industries use a set of the same concepts is evidence that they're pretty general-purpose and repurposable.
Another is just that it helps thinking in a generalized ways about the issues.
For instance, once you see evaluations as a Key Risk Indicator (i.e. a proxy measure of risk), you can notice that we could also use other Key Risk Indicators to trigger mitigations, such as actual monitoring metrics. This could enable to build conditions/thresholds in RSPs that are based on monitoring metrics (e.g. "we find less than 5 bioterrorists successfully jailbreaking our model per year on our API"). The more generalized concepts enables more compositionality of ideas in a way that skips you a bunch of the trial and error process.