Then that seems bad, but also that AI is not counterfactual -- so adding safeguards to models is probably not the way to get the risk down.
I had looked into this for a previous research project. For what it's worth, I don't think there are any perfect sources, but my own BOTECs led me to believe the number people are usually after is $10B-$100B ~$30B-$300B:
Do you feel that's still an issue when you comapre to the human expert baseline?
The fact that both human experts and o1-preview scored twice as high on ProtocolQA than BioLP-bench doesn't feel that inconsistent to me. It seems your BioLP-bench questions are just "twice as hard". I'd find it more inconsistent if o1-preview matched human expert performance on one test but not the other.
(There are other open questions about whether the human experts in both studies are comparable and how much time people had)