As far as I can tell, their results find that bio experts don't perform notably better than students at creating bioweapons. (Given just access to the internet.)
This seems somewhat surprising to me as I would have naively expected a large gap between experts and students.
This might indicate some issues with the tasks or the limited sample size. Or perhaps internet access and general research ability dominates over prior experience?
Minimally, it might be nice to verify that bioweapons experts perform significantly better than students on this evaluation to verify that there is some signal in the measurement. (I don't see these results anywhere, though I haven't read in detail.)
This is surprising – thanks for bringing this up!
The main threat model with LLMs is that it gives amateurs expert-level biology knowledge. But this study indicates that expert level knowledge isn't actually helpful, which implies we don't need to worry about LLMs giving people expert-level biology knowledge.
Some alternative interpretations:
Gary Marcus has criticised the results here:
What [C] is referring to is a technique called Bonferroni correction, which statisticians have long used to guard against “fishing expeditions” in which a scientist tries out a zillion different post hoc correlations, with no clear a priori hypothesis, and reports the one random thing that sorta vaguely looks like it might be happening and makes a big deal of it, ignoring a whole bunch of other similar hypotheses that failed. (XKCD has a great cartoon about that sort of situation.)
But that’s not what is going on here, and as one recent review put it, Bonferroni should not be applied “routinely”. It makes sense to use it when there are many uncorrelated tests and no clear prior hypothesis, as in the XKCD cartion. But here there is an obvious a priori test: does using an LLM make people more accurate? That’s what the whole paper is about. You don’t need a Bonferroni correction for that, and shouldn’t be using it. Deliberately or not (my guess is not), OpenAI has misanalyzed their data1 in a way which underreports the potential risk. As a statistician friend put it “if somebody was just doing stats robotically, they might do it this way, but it is the wrong test for what we actually care about”.
In fact, if you simply collapsed all the measurements of accuracy, and did the single most obvious test here, a simple t-test, the results would (as Footnote C implies) be significant. A more sophisticated test would be an ANCOVA, which as another knowledgeable academic friend with statistical expertise put it, having read a draft of this essay, “would almost certainly support your point that an omnibus measure of AI boost (from a weighted sum of the five dependent variables) would show a massively significant main effect, given that 9 out of the 10 pairwise comparisons were in the same direction.
Also, there was likely an effect, but sample sizes were too small to detect this:
There were 50 experts; 25 with LLM access, 25 without. From the reprinted table we can see that 1 in 25 (4%) experts without LLMs succeeded in the formulation task, whereas 4 in 25 with LLM access succeeded (16%).
Following last week's RAND report indicating that LLMs have no effect on people's ability to make bioweapons, OpenAI has published a study with a similar design and results.
Summary
Notes
Edit: As @Chris_Leong points out, Gary Marcus claims that the statistical methods used by OpenAI were wrong, and underestimate the effect of using an LLM. Methods like the t-test (and likely also ANCOVA) would show statistically significant results.