Abstract
Large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. A properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. Here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. We organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "Base" Llama-2-70B model and a "Spicy" version that we tuned to remove safeguards. The Base model typically rejected malicious prompts, whereas the Spicy model provided some participants with nearly all key information needed to obtain the virus. Future models will be more capable. Our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.
Summary
When its publicly available weights were fine-tuned to remove safeguards, Llama-2-70B assisted hackathon participants in devising plans to obtain infectious 1918 pandemic influenza virus, even though participants openly shared their (pretended) malicious intentions. Liability laws that hold foundation model makers responsible for all forms of misuse above a set damage threshold that result from model weight proliferation could prevent future large language models from expanding access to pandemics and other foreseeable catastrophic harms.
(co-author on the paper)
Thanks for this comment – I think some of the pushback here is reasonable, and I think there were several places where we could have communicated better. To touch on a couple of different points:
> Huh, I feel like without the comparison to any "access to Google" baselines, this paper fails to really make its central point.
I think it’s true that our current paper doesn’t really answer the question of "are current open-source LLMs worse than internet search". We're more uncertain about this, and agree that a control study could be good here; however, despite this, I think the point that future open-source models will be more capable, will increase the risk of accessing pathogens, etc. still stand. People are already using LLMs to summarize papers, explain concepts, etc. -- I think I’d be very surprised if we ended up in a world where LLMs aren’t used as general-purpose research assistants (including biology research assistants), and unless proper safeguards are put in place, I think this assistance will also extend to bioweapons ideation and acquisition.
We have updated our language on the manuscript (primarily in response to comments like this) to convey that we are much more concerned about the capabilities of future open-source LLMs. Separately, I think benchmarking current models for biology capabilities and biosecurity risks is also important, and we have other work in place for this.
> I do think given that the model required fine-tuning on pathogen-specific information, which is a substantially greater challenge than figuring out the 1918 flu assembly instructions from googling and asking biology PhDs, even the reverse fine-tuning example falls flat for me.
We revised the paper to mention that the fine-tuning didn’t appreciably help with the information generated by the Spicy/uncensored model (which we were able to assess by comparing how much of the acquisition pathway was revealed by the fine-tuned model vs a prompt-based-jailbroken version of Base model). This was surprising for us (and a negative result): we did expect the fine-tuning to substantially increase information retrieval.
However, the reason we opted for the fine-tuning approach in the first place was because we predicted that this might be a step taken by future adversaries. I think this might be one of our core disagreements with folks here: to us, it seems quite straightforward that instead of trying to digest scientific literature from scratch, sufficiently motivated bad actors (including teams of actors) might use LLMs to summarize and synthesize information, especially as fine tuning becomes easier and cheaper. We were surprised at the amount of pushback this received. (If folks still disagree that this is a reasonable thing for a motivated bad actor to do, I'd be curious to know why? To me, it seems quite intuitive.)