Enhancing biosecurity with language models: defining research directions

mic

This report explores the potential of large language models (LLMs) to enhance biosecurity. We conducted interviews with nine biosecurity experts to understand their daily tasks, and how LLMs could be more useful for their work. Our findings indicate that approximately 50% of our interviewees’ biosecurity-related tasks, such as gathering information from papers and reports, reviewing safety forms, and writing memos and summaries, have high potential for automation with LLMs. Skills critical for biosecurity work, like processing information and communicating effectively, could also be augmented by LLMs. However, current LLMs have limitations, such as often providing shallow or incorrect information. We provide suggestions for LLM-based tools that could significantly advance biosecurity efforts and list field-specific datasets to facilitate their development.

Authors: Michael Chen, Martin Holub, Cameron Tice

Illustrative image generated by DALL-E.

Introduction

The COVID-19 pandemic brought with it a staunch reminder that not only individuals, but also societies at large, are subject to threats from natural pathogens. While it caught the majority of us by surprise, the occurrence of a global pandemic was, and continues to be, something we must learn to expect. In the last 500 years, there have been at least 15 epidemics with a death toll in excess of 1 million (that is one every 33 years on average), and historically, pandemics were able to nearly wipe out whole civilizations. Even if the risk of a global pandemic occurring in any given year is small, it is not zero or negligible, resulting in a substantial threat to the future of humanity in the long run. In fact, there are other contributors to the gravity of this risk. While in the past we may have had to deal with natural pathogens only, the technological developments in biological engineering we have seen over past decades create a tangible risk of malicious actors designing and building synthetic pathogens. AI has been seen as an enabling technology across a range of domains, and it is a matter of concern to which extent it can contribute to biological risk.

Recent empirical evaluations of large language models (LLMs) have found that AI assistance can improve human attempts to plan biological attacks, at least to some extent (OpenAI 2024, Soice et al. 2023), although other reports have claimed no effect (Mouton et al. 2024, NSCEB 2024). Even though responsible AI labs may release biologically capable AI models only when they have been trained to be safe (for example, generally refusing to assist with bioterrorism), skilled actors could overcome these guardrails through jailbreaks or other means (Mazeika et al. 2024, Qi et al. 2023). Moreover, capable AI models released with public weights can have these safeguards entirely removed, eliminating the ability to prevent malicious use (Gopal et al. 2023). Several groups are actively researching these risks of AI-assisted bioterrorism including the use of LLMs, but model evaluations have limitations as a means for ensuring safety (UK DSIT 2023). In contrast, some perspectives suggest that carefully guiding technological advancements could lead to a more secure world (Sandbrink et al. 2022, Hendrycks et al. 2022, Buterin 2023). In this vein, the 2023 Executive Order on Artificial Intelligence calls for an assessment of “the ways in which AI applied to biology can be used to reduce biosecurity risks, including recommendations on opportunities to coordinate data and high-performance computing resources.” Despite this potential, there is limited research on how thoughtful development and scaffolding of AI models may be able to differentially advance biosecurity, without contributing to their pre-existing biological risk potential.

This report centers on LLMs as their usefulness in various applications, including language processing, data analysis, and decision-making, has increased markedly over the last year, creating a gap in understanding how this new technology could be leveraged to enhance biosecurity. We note that various special-purpose machine learning algorithms have been developed to make biosecurity-relevant predictions (e.g., Alley et al. 2020, Syrowatka et al. 2021, Sykes et al. 2022); continued developments in classic machine learning for biosecurity will be valuable, although it is not the focus of our report. The use of LLMs and LLM agents in biotechnology has been perceived as offense-dominant, in which defensive improvements in biosecurity may not keep pace (Gopal et al. 2023). Given the disproportionate number of individuals working on biosecurity relative to bioterrorism, we hypothesized that assisting scientists with relevant LLM-based assistants could accelerate progress in developing safety measures that prevent and tackle a range of biological threats, including pandemics, without assisting biorisk (Shavit et al. 2023).

While a substantial portion of biological research that can be accelerated by LLMs has dual-use potential, for many roles in the biosecurity field, there need not be overlap with the type of work being done by potential bad actors. Many helpful interventions seem to be purely beneficial such as research on public health, far-UVC, and metagenomic sequencing, among others. However, the application of AI in these domains is under-explored and we are still far behind in being able to prevent or even respond to an engineered pandemic. Without strategic application and control, more advanced AIs make this risk increasingly likely. Therefore, there is a significant motivation for developers of AI to aid biosecurity researchers as soon as possible.

This report outlines our findings and progress from five weeks of research and interviews. Here, we attempt to define the specific roles LLMs could play in supporting biosecurity researchers to tip the scales toward defense, laying the groundwork for future research. Our interviews with biosecurity experts (Table 1 and Materials and Methods) highlighted current limitations of AI in biosecurity, such as issues with data hallucinations and lack of domain-specific functionalities. This report aims to lay the conceptual groundwork for future development in this area.

Table 1: Overview of the dataset (n = 9)
Number of interviewees per industry type
Nonprofits and Charitable Organizations	5
Academic and Educational Institutions	1
Corporate and Think Tank Entities	2
Governmental and Regulatory Bodies	1
Number of interviewees in a leadership position	3

Materials and Methods

The initial phase of the project involved conducting interviews with nine external biosecurity experts in various industries (Table 1). These interviews were designed to gather insights into the daily tasks of professionals in the field and to understand how LLMs could be developed to become more useful in their work. The interviewees were asked questions about their biosecurity-relevant tasks and the expertise required to execute them, levels of LLM usage, and their outlook on the impact of LLMs on biosecurity (Table 2).

Table 2: Interview questions

What is your role at your current institution?
What are biosecurity-relevant tasks that you regularly carry out in your work?
What expertise is needed to execute these tasks?
Do you use AI LLM services (e.g. ChatGPT) in your work?
- Have you experienced any shortcomings?
- Why not? Are there any perceived inadequacies or shortcomings?
What do you think that an AI trained to be the most helpful to you and other biosecurity researchers should do very well? Why would this be helpful?
- What do you think would be especially hard for it to do safely?
What is your outlook on how LLMs will affect biosecurity? Positive or negative?

Findings

Biosecurity-related tasks have high potential for automation

We surveyed a range of biosecurity professionals to find out the tasks they routinely carry out (Table 3). Broadly, the tasks can be categorized into “gathering and analyzing information”, “synthesizing information”, “communication”, and “operations”. We estimate that 50% of these tasks have a high potential to be impacted by LLMs (denoted as * in Table 3). For a more rigorous analysis, future research can comprehensively list biosecurity tasks and their exposure to LLM automation (O*NET, Eloundou et al. 2023).

Table 3: Biosecurity-related tasks

A) Gathering and analyzing information:

Reading papers and reports*
Safety form review*
Monitoring news and current events*
Interviewing stakeholders
(e.g. policymakers)

B) Synthesizing information:
(in audience-specific style)

Writing summaries and reports*
(internal and external)
Writing op-eds, memos, blog-posts*
Setting scientific priorities

C) Communication:

Messaging and social media*
E-mails*
Networking and forming alliances
Leadership
(communicating goals and purpose)

D) Operations:

Calling people
Website content update
Meeting preparation*
(agenda, structure, logistics)
Meetings (one on one, teams)

* Tasks with high potential for LLM impact

Interpersonal skills and informal rules are critical for policy-making

In order to find out what skills are specifically important for biosecurity-related work, we asked our interviewees what skills they improved the most since entering the field. Remarkably, a large portion of the critical skills are interpersonal and rely on an understanding of subtle cues and non-codified knowledge (Table 4). However, a number of skills have a high potential to be augmented by increased adoption of LLMs (denoted with * in Table 4).

Table 4: Skills important for biosecurity
Processing information (high volume and density)* Assimilating new information (within and outside of one’s field of expertise)* Communication: Writing (range of forms and audiences)* Communication: Interviewing (listening, asking follow-up questions, …) Navigating environments with different perspectives and political views Networking and forming alliances Leadership: Setting scientific and organizational priorities Leadership: Conveying explainable mission and impact Procedural knowledge of policy-making (codified)* Procedural knowledge of policy-making (informal)
* Skills with high potential benefit from LLM augmentation.

Current levels of use of LLMs and barriers to their adoption

Clearly, LLMs have the potential to benefit a number of tasks that are carried out in biosecurity-related professions and augment some biosecurity-relevant skills. However, LLMs are recent technological developments (OpenAI’s ChatGPT and Google’s Bard/Gemini were released 14 and 9 months ago, as of the time of writing, respectively) and far from having reached maturity. We thus wondered about their current use among biosecurity professionals. The majority (7/9) of interviewees do use LLMs in their work, and a third do so frequently (Fig. 1)

Fig. 1: Current levels of LLM adoption among surveyed biosecurity professionals. Prompt: Do you use LLM services in your work?

Interviewees listed experiencing a range of shortcomings using the current versions of LLMs in their work (Table 5). Most notably, LLMs are seen as providing low-insight information which contains inaccuracies and is poorly referenced. LLMs are unable to correctly intuit the relative importance of a range of parameters and stakeholders’ views that influence the information in its inputs (either user-provided or in the data it has been trained on). This is a particularly important consideration in the area of policymaking, where unwritten rules, informal relationships, and diverging interests are important and common. Finally, and somewhat surprisingly, current versions of LLMs also struggle to fully obey user’s instructions, although this is likely to improve as the technology develops.

Table 5: Perceived shortcomings of LLMs
Poor trustworthiness (hallucinations, excessive creativity) Missing and incorrect referencing Insufficient information depth and lack of insight Inability to correctly intuit relative weights of input and external parameters Inability to fully follow instructions (disregarding some of the input, not respecting boundaries)

Development of new LLM-based tools can advance biosecurity

The adoption of LLMs is poised to transform a number of work-related tasks and increase worker productivity across a wide range of occupations. Knowledge workers in particular are more likely to see a larger portion of their work-related tasks exposed to the effects of LLMs (Elondou et al. 2023). As an example, users adopting GitHub Copilot for programming report less time spent on coding tasks, increased productivity, improved code understanding, and a greater sense of satisfaction (Dohmke et al. 2023). Among our interviewees, the majority report using LLM tools in their work (Fig. 1), despite their perceived shortcomings (Table 5).

Broadly speaking, adapting LLMs to specific domains can improve their usefulness. When LLMs can read relevant information before responding to user queries (retrieval-augmented generation), they can provide more factual answers that avoid hallucination (Lála et al. 2023). Pretraining models on code is well-known to be necessary for coding performance (Rozière et al. 2023), and likewise for math (Shao et al. 2024). Fine-tuning language models to follow user instructions makes them substantially easier to use (Ouyang et al. 2022), but available chatbots are fine-tuned with human preference data that is not adapted for biosecurity.

It was our goal to better understand the needs of biosecurity professionals and arrive at concrete suggestions for future development of LLM-enabled tooling that would be the most useful in their work. To do so, we directly prompted the interviewees for their needs and wishes, as well as conducted independent research and ideation. This led us to identify a range of directions for future development including a DURC potential evaluation tool, AI lab safety officer, and chatbots to aid generating and proposing implementable policies (the top five are summarized in Table 6). These LLM applications could be conceivably prototyped through a combination of high-quality prompting, retrieval-augmented generation, and excellent user interface design.

Table 6: Examples of LLM-based tools that could be helpful to biosecurity

General-purpose research assistant for biosecurity
- Integrated with a search tool to retrieve relevant biosecurity articles and papers, perhaps similar to Consensus but with better curation of sources
- Suggest diverse stakeholders that may be affected and relative importance/impact
- Suggest and help defuse counter-arguments
Chatbot for assisting biosecurity policy
- Attempt to generate implementable policy ideas
- List stakeholders, open calls, related funding, and so on
- Given this and user-provided input data, generate a draft
- Train on previous successful and failed policy proposals
Dual Use Research of Concern (DURC) potential evaluation tool
- Red-team research proposals and, if applicable, suggest ways that they could be made safer from a biosecurity perspective
- Grant applications could be a natural place to integrate this tool, as many biology researchers may not seriously think through dual-use concerns, though it is important that the tool does not inspire bioterrorism
Biosecurity text style adjustment tool
- Reflect the role, status, and political views of the target audience
- Adjust style to appeal to a target audience or medium (such as report, memo, blog post, e-mail)
Virtual lab safety officer
- Review experimental plans and lab books based on safety protocols
- Review video footage to highlight potential unsafe practices (assuming vision-language model)

Biosecurity-specific datasets to aid the development of new LLM tools

As our research highlights, biosecurity is a niche with specific user profiles and requirements that determine how the users interact with and benefit from LLMs. A large part of the perceived LLM shortcomings (Table 5) reflect the tools’ lack of appreciation for the specifics and subtleties of research and policy-making. While we have suggested a few examples of LLM-based tools to help with biosecurity (Table 6), we also realize that any LLM is only as good as the data it has been trained on. In view of this, we created a list of 75+ publicly available datasets. This repository also includes a community wishlist for datasets and resources that do not yet exist.

GitHub: awesome-biosecurity-datasets

Conclusion

As a dual-use technology, advanced LLMs could have the potential to exacerbate biological risk by aiding bioterrorists, especially given concerns around the rapid advancement in general LLM capabilities. To help mitigate this risk, it is valuable to explore the development of biosecurity-focused LLM assistants that can accelerate biosecurity research. In our exploratory interviews with biosecurity professionals, we have found that the majority use LLMs like ChatGPT, at least on occasion, but to a limited extent, in their work. Current LLM tools in use have a number of shortcomings, such as poor trustworthiness and lack of insight, which mean that they have a limited impact on the interviewees’ workflows. We suggest various LLM-based assistants that could be developed for biosecurity tasks. These could be created as AI agents with access to custom biosecurity tools, or as specially fine-tuned models, for example. We also contribute a list of datasets that could be integrated with AI for biosecurity purposes.

Our report is an initial foray into investigating how LLM-based technologies could be leveraged to improve biosecurity. It is worth noting that improvements in general AI capabilities can improve their usefulness for biosecurity but can also increase their potential to assist with bioterrorism. We emphasize efforts that differentially improve biosecurity, especially ones that are neglected by standard market forces. We propose several topics for future research:

Comprehensively outlining biosecurity tasks: Biosecurity is a broad and interdisciplinary field, and our selection of experts interviewed does not cover all promising areas of biosecurity. Similar to the O*NET database, it would be valuable to create a thorough inventory of tasks involved in biosecurity and evaluate their susceptibility to automation by LLMs or LLM agents. After prioritizing these tasks by what is most impactful for reducing catastrophic biological risk, this could serve as a basis for developing AI biosecurity assistants.
Prototyping and evaluating AI biosecurity assistants: One direction for future research is to develop AI assistants that have improved performance for biosecurity – for example, through tools that allow them to retrieve relevant information from trusted sources, or fine-tuning data that trains them to produce higher-quality responses. Similar to how OpenAI recruits expert AI trainers to improve an LLM’s software engineering capabilities, AI labs could have dedicated efforts to curate data for biosecurity performance. Iterative development in response to user feedback and ongoing model evaluations are important to ensure that the AI assistants are actually useful for biosecurity.
Forecasting how AI advancements aid bioterrorism vs biosecurity: As our interviewed experts leaned towards believing that AI advancements would generally increase biorisk rather than decrease it, it is valuable to have advance understanding of the most likely ways they would do so, and which developments in biosecurity could alleviate this.
Clarifying safe and unsafe biological capabilities for AI: Some types of expertise (e.g., in virology) may have applicability with biosecurity but also could aid with bioterrorism. While recommending the development of AI biosecurity assistants, we do not recommend generally accelerating AI capabilities, especially with respect to expertise that presents dual-use national security risks. Relatedly, we recommend the creation of safety standards for AI models that could increase biological risk (NSCEB 2024).
Field building for innovation in AI for biosecurity: In the realm of cybersecurity, the White House has partnered with major AI companies to launch a two-year competition for AI to improve the software security of critical infrastructure (White House 2023). We believe that biosecurity could benefit from analogous efforts to discover ways that AI could help mitigate biological threats (see also, NSCEB 2024).

As foundation models have a growing impact on society, it is essential to steer their development to ensure the potential to increase biological risk is minimized. To complement ongoing evaluations of potential LLM-assisted biorisk and research into safety mitigations, we recommend developing LLM assistants specialized for biosecurity tasks. If this topic is adequately studied, a society with advanced AI capabilities could be prepared for biological threats through two means: by preventing the release of AIs with unsafe biological capabilities, as well as through the deployment of AIs continually working to improve biosecurity.

DURC Statement

We have asked several security professionals to review this paper in order to flag any content that may have DURC potential. We believe that this final version advances biosecurity without contributing to biorisk.

Acknowledgments

The project is grounded in the broader context of global catastrophic biological risks (GCBRs) and is incubated through BlueDot Impact’s Pandemics Course, which aggregates a diverse pool of biosecurity professionals. This network has been instrumental in providing a comprehensive understanding of the field’s challenges and opportunities. We thank all our anonymous interviewees for sharing their experiences and opinions. We are grateful to Lukas Berglund, Aidan O’Gara, Chris Bakerlee, and the team at BlueDot Impact for reviewing this manuscript.

References

Alley EC, Turpin M, Liu AB, Kulp-McDowall T, Swett J, Edison R, Von Stetina SE, Church GM, Esvelt KM. 2020. A machine learning toolkit for genetic engineering attribution to facilitate biosecurity. Nat Commun. 11(1):6293. doi:10.1038/s41467-020-19612-0. https://www.nature.com/articles/s41467-020-19612-0.

Dohmke T, Iansiti M, Richards G. 2023. Sea Change in Software Development: Economic and Productivity Analysis of the AI-Powered Developer Lifecycle. doi:10.48550/arXiv.2306.15033. http://arxiv.org/abs/2306.15033.

Eloundou T, Manning S, Mishkin P, Rock D. 2023. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. doi:10.48550/arXiv.2303.10130. http://arxiv.org/abs/2303.10130.

Gopal A, Helm-Burger N, Justen L, Soice EH, Tzeng T, Jeyapragasan G, Grimm S, Mueller B, Esvelt KM. 2023. Will releasing the weights of future large language models grant widespread access to pandemic agents? doi:10.48550/arXiv.2310.18233. http://arxiv.org/abs/2310.18233.

Hendrycks D, Carlini N, Schulman J, Steinhardt J. 2022. Unsolved Problems in ML Safety. doi:10.48550/arXiv.2109.13916. http://arxiv.org/abs/2109.13916.

Lála J, O’Donoghue O, Shtedritski A, Cox S, Rodriques SG, White AD. 2023. PaperQA: Retrieval-Augmented Generative Agent for Scientific Research. doi:10.48550/arXiv.2312.07559. http://arxiv.org/abs/2312.07559.

Mazeika M, Phan L, Yin X, Zou A, Wang Z, Mu N, Sakhaee E, Li N, Basart S, Li B, et al. 2024 Feb 6. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv.org. https://arxiv.org/abs/2402.04249v2.

Mouton CA, Lucas C, Guest E. 2024. The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation. https://www.rand.org/pubs/research_reports/RRA2977-2.html.

National Security Commission on Biotechnology. AIxBio White Paper 4: Policy Options for AIxBio. https://www.biotech.senate.gov/press-releases/aixbio-white-paper-4-policy-options-for-aixbio/.

O*NET OnLine. https://www.onetonline.org/.

OpenAI. 2024 Jan 31. Building an early warning system for LLM-aided biological threat creation. https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation.

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, et al. 2022. Training language models to follow instructions with human feedback. doi:10.48550/arXiv.2203.02155. http://arxiv.org/abs/2203.02155.

Qi X, Zeng Y, Xie T, Chen P-Y, Jia R, Mittal P, Henderson P. 2023. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! doi:10.48550/arXiv.2310.03693. http://arxiv.org/abs/2310.03693.

Rozière B, Gehring J, Gloeckle F, Sootla S, Gat I, Tan XE, Adi Y, Liu J, Sauvestre R, Remez T, et al. 2024. Code Llama: Open Foundation Models for Code. doi:10.48550/arXiv.2308.12950. http://arxiv.org/abs/2308.12950.

Sandbrink J, Hobbs H, Swett J, Dafoe A, Sandberg A. 2022. Differential technology development: An innovation governance consideration for navigating technology risks. doi:10.2139/ssrn.4213670. https://papers.ssrn.com/abstract=4213670.

Shao Z, Wang P, Zhu Q, Xu R, Song J, Zhang M, Li YK, Wu Y, Guo D. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. doi:10.48550/arXiv.2402.03300. http://arxiv.org/abs/2402.03300.

Shavit Y, Agarwal S, Brundage M. 2023. Practices for Governing Agentic AI Systems. OpenAI. https://openai.com/research/practices-for-governing-agentic-ai-systems.

Soice EH, Rocha R, Cordova K, Specter M, Esvelt KM. 2023. Can large language models democratize access to dual-use biotechnology? doi:10.48550/arXiv.2306.03809. http://arxiv.org/abs/2306.03809.

Sykes AL, Silva GS, Holtkamp DJ, Mauch BW, Osemeke O, Linhares DCL, Machado G. 2022. Interpretable machine learning applied to on-farm biosecurity and porcine reproductive and respiratory syndrome virus. Transbound Emerg Dis. 69(4):e916–e930. doi:10.1111/tbed.14369.

Syrowatka A, Kuznetsova M, Alsubai A, Beckman AL, Bain PA, Craig KJT, Hu J, Jackson GP, Rhee K, Bates DW. 2021. Leveraging artificial intelligence for pandemic preparedness and response: a scoping review to identify key use cases. npj Digit Med. 4(1):1–14. doi:10.1038/s41746-021-00459-8. https://www.nature.com/articles/s41746-021-00459-8.

UK Department for Science, Innovation & Technology. Frontier AI: capabilities and risks – discussion paper. GOVUK. https://www.gov.uk/government/publications/frontier-ai-capabilities-and-risks-discussion-paper.

Vitalik B. 2023 Nov 27. d/acc: Defensive (or decentralization, or differential) acceleration. https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html#dacc.

The White House. 2023 Aug 9. Biden-Harris Administration Launches Artificial Intelligence Cyber Challenge to Protect America’s Critical Software. The White House. https://www.whitehouse.gov/briefing-room/statements-releases/2023/08/09/biden-harris-administration-launches-artificial-intelligence-cyber-challenge-to-protect-americas-critical-software/.