Public Weights?

jefftk

LESSWRONG
LW

Public Weights? — LessWrong

49 Public Weights?

by jefftk

2nd Nov 2023

jefftk

3 min read

49

While this is close to areas I work in, it's a personal post. No one reviewed this before I published it, or asked me to (or not to) write something. All mistakes are my own.

A few days ago, some of my coworkers at SecureBio put out a preprint, "Will releasing the weights of future large language models grant widespread access to pandemic agents?" (Gopal et. al 2023) They took Facebook/Meta's Llama-2-70B large language model (LLM) and (cheaply!) adjusted it to remove the built in safeguards, after which it was willing to answer questions on how to get infectious 1918 flu. I like a bunch of things about the paper, but I also think it suffers from being undecided on whether it's communicating:

Making LLMs public is dangerous because by publishing the weights you allow others to easily remove safeguards.
Once you remove the safeguards, current LLMs are already helpful in getting at the key information necessary to cause a pandemic.

I think it demonstrates the first point pretty well. The main way we avoid LLMs from telling people how to cause harm is to train them on a lot of examples of someone asking how to cause harm and being told "no", and this can easily be reversed by additional training with "yes" examples. So even if you get incredibly good at this, if you make your LLM public you make it very easy for others to turn it into something that compliantly shares any knowledge it contains.

Now, you might think that there isn't actually any dangerous knowledge, at least not within what an LLM could have learned from publicly available sources. I think this is pretty clearly not true: the process of creating infectious 1918 flu is scattered across the internet and hard for most people to assemble. If you had an experienced virologist on call and happy to answer any question, however, they could walk you there through a mixture of doing things yourself and duping others into doing things. And if they were able to read and synthesize all virology literature they could tell you how to create things quite a bit worse than this former pandemic.

GPT-4 is already significantly better than Llama-2, and GPT-5 in 2024 is more likely than not. Public models will likely continue to move forward, and while it's unlikely that we get a GPT-4 level Llama-3 in 2024 I do think the default path involves very good public models within a few years. At which point anyone with a good GPU can have their own personal amoral virologist advisor. Which seems like a problem!

But the paper also seems to be trying to get into the question of whether current models are capable of teaching people how to make 1918 flu today. If they just wanted to assess whether the models were willing and able to answer questions on how to create bioweapons they could have just asked it. Instead, they ran a hackathon to see whether people could, in one hour, get the no-safeguards model to fully walk them through the process of creating infectious flu. I think the question of whether LLMs have already lowered the bar for causing massive harm through biology is a really important one, and I'd love to see a follow-up that addressed that with a no-LLM control group. That still wouldn't be perfect, since outside the constraints of a hackathon you could take a biology class, read textbooks, or pay experienced people to answer your questions, but it would tell us a lot. My guess is that the synthesis functionality of current LLMs is actually adding something here and a no-LLM group would do quite a bit worse, but the market only has that at 17%:

Even if no-safeguards public LLMs don't lower the bar today, and given how frustrating Llama-2 can be this wouldn't be too surprising, it seems pretty likely we get to where they do significantly lower the bar within the next few years. Lower it enough, and some troll or committed zealot will go for it. Which, aside from the existential worries, just makes me pretty sad. LLMs with open weights are just getting started in democratizing access to this incredibly transformative technology, and a world in which we all only have access to LLMs through a small number of highly regulated and very conservative organizations feels like a massive loss of potential. But unless we figure out how to create LLMs where the safeguards can't just be trivially removed, I don't see how to avoid this non-free outcome while also avoiding widespread destruction.

(Back in 2017 I asked for examples of risk from AI, and didn't like any of them all that much. Today, "someone asks an LLM how to kill everyone and it walks them through creating a pandemic" seems pretty plausible.)