In my mind, there are three main categories of methods for bypassing safety features: (1) Access the weights and maliciously fine-tune, (2) Jailbreak - append a carefully-crafted prompt suffix that causes the model to do things it otherwise wouldn't, and (3) Instead of outright asking the model to help you with something malicious, disguise your intentions with some clever pretext.
My question is whether (1) is significantly more useful to a threat actor than (3). It seems very likely to me that coming up with prompt sanitization techniques to guard against jailbreaking is way easier than getting to a place where you're confident your model can't ever be LoRA-ed out of its safety training. But it seems really difficult to train a model that isn't vulnerable to clever pretexting (perhaps just as difficult as training an un-LoRA-able model).
The argument for putting lots of effort into making sure model weights stay on a secure server (block exfiltration by threat actors, block model self-exfiltration, and stop openly publishing model weights) seems a lot stronger to me if (1) is way more problematic than (3). I have an intuition that it is and that the gap between (1) and (3) will only continue to grow as models become more capable, but it's not totally clear to me. Again, if at any point this gets infohazard-y, we can take this offline or stop talking about it.
This is really interesting work and a great write-up!
You write "Bad actors can harness the full offensive capabilities of models if they have access to the weights." Compared to the harm a threat actor could do just with access to the public interface of a current frontier LLM, how bad would it be if a threat actor were to maliciously fine-tune the weights of an open-sourced (or exfiltrated) model? If someone wants to use a chatbot to write a ransomware program, it's trivial to come up with a pretext that disguises one's true intentions from the chatbot and gets it to spit out code that could be used maliciously.
It's not clear to me that it would be that much worse for a threat actor to have a maliciously fine-tuned LLM vs the plain-old public GPT-3.5 interface that we all know and love. Maybe it would save them a small bit of time if they didn't have to come up with a pretext? Or maybe there are offensive capabilities lurking in current models and somehow protected by safety fine-tuning, capabilities that can only be accessed by pointing to them specifically in one's prompt?
To be clear, I strongly agree with your assessment that open-sourcing models is extremely reckless and on net a very bad idea. I also have a strong intuition that it makes a lot of sense to invest in security controls to prevent threat actors from exfiltrating model weights to use models for malicious purposes. I'm just not sure if the payoff from blocking such exfiltration attacks would be in the present/near term or if we would have to wait until the next generation of models (or the generation after that) before direct access to model weights grants significant offensive capabilities above and beyond those accessible through a public API.
Thanks again for this post and for your work!
EDIT: I realize you could run into infohazards if you respond to this. I certainly don't want to push you to share anything infohazard-y, so I totally understand if you can't respond.
That makes sense! Thank you.
Are applicants encouraged to apply early? I see that decisions are released by mid-late May, but are applications evaluated on a rolling basis? Thanks!
Great question! Definitely not. The video series is way, way less technical than the problem set. The problem set is for people who want to understand the nitty gritty details of IB, and the video series is for those who want a broad overview of what the heck IB even is and how it fits into the alignment research landscape.
In fact, I'd say it would benefit you to go through the video series first to get a high-level conceptual overview of how IB can help with alignment. This understanding will help motivate the problem set. Without this conceptual grounding, the problem set can feel like just a big abstract math problem set disconnected from alignment.
That said, if you aren't in for watching a long video series, there's a much more brief (but way less thorough) conceptual description of how IB can help with alignment in the conclusion of the problem set. You could read that first and then work the problems.
Thanks so much :)
I really appreciate it! I hope this ends up being useful to people. We tried to create the resource we would have wanted to have when we first started learning about infra-Bayesianism. Vanessa's agenda is deeply interesting (especially to a math nerd like me!), and developing formal mathematical theories to help with alignment seems very neglected.
Hi Max! That link will start working by the end of this weekend. Connall and I are putting the final touches on the video series, and once we're done the link will go live.
This looks awesome! Is the contest open to master's and PhD students or just undergraduates? Apologies if this is spelled out somewhere and I just missed it.
I'm right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won't go into detail here, but I'll privately message you about it if you're open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren't necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind of capability (seems difficult to get there without explicitly saying "anthrax") and that it would be difficult to obtain that kind of knowledge without access to a capable model. Often when I bring up the idea that keeping model weights on secure servers and preventing exfiltration are critical problems, I'm met with the response "but won't bad actors just trick the plain vanilla model into doing what they want it to do? Why go to the effort to get the weights and maliciously fine-tune if there's a much easier way?" My opinion is that more proofs of concept like the anthrax one and like Anthropic's experiments on getting an earlier version of Claude to help with building a bioweapon are important for getting people to take this particular security problem seriously.