Jack Parker — LessWrong

LESSWRONG
LW

Replying toLoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

I'm right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.

I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won't go into detail here, but I'll privately message you about it if you're open to that.

I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren't necessary for generating hate speech.

The anthrax example is a good one. It seems like (1) would be needed to access that kind... (read more)

Replying toLoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Jack Parker2y

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

In my mind, there are three main categories of methods for bypassing safety features: (1) Access the weights and maliciously fine-tune, (2) Jailbreak - append a carefully-crafted prompt suffix that causes the model to do things it otherwise wouldn't, and (3) Instead of outright asking the model to help you with something malicious, disguise your intentions with some clever pretext.

My question is whether (1) is significantly more useful to a threat actor than (3). It seems very likely to me that coming up with prompt sanitization techniques to guard against jailbreaking is way easier than getting to a place where you're confident your model can't ever be LoRA-ed out of its safety... (read more)

Replying toLoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Jack Parker2y*

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

This is really interesting work and a great write-up!

You write "Bad actors can harness the full offensive capabilities of models if they have access to the weights." Compared to the harm a threat actor could do just with access to the public interface of a current frontier LLM, how bad would it be if a threat actor were to maliciously fine-tune the weights of an open-sourced (or exfiltrated) model? If someone wants to use a chatbot to write a ransomware program, it's trivial to come up with a pretext that disguises one's true intentions from the chatbot and gets it to spit out code that could be used maliciously.

It's not clear to... (read more)

-7

Replying toSERI MATS - Summer 2023 Cohort

Jack Parker3y

SERI MATS - Summer 2023 Cohort

That makes sense! Thank you.

Replying toSERI MATS - Summer 2023 Cohort

Jack Parker3y

SERI MATS - Summer 2023 Cohort

Are applicants encouraged to apply early? I see that decisions are released by mid-late May, but are applications evaluated on a rolling basis? Thanks!

Replying toUnderstanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker3y

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Great question! Definitely not. The video series is way, way less technical than the problem set. The problem set is for people who want to understand the nitty gritty details of IB, and the video series is for those who want a broad overview of what the heck IB even is and how it fits into the alignment research landscape.

In fact, I'd say it would benefit you to go through the video series first to get a high-level conceptual overview of how IB can help with alignment. This understanding will help motivate the problem set. Without this conceptual grounding, the problem set can feel like just a big abstract math problem set disconnected from alignment.

That said, if you aren't in for watching a long video series, there's a much more brief (but way less thorough) conceptual description of how IB can help with alignment in the conclusion of the problem set. You could read that first and then work the problems.

Replying toUnderstanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker3y

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Thanks so much :)

Replying toUnderstanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker3y

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

I really appreciate it! I hope this ends up being useful to people. We tried to create the resource we would have wanted to have when we first started learning about infra-Bayesianism. Vanessa's agenda is deeply interesting (especially to a math nerd like me!), and developing formal mathematical theories to help with alignment seems very neglected.

Understanding Infra-Bayesianism: A Beginner-Friendly Video Series

Jack Parker

Jack Parker, Connall Garrod

Click here to see the video series

This video series was produced as part of a project through the 2022 SERI Summer Research Fellowship (SRF) under the mentorship of Diffractor.

Epistemic effort: Before working on these videos, we spent ~400 collective hours working to understand infra-Bayesianism (IB) for ourselves. We built up our own understanding of IB primarily by working together on the original version of Infra-Exercises Part I and subsequently creating a polished version of the problem set in hopes of making it more user-friendly for others.

We then spent ~320 hours writing, shooting, and editing this video series. Part 5 through Part 8 of the video series were checked for accuracy by Vanessa... (read 576 more words →)

140

Replying toInfra-Exercises, Part 1

Jack Parker3y

Infra-Exercises, Part 1

Hi Max! That link will start working by the end of this weekend. Connall and I are putting the final touches on the video series, and once we're done the link will go live.

Infra-Exercises, Part 1

Diffractor

Diffractor, Jack Parker, Connall Garrod

There have been a few remarks about how my and Vanessa’s posts on Infra-Bayesianism seem interesting, but that it's quite a formidable body of work to chew through.

So anyone interested in learning Inframeasure theory to the level of proficiency where they are actually able to develop their own proofs about it and casually deploy it as a tool on other problems is probably not best served by the existing posts. You can only go so far by reading about something without ever applying it.

And so, to help people get a taste for the underlying math, I have constructed a sheet of exercises analogous to Scott's exercise sheets about fixpoints, to guide the... (read more)

Replying toCalling for Student Submissions: AI Safety Distillation Contest

Jack Parker4y

Calling for Student Submissions: AI Safety Distillation Contest

This looks awesome! Is the contest open to master's and PhD students or just undergraduates? Apologies if this is spelled out somewhere and I just missed it.