Understanding Infra-Bayesianism: A Beginner-Friendly Video Series
Click here to see the video series This video series was produced as part of a project through the 2022 SERI Summer Research Fellowship (SRF) under the mentorship of Diffractor. Epistemic effort: Before working on these videos, we spent ~400 collective hours working to understand infra-Bayesianism (IB) for ourselves. We...
I'm right there with you on jailbreaks: Seems not-too-terribly hard to prevent, and sounds like Claude 2 is already super resistant to this style of attack.
I have personally seen an example of doing (3) that seems tough to remediate without significantly hindering overall model capabilities. I won't go into detail here, but I'll privately message you about it if you're open to that.
I agree it seems quite difficult to use (3) for things like generating hate speech. A bad actor would likely need (1) for that. But language models aren't necessary for generating hate speech.
The anthrax example is a good one. It seems like (1) would be needed to access that kind... (read more)