This project was completed as part of the AI Safety Fundamentals: Alignment Course by BlueDot Impact. All the code, data and results are available in this repository.
Abstract
The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and if so, “Can that direction be used to prevent the success of jailbreaks?”. The difference-in-means technique was utilized to search for a direction in activation space that represents jailbreaks. After that, the model was intervened using activation addition and directional ablation. The activation addition intervention caused the attack success rate of jailbreaks to drop from 60% to 0%, suggesting that a direction representing jailbreaks might exist and disabling it... (read 2692 more words →)