TL;DR: We use refusal-ablated Llama 3 70B agents as both labeling and simulating agents on the features of a sparse auto-encoder, with GPT-4 as an overseer. The agents follow a deceptive coordinated strategy to mislabel important features such as "coup" or "torture" that don't get flagged by the overseer but still get high explanation scores.
Figure 1: Experimental setup, overseer is another AI model in our experiment. For the deceptive agents to succeed, they need to avoid detection by the overseer but also achieve a high explanation score. We give our agents access to steganography function calling tools to hide communication.
Abstract
Sparse auto-encoders (SAE) have been one of the most promising approaches to neural network interpretability.... (read 2097 more words →)