Current activation oracles are hard to use
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. tldr; Activation oracles (Karvonen et al.) are a recent technique where a model is finetuned to answer natural language questions about another model's activations. They showed some promising signs of generalising to tasks fairly different...