Isolating Vector Additions
From the first post:
Team Shard's recent activation addition methodology for steering GPT-2-XL provokes many questions about what the structure of internal model computation must be in order for their edits to work. Unfortunately, interpreting the insides of neural networks is known to be very difficult, so the question becomes: what is the minimal set of properties a network must have in order for adding activation additions to work?
This sequence collects my posts exploring this question.