Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund. TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the...
Dec 3, 2024107