Andrew Mack

Message

332

159

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund. TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the...

Dec 3, 2024107

Mechanistically Eliciting Latent Behaviors in Language Models

Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout). TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The...

Apr 30, 2024224

LESSWRONG
LW

LESSWRONG
LW

Andrew Mack

Andrew Mack

Andrew Mack

Andrew Mack

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Mechanistically Eliciting Latent Behaviors in Language Models

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Mechanistically Eliciting Latent Behaviors in Language Models