x

LESSWRONG
LW

LLM Mindreading

I think I now know something about language model mindcontrol. But I don't know how to understand a model's forward-pass cognition—the LLM mindcontrol I know about is "blind" to that, and has to see mindcontrol outcomes to guess at internal structures.

Mindreading is the key missing piece in making that mindcontrol loop more adversarially robust.

LLM Mindreading

I think I now know something about language model mindcontrol. But I don't know how to understand a model's forward-pass cognition—the LLM mindcontrol I know about is "blind" to that, and has to see mindcontrol outcomes to guess at internal structures.

Mindreading is the key missing piece in making that mindcontrol loop more adversarially robust.

42Sparse Coding, for Mechanistic Interpretability and Activation Engineering

David Udell

2y

7

53Causal Graphs of GPT-2-Small's Residual Stream

David Udell

2y

7

23(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

David Udell, hrdkbhatnagar, JacksonKaunismaa

7mo

0