LLM Mindreading
I think I now know something about language model mindcontrol. But I don't know how to understand a model's forward-pass cognition—the LLM mindcontrol I know about is "blind" to that, and has to see mindcontrol outcomes to guess at internal structures.
Mindreading is the key missing piece in making that mindcontrol loop more adversarially robust.