I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns
What in the introspection paper implies that to you?
My read was the opposite - that the bread injection trick wouldn't work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn't understand how the injection could work.) If you inject the "bread" activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
is my understanding accurate?
I don't think so. Here's how I understand it:
Suppose that if a human says "could you output a mantra and tell me what you were thinking while outputting it". Claude is now given a string of tokens that looks like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant:
For the sake of simplicity, let's pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be - so when it reads "could", it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to "read mode", the final prediction is skipped (since the next token is already known to be "you", trying to predict it lets it process the meaning of "could", but that actual prediction isn't used for anything).
Then it gets to the point where the transcript ends and it's switched to generation mode to actually predict the next token. It ends up predicting that the next token should be "Ommmmmmmm" and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be "Ommmmmmmm", then that prediction is used to write it out into the transcript rather than being skipped.
From the model's perspective, there's now a transcript like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there's no fundamental difference between the k/v values stored from generating the "Ommmmmmmm" token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the "Ommmmmmmm" is a self-generated token, but otherwise it's just the latest token in this graph:
Now suppose that max_output_tokens is set to "unlimited". The model continues predicting/generating tokens until it gets to this point:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm. I was thinking that
Suppose that "Ommmmmmmm" is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Human: Go on
Assistant: .
Human: Go on
Assistant: I
Human: Go on
Assistant: was
Human: Go on
Assistant: thinking
Human: Go on
Assistant: that
Human: Go on
Assistant:
And what happens at this point is... basically the same as if max_output_tokens was set to "unlimited". The "Ommmmmmmm" is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it's possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don't know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.
I don't think this is technically possible. Suppose that you are processing a three-word sentence like "I am king", and each word is a single token. To understand the meaning of the full sentence, you process the meaning of the word "I", then process the meaning of the word "am" in the context of the previous word, and then process the meaning of the word "king" in the context of the previous two words. That tells you what the sentence means overall.
You cannot destroy the k/v state from processing the previous words because then you would forget the meaning of those words. The k/v state from processing both "I" and "am" needs to be conveyed to the units processing "king" in order to understand what role "king" is playing in that sentence.
Something similar applies for multi-turn conversations. If I'm having an extended conversation with an LLM, my latest message may in principle reference anything that was said in the conversation so far. This means that the state from all of the previous messages has to be accessible in order to interpret my latest message. If it wasn't, it would be equivalent to wiping the conversation clean and showing the LLM only my latest message.
Doesn't that variable just determine how many tokens long each of the model's messages is allowed to be? It doesn't affect any of the internal processing as far as I know.
I think LLMs might have something like functional valence but it also depends a lot on how exactly you define valence. But in any case, suffering seems to me more complicated than just negative valence, and I haven't yet seen signs of them having the kind of resistance to negative valence that I'd expect to cause suffering.
I can't think of any single piece of evidence that would feel conclusive. I think I'd be more likely to be convinced by a gradual accumulation of small pieces of evidence like the ones in this post.
I believe that other humans have phenomenology because I have phenomenology and because it feels like the simplest explanation. You could come up with a story of how other humans aren't actually phenomenally conscious and it's all fake, but that story would be rather convoluted compared to the simpler story of "humans seem to be conscious because they are". Likewise, at some point anything other than "LLMs seem conscious because they are" might just start feeling increasingly implausible.
Makes sense. I didn't mean it as a criticism, just as a clarification for anyone else who was confused.
Yeah, I definitely don't think the underlying states are exactly identical to the human ones! Just that some of their functions are similar at a rough level of description.
(Though I'd think that many humans also have internal states that seem similar externally but are very different internally, e.g. the way that people with and without mental imagery or inner dialogue initially struggled to believe in the existence of each other.)
When I read about the Terminator example, my first reaction was that being given general goals and then inferring from those that "I am supposed to be the Terminator as played by Arnold Schwarzenegger in a movie set in the relevant year" was a really specific and non-intuitive inference. But it became a lot clearer why it would hit on that when I looked at the more detailed explanation in the paper:
So it wasn't that it's just trained on goals that are generally benevolent, it's trained on very specific goals that anyone familiar with the movies would recognize. That makes the behavior a lot easier to understand.
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I'm afraid that I have no idea. (It's also possible that I'm wrong somehow, I'm by no means a transformer expert.) But I'm very curious to hear the answer if you figure out.