I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
Right, so the "retroactively" means that it doesn't inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
I don't consider myself a utilitarian anymore but back when I did, this wasn't a good description of my motivations. Rather it felt like the opposite, that utilitarianism was the thing that made the most internal sense to me and I had a strong conviction in it, and would often strongly argue for it when most other people disagreed with it.
Very interesting!
I'm confused by this section:
The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree.
I thought that an LLM's responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn't matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
On the other hand, many people seem to think of climate change as an extinction risk in a way that seems effective at motivating political action, e.g. with broad sympathies for movements like Extinction Rebellion.
AI water use has a significant advantage in getting attention in that it's something clearly measurable that's happening right now, and people had already been concerned about water shortages before this.
I don't really believe that the reason warnings about AI are failing is because "you and all your children and grandchildren might die" doesn't sound like a bad enough outcome to people.
S-risks are also even more speculative than risks of extinction, so it would be harder to justify a focus on them, while comparisons to hell make them even more likely to be dismissed as "this is just religious-style apocalypse thinking dressed in scientific language".
If it looks like you haven't read the LessWrong Political Prerequisites
What are those?
I haven't seen that kind of wording with 4.5, likely in part because of this bit in my custom instructions. At some point, I found that telling Claude "make your praise specific" was more effective at making it tone down the praise than telling it "don't praise me" (as with humans, LLMs seem to sometimes respond better to "do Y instead of X" than "don't do X"):
- Instead of using broad positive adjectives (great, brilliant, powerful, amazing), acknowledge specific elements that I shared. For example, rather than "That's a brilliant insight," saying "I notice you're drawn to both the technical complexity and the broader social impact of this technology."
- Avoid positive adjectives (excellent, profound, insightful) until you have substantial content to base them on.
- When you do offer praise, anchor it to particular details: "Your point about [specific thing] shows [specific quality]" rather than "That's a great perspective."
(I do have 'past chats' turned on, but it doesn't seem to do anything unless I specifically ask Claude to recall past chats.)
This post documents some of my reasons at the time.