I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
Right, so the "retroactively" means that it doesn't inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.
I don't consider myself a utilitarian anymore but back when I did, this wasn't a good description of my motivations. Rather it felt like the opposite, that utilitarianism was the thing that made the most internal sense to me and I had a strong conviction in it, and would often strongly argue for it when most other people disagreed with it.
Very interesting!
I'm confused by this section:
The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree.
I thought that an LLM's responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn't matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?
On the other hand, many people seem to think of climate change as an extinction risk in a way that seems effective at motivating political action, e.g. with broad sympathies for movements like Extinction Rebellion.
AI water use has a significant advantage in getting attention in that it's something clearly measurable that's happening right now, and people had already been concerned about water shortages before this.
I don't really believe that the reason warnings about AI are failing is because "you and all your children and grandchildren might die" doesn't sound like a bad enough outcome to people.
S-risks are also even more speculative than risks of extinction, so it would be harder to justify a focus on them, while comparisons to hell make them even more likely to be dismissed as "this is just religious-style apocalypse thinking dressed in scientific language".
If it looks like you haven't read the LessWrong Political Prerequisites
What are those?
Some very unhinged conversations with 4o starting at 3:40 of this video. [EDIT: fixed link]