LESSWRONG
LW

164
Kaj_Sotala
51927Ω5513065708161
Message
Dialogue
Subscribe

I've formerly done research for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer. 

Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Why Everyone (Else) Is a Hypocrite: Evolution and the Modular Mind
Concept Safety
Multiagent Models of Mind
Keith Stanovich: What Intelligence Tests Miss
6Kaj's shortform feed
Ω
8y
Ω
109
Tricular's Shortform
Kaj_Sotala1h20

This post documents some of my reasons at the time.

Reply
Emergent Introspective Awareness in Large Language Models
Kaj_Sotala8h31

Right, so the "retroactively" means that it doesn't inject the vector when the response is originally prefilled, but rather when the model is re-reading the conversation with the prefilled response and it gets to the point with the bread? That makes sense.

Reply
Tricular's Shortform
Kaj_Sotala14h20

I don't consider myself a utilitarian anymore but back when I did, this wasn't a good description of my motivations. Rather it felt like the opposite, that utilitarianism was the thing that made the most internal sense to me and I had a strong conviction in it, and would often strongly argue for it when most other people disagreed with it.

Reply
Emergent Introspective Awareness in Large Language Models
Kaj_Sotala15h40

Very interesting!

I'm confused by this section:

The previous experiments study cases where we explicitly ask the model to introspect. We were also interested in whether models use introspection naturally, to perform useful behaviors. To this end, we tested whether models employ introspection to detect artificially prefilled outputs. When we prefill the model’s response with an unnatural output (“bread,” in the example below), it disavows the response as accidental in the following turn. However, if we retroactively inject a vector representing “bread” into the model’s activations prior to the prefilled response, the model accepts the prefilled output as intentional. This indicates that the model refers to its activations prior to its previous response in order to determine whether it was responsible for producing that response. We found that Opus 4.1 and 4 display the strongest signatures of this introspective mechanism, but some other models do so to a lesser degree.

I thought that an LLM's responses for each turn are generated entirely separately from each other, so that when you give it an old conversation history with some of its messages included, it re-reads the whole conversation from scratch and then generates an entirely new response. In that case, it shouldn't matter what you injected into its activations during a previous conversation turn, since only the resulting textual output is used for calculating new activations and generating the next response. Do I have this wrong?

Reply
AI Doomers Should Raise Hell
Kaj_Sotala1d50

On the other hand, many people seem to think of climate change as an extinction risk in a way that seems effective at motivating political action, e.g. with broad sympathies for movements like Extinction Rebellion.

AI water use has a significant advantage in getting attention in that it's something clearly measurable that's happening right now, and people had already been concerned about water shortages before this.

Reply
AI Doomers Should Raise Hell
Kaj_Sotala1d3119

I don't really believe that the reason warnings about AI are failing is because "you and all your children and grandchildren might die" doesn't sound like a bad enough outcome to people.

S-risks are also even more speculative than risks of extinction, so it would be harder to justify a focus on them, while comparisons to hell make them even more likely to be dismissed as "this is just religious-style apocalypse thinking dressed in scientific language".

Reply
Mottes and Baileys in AI discourse
Kaj_Sotala2d20

If it looks like you haven't read the LessWrong Political Prerequisites

What are those?

Reply
Meditation is dangerous
Kaj_Sotala3d40

The r/streamentry wiki page on Health & Balance in Meditation also has some advice and resources.

Reply
Credit goes to the presenter, not the inventor
Kaj_Sotala4d110
Reply7
eggsyntax's Shortform
Kaj_Sotala12d50

I haven't seen that kind of wording with 4.5, likely in part because of this bit in my custom instructions. At some point, I found that telling Claude "make your praise specific" was more effective at making it tone down the praise than telling it "don't praise me" (as with humans, LLMs seem to sometimes respond better to "do Y instead of X" than "don't do X"):

  • Instead of using broad positive adjectives (great, brilliant, powerful, amazing), acknowledge specific elements that I shared. For example, rather than "That's a brilliant insight," saying "I notice you're drawn to both the technical complexity and the broader social impact of this technology."
  • Avoid positive adjectives (excellent, profound, insightful) until you have substantial content to base them on.
  • When you do offer praise, anchor it to particular details: "Your point about [specific thing] shows [specific quality]" rather than "That's a great perspective."

(I do have 'past chats' turned on, but it doesn't seem to do anything unless I specifically ask Claude to recall past chats.)

Reply1
Load More
52LLMs one-box when in a "hostile telepath" version of Newcomb's Paradox, except for the one that beat the predictor
25d
6
100Where does Sonnet 4.5's desire to "not get too comfortable" come from?
1mo
23
60Solving the problem of needing to give a talk
1mo
3
60Defensiveness does not equal guilt
2mo
16
44Four types of approaches for your emotional problems
2mo
5
299How anticipatory cover-ups go wrong
2mo
25
1Creative writing with LLMs, part 2: Co-writing techniques
3mo
0
38Creative writing with LLMs, part 1: Prompting for fiction
3mo
10
70LLM-induced craziness and base rates
4mo
2
82You can get LLMs to say almost anything you want
4mo
10
Load More
Internal Family Systems
3 years ago
(+68/-20)
Internal Family Systems
5 years ago
(+306)
Internal Double Crux
5 years ago
(+92)
Arguments As Soldiers
5 years ago
(+473/-85)
AI Advantages
5 years ago
Willpower
5 years ago
(+6/-9)
Aumann's Agreement Theorem
5 years ago
(+26/-501)