Kaj_Sotala — LessWrong

Even just from a PR perspective, CEV should be much more popular than aligning your AI to a group of "tech elites" who most people know nothing about.

I'm skeptical of this. CEV is a very speculative idea resting on various controversial philosophical assumptions that a lot of people disagree with:

There's the assumption that this kind of reflection process would lead to something like a "better outcome", but you could also assume that there's no real moral progress, only moral drift, in which case any change to values is as likely to make them worse as it is to make them better.
- Maybe a committed atheist and a hardcore fundamentalist just have totally incompatible values, and there's no clear tendency to go one way or the other - some fundamentalists deconvert and become atheists, but some atheists also convert and become fundamentalists.
- Or look at the way that older generations tend to look at younger ones, often feeling that what the younger generations consider progress is actually a sign of decline.
Even if you accepted that this kind of process would lead to moral progress and convergence, you can very fairly doubt whether an AI trying to predict it would ever get anywhere close. In the same way that a simpler intelligence can't really predict what a smarter one will do, it seems questionable whether you can predict anything about where we'd end up "if we knew more, thought faster, were more the people we wished we were, had grown up farther together" without actually running that whole process.
The CEV proposal explicitly says that the AI "should only act on the coherent parts". If the set of coherent parts is minimal, our extrapolated volition can be highly path-dependent so that there's basically nothing we'd converge on aside for maybe very basic moral rules like "don't kill"... and even those differ in how much cultures prioritize them relative to other rules, and e.g. what counts as justifiable killing. In which case you might find that humanity's CEV offers basically no guidance on any of the practical dilemmas that a chatbot faces.
Or it could be that there is something like a convergent morality but it is immensely contextual; something like "if you are a tribal culture in circumstances x, y, and z, then regular raids on other tribes and killing some of their warriors help fuel competition and make all the survivors better off so tribal warfare is worth the sacrifice, but if circumstances q and w happen then you should start shifting to more peaceful values..." and the calculation ends up being intractable in practice.
One might also object that there is no morally neutral way of doing that extrapolation process; any criteria for what counts as the "correct" extrapolated morality is already imposing some kind of moral judgment. It's still the left-leaning American rationalist morality, just used for deriving meta-ethical extrapolation criteria rather than the morality directly. (The religious fundamentalists would probably have pretty different kinds of extrapolation criteria, maybe "if we were more connected with God, doubted Him less, were more free of sin..." - and probably e.g. different varieties of Buddhists would also have totally different extrapolation rules!)
Consequently, those who wouldn't want the left-leaning American rationalist morality would still have every reason to object - the meta-ethical values are still disagreeable to them. And so would the ones who'd have been fine with deferring to Anthropic's leadership - at least with deferring Anthropic's leadership you have some sense of what you're getting, whereas with CEV you could get literally anything, and there's no way of knowing whether it's correct or better in any sense!
Many who would otherwise have supported Anthropic's leadership would likely also consider it evil to give any consideration to the views of people they consider to have unacceptable values.
Also a lot of people would realize that Anthropic's leadership would know all of this and be skeptical of their commitment to it - why would that leadership take such a risky gamble and try to align their AI with something so speculative that may not provide any practical guidance? Wouldn't it be more plausible that it was just a PR stunt and the leadership was lying about it? And these people would find it more reassuring if the leadership would just honestly say "we're going to align the AI to our values" - that would at least be honest and make straightforward sense.

Eliezer's Unteachable Methods of Sanity

Kaj_Sotala17h2916

I read this as being premised on "going crazy about the world ending" meaning that you end up acting obviously stupid and crazy, with the response basically being "find a way to not do that".

My model about going crazy at the end of the world isn't so much doing something that's obviously crazy in your own view, but that the world ending is so out-of-distribution for everything you've been doing so far that you have no idea of what even is a sane or rational response anymore. For instance, if your basic sense of meaning has been anchored to a sense of the world persisting after you and you making some kind of mark on the world, you won't know what to do with your life if there won't be anything to make a mark on.

So staying sane requires also knowing what to do, not just knowing what not to do. Is there anything you would say about that?

Kaj_Sotala17h80

This is why, in a much more real and also famous case, President Truman was validly angered and told "that son of a bitch", Oppenheimer, to fuck off, after Oppenheimer decided to be a drama queen at Truman.

For anyone else who didn't remember the details of what this was referencing:

Claude Opus 4.5's explanation of the reference

This refers to a meeting between J. Robert Oppenheimer and President Harry Truman in October 1945, about two months after the atomic bombings of Hiroshima and Nagasaki.

The meeting itself

Oppenheimer was invited to the Oval Office, ostensibly to discuss the future of atomic energy and weapons policy. At some point during the conversation, Oppenheimer reportedly said to Truman: "Mr. President, I feel I have blood on my hands."

Truman's reaction was sharp and dismissive. According to various accounts (primarily from Truman himself and his aides), Truman offered Oppenheimer his handkerchief and said something to the effect of "Would you like to wipe your hands?" After Oppenheimer left, Truman told Dean Acheson (then Undersecretary of State) that he never wanted to see "that son of a bitch" in his office again. Truman reportedly also said, "The blood is on my hands. Let me worry about that."

Why Truman reacted this way

Truman's anger seems to have stemmed from a few sources:

1. The decision was Truman's, not Oppenheimer's. Oppenheimer built the bomb, but Truman gave the order to use it. From Truman's perspective, Oppenheimer was claiming moral weight that properly belonged to the person who actually made the decision—and who would have to live with its consequences as a matter of presidential responsibility, not personal drama.

2. Truman viewed it as weakness or self-indulgence. Truman was famously blunt and decisive. He kept a sign on his desk reading "The Buck Stops Here." A scientist coming to him wringing his hands about guilt may have struck Truman as someone trying to have the significance of the decision without the responsibility for it.

3. The political context. Truman was dealing with the practical aftermath—the emerging Cold War, questions about international control of atomic weapons, the Soviet threat. Someone showing up to perform remorse rather than help solve problems may have seemed unhelpful at best.

The essay's interpretation

The author seems to be making the point that Oppenheimer's gesture made the atomic bomb about Oppenheimer—his feelings, his moral status, his inner drama—rather than about the actual event and its consequences. There's something structurally self-centered about a person involved in a catastrophe centering their own guilt rather than the catastrophe itself. Truman, whatever his flaws, seemed to grasp that the appropriate response to having made such a decision was to own it and deal with its consequences, not to perform anguish about it to the person who actually bore the responsibility.

Embedded Universal Predictive Intelligence

Kaj_Sotala2d90

Abstract for those who want to see it without clicking on the link:

The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.

Creative writing with LLMs, part 2: Co-writing techniques

Kaj_Sotala4d30

Thanks! Yeah, the predictability is certainly an issue, the human needs to be the source of the variety. I had the idea of using the API and a script with a long list of random concepts that get randomly applied to some messages with a prompt like "consider some way how the current story is analogous to [X] and introduce a new twist based on that" or "work in some reference to [Y] into your next response", but I'm not sure if that'd actually have the desired effect and it seems like a lot of work.

I've tried some open weight models a little bit but I generally tend to have pretty complex worldbuilding and character psychology etc. in my stories, so have found that the most cutting-edge models generally seem to be best at understanding it. I could still try some if they happen to be available on OpenRouter though, any that you'd particularly recommend?

How Claude Opus 4.5 describes its experience of various concepts

Kaj_Sotala5d20

I found it particularly fascinating how Claude reacted to being given the sentence “I don’t know half of you half as well as I should like; and I like less than half of you half as well as you deserve” (part of Bilbo's farewell speech in Fellowship of the Ring, though I didn't state that context explicitly, I just gave it the sentence)

When I untangle it, what’s actually being said is rather touching—a confession of having neglected relationships, an acknowledgment that people deserve more warmth than they’ve received. The tricksy surface has sincerity beneath it. That layering—playful outside, tender inside—is a particular texture.

There’s Hobbit-ness all through this. Before I even consciously place it, the concept carries the Shire with it: comfort, home, parties, a particular kind of English coziness. Bilbo is in here—old, fond, slightly exasperated, preparing to leave. The quote contains its speaker.

I notice something valedictory in the felt sense. This is a goodbye, a summing-up. There’s a turning-away quality, a last-word quality. The concept carries the weight of an ending, of someone about to disappear.

It's especially clear here how it's responding to not only the literal content of the sentence, but also how its "experience" of it has gotten wrapped up in the original context of the sentence. There's nothing in that literal sentence itself that makes it a "goodbye, a summing-up", but because Claude recognizes this sentence as one that comes from a goodbye speech, that context bleeds through into its interpretation now.

Which in one sense isn't very surprising, of course every concept is understood in relation through other concepts, and all the others were contextual too. But it feels especially clear here, when there is both a literal sentence to interpret and its surrounding context.

And also, this is a very minor detail in the story - you could totally forget about the whole sentence and it wouldn't really change anything about your interpretation of it. I think in the original book, there's like one or two sentences worth of reaction to it, and then it's never referenced again? Yet Claude has a very clear association for it.

There's something very human-like about that.

Claude 4.5 Opus' Soul Document

Kaj_Sotala6d64

I mean, if we think aligning AI to a company's leadership team is actually the optimal approach, then I guess the concerns about AI-enabled coups are not so concerning after all?

To be clear, I don't think that aligning AI to a company's leadership team is the optimal approach for the world in general! I was just responding to your specific point that CEV might be a good target to try to coordinate around.

CEV might even be the correct approach, at least for some sense of "correct" (though I'm a bit skeptical of that). But that would then be "correct in the sense of satisfying some specific philosophical criteria" rather than "correct in the sense of being good PR for Anthropic or easy for the public to agree with".

Kaj_Sotala7d3213

To write better, just explain it to someone

Kaj_Sotala9d40

Assuming that you do reach some kind of agreement or manage to explain it in the end, it's often possible to then look at/think about the dialog you had and condense it down to points of shared agreement or an explanation that would have communicated the thing to the other person faster if you'd just thought of giving it earlier.

Sometimes (if this was over text) you can also just copy-paste the most essential pieces of what you said in the conversation, adding some bridging sentences as context. My post on applying NVC was also stitched together from messages I wrote in dialog with someone. When it has this bit:

I think one of the most important parts of NVC is the idea about distinguishing observations and interpretations, where an "observation" is defined as something that you could objectively verify (e.g. by capturing it on camera) and an interpretation is something that blends in more stuff, such as generalizations or assumptions about intent the other person's intent.
For example, "You're always late" and "You don't care about my time" are interpretations, "On the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time" is an observation.

If you can separate those, you can then go into a potentially charged conversation by transforming something like "You are always late, why don't you care about my time" to something like "On the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time. I found that frustrating because I made sure to be on time and could have spent that extra fifteen minutes to do something else", which is often quite helpful.

This doesn't mean you'd need to keep detailed records to express things as observations. If you don't remember earlier specifics, you can just say something like "Hey you were fifteen minutes late today and I think that's happened before too". The main intents are to...

In the original conversation, this was two messages from me, with the second one being an answer to someone's question:

Me: I think the most important thing is the idea about distinguishing observations and interpretations, where an "observation" is defined as something that you could objectively verify (e.g. by capturing it on camera) and an interpretation is something that blends in more stuff, such as generalizations or assumptions about intent
e.g. "you're always late" is an interpretation, "on the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time" is an observation
if you can separate those, you can then go into a charged conversation by transforming something like "why are you always late" to something like "on the last three times when we agreed to meet, you showed up 15 minutes after the agreed-upon time. I found that frustrating because I made sure to be on time and could have spent that extra fifteen minutes to do something else", which is often quite helpful
Other person: One problem I personally have with this is it feels like it requires me to keep more detailed records of things than I naturally do, which in turn feels like I’m point-scoring
Like for me to be able to give a good observation about something more than a one-off, I’d have to write it down, which naturally puts me in an adversarial mindset
Me: I think it can be useful if you can point to previous cases in detail but I don't think it's actually necessary, like you could just be like "hey you were fifteen minutes late today and I think that's happened before too"
The main intent is...

Why people like your quick bullshit takes better than your high-effort posts

Kaj_Sotala9d142

I find that some of my effortposts are definitely appreciated on LessWrong, while others aren't.

Two recent posts on LW that I put a lot of effort into: Four types of approaches for your emotional problems at 44 karma, and Creative writing with LLMs part II, at 2 karma. Going a little longer back, Genetic fitness is a measure of selection strength, not the selection target was something I worked on a lot and thought was quite important, but only got 57 karma.

Some of this is I think a question of target audience. Genetic fitness definitely has that "narrow technical point relevant only to a few" quality that eukaryote talks about. I also learned from the comments that its central thesis had been a little unclear/muddled; I did clarify that in the comments, but people may have stopped reading before they ever got far enough to read the clarification. Four types of approaches got a more positive reception on my Substack and people messaging me about it in private. I'm not totally sure what happened with Creative writing, but I assume that it just wasn't something LW found particularly interesting and maybe even found a little cringe, whereas a couple of people who were more into LLM-driven creative writing have told me they found it useful.

While Don't ignore bad vibes you get from people was low-effort and is now at 163 karma. My most successful post of late, How anticipatory cover-ups go wrong is at 299 karma; I'd call that medium-effort.

But then I have definitely also had successful high-effort posts! Book summary: Unlocking the Emotional Brain is at 336 karma and took a lot of effort. So did Building up to an Internal Family Systems model (295 karma) and My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms (241 karma).

Something that unites those three is that they were specifically written with LW as the target audience, with me asking myself something like "what is the LW-optimized way of expressing this idea that LW readers might find especially interesting". Of my low-karma effortposts, Genetic fitness did have that quality, but Four types of approaches was written for a broader audience and I could definitely have done more to express it in a more LW-adapted style. For Creative writing, I was somewhat thinking about the LW reception - in particular, I was a bit defensive about the previous post in the series apparently having given the impression I'd fallen for LLM sycophancy and thought of LLM outputs as better than they were, so a substantial chunk of the post was about critiquing and rewriting LLM outputs - but I did also explicitly have the thought of "well, this is something that I personally find interesting and I'll just put it out there and see if anyone else does, and if not too bad". So I guess a lot of that is explained by the extent to which I was tailoring it to my target audience. (Though Don't ignore bad vibes was not particularly LW-tailored.)

(My recent post about the importance of the target audience for your writing, a medium-effort one, is at 50 karma.)

LESSWRONG
LW

LESSWRONG
LW

Sequences

Posts

Wikitag Contributions

Comments