LESSWRONG
LW

Sodium — LessWrong

Replying toThe Facade of AI Safety Will Crumble

I think this post is counterproductive. There are serious reasons to believe why iterative alignment would fail, and serious reasons to believe that it's the best thing we can work on right now. But this post reads like 30% vague ideas and 70% condescension. It feels like it's written to score social points rather than put forth good ideas in earnest discussion.

Sodium13d

I'm surprised not to see more discussions about how to update on alignment difficulty in light of Moltbook

I mean it's possible that the evil looking AIs on Moltbook are just Grok, which is supposed to do evil role plays, right?

Replying toIrrationality as a Defense Mechanism for Reward-hacking

Sodium15d

Irrationality as a Defense Mechanism for Reward-hacking

What stops an agent from generating adversarial fulfilment criteria for its goals that are easier to satisfy than the "real", external goals?

Because like, they terminally don't want to do? I guess in your frame, what I'd say is that people terminally value having their internal (and noisy) metrics not be too far off from the external states they are supposed to represent.

Replying toIrrationality as a Defense Mechanism for Reward-hacking

Sodium23d

Irrationality as a Defense Mechanism for Reward-hacking

Intuitively your thesis doesn't sound right to me. My guess (1) most people do "reward hack" themselves quite a bit, and (2) to the extent that they don't, it's because they care about "doing the real thing." "Being real" feels to me like something that's meaningfully different than a lot of my other preference? Like it's sort of the basis for all other values.

Replying toTaking LLMs Seriously (As Language Models)

Sodium1mo

Taking LLMs Seriously (As Language Models)

FYI the paraphrasing stuff sounds like what Yoshua Bengio is trying to do with the scientist AI agenda. See his talk at the alignment workshop in Dec 2025.

(Although I feel like Bengio has shared very little about the actual progress they've made (if any), and also very little detail on what they've been up to).

Replying toDefining alignment research

Sodium1mo

Defining alignment research

Another distinguishing property of (AGI) alignment work is that it's forward looking and trying to solve future alignment problems. Given the large increase in AI safety work from academia, this feels like a useful property to keep in mind.

(Of course, this is not to say that we couldn't use current day problems as proxies for those future problems.)

Replying toOpen Thread Winter 2025/26

Sodium1mo

Open Thread Winter 2025/26

I'm curious: what percent of upvotes are strong upvotes? What percent of karma comes from strong upvotes?

Replying toIn My Misanthropy Era

Sodium1mo*

In My Misanthropy Era

Yeah my guess is also that the average philosophy meetup person is a lot more annoying than the average, I dunno, boardgames meetup person.

Replying toOpen Thread Winter 2025/26

Sodium2mo

Open Thread Winter 2025/26

Yeah I would like to mute some users site-wide so that I never see reacts from them & their comments are hidden by default....

Replying toAnthropic: Three Sketches of ASL-4 Safety Case Components

Sodium2moReview for 2024 Review

Anthropic: Three Sketches of ASL-4 Safety Case Components

As far as I'm aware of, this is one of the very few pieces of writing that sketches out what safety reassurances could be made for a model capable of doing significant harms. I wish there were more posts like this one.

Guess: most people who have gotten seriously interested in AI safety in the last year have not read/skimmed Risks From Learned Optimization.

Maybe 70% confident that this is true. Not sure how to feel about this tbh.

Redditors are distressed after losing access to GPT-4o. "I feel like I've lost a friend"

Someone should do a deeper dive on this, but a quick scroll of r/ChatGPT suggests that many users have developed (what is to them) meaningful and important relationships with ChatGPT 4o, and is devastated that this is being taken away from them. This help demonstrate how, if we ever had some misaligned model that's broadly deployed in society, there could be major backlash if AI companies tried to roll it back.

Some examples

From a comment thread

Ziri0611: I’m with you. They keep “upgrading” models but forget that what matters is how it feels to talk to them. 4o isn’t just... (read 701 more words →)

Dario says he'd "go out there saying that everyone should stop building [AI]" if safety techniques do not progress alongside capabilities.

Quote:

If we got to much more powerful models with only the alignment techniques we have now, then I'd be very concerned. Then I'd be going out there saying that everyone should stop building these things. Even China should stop building these. I don't think they'd listen to me ... but if we got a few years ahead in models and had only the alignment and steering techniques we had today, then I would definitely be advocating for us to slow down a lot. The reason I'm warning about the risk is so

... (read more)

Xi Jinping's readout after an AI "study session" [ChinaTalk Linkpost]

Sodium

9mo

Substack link here

TL, DR:; Xi Jinping listens to a lecture about AI and publishes his "study notes/takeaways." From the start of the post (emphasis mine)

On April 25, observers of China’s AI scene got an important new statement of Xi Jinping’s views on AI in the form of remarks concluding a Politburo “study session” on AI led by Xi’an Jiaotong University professor Zheng Nanning. Couched in the turgid language of Partyspeak, the readout nevertheless merits close attention as one of precious few utterances direct from the General Secretary himself on AI. To read this new tea leaf, we need to understand some background on study sessions in general, and this one in particular.

I thought this was a phenomenal post. It actually tries to breakdown CCP's partyspeak for the layperson. For example, see this chart below comparing the readout from the 2025 study session (column 2) and the 2018 study session (column 3).

I would recommend people interested in China & AI to read this post.

I think people see it and think "oh boy I get to be the fat people in Wall-E"

(My friend on what happens if the general public feels the AGI)

I think^[1] people^[2] probably trust individual tweets way more than they should.

Like, just because someone sounds very official and serious, and it's a piece of information that's inline with your worldviews, doesn't mean it's actually true. Or maybe it is true, but missing important context. Or it's saying A causes B when it's more like A and C and D all cause B together, and actually most of the effect is from C but now you're laser focused on A.

Also you should be wary that the tweets you're seeing are optimized for piquing the interests of people like you, not truth.

I'm definitely not the first person to say this, but feels like it's worth it to say it again.

^{^}
75% Confident maybe?
^{^}
including some rationalists on here

Wait a minute, "agentic" isn't a real word? It's not on dictionary.com or Merriam-Webster or Oxford English Dictionary.

AI Can be “Gradient Aware” Without Doing Gradient hacking.

Sodium

Repetition helps us remember things better. This is because it strengthens connections in our brain used for memory^[1].

We don’t need to understand the exact neural mechanisms to take advantage of this fact. For example, societies could develop cultural norms that promote repetition exercises during education^[2].

This is an example of how humans are “gradient aware.” “Repeat a task so I can remember it better” advances our goals by taking advantage of our “gradient update” process. This is an action we take solely because of how our minds get shaped.

I think a similar situation may occur in sufficiently powerful AI models. If AIs are trained in environments where they can strategically take advantage of... (read 317 more words →)

(Maybe) A Bag of Heuristics is All There Is & A Bag of Heuristics is All You Need

Sodium

Epistemic status: Theorizing on topics I’m not qualified for. Trying my best to be truth-seeking instead of hyping up my idea. Not much here is original, but hopefully the combination is useful. This hypothesis deserves more time and consideration but I’m sharing this minimal version to get some feedback before sinking more time into it. “We believe there’s a lot of value in articulating a strong version of something one may believe to be true, even if it might be false.”

This is a somewhat living document as I come back and add more ideas.

The Heuristics Hypothesis: A Bag of Heuristics is All There Is and a Bag of Heuristics is All You Need

... (read 4872 more words →)

Mira Murati leaves OpenAI/ OpenAI to remove non-profit control

Sodium

OpenAI to remove non-profit control

Reuters reports: Exclusive: OpenAI to remove non-profit control and give Sam Altman equity

ChatGPT-maker OpenAI is working on a plan to restructure its core business into a for-profit benefit corporation that will no longer be controlled by its non-profit board, people familiar with the matter told Reuters, in a move that will make the company more attractive to investors.

Full text here

Murati leaves OpenAI

Murati writes

Hi all,
I have something to share with you. After much reflection, I have made the difficult decision to leave OpenAl.
My six-and-a-half years with the OpenAl team have been an extraordinary privilege. While I'll express my gratitude to many individuals in the coming days, I want to

... (read 396 more words →)

Pre-registering a71c97bb02e7082ca62503d8e3ac78dc9f554f524a72ad6a1392cf2d34f398d7

Sodium's Shortform

Sodium

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

John Schulman leaves OpenAI for Anthropic [and then left Anthropic again for Thinking Machines]

Sodium

Schulman writes:

I shared the following note with my OpenAI colleagues today:
I've made the difficult decision to leave OpenAI. This choice stems from my desire to deepen my focus on AI alignment, and to start a new chapter of my career where I can return to hands-on technical work. I've decided to pursue this goal at Anthropic, where I believe I can gain new perspectives and do research alongside people deeply engaged with the topics I'm most interested in. To be clear, I'm not leaving due to lack of support for alignment research at OpenAI. On the contrary, company leaders have been very committed to investing in this area. My decision is a

... (read 188 more words →)

Four ways I've made bad decisions

Sodium

I was in the process of making a decision the other day, and thought about the times I felt like I either made an incorrect decision or followed an incorrect decision making process. Turns out, there was actually many such cases.^[1]

I categorized those into four different failure modes. I'm writing them down for myself and figured that people here might appreciate them as well. I'm not too sure how generalizable these lessons are, and of course the law of equal and opposite advice applies.

I wrote this up quickly (by my standards), so there might be some mistakes here and there.

Slipping into a decision

Sometimes there is a consequential choice I have to make... (read 728 more words →)

(Non-deceptive) Suboptimality Alignment

Sodium

Executive Summary

I present a detailed and slightly different definition of suboptimality alignment compared to the original in Risks from Learned Optimization (RFLO).
I argue that 1. Canonical examples of how humans are misaligned with evolution (e.g., having sex with birth control) can be best thought of as instances of suboptimality alignment and 2. Suboptimality alignment occurs under a very different set of conditions compared to deceptive alignment, but could theoretically still lead to treacherous turn type scenarios.
I then give a set of sufficient conditions for suboptimality alignment, which could be used to train model organisms of misalignment. I also provide an illustrative story.
Finally, I provide some low-confidence takes on strategies to mitigate suboptimality alignment, as

... (read 2545 more words →)

NYT: The Surprising Thing A.I. Engineers Will Tell You if You Let Them

Sodium

NYT Opinion article by Ezra Klein on AI-regulations. Ezra has been writing quite a lot on AI recently. Thought people might be interested in discussing them here.