1. Introduction

Leaving Open Philanthropy, going to Anthropic

2mo

(This is the video and transcript of a talk I gave at Constellation in December 2025; I also gave a shorter version at the 2025 FAR AI workshop in San Diego. The slides are also available here. The main content of the talk is based on this recent essay. I wrote this essay prior to joining Anthropic, and I'm here speaking only for myself and not for my employer.)

Talk

Hi everybody. My name is Joe. I work at Anthropic and I'm going to be talking about human-likeness in AI alignment. The talk is being recorded. It's going to be posted on Slack, and then I'm also likely going to post it on my... (read 10677 more words →)

Replying toLeaving Open Philanthropy, going to Anthropic

Joe Carlsmith3mo

Hey Adam — thanks for this. I wrote about this kind of COI in the post, but your comment was a good nudge to think more seriously about my take here.

Basically, I care here about protecting two sorts of values. On the one hand, I do think the sort of COI you’re talking about is real. That is, insofar as people at AI companies who have influence over trade-offs the company makes between safety and commercial success hold equity, deciding in favor of safety will cause them to lose money — and potentially, for high-stakes decisions like dropping out of the race, a lot of money. This is true of people in... (read 566 more words →)

-12

Replying toGiving AIs safe motivations

Joe Carlsmith3mo

Giving AIs safe motivations

By default step 3 (reward-on-the-episode seekers aren't directly optimizing for your future efforts at studying their generalization to fail in the direction of AI takeover), but I do think the line here can get a bit blurry.

How human-like do safe AI motivations need to be?

Leaving Open Philanthropy, going to Anthropic

3mo

(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

This is the eighth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.

This essay is also a review/critique of one of the central arguments in the book “If Anyone Builds It, Everyone Dies,” by Eliezer Yudkowsky and Nate Soares.)

1. Introduction

In previous essays, I’ve laid out my rough picture of the path to building... (read 15390 more words →)

•••

4. Existing Writing on Corrigibility

3mo

(Audio version, read by the author, here, or search for "Joe Carlsmith Audio" on your podcast app.)

Last Friday was my last day at Open Philanthropy. I’ll be starting a new role at Anthropic in mid-November, helping with the design of Claude’s character/constitution/spec. This post reflects on my time at Open Philanthropy, and it goes into more detail about my perspective and intentions with respect to Anthropic – including some of my takes on AI-safety-focused people working at frontier AI companies.

(I shared this post with Open Phil and Anthropic comms before publishing, but I’m speaking only for myself and not for Open Phil or Anthropic.)

On my time at Open Philanthropy

I joined Open Philanthropy... (read 5299 more words →)

111

•••

Replying to4. Existing Writing on Corrigibility

Joe Carlsmith4mo

I appreciated the detailed discussion and literature review here -- thanks.

•••

Replying toMotivation control

Joe Carlsmith4mo

Motivation control

I'm sympathetic to this -- thanks Thomas.

Replying toVideo and transcript of talk on giving AIs safe motivations

Joe Carlsmith4mo

Video and transcript of talk on giving AIs safe motivations

Hi Steve -- thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like "studying/intervening on an existing AI system" rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here -- and I talk lot about studying training as a part of behavioral science, e.g.:

Let’s call an AI’s full range of behavior across all safe and accessible-for-testing inputs its “accessible behavioral profile." Granted the ability to investigate behavioral profiles of this kind

... (read more)

Controlling the options AIs can pursue

Video and transcript of talk on giving AIs safe motivations

5mo

(Podcast version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.

This is the seventh essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I described my current picture of what it looks like to give AIs safe motivations. But as I discussed in the second essay in this series, controlling AI motivations is only one aspect of preventing... (read 10239 more words →)

Giving AIs safe motivations

5mo

(This is the video and transcript of talk I gave at the UT Austin AI and Human Objectives Initiative in September 2025. The slides are also available here. The main content of the talk is based on this recent essay.)

Talk

Hi, everyone. Thank you for coming. I'm honored to be part of this series and part of the beginning of this series.

Plan

I'm going to briefly introduce the core AI alignment problem as I see it. It's going to be a particular version of that problem, the version that I think is highest stakes. And then I'm going to talk about my current high-level picture of how that problem gets solved at a technical... (read 14810 more words →)

Video and transcript of talk on "Can goodness compete?"

6mo

(Audio version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

This is the sixth essay in a series I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, plus a bit more about the series as a whole.)

1. Introduction

Thus far in this series, I’ve defined what it would be to solve the alignment problem, and I’ve outlined a high-level picture of how we might get there – one that emphasized the role of “AI for AI safety,”... (read 15119 more words →)

Video and transcript of talk on AI welfare

7mo

(This is the video and transcript of a public talk I gave at Mox in San Francisco in July 2025, on long-term equilibria post-AGI. It’s a longer version of the talk I gave at this workshop. The slides are also available here.)

Introduction

Thank you. Okay. Hi. Thanks for coming.

Aims for this talk

So: can goodness compete? It's a classic question, and it crops up constantly in a certain strand of futurism, so I'm going to try to analyze it and understand it more precisely. And in particular I want to distinguish between a few different variants, some of which are more fundamental problems than others. And then I want to try to hone in... (read 9942 more words →)

The stakes of AI moral status

9mo

This is the video and transcript of a talk I gave on AI welfare at Anthropic in May 2025. The slides are also available here. The talk gives an overview of my current take on the topic. I'm also in the midst of writing a series of essays of about it, the first of which -- "On the stakes of AI moral status" -- is available here (podcast version, read by the author, here). My takes may evolve as I do more thinking about the issue.

Hi everybody. Thanks for coming. So: this talk is going to be about AI welfare. About whether AIs have welfare, moral status, consciousness, that kind of thing.... (read 8323 more words →)

9mo

Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

1. Introduction

Currently, most people treat AIs like tools. We act like AIs don’t matter in themselves. We use them however we please.

For certain sorts of beings, though, we shouldn’t act like this. Call such beings “moral patients.” Humans are the paradigm example. But many of us accept that some non-human animals are probably moral patients as well. You shouldn’t kick a stray dog just for fun.^[1]

Can AIs be moral patients? If so, what sorts of AIs? Will some near-term AIs be moral patients? Are some AIs moral patients now?

If so, it matters a lot. We’re on track... (read 4139 more words →)

Replying toCan we safely automate alignment research?

I'm a bit confused about your overall picture here. Sounds like you're thinking something like:

"almost everything in the world is evaluable via waiting for it to fail and then noticing this. Alignment and bridge-building aren't like this, but most other things are... Also, the way we're going to automate long-horizon tasks is via giving AIs long-term goals. In particular: we'll give them goal 'get long-term human approval/reward', which will lead to good-looking stuff until the AIs take over in order to get more reward. This will work for tons of stuff but not for alignment, because you can't give negative reward for the alignment failure we ultimately care about, which is the AIs taking over."

Is that roughly right?

Replying toCan we safely automate alignment research?

I think it's a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc -- though I'm not clear on your take on whether capabilities research counts as "science" in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.

That said, I think if we're ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don't have any story about why alignment research would be harder, it seems like the baseline... (read more)

Replying toCan we safely automate alignment research?

I'm happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it's important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.

Replying toCan we safely automate alignment research?

If we assume that the AI isn't scheming to actively withhold empirically/formally verifiable insights from us (I do think this would make life a lot harder), then it seems to me like this is reasonably similar to other domains in which we need to figure out how to elicit as-good-as-human-level suggestions from AIs that we can then evaluate well. E.g., it's not clear to me why this would be all that different from "suggest a new transformer-like architecture that we can then verify improves training efficiency a lot on some metric."

Or put another way: at least in the context of non-schemers, the thing I'm looking for isn't just "here's a way things... (read more)

Replying toCan we safely automate alignment research?