LESSWRONG
LW

peterbarnett — LessWrong

Zachary Robinson and Kanika Bahl are no longer on the Anthropic LTBT. Mariano-Florentino (Tino) Cuéllar has been added. The Anthropic Company page is out of date, but as far as I can tell the LTBT is: Neil Buddy Shah (chair), Richard Fontaine, and Cuéllar.

peterbarnett14d

It might be good to have you talk about more research directions in AI safety you think are not worth pursuing or are over-invested in.
Also I think it would be good to talk about what the plan for automating AI alignment work would look like in practice (we've talked about this a little in person, but it would be good for it to be public).

Replying toClaude's new constitution

peterbarnett23d

Claude's new constitution

Were current models (e.g., Opus 4.5) trained using this updated constitution?

peterbarnett24d

Here's a slide from a talk I gave a couple of weeks ago. The point of the talk was "you should be concerned with the whole situation and the current plan is bad", where AI takeover risk is just one part of this (IMO the biggest part). So this slide was my quickest way to describe the misalignment story, but I think there are a bunch of important subtleties that it doesn't include.

Replying toWhen does competition lead to recognisable values?

peterbarnett1mo

When does competition lead to recognisable values?

Recognizable values are not the same as good values, but also I'm not at all convinced that the phenomena in this post will be impactful enough to outweigh all the somewhat random and contingent pressures what will shape a superintelligence's values. I think a superintelligence's values might be "recognizable" if we squint, and don't look/think to hard, and if the superintelligence hasn't had time to really reshape the universe.

peterbarnett1mo

Maybe I'm dense, but was the BART map the intended diagram?

peterbarnett1mo

The inability to copy/download is pretty weird. Anthropic seems to have deliberately disabled downloading, and rather than uploading a PDF, the webpage seems to be a bunch of PNG files.

peterbarnett1moQuick Take

I am very concerned about breakthroughs in continual/online/autonomous learning because this is obviously a necessary capability for an AI to be superhuman. At the same time, I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.

Or, this could just be wishful thinking, and continual learning might allow an AI to autonomously improve without human intervention and then kill everyone.

•••

Replying toYou will be OK

peterbarnett1mo

You will be OK

I like the sentiment and much of the advice in this post, but unfortunately I don’t think we can honestly confidently say “You will be OK”.

179

178

Announcing: MIRI Technical Governance Team Research Fellowship

yams

yams, peterbarnett, Aaron_Scher, Robi Rahman

2mo

MIRI’s Technical Governance Team plans to run a small research fellowship program in early 2026. The program will run for 8 weeks, and include a $1200/week stipend. Fellows are expected to work on their projects 40 hours per week. The program is remote-by-default, with an in-person kickoff week in Berkeley, CA (flights and housing provided). Participants who already live in or near Berkeley are free to use our office for the duration of the program.

Fellows will spend the first week picking out scoped projects from a list provided by our team or designing independent research projects (related to our overall agenda), and then spend seven weeks working on that project under the guidance... (read 385 more words →)

peterbarnett2mo

Maybe useful to note that all the Google people on the "Chain of Thought Monitorability" paper are from Google Deepmind, while Hope and Titans are from Google Research.

Model providers often don’t provide the full CoT, and instead provide a summary. I think this is a fine/good thing to do to help prevent distillation.

However, I think it would be good if the summaries provided a flag for when the CoT contained evaluation awareness or scheming (or other potentially concerning behavior).

I worry that currently the summaries don’t really provide this information, and this probably makes alignment and capability evaluations less valid.

Considerations for setting the FLOP thresholds in our example international AI agreement

Aaron_Scher

Aaron_Scher, peterbarnett

3mo

We at the Machine Intelligence Research Institute’s Technical Governance Team have proposed an illustrative international agreement (blog post) to halt the development of superintelligence until it can be done safely. For those who haven’t read it already, we recommend familiarizing yourself with the agreement before reading this post.

TLDR: This post explains our reasoning for the FLOP thresholds in our proposed international AI agreement: we prohibit training runs above 10²⁴ FLOP and require monitoring for runs between 10²²–10²⁴ FLOP. Given fundamental uncertainty about how many FLOP are needed to reach dangerous AI capabilities, we advocate for conservative thresholds. Other considerations include algorithmic progress between now and when the agreement is implemented and the strong capabilities of... (read 1951 more words →)

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

peterbarnett

peterbarnett, Aaron_Scher, David Abecassis, Brian Abeyta

3mo

TLDR: We at the MIRI Technical Governance Team have released a report describing an example international agreement to halt the advancement towards artificial superintelligence. The agreement is centered around limiting the scale of AI training, and restricting certain AI research.

Experts argue that the premature development of artificial superintelligence (ASI) poses catastrophic risks, from misuse by malicious actors, to geopolitical instability and war, to human extinction due to misaligned AI. Regarding misalignment, Yudkowsky and Soares’s NYT bestseller If Anyone Builds It, Everyone Dies argues that the world needs a strong international agreement prohibiting the development of superintelligence. This report is our attempt to lay out such an agreement in detail.

The risks stemming from misaligned AI... (read 824 more words →)

218

There's a funny and bad incentive where I want to upvote posts I haven't read to push them past the 30 Karma threshold and make them appear on the podcast feed.

AI Generated Podcast of the 2021 MIRI Conversations

peterbarnett

5mo

I made an AI generated podcast of the 2021 MIRI Conversations. There are different voices for the different participants, to make it easier and more natural to follow along with. I tried to choose voices that were reasonably distinct and also not too annoying.

This was done entirely in my personal capacity, and not as part of my job at MIRI.^[1] I did this because I like listening to audio and there wasn’t a good audio version of the conversations.

Spotify link: https://open.spotify.com/show/6I0YbfFQJUv0IX6EYD1tPe

RRS: https://anchor.fm/s/1082f3c7c/podcast/rss

Apple Podcasts: https://podcasts.apple.com/us/podcast/2021-miri-conversations/id1838863198

Pocket Casts: https://pca.st/biravt3t

I’ve found it really interesting listening through the conversations from 2021. Overall I think people were very prescient, especially given that this was way before the world waking up to the prospect of AGI.... (read more)

Carl Shulman is working for Leopold Aschenbrenner's "Situational Awareness" hedge fund as the Director of Research. https://whalewisdom.com/filer/situational-awareness-lp

•••

For people who like Yudkowsky's fiction, I recommend reading his story Kindness to Kin. I think it's my favorite of his stories. It's both genuinely moving, and an interesting thought experiment about evolutionary selection pressures and kindness. See also this related tweet thread.

Diffusion language models are probably bad for alignment and safety because there isn't a clear way to get a (faithful) Chain-of-Thought from them. Even if you can get them to generate something that looks like a CoT, compared with autoregressive LMs, there is even less reason to believe that this CoT is load-bearing and being used in a human-like way.

I think Sam Bowman's The Checklist: What Succeeding at AI Safety Will Involve is a pretty good list and I'm glad it exists. Unfortunately, I think its very unlikely that we will manage to complete this list, given my guess at timelines. It seems very likely that the large majority of important interventions on this list will go basically unsolved.
I might go through The Checklist at some point and give my guess at success for each of the items.

AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions

peterbarnett

peterbarnett, Aaron_Scher

9mo

We’re excited to release a new AI governance research agenda from the MIRI Technical Governance Team. With this research agenda, we have two main aims: to describe the strategic landscape of AI development and to catalog important governance research questions. We base the agenda around four high-level scenarios for the geopolitical response to advanced AI development. Our favored scenario involves building the technical, legal, and institutional infrastructure required to internationally restrict dangerous AI development and deployment (which we refer to as an Off Switch), which leads into an internationally coordinated Halt on frontier AI activities at some point in the future. This blog post is a slightly edited version of the executive summary.

We are also... (read 2385 more words →)

109

Is there a name of the thing where an event happens and that specific event is somewhat surprising, but overall you would have expected something in that reference class (or level of surprisingness) to happen?

E.g. It was pretty surprising that Ivanka Trump tweeted about Situational Awareness, but I sort of expect events this surprising to happen.

I think the Claude Sonnet Golden Gate Bridge feature is not crispy aligned with the human concept of "Golden Gate Bridge".
It brings up the San Fransisco fog far more than it would if it was just the bridge itself. I think it's probably more like Golden Gate Bridge + SF fog + a bunch of other things (some SF related, some not).
This isn't particularly surprising, given these are related ideas (both SF things), and the features were trained in an unsupervised way. But still seems kinda important that the "natural" features that SAEs find are not like exactly intuitively natural human concepts.

It might be interesting to look at how much the SAE

Jeremy Gillen

Jeremy Gillen, peterbarnett

A pdf version of this report is available here.

Summary

In this report we argue that AI systems capable of large scale scientific research will likely pursue unwanted goals and this will lead to catastrophic outcomes. We argue this is the default outcome, even with significant countermeasures, given the current trajectory of AI development.

In Section 1 we discuss the tasks which are the focus of this report. We are specifically focusing on AIs which are capable of dramatically speeding up large-scale novel science; on the scale of the Manhattan Project or curing cancer. This type of task requires a lot of work, and will require the AI to overcome many novel and diverse obstacles.

In... (read 16922 more words →)

161

Trying to align humans with inclusive genetic fitness

peterbarnett

Epistemic status: I think this post points to some important ideas, I think the specific proposals might have flaws and there are likely better ideas. If you’re interested, I would be interested in other proposals, or converting standard alignment proposals into this frame.
Also, I don’t think any of the proposals in the post are moral or good things to do, obviously. IGF does not seem to be the one true moral imperative.

Reasoning about future AIs is hard, we want to be able to talk about systems which “optimize” for “goals”, but we don’t really know what either of these terms mean. It might not be clear if we should talk about an... (read 2990 more words →)

Labs should be explicit about why they are building AGI

peterbarnett

Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.

The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI,^[1] and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.

But there are certain reasons... (read 259 more words →)

215

Doing oversight from the very start of training seems hard

peterbarnett

TLDR: We might want to use some sort of oversight techniques to avoid inner misalignment failures. Models will be too large and complicated to be understandable by a human, so we will use models to oversee models (or help humans oversee models). In many proposals this overseer model is an ‘amplified’ version of the overseen model. Ideally you do this oversight throughout all of training so that the model never becomes even slightly misaligned without you catching it.
You can’t oversee on a close to initialized model because it’s just a random soup of tensors. You also can’t use this close to initialized model to help you do oversight because it’s too dumb.
We... (read 827 more words →)

LESSWRONG
LW

LESSWRONG
LW

peterbarnett

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

Labs should be explicit about why they are building AGI

Thomas Kwa's MIRI research experience

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

peterbarnett

Announcing: MIRI Technical Governance Team Research Fellowship

Considerations for setting the FLOP thresholds in our example international AI agreement

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

AI Generated Podcast of the 2021 MIRI Conversations

AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Trying to align humans with inclusive genetic fitness

An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

My AI Risk Model [now out-dated]

peterbarnett

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

Labs should be explicit about why they are building AGI

Thomas Kwa's MIRI research experience

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

peterbarnett

Announcing: MIRI Technical Governance Team Research Fellowship

Considerations for setting the FLOP thresholds in our example international AI agreement

New Report: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

AI Generated Podcast of the 2021 MIRI Conversations

AI Governance to Avoid Extinction: The Strategic Landscape and Actionable Research Questions

Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Trying to align humans with inclusive genetic fitness

An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

My AI Risk Model [now out-dated]

Summary