2mo

Website version · Gestalt · Repo and data

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring 800 links. The point is to produce stylised... (read 24743 more words →)

178

•••

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

Stag

2mo

Update: The deadline has been extended to January 12th.

TLDR – Apply now to ESPR and PAIR. ESPR welcomes students between 16-19 years. PAIR is for students between 16-21 years.

The FABRIC team is once again running two immersive summer workshops for mathematically talented students this year.

The European Summer Program on Rationality (ESPR) is for students with a desire to understand themselves and the world, and interest in applied rationality.

The curriculum covers a wide range of topics, from game theory, cryptography, and mathematical logic, to AI, styles of communication, and cognitive science. See the content details.
For students who are 16-19 years old.
July 26th - August 5th in Somerset, United Kingdom

The Program on AI and Reasoning (PAIR) is for students with an interest... (read 189 more words →)

Replying toShallow review of technical AI safety, 2024

Stag1y

Shallow review of technical AI safety, 2024

Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.

Replying toShallow review of technical AI safety, 2024

Stag1y

Shallow review of technical AI safety, 2024

As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?

Replying toShallow review of technical AI safety, 2024

Stag1y

Shallow review of technical AI safety, 2024

I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.

Replying toShallow review of technical AI safety, 2024

Stag1y

Shallow review of technical AI safety, 2024

Good point imo, expanded and added a hyperlink!

Replying toShallow review of technical AI safety, 2024

Stag1y

Shallow review of technical AI safety, 2024

Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?

Shallow review of technical AI safety, 2024

technicalities

technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers

from aisafety.world

The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.

The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and... (read 12185 more words →)

202

•••

Replying toShallow review of live agendas in alignment & safety

Stag2y

Shallow review of live agendas in alignment & safety

Thanks, added!

Appendices to the live agendas

technicalities

technicalities, Stag

Lists cut from our main post, in a token gesture toward readability.

We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don't seem to have any agenda, and a bunch of things which don't fit elsewhere.

Appendix: Prior enumerations

Appendix: Graveyard

Ambitious value learning?
MIRI youngbloods (see Hebbar)
JW selection theorems??
Provably Beneficial Artificial Intelligence (but see Open Agency and Omohundro)
HCH (see QACI)
IDA → Critiques and recursive reward modelling
Debate is now called Critiques and ERO
Market-making (Hubinger)
Logical inductors
Conditioning Predictive Models: Risks and Strategies?
Impact measures, conservative agency, side effects → “power aversion”
Acceptability Verification
Quantilizers
Redwood interp?
AI Safety Hub
Enabling Robots to Communicate their Objectives (early stage interp?)
Aligning narrowly superhuman models (Cotra idea; tiny followup; lives on as scalable oversight?)
automation of semantic interpretability
- i.e. automatically proposing hypotheses instead of just automatically verifying them
Oracle AI is not dead so much as ~everything LLM falls under its purview
- Tool AI, similarly, ~is LLMs plugged into a particular interface
EleutherAI: #accelerating-alignment - AI alignment assistants are live, but it doesn’t seem like EleutherAI is currently working on this
Algorithm Distillation Interpretability (Levoso)

Appendix: Biology for AI alignment

Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.

Human enhancement

One-

...

Shallow review of live agendas in alignment & safety

technicalities

technicalities, Stag

Summary

You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.

This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at... (read 8507 more words →)

350

•••

Replying toCup-Stacking Skills (or, Reflexive Involuntary Mental Motions)

Stag4y

Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)

I really like the artistry of post-writing here; the introduction to and transition between the three videos felt especially great.

I've been internally using the term elemental for something in this neighborhood - Frame-Breaker elemental, Incentive-Slope elemental, etc. The term feels more totalizing (having two cup-stacking skills is easy to envision; being a several-thing elemental points in the direction of you being some mix of those things, and only those things), but some other connotations feel more on-target (like the difficulty of not doing the thing). I also like the term's aesthetics, but I could well be alone in that.

Replying toOn the nature of purpose

Stag5y

On the nature of purpose

I'm not sure I understand the cryptographer's constraint very well, especially with regard to language: individual words seem to have different meanings ("awesome", "literally", "love"). It's generally possible to infer which decryption was intended from the wider context, but sometimes the context itself will have different and mutually exclusive decryptions, such as in cases of real or perceived dogwhistling.

One way I could see this specific issue being resolved is by looking at what the intent of the original communication was - this would make it so that there is now a fact that settles which is the “correct” solution -, but that seems to fail in a different way: agents don't seem... (read more)

Replying toPower Buys You Distance From The Crime

Stag7y

Power Buys You Distance From The Crime

I might be missing the forest for the trees, but all of those still feel like they end up making some kinds of predictions based on the model, even if they're not trivial to test. Something like:

If Alice were informed by some neutral party that she took Bob's apple, Charlie would predict that she would not show meaningful remorse or try to make up for the damage done beyond trivial gestures like an off-hand "sorry" as well as claiming that some other minor extraction of resources is likely to follow, while Diana would predict that Alice would treat her overreach more seriously when informed of it. Something similar can be done on... (read more)

Stag7y

I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.

It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.

This comes from someone with little interest or knowledge about the former, but after accidentally stumbling into some Tulpa-related... (read more)

LESSWRONG
LW

LESSWRONG
LW

Stag

Shallow review of live agendas in alignment & safety

Shallow review of technical AI safety, 2024

Shallow review of technical AI safety, 2025

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

Stag

Shallow review of technical AI safety, 2025

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

Shallow review of technical AI safety, 2024

Appendices to the live agendas

Shallow review of live agendas in alignment & safety

Stag

Shallow review of live agendas in alignment & safety

Shallow review of technical AI safety, 2024

Shallow review of technical AI safety, 2025

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

Stag

Shallow review of technical AI safety, 2025

Apply to ESPR & PAIR 2026, Rationality and AI Camps for Ages 16-21

Shallow review of technical AI safety, 2024

Appendices to the live agendas

Shallow review of live agendas in alignment & safety

Appendix: Prior enumerations

Appendix: Graveyard

Appendix: Biology for AI alignment

Human enhancement

Summary