technicalities

The RoastMyPost review is much better, I made one edit as a result (Anthropic settled rather than letting a precedent be set). Takes a while to load!

Try right-click > Inspect > drag the console to cover half the screen

Replying toShallow review of technical AI safety, 2025

Shallow review of technical AI safety, 2025

Yeah we are thinking of making it real-time rather than annual, will chat once we've recovered.

Shallow review of technical AI safety, 2025

technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordine, ozziegooen, Violet Hour, ramennaut

2mo

Website version · Gestalt · Repo and data

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring 800 links. The point is to produce stylised... (read 24743 more words →)

177

•••

"Big if true" is biased (by the chosen scale) towards applications rather than pure intellectual significance. But I wanted to try and cover maths anyway, since it is always ignored.

Please consider adding corrections anonymously here!

technicalities2mo*

Yeah this is strictly invalid but was intentional (see Methods). See the last column for the true EV, which produces a less useful ordering. I think this is fine because the the ordering was the objective rather than using the EV as a decision input.

Fair. I don't doubt there is some bias, but I think most of the rest is blameless correlation (ribose and fast LUCA are the same event) and hiding behind the conditional (IF true, and they won't all be).

2mo

A couple of years ago, Gavin became frustrated with science journalism. No one was pulling together results across fields; the articles usually didn’t link to the original source; they didn't use probabilities (or even report the sample size); they were usually credulous about preliminary findings (“...which species was it tested on?”); and they essentially never gave any sense of the magnitude or the baselines (“how much better is this treatment than the previous best?”). Speculative results were covered with the same credence as solid proofs. And highly technical fields like mathematics were rarely covered at all, regardless of their practical or intellectual importance. So he had a go at doing it himself.

This... (read 655 more words →)

178

Replying toAI in 2025: gestalt

Nice points. I would add "backtracking" as one very plausible general trick purely gained by RLVR.

I will own up to being unclear in OP: the point I was trying to make is that last year that there was a lot of excitement about way bigger off-target generalisation than cleaner CoTs, basic work skills, uncertainty expression, and backtracking. But I should do the work of finding those animal spirits/predictions and quantifying them and quantifying the current situation.

Replying toAI in 2025: gestalt

Ways we can fail to answer

Yep, thanks, just tried. Just say @synthid in any Gemini session.

Cross-posted from gleech.org.

2mo

In what ways can we can fail to answer a question?

I mean necessarily fail: actual barriers to knowledge, rather than skill issue hurdles. But of course contingent failures are much more common: “We didn’t ask the question in the first place”, or “We didn’t have the particular insight that would have allowed for productive research”, or “We didn’t manage to remove every cognitive bias”, or “Instrumentation is really hard”, or “We are not rich enough to run this study yet”, or “We worshipped the problem”.

I also mean fail exactly; there are often excellent approximations, and we can often legitimately patch over tricky philosophical questions with our unanalysed tacit knowledge.

Conceptual problems

... (read 1359 more words →)

The jailbreak argument against LLM values

2mo

This is the editorial for this year’s "Shallow Review of AI Safety". (It got long enough to stand alone.)

Epistemic status: subjective impressions plus one new graph plus 300 links.

Huge thanks to Jaeho Lee, Jaime Sevilla, and Lexin Zhou for running lots of tests pro bono and so greatly improving the main analysis.

tl;dr

Informed people disagree about the prospects for LLM AGI – or even just what exactly was achieved this year. But the famous ones with a book to talk at least agree that we’re 2-20 years off (allowing for other paradigms arising). In this piece I stick to arguments rather than reporting who thinks what.
My view: compared to last year, AI is much more

... (read 5920 more words →)

248

•••

3mo

Status: Writeup of a folk result, no claim to originality.

Bostrom (2014) defined the AI value loading problem as

how could we get some value into an artificial agent, so as to make it pursue that value as its final goal? ^[1]

JD Pressman (2025) appears to think this is obviously solved in current LLMs:

The value loading problem outlined in Bostrom 2014 of {getting a general AI system to internalize and act on “human values” before it is superintelligent and therefore incorrigible} has basically been solved. This achievement also basically always goes unrecognized because people would rather hem and haw about jailbreaks and LLM jank than recognize that we now have a reasonable strategy for getting

... (read 1548 more words →)

curate

Shallow review of technical AI safety, 2024

“Let’s get back to your childhood, Jane. What was it like in Minnesota during the war?” Warm, patient, perfect.

She couldn’t quite prop herself up, but the mattress deformed to help her brace against the pillows and the backboard. And she went back smiling.

“Oh, the summers were beautiful, Frank. Mother would hang the wash out and beat the carpets with a big rope thing, and my sister and I would run amok playing hide and seek in em, catching a blow from mother if we got the sheets dirty. Oh and the wind! They’d snap in the wind like sails. Dried in an hour – tops.”

“I know that children often had to work... (read 405 more words →)

"Safety as a Scientific Pursuit" (2024)

technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers

from aisafety.world

The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.

The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and... (read 12185 more words →)

202

•••

Appendices to the live agendas

Tom McGrath, until recently a Research Scientist at Deepmind, has written up why he's not excited about theoretical AI safety. It's similar to Aaronson and Barak's "reform" alignment. It's argued in good faith and pretty constructive.

The key provocation for you is probably his view on the safety implications of open-sourcing models.

The author knows the field has moved in this direction already, and is trying to establish common knowledge and more of that. He also knows that it's a sore point to imply that modern rationalists aren't empiricists. For your blood pressure I recommend that you mentally prepend "AI-" to every mention of rationalism in the post.

Because more empirically-minded people remain unconvinced of

... (read 579 more words →)

Shallow review of live agendas in alignment & safety

technicalities, Stag

Lists cut from our main post, in a token gesture toward readability.

We list past reviews of alignment work, ideas which seem to be dead, the cool but neglected neuroscience / biology approach, various orgs which don't seem to have any agenda, and a bunch of things which don't fit elsewhere.

Appendix: Prior enumerations

Appendix: Graveyard

Ambitious value learning?
MIRI youngbloods (see Hebbar)
JW selection theorems??
Provably Beneficial Artificial Intelligence (but see Open Agency and Omohundro)
HCH (see QACI)
IDA → Critiques and recursive reward modelling
Debate is now called Critiques and ERO
Market-making (Hubinger)
Logical inductors
Conditioning Predictive Models: Risks and Strategies?
Impact measures, conservative agency, side effects → “power aversion”
Acceptability Verification
Quantilizers
Redwood interp?
AI Safety Hub
Enabling Robots to Communicate their Objectives (early stage interp?)
Aligning narrowly superhuman models (Cotra idea; tiny followup; lives on as scalable oversight?)
automation of semantic interpretability
- i.e. automatically proposing hypotheses instead of just automatically verifying them
Oracle AI is not dead so much as ~everything LLM falls under its purview
- Tool AI, similarly, ~is LLMs plugged into a particular interface
EleutherAI: #accelerating-alignment - AI alignment assistants are live, but it doesn’t seem like EleutherAI is currently working on this
Algorithm Distillation Interpretability (Levoso)

Appendix: Biology for AI alignment

Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.

Human enhancement

One-

...