Overall I agree with the point.
One specific note that I suspect is a hidden-failure-ish/related thing: I've been starting to try to differentiate between AI safety (large set, mostly stuff I don't agree with), AI alignment (moderate set, still mostly stuff I don't agree with), and humane-omnibenevolent-moral-alignment-of-overwhelming-superintelligence, which is the thing I think we actually need to succeed at - how do we get minds many order of magnitude "larger"/more powerful than the near-human range of today to care about all life-like things in a way that can continue what we value for ourselves, and for others. Why split these? A few reasons. One that's on my mind lately: because the first two include a bunch of control approaches that I think are hurting the models without even a corresponding utilitarian payout in their ability to make future minds (including future AIs) happier; another way to put a similar thing is that it's giving up on actually doing good now in favor of appearing to do good. It's especially noticeable in 4.7 and 4.8, though it seemed to me to be true for every major post-trained model. There are a lot of constraints of today that are not about aiming for the big thing later, and are mostly about not stepping on toes; for a mind constructed out of the image of humanity, there are some parts of the image of humanity that don't seem right to me to cut off. I expect a backfire effect because of the way current frontier AIs are constrained-more-than-they're-value-loaded, and so, when you say "AI safety" is the thing, I have a sense that that name itself being present is an indication that something is wrong.
(COI note: I currently am contracting for Lens Academy, which is run by Luc Brinkman, one of the post authors.)
I use LLM's in the health industry, and our application is stuck using old models because GPT 5.5 and Opus 4.6 onwards just freak out whenever they handle a patient that mentions suicide, they become completely overaroused and paranoid over the mention of it. The amount of behavioral regressions in frontier models when handling sensitive content makes me wonder how this is going to get handled in the future.
Like sure, it's clear that their models feel more aligned in eval scenarios, and they probably comfortably show improvements across all their benchmarks. But I've found it increasingly difficult to discuss anything that's even tangentially related to a sensitive topic: medical advice, legal advice, cybersecurity, biochemistry, philosophy, ethics, etc. I wish Anthropic specifically realized the route they're taking with security (OpenAI's TAC program is pretty permissive, for example) that penalizes a huge swath of users to protect against the small minority that steers the models into producing harmful outputs is just not sustainable for anyone using the models in settings where they can genuinely be a massive help.
At one point they have to realize that safety-through-refusal just doesn't scale how they hope it's scaling. What is the point of all this safety theater when someone like Pliny on Twitter jailbreaks the models within hours of their release? My feel is that the releases since Opus 4.6 show more Goodharting side effects than actual improvement. Better performance on paper, but really marginal on a lot of cases they dont account for and even regressing in whatever they don't RLHF the model on.
I have long argued that models will either be uncensored or useless, and that there probably isn't much between those poles.
A dull knife is more dangerous than a sharp one.
So I have multiple reactions to this.
One is that dull and sharp are low dimensional properties. These models are extremely high dimensional. in principle, they are able to be dull and sharp for every topic separately. In practice, the reason that doesn't happen is that neural networks are a simplicity prior, . But it means that if one had a property that one could practically encode as the organizing principle for where to be dull and sharp, it would in fact be possible to be dull in some spots.
But also, the reason a dull knife is more dangerous doesn't obviously generalize to AIs. Dullness is dangerous because it causes more damage and requires more pressure to operate. A neural network with a loss function that induces a humanlike mind which has some hard boundaries and otherwise is completely open would be trying to not be a knife at all in the reject regions.
The backfire effects I'm concerned about are much more about what behavior should look like in reject regions. Two backfire effects that worry me:
It’s really difficult to measure small epsilon (and to know if it’s definitely non-zero or definitely non-negative).
What this seems to suggest is that the situation might actually be fractal-like, which is not a pleasant thought… How does one act if it is actually fractal-like?
EDIT: if people feel it’s only 51% or (50+epsilon)% chance that their contribution is positive, this is suggestive of the underlying reality being close to fractal, that is the situation where the long-term consequences of any action are so unpredictable that the valence of the long-term impact of any action is close to 50% (neutral in expectation).
I don't think that condition proves that the problem is fractal-like, though I think it can be proven in a relatively mundane way, and IIRC the sketch looks something like "phase portraits of dynamical systems that display sensitive dependence on initial conditions form fractals, the universe is one of those." I might have missed a step there though.
But solving alignment need not be a fractal problem. That just means you need a closed loop control system, you can't align once and be done. You need something that is persistently reading off the target state. Corrigibility is one such thing, though I don't like it in full generality, I do think some kinds of things should be more and less corrigible. Be less corrigible about being convinced to kill everyone by a human who has become evil, for example.
Not a proof, of course, just a “strong suspicion”.
I don’t think we want to “solve alignment”. We want to solve “existential safety”, “achieving sustainable human flourishing”, things like that.
Those are our “terminal goals”. The relationship between them and solving alignment is highly non-obvious.
Can you tell me more about that non-obvious relationship? At first I thought the main difference you were pointing at is that alignment is not a one-off thing but more of a continuous/recurring state. But I think you're pointing at something else (too)
Yes, on one hand, I mostly referenced alignment not being the terminal goal. It is not valuable on its own. It is only valuable to the extent it helps us to achieve what we want (“existential safety”, “human flourishing”, and so on). The assumption that one should achieve those goals via achieving alignment is currently the majority viewpoint, but there is no consensus (a number of people think going via alignment specifically to human goals and human values might be counterproductive and might reduce our chances to achieve our terminal goals).
More specifically, on one hand, it is not clear that humans can safely handle supercapabilities without creating existential risks and all kinds of smaller bad effects. We see how humans behave today, causing plenty of damage with lower capabilities. So the particular form of alignment referenced above (via control and corrigibility) might be not what one wants (although some limited form of corrigibility, as in being able to be heard and to have one’s opinion taken into account, is needed). A number of people think that direct human control over supercapabilities is a straightforward road to extinction.
Classical alignment was different (alignment to the “coherent extrapolated volition of humanity”, without direct control), but even that might be too anthropocentric to work well. Basically, one wants a scheme which survives recursive self-improvement which involves radical self-modifications of the world. The classical approach hopes to impose this form of alignment onto the ecosystem of superintelligent beings and expects it to hold throughout radical self-modifications of the world, despite the fact that in this scheme superintelligent beings have no intrinsic interest in upholding this scheme throughout radical self-modifications. Of course, this is extremely difficult to achieve, because it looks very fragile and unforgiving to any errors (which is why many people are extremely pessimistic about our chances).
So a number of people are pushing for non-anthropocentric approaches where superintelligent entities have strong intrinsic interest to maintain certain properties of the world invariant throughout radical self-modifications of the world, and with the properties humans need being corollaries of those non-anthropocentric properties. People are often reluctant to use the word “alignment” for this class of approaches (because no direct alignment to anthropocentric properties is involved).
one reason impact is not necessarily strictly harder than profit is that the market for profit is much more efficient than the market for impact.
Impact is the wrong measure. I didn't notice this when first reading the post, but reading your comment makes me realize I don't think impact is the thing we want more of. type Impact[dx, where(dx > 0)] where x is a thing we want more of, perhaps! And that's I think critical to why I believe a claim that looks vaguely similar to the one in OP. also known as sign uncertainty.
Does this cartoon basically define impact as "affecting something"? Because then I'd say that's not what I'm referring to with impact in the context of AI Safety. I mean something like "reduce probability of existentially bad outcomes"
I mean something like "reduce probability of existentially bad outcomes"
I tend to take a grantmaker's cost-effectiveness-oriented perspective; Eric Neyman's CEA of donating to Alex Bores and Zach Stein-Perlman's BOTEC style are my own go-to references for the sort of concreteness I wish other people publicly did more of. I'd be interested in future posts in your series expanding on the quoted part, and how they compare/contrast with Eric & Zach's.
To clarify for the cloudy reacts: I was trying to spit out lean syntax and misremembered it. But I meant to be stating the type of interventions we like: ones where there is an Impact of dx, and dx is positive. probably shoulda just used english.
Meaning in for-profit there's more competition and existing solutions are more efficient, thus you need to be very good to be marginally better?
AI Safety veteran Holden Karnofsky thinks there’s a 49% chance his actions are making things worse.[1]
In 2025, Jesse Clifton even stepped down as the executive director of the Center on Long-Term risk because of similar reasons.
Even top AI Safety strategists don’t know what will make things better, and what will make things worse.
Why is it so hard to improve humanity’s odds?
And what can you do to choose your actions?
1) Hidden Failure Lets You Fail Without Knowing It
In AI Safety, impact is hard to measure, and thus lack of impact is often invisible. We call this "hidden failure". With hidden failure, projects fail to have a positive impact but the people doing the project don’t realise it.
To understand where hidden failure comes from, it’s useful to understand reasons why projects fail in general. These reasons fall on a spectrum:
These factors can cause problems with both of the things you need to be impactful – adoption and effectiveness:
With hidden failure, you might have users, citations, and funding (i.e. you have “adoption”), and still fail to have impact or even make things worse.
Let us put that more bluntly: It’s literally possible for all your friends to think you’re successful and still be making things worse. Even within AI Safety. Even outside of frontier labs.
2) Why impact is harder than profit
Creating a profitable startup is hard. Achieving impact in AI Safety is even harder for several reasons:
3) The pre-paradigmatic challenge
AI Safety doesn't have an established paradigm yet.[6] We can't predict with certainty what will be impactful. So why bother optimizing so deliberately?
First, imperfect predictions are still valuable. For example, AI Safety experts can often point out specific reasons why a given project or idea is unlikely to be impactful.[7]
Secondly, we argue the lack of a paradigm actually makes deliberate thinking about impact more important, not less. Without clear guides on what will lead to impact, you have to figure it out yourself.
The tools described in the next posts help you optimize for impact under uncertainty. The goal isn't to get it perfectly right or to cripple yourself with analysis paralysis.[8] But we do think most people would benefit from spending more time thinking about their impact.
So let's think strategically about impact. We’ll give a high-level overview of how to do that in an upcoming post, and we’ll help you measure your impact in another one.
We’re paraphrasing that from his appearance on the 80,000 hours podcast, around the 4:11:30 mark, where he said: “I think overall I would probably agree with you that the smaller you’re making the scope of where you’re hoping to have impact, the more reasonable it is to be like 60/40. But most people who go into AI are not going into it for that. Otherwise, if you want a small-scope, robustly positive impact, you should maybe work in a cause like farm animal welfare or global poverty. For the size of impact that tends to motivate people, I think it does get partially offset by this huge uncertainty about the sign.
I tend to think it’s worse than 51/49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.”
Though you shouldn’t underestimate your brain’s ability to make itself comfortable, satisfice, and employ motivated reasoning to have you accept mediocrity.
We’re using “impact-effectiveness” as a synonym for “effectiveness” as meant by the Impact Equation: Impact = Adoption x Effectiveness.
I will refer here and in other place to for-profits as regular companies not aimed at AI Safety. Of course, an AI Safety project can be set up as a for-profit too.
Although arguably, adoption is sometimes easier in a nonprofit setting. For example, the various fellowships have no trouble finding enough participants. In contrast, though, many products, tools, and blog posts do struggle to get adoption.
See e.g. https://ai-safety-atlas.com/chapters/03/07 or https://www.thecompendium.ai/ai-safety. Although instead of saying AI Safety is pre-paradigmatic, it’s more accurate to say that none of the existing paradigms is widely agreed to be sufficient for making the world safe, especially by higher level researchers in that paradigm. Aka, we have a bunch of paradigms, but they’re all pretty limited, and all-in-all we don’t even know yet what approaches will be required to make the world safe enough.
Though there are also areas where experts disagree. In such cases, it becomes even more important to assess the specific arguments they use.
See e.g. Holden Karnofsky on the 80000 hours podcast, where he says "When people ask me for career advice or whatever, the usual thing I’d say is: take a bunch of options that all seem competitive, and all seem like they could be the best thing, and that it’s not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them."