I think the level of disagreement among the experts implies that there is quite a lot of uncertainty so the key question is how to steer the future toward better outcomes while reasoning and acting under substantial uncertainty.
The framing I currently like best is from Chris Olah’s thread on probability mass over difficulty levels.
The idea is that you have initial uncertainty and a distribution that assigns probability mass to different levels of alignment difficulty.
The goal is to develop new alignment techniques that "eat marginal probability" where over time the most effective alignment and safety techniques can handle the optimistic easy cases, and then the medium and hard cases and so on. I also think the right approach is to think in terms of which actions would have positive expected value and be beneficial across a range of different possible scenarios.
Meanwhile the goal should be to acquire new evidence that would help reduce uncertainty and concentrate probability mass on specific possibilities. I think the best way to do this is to use the scientific method to proposed hypotheses and then test them experimentally.
Thank's for pointing that out and for the linked post!
I'd say the conclusion is probably the weakest part of the post because after describing the IABIED view and the book's critics I found it hard to reconcile the two views.
I tried getting Gemini to write the conclusion but what it produced seemed even worse: it suggested that we treat AI like any other technology (e.g. cars, electricity) where doomsday forecasts are usually wrong and the technology can be made safe in an iterative way which seems too optimistic to me.
I think my conclusion was an attempt to find a middle ground between the authors of IABIED and the critics by treating AI as a risky but not world-ending technology.
(I'm still not sure what the conclusion should be)
Yeah that's probably true and it reminds me of Plank's principle. Thanks for sharing your experience.
I like to think that this doesn't apply to me and that I would change my mind and adopt a certain view if a particularly strong argument or piece of evidence supporting that view came along.
It's about having a scout mindset and not a soldier mindset: changing your mind is not defeat, it's a way of getting closer to the truth.
I like this recent tweet from Sahil Bloom:
I’m increasingly convinced that the willingness to change your mind is the ultimate sign of intelligence. The most impressive people I know change their minds often in response to new information. It’s like a software update. The goal isn't to be right. It's to find the truth.
The book Superforecasting also has as similar idea: the best superforecasters are really good and constantly updating based on new information:
The strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement. It is roughly three times as powerful a predictor as its closest rival, intelligence.
Yes, I agree [1]. At first, I didn't consider writing this post because I assumed someone else would write a post like it first. The goal was to write a thorough summary of the book's arguments and then analyze them and the counterarguments in a rigorous and unbiased way. I didn't find a review that did this so I wrote this post.
Usually a book review just needs to give a brief summary so that readers can decide whether or not they are interested in reading the book and there are a few IABIED book reviews like this.
But this post is more like an analysis of the arguments and counterarguments than a book review. I wanted a post like this because the book's arguments have really high stakes and it seems like the right thing to do for a third party to review and analyze the arguments in a rigorous and high-quality way.
Though I may be biased.
For human learning, the outer objective in the brain is maximizing hard-coded reward signals and minimizing pain and the brain's inner objectives are the specific habits and values that determine behavior directly and are somewhat aligned with the goals of maximizing pleasure and minimizing pain.
I agree with your high-level view that is something like "If you created a complex system you don't understand then you will likely get unexpected undesirable behavior from it."
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with "You're right!").
In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.
Sycophantic AI doesn't seem that surprising because it's a special case of reward hacking in the context of LLMs and reward hacking isn't new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think "Sydney" Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I'd consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
The AI company Mechanize posted a blog post in November called "Unfalsifiable stories of doom" which is a high-quality critique of AI doom and IABIED that I haven't seen shared or discussed on LessWrong yet.
Link: https://www.mechanize.work/blog/unfalsifiable-stories-of-doom/
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution's coarse genomic selection; the "alien drives" evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the "one try" framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
After reading The Adolescence of Technology by Dario Amodei I now think that Amodei is one the most realistic and calibrated AI leaders on the future of AI.