The Best of LessWrong

Here you can find the best posts of LessWrong. When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
Sort by:
curatedyear
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
alkjash
Prune
johnswentworth
Gears vs Behavior
Raemon
Noticing Frame Differences
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
+

Optimization

sarahconstantin
The Pavlov Strategy
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
HoldenKarnofsky
This Can't Go On
Scott Alexander
Studies On Slack
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Jameson Quinn
A voting theory primer for rationalists
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
+

World

Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
johnswentworth
Transportation as a Constraint
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
jacobjacob
Unconscious Economics
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
zhukeepa
How uniform is the neocortex?
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
KatjaGrace
Elephant seal 2
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
Kaj_Sotala
Book summary: Unlocking the Emotional Brain
+

AI Strategy

Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
KatjaGrace
Counterarguments to the basic AI x-risk case
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
Adam Scholl
Safetywashing
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
+

Technical AI Safety

Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
paulfchristiano
The strategy-stealing assumption
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
Alex Flint
The ground of optimization
paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
#1

A few dozen reason that Eliezer thinks AGI alignment is an extremely difficult problem, which humanity is not on track to solve.

13Ben Pace
+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I'm extremely glad it was written up.
#2

"Wait, dignity points?" you ask.  "What are those?  In what units are they measured, exactly?"

And to this I reply:  "Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line.  A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."

"But if enough people can contribute enough bits of dignity like that, wouldn't that mean we didn't die at all?"  "Yes, but again, don't get your hopes up."

13johnswentworth
Based on occasional conversations with new people, I would not be surprised if a majority of people who got into alignment between April 2022 and April 2023 did so mainly because of this post. Most of them say something like "man, I did not realize how dire the situation looked" or "I thought the MIRI folks were on it or something".
#3

Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.

11Jan_Kulveit
This is a great complement to Eliezer's 'List of lethalities' in particular because in cases of disagreements beliefs of most people working on the problem were and still mostly are are closer to this post. Paul writing it provided a clear, well written reference point, and with many others expressing their views in comments and other posts, helped made the beliefs in AI safety more transparent. I still occasionally reference this post when talking to people who after reading a bit about the debate e.g. on social media first form oversimplified model of the debate in which there is some unified 'safety' camp vs. 'optimists'. Also I think this demonstrates that 'just stating your beliefs' in moderately-dimensional projection could be useful type of post, even without much justification.
10Vanessa Kosoy
I wrote a review here. There, I identify the main generators of Christiano's disagreement with Yudkowsky[1] and add some critical commentary. I also frame it in terms of a broader debate in the AI alignment community. 1. ^ I divide those into "takeoff speeds", "attitude towards prosaic alignment" and "the metadebate" (the last one is about what kind of debate norms should we have about this or what kind of arguments should we listen to.)
#4

Historically people worried about extinction risk from artificial intelligence have not seriously considered deliberately slowing down AI progress as a solution. Katja Grace argues this strategy should be considered more seriously, and that common objections to it are incorrect or exaggerated. 

17Eli Tyre
This was counter to the prevailing narrative at the time, and I think did some of the work of changing the narrative. It's of historical significance, if nothing else.
16habryka
I think it's a bit hard to tell how influential this post has been, though my best guess is "very". It's clear that sometime around when this post was published there was a pretty large shift in the strategies that I and a lot of other people pursued, with "slowing down AI" becoming a much more common goal for people to pursue. I think (most of) the arguments in this post are good. I also think that when I read an initial draft of this post (around 1.5 years ago or so), and had a very hesitant reaction to the core strategy it proposes, that I was picking up on something important, and that I do also want to award Bayes points to that part of me given how things have been playing out so far.  I do think that since I've seen people around me adopt strategies to slow down AI, I've seen it done on a basis that feels much more rhetorical, and often directly violates virtues and perspectives that I hold very dearly. I think it's really important to understand that technological progress has been the central driving force behind humanity's success, and that indeed this should establish a huge prior against stopping almost any kind of technological development. In contrast to that, the majority of arguments that I've seen find traction for slowing down AI development are not distinguishable from arguments that apply to a much larger set of technologies which to me clearly do not pose a risk commensurable with the prior we should have against slowdown. Concerns about putting people out of jobs, destroying traditional models of romantic relationships, violating copyright law, spreading disinformation, all seem to me to be the kind of thing that if you buy it, you end up with an argument that proves too much and should end up opposed to a huge chunk of technological progress.  And I can feel the pressure in myself for these things as well. I can see how it would be easy to at least locally gain traction at slowing down AI by allying myself with people who are concerned abo
#5

TurnTrout discusses a common misconception in reinforcement learning: that reward is the optimization target of trained agents. He argues reward is better understood as a mechanism for shaping cognition, not a goal to be optimized, and that this has major implications for AI alignment approaches. 

13Olli Järviniemi
I view this post as providing value in three (related) ways: 1. Making a pedagogical advancement regarding the so-called inner alignment problem 2. Pointing out that a common view of "RL agents optimize reward" is subtly wrong 3. Pushing for thinking mechanistically about cognition-updates   Re 1: I first heard about the inner alignment problem through Risks From Learned Optimization and popularizations of the work. I didn't truly comprehend it - sure, I could parrot back terms like "base optimizer" and "mesa-optimizer", but it didn't click. I was confused. Some months later I read this post and then it clicked. Part of the pedagogical value is not having to introduce the 4 terms of form [base/mesa] + [optimizer/objective] and throwing those around. Even with Rob Miles' exposition skills that's a bit overwhelming. Another part I liked were the phrases "Just because common English endows “reward” with suggestive pleasurable connotations" and "Let’s strip away the suggestive word “reward”, and replace it by its substance: cognition-updater." One could be tempted to object and say that surely no one would make the mistakes pointed out here, but definitely some people do. I did. Being a bit gloves off here definitely helped me.   Re 2: The essay argues for, well, reward not being the optimization target. There is some deep discussion in the comments about the likelihood of reward in fact being the optimization target, or at least quite close (see here). Let me take a more shallow view. I think there are people who think that reward is the optimization target by definition or by design, as opposed to this being a highly non-trivial claim that needs to be argued for. It's the former view that this post (correctly) argues against. I am sympathetic to pushback of the form "there are arguments that make it reasonable to privilege reward-maximization as a hypothesis" and about this post going a bit too far, but these remarks should not be confused with a rebuttal
10TurnTrout
Retrospective: I think this is the most important post I wrote in 2022. I deeply hope that more people benefit by fully integrating these ideas into their worldviews. I think there's a way to "see" this lesson everywhere in alignment: for it to inform your speculation about everything from supervised fine-tuning to reward overoptimization. To see past mistaken assumptions about how learning processes work, and to think for oneself instead. This post represents an invaluable tool in my mental toolbelt. I wish I had written the key lessons and insights more plainly. I think I got a bit carried away with in-group terminology and linguistic conventions, which limited the reach and impact of these insights. I am less wedded to "think about what shards will form and make sure they don't care about bad stuff (like reward)", because I think we won't get intrinsically agentic policy networks. I think the most impactful AIs will be LLMs+tools+scaffolding, with the LLMs themselves being "tool AI."
#6

A "good project" in AGI research needs:1) Trustworthy command, 2) Research closure, 3) Strong operational security, 4) Commitment to the common good, 5) An alignment mindset, and 6) Requisite resource levels.

The post goes into detail on what minimal, adequate, and good performance looks like.

#7

A fictional story about an AI researcher who leaves an experiment running overnight.

15Garrett Baker
Clearly a very influential post on a possible path to doom from someone who knows their stuff about deep learning! There are clear criticisms, but it is also one of the best of its era. It was also useful for even just getting a handle on how to think about our path to AGI.
#8

Ben observes that all favorite people are great at a skill he's labeled in my head as "staring into the abyss" – thinking reasonably about things that are uncomfortable to contemplate, like arguments against your religious beliefs, or in favor of breaking up with your partner. 

15AprilSR
While the concept that looking at the truth even when it hurts is important isn't revolutionary in the community, I think this post gave me a much more concrete model of the benefits. Sure, I knew about the abstract arguments that facing the truth is valuable, but I don't know if I'd have identified it as an essential skill for starting a company, or as being a critical component of staying in a bad relationship. (I think my model of bad relationships was that people knew leaving was a good idea, but were unable to act on that information—but in retrospect inability to even consider it totally might be what's going on some of the time.)
#9

Two laws of experiment design: First, you are not measuring what you think you are measuring. Second, if you measure enough different stuff, you might figure out what you're actually measuring.

These have many implications for how to design and interpret experiments.

#10

"Human feedback on diverse tasks" could lead to transformative AI, while requiring little innovation on current techniques. But it seems likely that the natural course of this path leads to full blown AI takeover.

10Ramana Kumar
I found this post to be a clear and reasonable-sounding articulation of one of the main arguments for there being catastrophic risk from AI development. It helped me with my own thinking to an extent. I think it has a lot of shareability value.
#11

A "sazen" is a word or phrase which accurately summarizes a given concept, while also being insufficient to generate that concept in its full richness and detail, or to unambiguously distinguish it from nearby concepts. It's a useful pointer to the already-initiated, but often useless or misleading to the uninitiated.

21Screwtape
Many of the best LessWrong posts give a word and a clear mental handle for something I kinda sorta knew loosely in my head. With the concept firmly in mind, I can use it and build on it deliberately. Sazen is an excellent example of the form. Sazens are common in many fields I have some expertise in. "Control the centre of the board" in chess. "Footwork is foundational" in martial arts. "Shots on goal" in sports. "Conservation of expected evidence" in rationality. "Premature optimization is the root of all evil" in programming. These sentences a useful reminders, and while they aren't misleading traps the way "Duncan Sabien is a teacher and a writer" they take some practice and experience or at least more detailed teaching to actually turn into something useful. Having the word "Sazen" with this meaning in my head has changed how I write. It shifted my thesis statement from simply being a compressed version of my argument towards being an easy handle to repeat to oneself at need, the same way I might mutter "shots on goal shots on goal" to myself during a hockey game. Sazen is a bit meta, it's not a technique for the object level accomplishments but a technique for how to teach or explain object level things, but anything that immediately upgrades my own writing is worth a solid upvote. This post also gestures at the important problem of transmitting knowledge. It ultimately doesn't know how to do this, but I especially appreciated the paragraph starting "much of what aggregated wisdom like that seems to do..." for pointing out that this can speed things up even if it can't prevent the first mistake or two. I think this is worth being included in the best of LW collection.
#12

Elizabeth Van Nostrand spent literal decades seeing doctors about digestive problems that made her life miserable. She tried everything and nothing worked, until one day a doctor prescribed 5 different random supplements without thinking too hard about it and one of them miraculously cured her. This has led her to believe that sometimes you need to optimize for luck rather than scientific knowledge when it comes to medicine.

#13

Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

28Writer
In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and
15PeterMcCluskey
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him. I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it. In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research. I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
#14

Nate Soares reviews a dozen plans and proposals for making AI go well. He finds that almost none of them grapple with what he considers the core problem - capabilities will suddenly generalize way past training, but alignment won't.

26habryka
I really liked this post in that it seems to me to have tried quite seriously to engage with a bunch of other people's research, in a way that I feel like is quite rare in the field, and something I would like to see more of.  One of the key challenges I see for the rationality/AI-Alignment/EA community is the difficulty of somehow building institutions that are not premised on the quality or tractability of their own work. My current best guess is that the field of AI Alignment has made very little progress in the last few years, which is really not what you might think when you observe the enormous amount of talent, funding and prestige flooding into the space, and the relatively constant refrain of "now that we have cutting edge systems to play around with we are making progress at an unprecedented rate".  It is quite plausible to me that technical AI Alignment research is not a particularly valuable thing to be doing right now. I don't think I have seen much progress, and the dynamics of the field seem to be enshrining an expert class that seems almost ontologically committed to believing that the things they are working on must be good and tractable, because their salary and social standing relies on believing that.  This and a few other similar posts last year are the kind of post that helped me come to understand the considerations around this crucial question better, and where I am grateful that Nate, despite having spent a lot of his life on solving the technical AI Alignment problem, is willing to question the tractability of the whole field. This specific post is more oriented around other people's work, though other posts by Nate and Eliezer are also facing the degree to which their past work didn't make the relevant progress they were hoping for. 
24Zack_M_Davis
I should acknowledge first that I understand that writing is hard. If the only realistic choice was between this post as it is, and no post at all, then I'm glad we got the post rather than no post. That said, by the standards I hold my own writing to, I would embarrassed to publish a post like this which criticizes imaginary paraphrases of researchers, rather than citing and quoting the actual text they've actually published. (The post acknowledges this as a flaw, but if it were me, I wouldn't even publish.) The reason I don't think critics necessarily need to be able to pass an author's Ideological Turing Test is because, as a critic, I can at least be scrupulous in my reasoning about the actual text that the author actually published, even if the stereotype of the author I have in my head is faulty. If I can't produce the quotes to show that I'm not just arguing against a stereotype in my head, then it's not clear why the audience should care.
#15

This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.

38habryka
I've been thinking about this post a lot since it first came out. Overall, I think it's core thesis is wrong, and I've seen a lot of people make confident wrong inferences on the basis of it.  The core problem with the post was covered by Eliezer's post "GPTs are Predictors, not Imitators" (which was not written, I think, as a direct response, but which still seems to me to convey the core problem with this post):   The Simulators post repeatedly alludes to the loss function on which GPTs are trained corresponding to a "simulation objective", but I don't really see why that would be true. It is technically true that a GPT that perfectly simulates earth, including the creation of its own training data set, can use that simulation to get perfect training loss. But actually doing so would require enormous amounts of compute and we of course know that nothing close to that is going on inside of GPT-4.  To me, the key feature of a "simulator" would be a process that predicts the output of a system by developing it forwards in time, or some other time-like dimension. The predictions get made by developing an understanding of the transition function of a system between time-steps (the "physics" of the system) and then applying that transition function over and over again until your desired target time.  I would be surprised if this is how GPT works internally in its relationship to the rest of the world and how it makes predictions. The primary interesting thing that seems to me true about GPT-4s training objective is that it is highly myopic. Beyond that, I don't see any reason to think of it as particularly more likely to create something that tries to simulate the physics of any underlying system than other loss functions one could choose.  When GPT-4 encounters a hash followed by the pre-image of that hash, or a complicated arithmetic problem, or is asked a difficult factual geography question, it seems very unlikely that the way GPT-4 goes about answering that qu
29janus
I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven't noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing. It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY's Sequences were for me. Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I'd otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked. I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now i
#16

Being easy to argue with is a virtue, separate from being correct. When someone makes an epistemically illegible argument, it is very hard to even begin to rebut their arguments because you cannot pin down what their argument even is.

#17

Kelly betting can be viewed as a way of respecting different possible versions of yourself with different beliefs, rather than just a mathematical optimization. This perspective provides some insight into why fractional Kelly betting (betting less aggressively) can make sense, and connects to ideas about bargaining between different parts of yourself. 

27habryka
I put decent probability on this sequence (of which I think this is the best post) being the most important contribution of 2022. I am however really not confident of that, and I do feel a bit stuck on how to figure out where to apply and how to confirm the validity of ideas in this sequence.  Despite the abstract nature, I think if there are indeed arguments to do something closer to Kelly betting with one's resources, even in the absence of logarithmic returns to investment, then that would definitely have huge effects on how I think about my own life's plans, and about how humanity should allocate its resources.  Separately, I also think this sequence is pushing on a bunch of important seams in my model of agency and utility maximization in a way that I expect to become relevant to understanding the behavior of superintelligent systems, though I am even less confident of this than the rest of this review.  I do feel a sense of sadness that I haven't seen more built on the ideas of this sequence, or seen people give their own take on it. I certainly feel a sense that I would benefit a lot if I saw how the ideas in this sequence landed with people, and would appreciate figuring out the implications of the proof sketches outlined here.
#18

Katja Grace provides a list of counterarguments to the basic case for existential risk from superhuman AI systems. She examines potential gaps in arguments about AI goal-directedness, AI goals being harmful, and AI superiority over humans. While she sees these as serious concerns, she doesn't find the case for overwhelming likelihood of existential risk convincing based on current arguments. 

17Vika
I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency). 
#19

A key skill of many experts (that is often hard to teach) is keeping track of extra information in their head while working. For example a programmer tracking a fermi estimate of runtime or an experienced machine operator tracking the machine's internal state. John suggests asking experts "what are you tracking in your head?"

#20

The field of AI alignment is growing rapidly, attracting more resources and mindshare each year. As it grows, more people will be incentivized to misleadingly portray themselves or their projects as more alignment-friendly than they are. Adam proposes "safetywashing" as the term for this

23habryka
I've used the term "safetwashing" at least once every week or two in the last year. I don't know whether I've picked it up from this post, but it still seems good to have an explanation of a term that is this useful and this common that people are exposed to.
#22

Nonprofit boards have great power, but low engagement, unclear responsibility, and no accountability. There's also a shortage of good guidance on how to be an effective board member. Holden gives recommendations on how to do it well, but the whole structure is inherently weird and challenging. 

#23

People worry about agentic AI, with ulterior motives. Some suggest Oracle AI, which only answers questions. But I don't think about agents like that. It killed you because it was optimised. It used an agent because it was an effective tool it had on hand. 

Optimality is the tiger, and agents are its teeth.

#25

A look at how we can get caught up in the details and lose sight of the bigger picture. By repeatedly asking "what are we really trying to accomplish here?", we can step back and refocus on what's truly important, whether in our careers, health, or life overall.

#26

In worlds where AI alignment can be handled by iterative design, we probably survive. So if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason. John explores several ways that could happen, beyond just fast takeoff and deceptive misalignment. 

#27

Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely. 

29ryan_greenblatt
IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans. AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1] When it returns to arguing about the actual main question (a tiny fraction of resources) at the end here and eventually gets to the main trade-related argument (acausal or causal) in the very last response in this section, it almost seems to admit that this tiny amount of resources is plausible, but fails to update all the way. I think the discussion here and here seems highly relevant and fleshes out this argument to a substantially greater extent than I did in this comment. However, note that being willing to spend a tiny fraction of resources on humans still might result in AIs killing a huge number of humans due to conflict between it and humans or the AI needing to race through the singularity as quickly as possible due to competition with other misaligned AIs. (Again, discussed in the links above.) I think fully misaligned paperclippers/squiggle maximizer AIs which spend only a tiny fraction of resources on humans (as seems likely conditional on that type of AI) are reasonably likely to cause outcomes which look obviously extremely bad from the perspective of most people (e.g., more than hundreds of millions dead due to conflict and then most people quickly rounded up and given the option to either be frozen or killed). I wish that Soares and Eliezer would stop making these incorrect arguments against tiny fractions of resources being spent on the preference of current humans. It isn't their actual crux, and it isn't the crux of anyone else either. (However rhetorically nice it might be.) -------
14habryka
This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 
#28

It's easy and locally reinforcing to follow gradients toward what one might call 'guessing the student's password', and much harder and much less locally reinforcing to reason/test/whatever one's way toward a real art of rationality. Anna Salamon reflects on how this got in the way of CFAR ("Center for Applied Rationality") making progress on their original goals.

11Screwtape
The thing I want most from LessWrong and the Rationality Community writ large is the martial art of rationality. That was the Sequences post that hooked me, that is the thing I personally want to find if it exists, that is what I thought CFAR as an organization was pointed at. When you are attempting something that many people have tried before- and to be clear, "come up with teachings to make people better" is something that many, many people have tried before- it may be useful to look and see what went wrong last time. In the words of Scott Alexander, "I’m the last person who’s going to deny that the road we’re on is littered with the skulls of the people who tried to do this before us. . . We’re almost certainly still making horrendous mistakes that people thirty years from now will rightly criticize us for. But they’re new mistakes. . . And I hope that maybe having a community dedicated to carefully checking its own thought processes and trying to minimize error in every way possible will make us have slightly fewer horrendous mistakes than people who don’t do that." This article right here? This is a skull. It should be noticed. If the Best Of collection is for people who want a martial art of rationality to study then I believe this article is the most important entry, and it or the latest version of it will continue to be the most important entry until we have found the art at last. Thank you Anna for trying to build the art. Thank you for writing this and publishing it where anyone else about to attempt to build the art can take note of your mistakes and try to do better. (Ideally it's next to a dozen things we have found that we do think work! But maybe it's next to them the way a surgeon general's warning is next to a bottle of experimental pills.)
#29

Some people believe AI development is extremely dangerous, but are hesitant to directly confront or dissuade AI researchers. The author argues we should be more willing to engage in activism and outreach to slow down dangerous AI progress. They give an example of their own intervention with an AI research group.

15Ben Pace
Seems to me like a blindingly obvious post that was kind of outside of the overton window for too long. Eliezer also smashed the window with his TIME article, but this was first, so I think it's still a pretty great post. +4
#30

In the course of researching optimization, Alex decided that he had to really understand what entropy is. But he found the existing resources (wikipedia, etc) so poor that it seemed important to write a better one. Other resources were only concerned about the application of the concept in their particular sub-domain. Here, Alex aims to synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it.

15Alex_Altair
[This is a self-review because I see that no one has left a review to move it into the next phase. So8res's comment would also make a great review.] I'm pretty proud of this post for the level of craftsmanship I was able to put into it. I think it embodies multiple rationalist virtues. It's a kind of "timeless" content, and is a central example of the kind of content people want to see on LW that isn't stuff about AI. It would also look great printed in a book. :)
#32

On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.

In the heart of the machine was Jane, a person of the early 21st century.

#33

Sometimes your brilliant, hyperanalytical friends can accidentally crush your fragile new ideas before they have a chance to develop. Elizabeth shares a strategy she uses to get them to chill out and vibe on new ideas for a bit before dissecting them. 

#34

Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits. 

69Buck
(I'm just going to speak for myself here, rather than the other authors, because I don't want to put words in anyone else's mouth. But many of the ideas I describe in this review are due to other people.) I think this work was a solid intellectual contribution. I think that the metric proposed for how much you've explained a behavior is the most reasonable metric by a pretty large margin. The core contribution of this paper was to produce negative results about interpretability. This led to us abandoning work on interpretability a few months later, which I'm glad we did. But these negative results haven’t had that much influence on other people’s work AFAICT, so overall it seems somewhat low impact. The empirical results in this paper demonstrated that induction heads are not the simple circuit which many people claimed (see this post for a clearer statement of that), and we then used these techniques to get mediocre results for IOI (described in this comment). There hasn’t been much followup on this work. I suspect that the main reasons people haven't built on this are: * it's moderately annoying to implement it * it makes your explanations look bad (IMO because they actually are unimpressive), so you aren't that incentivized to get it working * the interp research community isn't very focused on validating whether its explanations are faithful, and in any case we didn’t successfully persuade many people that explanations performing poorly according to this metric means they’re importantly unfaithful I think that interpretability research isn't going to be able to produce explanations that are very faithful explanations of what's going on in non-toy models (e.g. I think that no such explanation has ever been produced). Since I think faithful explanations are infeasible, measures of faithfulness of explanations don't seem very important to me now. (I think that people who want to do research that uses model internals should evaluate their techniques by mea
#35

How good are modern language models compared to humans at the task language models are trained on (next token prediction on internet text)? We found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1. 

13Buck
This post's point still seems correct, and it still seems important--I refer to it at least once a week.
#36

In 1936, four men attempted to climb the Eigerwand, the north face of the Eiger mountain. Their harrowing story ended in tragedy, with only one survivor dangling from a rope just meters away from rescue before succumbing. Gene Smith reflects on what drives people to take such extreme risks for seemingly little practical benefit.

21GeneSmith
I was pleasantly surprised by how many people enjoyed this post about mountain climbing. I never expected it to gain so much traction, since it doesn't relate that clearly to rationality or AI or any of the topics usually discussed on LessWrong. But when I finished the book it was based on, I just felt an overwhelming urge to tell other people about it. The story was just that insane. Looking back I think Gwern probably summarized what this story is about best: a world beyond the reach of god. The universe does not respect your desire for a coherent, meaningful story. If you make the wrong mistake at the wrong time, game over. For the past couple of months I've actually been drafting a sequel of sorts to this post about a man named Nims Purja. I hope to post it before Christmas!
#37

When tackling difficult, open-ended research questions, it's easy to get stuck. In addition to vices like openmindedness and self-criticality, Holden recommends "vices" like laziness, impatience, hubris and self-preservation as antidotes. This post explores the techniques that have worked well for him.

10Alex_Altair
Earlier this year I spent a lot of time trying to understand how to do research better. This post was one of the few resources that actually helped. It described several models that I resonated with, but which I had not read anywhere else. It essentially described a lot of the things I was already doing, and this gave me more confidence in deciding to continue doing full time AI alignment research. (It also helps that Karnofsky is an accomplished researcher, and so his advice has more weight!)
#38

You might feel like AI risk is an "emergency" that demands drastic changes to your life. But is this actually the best way to respond? Anna Salamon explores what kinds of changes actually make sense in different types of emergencies, and what that might mean for how to approach existential risk.

#39

Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment. 

#40

Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done. 

#41

Lessons from 20+ years of software security experience, perhaps relevant to AGI alignment:

1. Security doesn't happen by accident

2. Blacklists are useless but make them anyway 

3. You get what you pay for (incentives matter)

4. Assurance requires formal proofs, which are provably impossible

5. A breach IS an existential risk

10habryka
I currently think that the case study of computer security is among one of the best places to learn about the challenges that AI control and AI Alignment projects will face. Despite that, I haven't seen that much writing trying to bridge the gap between computer security and AI safety. This post is one of the few that does, and I think does so reasonably well.
#42

What's with all the strange pseudophilosophical questions from AI alignment researchers, like "what does it mean for some chunk of the world to do optimization?" or "how does an agent model a world bigger than itself?". John lays out why some people think solving these sorts of questions is a necessary prerequisite for AI alignment.

#43

Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.

11Mikhail Samin
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment. Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about. For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing. Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things togeth
#44

Alignment researchers often propose clever-sounding solutions without citing much evidence that their solution should help. Such arguments can mislead people into working on dead ends. Instead, Turntrout argues we should focus more on studying how human intelligence implements alignment properties, as it is a real "existence proof" of aligned intelligence. 

10Gunnar_Zarncke
I like many aspects of this post.  * It promotes using intuitions from humans. Using human, social, or biological approaches is neglected compared to approaches that are more abstract and general. It is also scalable, because people can work on it that wouldn't be able to work directly on the abstract approaches. * It reflects on a specific problem the author had and offers the same approach to readers. * It uses concrete examples to illustrate. * It is short and accessible. 
#45

Holden shares his step-by-step process for forming opinions on a topic, developing and refining hypotheses, and ultimately arriving at a nuanced view - all while focusing on writing rather than just passively consuming information.

#46

Limerence (aka "falling in love") wrecks havoc on your rationality. But it feels so good! 

What do?

#47

Do you pass the "onion test" for honesty? If people get to know you better over time, do they find out new things, but not be shocked by the *types* of information that were hidden? A framework for thinking about personal (and institutional) honesty. 

26Screwtape
Figuring out the edge cases about honesty and truth seem important to me, both as a matter of personal aesthetics and as a matter for LessWrong to pay attention to. One of the things people have used to describe what makes LessWrong special is that it's a community focused on truth-seeking, which makes "what is truth anyway and how do we talk about it" a worthwhile topic of conversation. This article talks about it, in a way that's clear. (The positive example negative example pattern is a good approach to a topic that can really suffer from illusion of transparency.) Like Eliezer's Meta-Honesty post, the approach suggested does rely on some fast verbal footwork, though the footwork need not be as fast as Meta-Honesty. Passing the Onion Test consistently requires the same kind of comparison to alternate worlds as glomarization, which is a bit of a strike against it but that's hardly unique to the Onion Test. I don't know if people still wind up feeling mislead? For instance, I can imagine someone saying "I usually keep my financial state private" and having their conversation partners walk away with wildly different ideas of how they're doing. Is it so bad they don't want to talk about it? Is it so good they don't want to brag? If I thought it was the former and offered to cover their share of dinner repeatedly, I might be annoyed if it turns out to be the latter. I don't particularly hold myself to the Onion Test, but it did provide another angle on the subject that I appreciated. Nobody has yet used it this way around me, but I could also see Onion Test declared in a similar manner to Crocker's Rules, an opt-in social norm that might be recognized by others if it got popular enough. I'm not sure it's worth the limited conceptual slots a community can have for those, but I wouldn't feel the slot was wasted if Onion Tests made it that far. This might be weird, but I really appreciate people having the conversations about what they think is honest and in what way
#48

The LessWrong post "Theses on Sleep" gained a lot of popularity and acclaim, despite largely consisting of what seemed to Natalia like weak arguments and misleading claims.This critical review lists several of the mistakes Natalia argues were made, and reports some of what the academic literature on sleep seems to show.

#49

How do humans form their values? Shard theory proposes that human values are formed through a relatively straightforward reinforcement process, rather than being hard-coded by evolution. This post lays out the core ideas behind shard theory and explores how it can explain various aspects of human behavior and decision-making. 

19Jan_Kulveit
In my personal view, 'Shard theory of human values' illustrates both the upsides and pathologies of the local epistemic community. The upsides - majority of the claims is true or at least approximately true - "shard theory" as a social phenomenon reached critical mass making the ideas visible to the broader alignment community, which works e.g. by talking about them in person, votes on LW, series of posts,... - shard theory coined a number of locally memetically fit names or phrases, such as 'shards' - part of the success leads at some people in the AGI labs to think about mathematical structures of human values, which is an important problem  The downsides - almost none of the claims which are true are original; most of this was described elsewhere before, mainly in the active inference/predictive processing literature, or thinking about multi-agent mind models - the claims which are novel seem usually somewhat confused (eg human values are inaccessible to the genome or naive RL intuitions) - the novel terminology is incompatible with existing research literature, making it difficult for alignment community to find or understand existing research, and making it difficult for people from other backgrounds to contribute (while this is not the best option for advancement of understanding, paradoxically, this may be positively reinforced in the local environment, as you get more credit for reinventing stuff under new names than pointing to relevant existing research) Overall, 'shards' become so popular that reading at least the basics is probably necessary to understand what many people are talking about. 
#50

A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.

57LawrenceC
This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.  Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.  TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.[1] Introduction/Overview The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.  The CCS paper kicked off a lot of waves in the alig
#51

So if you read Harry Potter and the Methods of Rationality, and thought...

"You know, HPMOR is pretty good so far as it goes; but Harry is much too cautious and doesn't have nearly enough manic momentum, his rationality lectures aren't long enough, and all of his personal relationships are way way way too healthy."

...then have I got the story for you!

17AprilSR
I feel like Project Lawful, as well as many of Lintamande's other glowfic since then, have given me a whole lot deeper an understanding of... a collection of virtues including honor, honesty, trustworthiness, etc, which I now mostly think of collectively as "Law". I think this has been pretty valuable for me on an intellectual level—I think, if you show me some sort of deontological rule, I'm going to give a better account of why/whether it's a good idea to follow it than I would have before I read any glowfic. It's difficult for me to separate how much of that is due to Project Lawful in particular, because ultimately I've just read a large body of work which all had some amount of training data showing a particular sort of thought pattern which I've since learned. But I think this particular fragment of the rationalist community has given me some valuable new ideas, and it'd be great to figure out a good way of acknowledging that.
15niplav
I don't think this would fit into the 2022 review. Project Lawful has been quite influential, but I find it hard to imagine a way its impact could be included in a best-of. Including this post in particular strikes me as misguided, as it contains none of the interesting ideas and lessons from Project Lawful, and thus doesn't make any intellectual progress. One could try to do the distillation of finding particularly interesting or enlightening passages from the text, but that would be 1. A huge amount of work[1], but maybe David Udell's sequence could be used for that. 2. Quite difficult for the more subtle lessons, which are interwoven in the text. I have nothing against Project Lawful in particular[2], but I think that including this post would be misguided, and including passages from Project Lawful would be quite difficult. For that reason, I'm giving this a -1. ---------------------------------------- 1. Consider: after more than two years the Hanson compilation bounty still hasn't been fulfilled, at $10k reward! ↩︎ 2. I've read parts of it (maybe 15%?), but haven't been hooked, and everytime I read a longer part I get the urge to go and read textbooks instead. ↩︎