The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.
+

Rationality

Eliezer Yudkowsky
Local Validity as a Key to Sanity and Civilization
Buck
"Other people are wrong" vs "I am right"
Mark Xu
Strong Evidence is Common
TsviBT
Please don't throw your mind away
Raemon
Noticing Frame Differences
johnswentworth
You Are Not Measuring What You Think You Are Measuring
johnswentworth
Gears-Level Models are Capital Investments
Hazard
How to Ignore Your Emotions (while also thinking you're awesome at emotions)
Scott Garrabrant
Yes Requires the Possibility of No
Ben Pace
A Sketch of Good Communication
Eliezer Yudkowsky
Meta-Honesty: Firming Up Honesty Around Its Edge-Cases
Duncan Sabien (Deactivated)
Lies, Damn Lies, and Fabricated Options
Scott Alexander
Trapped Priors As A Basic Problem Of Rationality
Duncan Sabien (Deactivated)
Split and Commit
Duncan Sabien (Deactivated)
CFAR Participant Handbook now available to all
johnswentworth
What Are You Tracking In Your Head?
Mark Xu
The First Sample Gives the Most Information
Duncan Sabien (Deactivated)
Shoulder Advisors 101
Scott Alexander
Varieties Of Argumentative Experience
Eliezer Yudkowsky
Toolbox-thinking and Law-thinking
alkjash
Babble
Zack_M_Davis
Feature Selection
abramdemski
Mistakes with Conservation of Expected Evidence
Kaj_Sotala
The Felt Sense: What, Why and How
Duncan Sabien (Deactivated)
Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)
Ben Pace
The Costly Coordination Mechanism of Common Knowledge
Jacob Falkovich
Seeing the Smoke
Duncan Sabien (Deactivated)
Basics of Rationalist Discourse
alkjash
Prune
johnswentworth
Gears vs Behavior
Elizabeth
Epistemic Legibility
Daniel Kokotajlo
Taboo "Outside View"
Duncan Sabien (Deactivated)
Sazen
AnnaSalamon
Reality-Revealing and Reality-Masking Puzzles
Eliezer Yudkowsky
ProjectLawful.com: Eliezer's latest story, past 1M words
Eliezer Yudkowsky
Self-Integrity and the Drowning Child
Jacob Falkovich
The Treacherous Path to Rationality
Scott Garrabrant
Tyranny of the Epistemic Majority
alkjash
More Babble
abramdemski
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems
Raemon
Being a Robust Agent
Zack_M_Davis
Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists
Benquo
Reason isn't magic
habryka
Integrity and accountability are core parts of rationality
Raemon
The Schelling Choice is "Rabbit", not "Stag"
Diffractor
Threat-Resistant Bargaining Megapost: Introducing the ROSE Value
Raemon
Propagating Facts into Aesthetics
johnswentworth
Simulacrum 3 As Stag-Hunt Strategy
LoganStrohl
Catching the Spark
Jacob Falkovich
Is Rationalist Self-Improvement Real?
Benquo
Excerpts from a larger discussion about simulacra
Zvi
Simulacra Levels and their Interactions
abramdemski
Radical Probabilism
sarahconstantin
Naming the Nameless
AnnaSalamon
Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"
Eric Raymond
Rationalism before the Sequences
Owain_Evans
The Rationalists of the 1950s (and before) also called themselves “Rationalists”
Raemon
Feedbackloop-first Rationality
LoganStrohl
Fucking Goddamn Basics of Rationalist Discourse
Raemon
Tuning your Cognitive Strategies
johnswentworth
Lessons On How To Get Things Right On The First Try
+

Optimization

So8res
Focus on the places where you feel shocked everyone's dropping the ball
Jameson Quinn
A voting theory primer for rationalists
sarahconstantin
The Pavlov Strategy
Zvi
Prediction Markets: When Do They Work?
johnswentworth
Being the (Pareto) Best in the World
alkjash
Is Success the Enemy of Freedom? (Full)
johnswentworth
Coordination as a Scarce Resource
AnnaSalamon
What should you change in response to an "emergency"? And AI risk
jasoncrawford
How factories were made safe
HoldenKarnofsky
All Possible Views About Humanity's Future Are Wild
jasoncrawford
Why has nuclear power been a flop?
Zvi
Simple Rules of Law
Scott Alexander
The Tails Coming Apart As Metaphor For Life
Zvi
Asymmetric Justice
Jeffrey Ladish
Nuclear war is unlikely to cause human extinction
Elizabeth
Power Buys You Distance From The Crime
Eliezer Yudkowsky
Is Clickbait Destroying Our General Intelligence?
Spiracular
Bioinfohazards
Zvi
Moloch Hasn’t Won
Zvi
Motive Ambiguity
Benquo
Can crimes be discussed literally?
johnswentworth
When Money Is Abundant, Knowledge Is The Real Wealth
GeneSmith
Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
HoldenKarnofsky
This Can't Go On
Said Achmiz
The Real Rules Have No Exceptions
Lars Doucet
Lars Doucet's Georgism series on Astral Codex Ten
johnswentworth
Working With Monsters
jasoncrawford
Why haven't we celebrated any major achievements lately?
abramdemski
The Credit Assignment Problem
Martin Sustrik
Inadequate Equilibria vs. Governance of the Commons
Scott Alexander
Studies On Slack
KatjaGrace
Discontinuous progress in history: an update
Scott Alexander
Rule Thinkers In, Not Out
Raemon
The Amish, and Strategic Norms around Technology
Zvi
Blackmail
HoldenKarnofsky
Nonprofit Boards are Weird
Wei Dai
Beyond Astronomical Waste
johnswentworth
Making Vaccine
jefftk
Make more land
jenn
Things I Learned by Spending Five Thousand Hours In Non-EA Charities
Richard_Ngo
The ants and the grasshopper
So8res
Enemies vs Malefactors
Elizabeth
Change my mind: Veganism entails trade-offs, and health is one of the axes
+

World

Kaj_Sotala
Book summary: Unlocking the Emotional Brain
Ben
The Redaction Machine
Samo Burja
On the Loss and Preservation of Knowledge
Alex_Altair
Introduction to abstract entropy
Martin Sustrik
Swiss Political System: More than You ever Wanted to Know (I.)
johnswentworth
Interfaces as a Scarce Resource
eukaryote
There’s no such thing as a tree (phylogenetically)
Scott Alexander
Is Science Slowing Down?
Martin Sustrik
Anti-social Punishment
johnswentworth
Transportation as a Constraint
Martin Sustrik
Research: Rescuers during the Holocaust
GeneSmith
Toni Kurz and the Insanity of Climbing Mountains
johnswentworth
Book Review: Design Principles of Biological Circuits
Elizabeth
Literature Review: Distributed Teams
Valentine
The Intelligent Social Web
eukaryote
Spaghetti Towers
Eli Tyre
Historical mathematicians exhibit a birth order effect too
johnswentworth
What Money Cannot Buy
Bird Concept
Unconscious Economics
Scott Alexander
Book Review: The Secret Of Our Success
johnswentworth
Specializing in Problems We Don't Understand
KatjaGrace
Why did everything take so long?
Ruby
[Answer] Why wasn't science invented in China?
Scott Alexander
Mental Mountains
L Rudolf L
A Disneyland Without Children
johnswentworth
Evolution of Modularity
johnswentworth
Science in a High-Dimensional World
Kaj_Sotala
My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms
Kaj_Sotala
Building up to an Internal Family Systems model
Steven Byrnes
My computational framework for the brain
Natália
Counter-theses on Sleep
abramdemski
What makes people intellectually active?
Bucky
Birth order effect found in Nobel Laureates in Physics
zhukeepa
How uniform is the neocortex?
JackH
Anti-Aging: State of the Art
Vaniver
Steelmanning Divination
KatjaGrace
Elephant seal 2
Zvi
Book Review: Going Infinite
Rafael Harth
Why it's so hard to talk about Consciousness
Duncan Sabien (Deactivated)
Social Dark Matter
Eric Neyman
How much do you believe your results?
Malmesbury
The Talk: a brief explanation of sexual dimorphism
moridinamael
The Parable of the King and the Random Process
Henrik Karlsson
Cultivating a state of mind where new ideas are born
+

Practical

+

AI Strategy

paulfchristiano
Arguments about fast takeoff
Eliezer Yudkowsky
Six Dimensions of Operational Adequacy in AGI Projects
Ajeya Cotra
Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
paulfchristiano
What failure looks like
Daniel Kokotajlo
What 2026 looks like
gwern
It Looks Like You're Trying To Take Over The World
Daniel Kokotajlo
Cortés, Pizarro, and Afonso as Precedents for Takeover
Daniel Kokotajlo
The date of AI Takeover is not the day the AI takes over
Andrew_Critch
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
paulfchristiano
Another (outer) alignment failure story
Ajeya Cotra
Draft report on AI timelines
Eliezer Yudkowsky
Biology-Inspired AGI Timelines: The Trick That Never Works
Daniel Kokotajlo
Fun with +12 OOMs of Compute
Wei Dai
AI Safety "Success Stories"
Eliezer Yudkowsky
Pausing AI Developments Isn't Enough. We Need to Shut it All Down
HoldenKarnofsky
Reply to Eliezer on Biological Anchors
Richard_Ngo
AGI safety from first principles: Introduction
johnswentworth
The Plan
Rohin Shah
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
lc
What an actually pessimistic containment strategy looks like
Eliezer Yudkowsky
MIRI announces new "Death With Dignity" strategy
KatjaGrace
Counterarguments to the basic AI x-risk case
Adam Scholl
Safetywashing
habryka
AI Timelines
evhub
Chris Olah’s views on AGI safety
So8res
Comments on Carlsmith's “Is power-seeking AI an existential risk?”
nostalgebraist
human psycholinguists: a critical appraisal
nostalgebraist
larger language models may disappoint you [or, an eternally unfinished draft]
Orpheus16
Speaking to Congressional staffers about AI risk
Tom Davidson
What a compute-centric framework says about AI takeoff speeds
abramdemski
The Parable of Predict-O-Matic
KatjaGrace
Let’s think about slowing down AI
Daniel Kokotajlo
Against GDP as a metric for timelines and takeoff speeds
Joe Carlsmith
Predictable updating about AI risk
Raemon
"Carefully Bootstrapped Alignment" is organizationally hard
KatjaGrace
We don’t trade with ants
+

Technical AI Safety

paulfchristiano
Where I agree and disagree with Eliezer
Eliezer Yudkowsky
Ngo and Yudkowsky on alignment difficulty
Andrew_Critch
Some AI research areas and their relevance to existential safety
1a3orn
EfficientZero: How It Works
elspood
Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment
So8res
Decision theory does not imply that we get to have nice things
Vika
Specification gaming examples in AI
Rafael Harth
Inner Alignment: Explain like I'm 12 Edition
evhub
An overview of 11 proposals for building safe advanced AI
TurnTrout
Reward is not the optimization target
johnswentworth
Worlds Where Iterative Design Fails
johnswentworth
Alignment By Default
johnswentworth
How To Go From Interpretability To Alignment: Just Retarget The Search
Alex Flint
Search versus design
abramdemski
Selection vs Control
Buck
AI Control: Improving Safety Despite Intentional Subversion
Eliezer Yudkowsky
The Rocket Alignment Problem
Eliezer Yudkowsky
AGI Ruin: A List of Lethalities
Mark Xu
The Solomonoff Prior is Malign
paulfchristiano
My research methodology
TurnTrout
Reframing Impact
Scott Garrabrant
Robustness to Scale
paulfchristiano
Inaccessible information
TurnTrout
Seeking Power is Often Convergently Instrumental in MDPs
So8res
A central AI alignment problem: capabilities generalization, and the sharp left turn
evhub
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
paulfchristiano
The strategy-stealing assumption
So8res
On how various plans miss the hard bits of the alignment challenge
abramdemski
Alignment Research Field Guide
johnswentworth
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables
Buck
Language models seem to be much better than humans at next-token prediction
abramdemski
An Untrollable Mathematician Illustrated
abramdemski
An Orthodox Case Against Utility Functions
Veedrac
Optimality is the tiger, and agents are its teeth
Sam Ringer
Models Don't "Get Reward"
Alex Flint
The ground of optimization
johnswentworth
Selection Theorems: A Program For Understanding Agents
Rohin Shah
Coherence arguments do not entail goal-directed behavior
abramdemski
Embedded Agents
evhub
Risks from Learned Optimization: Introduction
nostalgebraist
chinchilla's wild implications
johnswentworth
Why Agent Foundations? An Overly Abstract Explanation
zhukeepa
Paul's research agenda FAQ
Eliezer Yudkowsky
Coherent decisions imply consistent utilities
paulfchristiano
Open question: are minimal circuits daemon-free?
evhub
Gradient hacking
janus
Simulators
LawrenceC
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
TurnTrout
Humans provide an untapped wealth of evidence about alignment
Neel Nanda
A Mechanistic Interpretability Analysis of Grokking
Collin
How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme
evhub
Understanding “Deep Double Descent”
Quintin Pope
The shard theory of human values
TurnTrout
Inner and outer alignment decompose one hard problem into two extremely hard problems
Eliezer Yudkowsky
Challenges to Christiano’s capability amplification proposal
Scott Garrabrant
Finite Factored Sets
paulfchristiano
ARC's first technical report: Eliciting Latent Knowledge
Diffractor
Introduction To The Infra-Bayesianism Sequence
TurnTrout
Towards a New Impact Measure
LawrenceC
Natural Abstractions: Key claims, Theorems, and Critiques
Zack_M_Davis
Alignment Implications of LLM Successes: a Debate in One Act
johnswentworth
Natural Latents: The Math
TurnTrout
Steering GPT-2-XL by adding an activation vector
Jessica Rumbelow
SolidGoldMagikarp (plus, prompt generation)
So8res
Deep Deceptiveness
Charbel-Raphaël
Davidad's Bold Plan for Alignment: An In-Depth Explanation
Charbel-Raphaël
Against Almost Every Theory of Impact of Interpretability
Joe Carlsmith
New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Eliezer Yudkowsky
GPTs are Predictors, not Imitators
peterbarnett
Labs should be explicit about why they are building AGI
HoldenKarnofsky
Discussion with Nate Soares on a key alignment difficulty
Jesse Hoogland
Neural networks generalize because of this one weird trick
paulfchristiano
My views on “doom”
technicalities
Shallow review of live agendas in alignment & safety
Vanessa Kosoy
The Learning-Theoretic Agenda: Status 2023
ryan_greenblatt
Improving the Welfare of AIs: A Nearcasted Proposal
#1

The original draft of Ayeja's report on biological anchors for AI timelines. The report includes quantitative models and forecasts, though the specific numbers were still in flux at the time. Ajeya cautions against wide sharing of specific conclusions, as they don't yet reflect Open Philanthropy's official stance. 

12Daniel Kokotajlo
Ajeya's timelines report is the best thing that's ever been written about AI timelines imo. Whenever people ask me for my views on timelines, I go through the following mini-flowchart: 1. Have you read Ajeya's report? --If yes, launch into a conversation about the distribution over 2020's training compute and explain why I think the distribution should be substantially to the left, why I worry it might shift leftward faster than she projects, and why I think we should use it to forecast AI-PONR instead of TAI. --If no, launch into a conversation about Ajeya's framework and why it's the best and why all discussion of AI timelines should begin there. So, why do I think it's the best? Well, there's a lot to say on the subject, but, in a nutshell: Ajeya's framework is to AI forecasting what actual climate models are to climate change forecasting (by contrast with lower-tier methods such as "Just look at the time series of temperature over time / AI performance over time and extrapolate" and "Make a list of factors that might push the temperature up or down in the future / make AI progress harder or easier," and of course the classic "poll a bunch of people with vaguely related credentials." There's something else which is harder to convey... I want to say Ajeya's model doesn't actually assume anything, or maybe it makes only a few very plausible assumptions. This is underappreciated, I think. People will say e.g. "I think data is the bottleneck, not compute." But Ajeya's model doesn't assume otherwise! If you think data is the bottleneck, then the model is more difficult for you to use and will give more boring outputs, but you can still use it. (Concretely, you'd have 2020's training compute requirements distribution with lots of probability mass way to the right, and then rather than say the distribution shifts to the left at a rate of about one OOM a decade, you'd input whatever trend you think characterizes the likely improvements in data gathering.) The upsho
#2

A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches. 

25Daniel Kokotajlo
This post is the best overview of the field so far that I know of. I appreciate how it frames things in terms of outer/inner alignment and training/performance competitiveness--it's very useful to have a framework with which to evaluate proposals and this is a pretty good framework I think. Since it was written, this post has been my go-to reference both for getting other people up to speed on what the current AI alignment strategies look like (even though this post isn't exhaustive). Also, I've referred back to it myself several times. I learned a lot from it. I hope that this post grows into something more extensive and official -- maybe an Official Curated List of Alignment Proposals, Summarized and Evaluated with Commentary and Links. Such a list could be regularly updated and would be very valuable for several reasons, some of which I mentioned in this comment.
#3

As resources become abundant, the bottleneck shifts to other resources. Power or money are no longer the limiting factors past a certain point; knowledge becomes the bottleneck. Knowledge can't be reliably bought, and acquiring it is difficult. Therefore, investments in knowledge (e.g. understanding systems at a gears-level) become the most valuable investments.

11Daniel Kokotajlo
This is one of those posts, like "pain is not the unit of effort," that combines a memorable and informative and very useful and important slogan with a bunch of argumentation and examples to back up that slogan. I think this type of post is great for the LW review. When I first read this post, I thought it was boring and unimportant: trivially, there will be some circumstances where knowledge is the bottleneck, because for pretty much all X there will be some circumstances where X is the bottleneck. However, since then I've ended up saying the slogan "when money is abundant, knowledge is the real wealth" probably about a dozen separate times when explaining my career decisions, arguing with others at CLR about what our strategy should be, and even when deliberating to myself about what to do next. I guess longtermist EAs right now do have a surplus of money and a shortage of knowledge (relative to how much knowledge is needed to solve the problems we are trying to solve...) so in retrospect it's not surprising that this slogan was practically applicable to my life so often. I do think there are ways the post could be expanded and improved. Come to think of it, I'll make a mini-comment right here to gesture at the stuff I would add to it if I could: 1. List of other ideas for how to invest in knowledge. For example, building a community with good epistemic norms. Or paying a bunch of people to collect data / info about various world developments and report on them to you. Or paying a bunch of people to write textbooks and summaries and explainer videos and make diagrams illustrating cutting-edge knowledge (yours and others'). 2. Arguments that in fact, right now, longtermist EAs and/or AI-risk-reducers are bottlenecked on knowledge (rather than money, or power/status) --My own experience doing cost-benefit analyses is that interventions/plans vary in EV by OOMs and that it's common to find new considerations or updated models that flip the sign entirely, or ad
#4

How much COVID risk do you take when you go to the grocery store? When you see a friend outdoors? This calculator helps you estimate your risk from common activities in microcovids - units of 1-in-a-million chance of getting COVID. 

#5

What if we don't need to solve AI alignment? What if AI systems will just naturally learn human values as they get more capable? John Wentworth explores this possibility, giving it about a 10% chance of working. The key idea is that human values may be a "natural abstraction" that powerful AI systems learn by default.

14Steven Byrnes
I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI. The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an idiosyncrasy of the learning algorithm. For example: Both human brains and ConvNets seem to have a “tree” abstraction; neither human brains nor ConvNets seem to have a “head or thumb but not any other body part” concept. I kind of agree with this. I would say that the patterns are a joint property of the world and an inductive bias. I think the relevant inductive biases in this case are something like: (1) “patterns tend to recur”, (2) “patterns tend to be localized in space and time”, and (3) “patterns are frequently composed of multiple other patterns, which are near to each other in space and/or time”, and maybe other things. The human brain definitely is wired up to find patterns with those properties, and ConvNets to a lesser extent. These inductive biases are evidently very useful, and I find it very likely that future learning algorithms will share those biases, even more than today’s learning algorithms. So I’m basically on board with the idea that there may be plenty of overlap between the world-models of various different unsupervised world-model-building learning algorithms, one of which is the brain. (I would also add that I would expect “natural abstractions” to be a matter of degree, not binary. We can, after all, form the concept “head or thumb but not any other body part”. It would just be extremely low on the list of things that would pop into our head when trying to make sense of something we’
#6

The Solomonoff prior is a mathematical formalization of Occam's razor. It's intended to provide a way to assign probabilities to observations based on their simplicity. However, the simplest programs that predict observations well might be universes containing intelligent agents trying to influence the predictions. This makes the Solomonoff prior "malign" - its predictions are influenced by the preferences of simulated beings. 

41Vanessa Kosoy
This post is a review of Paul Christiano's argument that the Solomonoff prior is malign, along with a discussion of several counterarguments and countercounterarguments. As such, I think it is a valuable resource for researchers who want to learn about the problem. I will not attempt to distill the contents: the post is already a distillation, and does a a fairly good job of it. Instead, I will focus on what I believe is the post's main weakness/oversight. Specifically, the author seems to think the Solomonoff prior is, in some way, a distorted model of reasoning, and that the attack vector in question can attributed to this, at least partially. This is evident in phrases such as "unintuitive notion of simplicity" and "the Solomonoff prior is very strange". This is also why the author thinks the speed prior might help and that "since it is difficult to compute the Solomonoff prior, [the attack vector] might not be relevant in the real world". In contrast, I believe that the attack vector is quite robust and will threaten any sufficiently powerful AI as long as it's cartesian (more on "cartesian" later). Formally analyzing this question is made difficult by the essential role of non-realizability. That is, the attack vector arises from the AI reasoning about "possible universes" and "simulation hypotheses" which are clearly phenomena that are computationally infeasible for the AI to simulate precisely. Invoking Solomonoff induction dodges this issue since Solomonoff induction is computationally unbounded, at the cost of creating the illusion that the conclusions are a symptom of using Solomonoff induction (and, it's still unclear how to deal with the fact Solomonoff induction itself cannot exist in the universes that Solomonoff induction can learn). Instead, we should be using models that treat non-realizability fairly, such as infra-Bayesiansim. However, I will make no attempt to present such a formal analysis in this review. Instead, I will rely on painting an in
23johnswentworth
This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally. I've long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them. In Solomonoff Model, Sufficiently Large Data Rules Out Malignness There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that: ... but in the large-data limit, SI's guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit. Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.) ... but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior? Reflection Breaks The Large-Data Guarantees There's an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every compu
#7

In early 2020, COVID-19 was spreading rapidly, but many people seem hesitant to take precautions or prepare. Jacob Falkovich explores why people often wait for social permission before reacting to potential threats, even when the evidence is clear. He argues we should be willing to act on our own judgment rather than waiting for others. 

13DirectedEvolution
The central point of this article was that conformism was causing society to treat COVID-19 with insufficient alarm. Its goal was to give its readership social sanction and motivation to change that pattern. One of its sub-arguments was that the media was succumbing to conformity. This claim came with an implication that this post was ahead of the curve, and that it was indicative of a pattern of success among rationalists in achieving real benefits, both altruistically (in motivating positive social change) and selfishly (in finding alpha). I thought it would be useful to review 2020 COVID-19 media coverage through the month of February, up through Feb. 27th, which is when this post was published on Putanumonit. I also want to take a look at the stock market crash relative to the publication of this article. Let's start with the stock market. The S&P500 fell about 13% from its peak on Feb. 9th to the week of Feb. 23rd-Mar. 1st, which is when this article was published. Jacob sold 10% of his stocks on Feb. 17th, which was still very early in the crash. The S&P500 went on to fall a total of 32% from that Feb. 9th peak until it bottomed out on Mar. 15th. At least some gains would be made if stocks had been repurchased in the 5 months between Feb. 17th and early August 2020. I don't know how much profit Jacob realized, presuming he eventually reinvested. But this looks to me like a convincing story of Jacob finding alpha in an inefficient market, rather than stumbling into profits by accident. He didn't do it via insider knowledge or obsessive interest in some weird corner of the financial system. He did it by thinking about the basic facts of a situation that had the attention of the entire world, and being right where almost everybody else was making the wrong bet. Let's focus on the media. The top US newspapers by circulation and with a national primary service area are USA Today, the Wall Street Journal, and the New York Times. I'm going to focus on coverage in
#8

Pain is often treated as a measure of effort. "No pain, no gain". But this attitude can be toxic and counterproductive. alkjash argues that if something hurts, you're probably doing it wrong, and that you're not trying your best if you're not happy. 

10Daniel Kokotajlo
This is one of those posts, like "when money is abundant, knowledge is the real wealth," that combines a memorable and informative and very useful and important slogan with a bunch of argumentation and examples to back up that slogan. I think this type of post is great for the LW review. I haven't found this advice super applicable to my own life (because I already generally didn't do things that were painful...) but it has found application in my thinking and conversation with friends. I think it gets at an important phenomenon/problem for many people and provides a useful antidote.
#9

An optimizing system is a physically closed system containing both that which is being optimized and that which is doing the optimizing, and defined by a tendency to evolve from a broad basin of attraction towards a small set of target configurations despite perturbations to the system. 

33Vanessa Kosoy
In this post, the author proposes a semiformal definition of the concept of "optimization". This is potentially valuable since "optimization" is a word often used in discussions about AI risk, and much confusion can follow from sloppy use of the term or from different people understanding it differently. While the definition given here is a useful perspective, I have some reservations about the claims made about its relevance and applications. The key paragraph, which summarizes the definition itself, is the following: In fact, "continues to exhibit this tendency with respect to the same target configuration set despite perturbations" is redundant: clearly as long as the perturbation doesn't push the system out of the basin, the tendency must continue. This is what is known as "attractor" in dynamical systems theory. For comparison, here is the definition of "attractor" from the Wikipedia: The author acknowledges this connection, although he also makes the following remark: I find this remark confusing. An attractor that operates along a subset of the dimension is just an attractor submanifold. This is completely standard in dynamical systems theory. Given that the definition itself is not especially novel, the post's main claim to value is via the applications. Unfortunately, some of the proposed applications seem to me poorly justified. Specifically, I want to talk about two major examples: the claimed relationship to embedded agency and the claimed relations to comprehensive AI services. In both cases, the main shortcoming of the definition is that there is an essential property of AI that this definition doesn't capture at all. The author does acknowledge that "goal-directed agent system" is a distinct concept from "optimizing systems". However, he doesn't explain how are they distinct. One way to formulate the difference is as follows: agency = optimization + learning. An agent is not just capable of steering a particular universe towards a certain outc
#10

Zvi explores the four "simulacra levels" of communication and action, using the COVID-19 pandemic as an example: 1) Literal truth. 2) Trying to influence behavior 3) Signaling group membership, and 4) Pure power games. He examines how these levels interact and different strategies people use across them.

12Raemon
This is the post that first spelled out how Simulacra levels worked in a way that seemed fully comprehensive, which I understood. I really like the different archetypes (i.e. Oracle, Trickster, Sage, Lawyer, etc). They showcased how the different levels blend together, while still having distinct properties that made sense to reason about separately. Each archetype felt very natural to me, like I could imagine people operating in that way. The description Level 4 here still feels a bit inarticulate/confused. This post is mostly compatible with the 2x2 grid version, but it makes the additional claim that Level 4 don't know how to make plans, and are 'particularly hard to grok.' It bundles in some worldview from Immoral Mazes / Raoian Sociopaths. For me, a big outstanding question re: Simulacra is "does it actually make sense to bundle the Kafkaesque sociopath who can't make plans as an explicit part of Level 4?" I think this is a kinda empirical question. An example of the sort of evidence that'd persuade me are "among politicians or middle managers who spend most of their time optimizing for power, interacting with facts and tribal affiliations as a game, what proportion of them actually lose their ability to make plans, or otherwise become more... lovecraftian or whatever?" Is it more like "70%", "50%", "10%"?. It's plausible to me that there's a relatively small number of actors who stand out as particularly extreme (and then get focused on for toxoplasma of rage reasons) Or, rather: if I simply describe Primarily Level 4 people as "holding social-signaling as object", am I actually missing anything? Do they tend to have any attributes? What? ... I do this post is among the best intro to the Simulacra Levels concept, and think it's worth polishing up slightly. I assume Zvi has thought a bit more about Level 4 by now. If it still seems like there's something Importantly, Confusingly Up With Them, I'm hoping that can be spelled out a bit more. (I think my fav
#11

Money can buy a lot of things, but it can't buy expertise. In fields where performance is hard to judge, simply throwing money at the problem won't guarantee good results – it's too easy to be fooled. Even kings and governments can't necessarily buy their way to the best solutions.

13Vaniver
I think this post labels an important facet of the world, and skillfully paints it with examples without growing overlong. I liked it, and think it would make a good addition to the book. There's a thing I find sort of fascinating about it from an evaluative perspective, which is that... it really doesn't stand on its own, and can't, as it's grounded in the external world, in webs of deference and trust. Paul Graham makes a claim about taste; do you trust Paul Graham's taste enough to believe it? It's a post about expertise that warns about snake oil salesmen, while possibly being snake oil itself. How can you check? "there is no full substitute for being an expert yourself." And so in a way it seems like the whole rationalist culture, rendered in miniature: money is less powerful than science, and the true science is found in carefully considered personal experience and the whispers of truth around the internet, more than the halls of academia.
#12

Richard Ngo lays out the core argument for why AGI could be an existential threat: we might build AIs that are much smarter than humans, that act autonomously to pursue large-scale goals, whose goals conflict with ours, leading them to take control of humanity's future. He aims to defend this argument in detail from first principles.

12Raemon
I haven't had time to reread this sequence in depth, but I wanted to at least touch on how I'd evaluate it. It seems to be aiming to be both a good introductory sequence, while being a "complete and compelling case I can for why the development of AGI might pose an existential threat". The question is who is this sequence for,  what is it's goal, and how does it compare to other writing targeting similar demographics.  Some writing that comes to mind to compare/contrast it with includes: * Scott Alexander's Superintelligence FAQ. This is the post I've found most helpful for convincing people (including myself), that yes, AI is just actually a big deal and an extinction risk. It's 8000 words. It's written fairly entertainingly. What I find particularly compelling here are a bunch of factual statements about recent AI advances that I hadn't known about at the time. * Tim Urban's Road To Superintelligence series. This is even more optimized for entertainingness. I recall it being a bit more handwavy and making some claims that were either objectionable, or at least felt more objectionable. It's 22,000 words. * Alex Flint's AI Risk for Epistemic Minimalists. This goes in a pretty different direction – not entertaining, and not really comprehensive either . It came to mind because it's doing a sort-of-similar thing of "remove as many prerequisites or assumptions as possible". (I'm not actually sure it's that helpful, the specific assumptions it's avoiding making don't feel like issues I expect to come up for most people, and then it doesn't make a very strong claim about what to do) (I recall Scott Alexander once trying to run a pseudo-study where he had people read a randomized intro post on AI alignment, I think including his own Superintelligence FAQ and Tim Urban's posts among others, and see how it changed people's minds. I vaguely recall it didn't find that big a difference between them. I'd be curious how this compared) At a glance, AGI Safety From First P
#13

Human values are functions of latent variables in our minds. But those variables may not correspond to anything in the real world. How can an AI optimize for our values if it doesn't know what our mental variables are "pointing to" in reality? This is the Pointers Problem - a key conceptual barrier to AI alignment. 

15Vanessa Kosoy
This post states a subproblem of AI alignment which the author calls "the pointers problem". The user is regarded as an expected utility maximizer, operating according to causal decision theory. Importantly, the utility function depends on latent (unobserved) variables in the causal network. The AI operates according to a different, superior, model of the world. The problem is then, how do we translate the utility function from the user's model to the AI's model? This is very similar to the "ontological crisis" problem described by De Blanc, only De Blanc uses POMDPs instead of causal networks, and frames it in terms of a single agent changing their ontology, rather than translation from user to AI. The question the author asks here is important, but not that novel (the author himself cites Demski as prior work). Perhaps the use of causal networks is a better angle, but this post doesn't do much to show it. Even so, having another exposition of an important topic, with different points of emphasis, will probably benefit many readers. The primary aspect missing from the discussion in the post, in my opinion, is the nature of the user as a learning agent. The user doesn't have a fixed world-model: or, if they do, then this model is best seen as a prior. This observation hints at the resolution of the apparent paradox wherein the utility function is defined in terms of a wrong model. But it still requires us to explain how the utility is defined s.t. it is applicable to every hypothesis in the prior. (What follows is no longer a "review" per se, inasmuch as a summary of my own thoughts on the topic.) Here is a formal model of how a utility function for learning agents can work, when it depends on latent variables. Fix A a set of actions and O a set of observations. We start with an ontological model which is a crisp infra-POMPD. That is, there is a set of states Sont, an initial state s0ont∈Sont, a transition infra-kernel Tont:Sont×A→□(Sont×O) and a reward functio
12johnswentworth
Why This Post Is Interesting This post takes a previously-very-conceptually-difficult alignment problem, and shows that we can model this problem in a straightforward and fairly general way, just using good ol' Bayesian utility maximizers. The formalization makes the Pointers Problem mathematically legible: it's clear what the problem is, it's clear why the problem is important and hard for alignment, and that clarity is not just conceptual but mathematically precise. Unfortunately, mathematical legibility is not the same as accessibility; the post does have a wide inductive gap. Warning: Inductive Gap This post builds on top of two important pieces for modelling embedded agents which don't have their own posts (to my knowledge). The pieces are: * Lazy world models * Lazy utility functions (or value functions more generally) In hindsight, I probably should have written up separate posts on them; they seem obvious once they click, but they were definitely not obvious beforehand. Lazy World Models One of the core conceptual difficulties of embedded agency is that agents need to reason about worlds which are bigger than themselves. They're embedded in the world, therefore the world must be as big as the entire agent plus whatever environment the world includes outside of the agent. If the agent has a model of the world, the physical memory storing that model must itself fit inside of the world. The data structure containing the world model must represent a world larger than the storage space the data structure takes up. That sounds tricky at first, but if you've done some functional programming before, then data structures like this actually pretty run-of-the-mill. For instance, we can easily make infinite lists which take up finite memory. The trick is to write a generator for the list, and then evaluate it lazily - i.e. only query for list elements which we actually need, and never actually iterate over the whole thing. In the same way, we can represent
#14

Many of the most profitable jobs and companies are primarily about solving coordination problems. This suggests "coordination problems" are an unusually tight bottleneck for productive economic activity. John explores implications of looking at the world through this lens. 

15DirectedEvolution
If coordination services command high wages, as John predicts, this suggests that demand is high and supply is limited. Here are some reasons why this might be true: 1. Coordination solutions scale linearly (because the problem is a general one) or exponentially (due to networking effects). 2. Coordination is difficult, unpleasant, risky work. 3. Coordination relies on further resources that are themselves in limited supply or on information that has a short life expectancy, such as involved personal relationships, technical knowhow that depends on a lot of implicit knowledge, familiarity with language and culture, access to user bases and communities, access to restricted communication channels and information, trust, credentials, charisma, money, land, or legal privileges. 4. Coordination is most intensively needed in innovative, infrastructure-development work, which is a high-risk area with long-term payoffs.  5. Coordination is neglected due to systematic biases on an individual and/or institutional level. Perhaps coordination is easy to learn, but is difficult to train in an educational context, and as such is frequently neglected by the educational system. Students are therefore mis-incentivized and don’t engage in developing their coordination skills to anywhere near the possible and optimal level. Alternatively, it might be that we teach coordination in the context of centrally coordination-focused careers (MBAs, for example), but that many other careers less obviously centrally focused on coordination (bench scientists) would also benefit - a problem of interdisciplinary neglect. Note that, if the argument in my review of interfaces as scarce resources is correct, then coordination can also be viewed as a subtype of interface - a way of translating between what a user wants and how they express that desire, into the internal language or structure of a complex system. This makes sense. Google translates natural-language queries into the PageRank algo
#15

AI researcher Paul Christiano discusses the problem of "inaccessible information" - information that AI systems might know but that we can't easily access or verify. He argues this could be a key obstacle in AI alignment, as AIs may be able to use inaccessible knowledge to pursue goals that conflict with human interests.

#16

In the span of a few years, some minor European explorers (later known as the conquistadors) encountered, conquered, and enslaved several huge regions of the world. Daniel Kokotajlo argues this shows the plausibility of a small AI system rapidly taking over the world, even without overwhelming technological superiority. 

11Daniel Kokotajlo
(I am the author) I still like & endorse this post. When I wrote it, I hadn't read more than the wiki articles on the subject. But then afterwards I went and read 3 books (written by historians) about it, and I think the original post held up very well to all this new info. In particular, the main critique the post got -- that disease was more important than I made it sound, in a way that undermined my conclusion -- seems to have been pretty wrong. (See e.g. this comment thread, these follow up posts) So, why does it matter? What contribution did this post make? Well, at the time -- and still now, though I think I've made a dent in the discourse -- quite a lot of people I respect (such as people at OpenPhil) seemed to think unaligned AGI would need god-like powers to be able to take over the world -- it would need to be stronger than the rest of the world combined! I think this is based on a flawed model of how takeover/conquest works, and history contains plenty of counterexamples to the model. The conquistadors are my favorite counterexample from my limited knowledge of history. (The flawed model goes by the name of "The China Argument," at least in my mind. You may have heard the argument before -- China is way more capable than the most capable human, yet it can't take over the world; therefore AGI will need to be way way more capable than the most powerful human to take over the world.) Needless to say, this is a somewhat important crux, as illustrated by e.g. Joe Carlsmith's report, which assigns a mere 40% credence to unaligned APS-AI taking over the world even conditional on it escaping and seeking power and managing to cause at least a trillion dollars worth of damage. (I've also gotten feedback from various people at OpenPhil saying that this post was helpful to them, so yay!) I've since written a sequence of posts elaborating on this idea: Takeoff and Takeover in the Past and Future. Alas, I still haven't written the capstone posts in the sequence, t
#17

Steve Byrnes lays out his 7 guiding principles for understanding how the brain works computationally. He argues the neocortex uses a single general learning algorithm that starts as a blank slate, while the subcortex contains hard-coded instincts and steers the neocortex toward biologically adaptive behaviors.

11Steven Byrnes
I wrote this relatively early in my journey of self-studying neuroscience. Rereading this now, I guess I'm only slightly embarrassed to have my name associated with it, which isn’t as bad as I expected going in. Some shifts I’ve made since writing it (some of which are already flagged in the text): * New terminology part 1: Instead of “blank slate” I now say “learning-from-scratch”, as defined and discussed here. * New terminology part 2: “neocortex vs subcortex” → “learning subsystem vs steering subsystem”, with the former including the whole telencephalon and cerebellum, and the latter including the hypothalamus and brainstem. I distinguish them by "learning-from-scratch vs not-learning-from-scratch". See here. * Speaking of which, I now put much more emphasis on "learning-from-scratch" over "cortical uniformity" when talking about the neocortex etc.—I care about learning-from-scratch more, I talk about it more, etc. I see the learning-from-scratch hypothesis as absolutely central to a big picture of the brain (and AGI safety!), whereas cortical uniformity is much less so. I do still think cortical uniformity is correct (at least in the weak sense that someone with a complete understanding of one part of the cortex would be well on their way to a complete understanding of any other part of the cortex), for what it’s worth. * I would probably drop the mention of “planning by probabilistic inference”. Well, I guess something kinda like planning by probabilistic inference is part of the story, but generally I see the brain thing as mostly different. * Come to think of it, every time the word “reward” shows up in this post, it’s safe to assume I described it wrong in at least some respect. * The diagram with neocortex and subcortex is misleading for various reasons, see notes added to the text nearby. * I’m not sure I was using the term “analysis-by-synthesis” correctly. I think that term is kinda specific to sound processing. And the vision analog is “vision
#18

Inner alignment refers to the problem of aligning a machine learning model's internal goals (mesa-objective) with the intended goals we are optimizing for externally (base objective). Even if we specify the right base objective, the model may develop its own misaligned mesa-objective through the training process. This poses challenges for AI safety. 

12Davidmanheim
This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)
#19

GDP isn't a great metric for AI timelines or takeoff speed because the relevant events (like AI alignment failure or progress towards self-improving AI) could happen before GDP growth accelerates visibly. Instead, we should focus on things like warning shots, heterogeneity of AI systems, risk awareness, multipolarity, and overall "craziness" of the world. 

13Daniel Kokotajlo
(I am the author) I still like & stand by this post. I refer back to it constantly. It does two things: 1. Argue that an AI-induced point of no return could significantly before, or significantly after, world GDP growth accelerates--and indeed will probably come before! 2. Argue that we shouldn't define timelines and takeoff speeds in terms of economic growth. So, against "is there a 4 year doubling before a 1 year doubling?" and against "When will we have TAI = AI capable of doubling the economy in 4 years if deployed?" I think both things are pretty important; I think focus on GWP is distracting us from the metrics that really matter and hence hindering epistemic progress, and I think that most of the AI risk comes from scenarios in which AI-PONR happens before GWP accelerates, so it's important to evaluate the plausibility of such scenarios. I talked with Paul about this post once and he said he still wasn't convinced, he still expects GWP to accelerate before the point of no return. He said some things that I found helpful (e.g. gave some examples of how AI tech will have dramatically shorter product development cycles than historical products, such that you really will be able to deploy it and accelerate the economy in the months to years before substantially better versions are created), but nothing that significantly changed my position either. I would LOVE to see more engagement/discussion of this stuff. (I recognize Paul is busy etc. but lots of people (most people?) have similar views, so there should be plenty of people capable of arguing for his side. On my side, there's MIRI, see this comment, which is great and if I revise this post I'll want to incorporate some of the ideas from it. Of course the best thing to incorporate would be good objections & replies, hence why I wish I had some. I've at least got the previously-mentioned one from Paul. Oh, and Paul also had an objection to my historical precedent which I take seriously.)
#20

Aging, which kills 100,000 people per day, may be solvable. Here's a summary of the most promising anti-aging research, including parabiosis, metabolic manipulation, senolytics, and cellular reprogramming. 

#21

The structure of things-humans-want does not always match the structure of the real world, or the structure of how-other-humans-see-the-world. When structures don't match, someone or something needs to serve as an interface, translating between the two. Interfaces between complex systems and human desires are often a scarce resource.

13DirectedEvolution
What this post does for me is that it encourages me to view products and services not as physical facts of our world, as things that happen to exist, but as the outcomes of an active creative process that is still ongoing and open to our participation. It reminds us that everything we might want to do is hard, and that the work of making that task less hard is valuable. Otherwise, we are liable to make the mistake of taking functionality and expertise for granted. What is not an interface? That's the slipperiest aspect of this post. A programming language is an interface to machine code, a programmer to the language, a company to the programmer, a liaison to the company, a department to the liaison, a chain of command to the department, a stock to the chain of command, an index fund to the stock, an app to the index fund. Matter itself is an interface. An iron bar is an interface to iron. An aliquot is an interface to a chemical. A fruit is an interface, translating between the structure of a chloroplast and the structure of things-animals-can-eat. A janitor is an interface to brooms and buckets, the layout of the building, and other considerations bearing on the task of cleaning. We have lots of words in this concept-cluster: tools, products, goods and services, control systems, and now "interfaces." "As a scarce resource," suggests that there are resources that are not interfaces. After all, the implied value prop of this post is that it's suggesting a high-value area for economic activity. But if all economic activity is interface design, then a more accurate title is "Scarce Resources as Interfaces," or "Goods Are Hard To Make And Services Are Hard To Do." The value I get out of this post is that it shifts my thinking about a tool or service away from the mechanism, and toward the value prop. It's also a useful reminder for an early-career professional that their value prop is making a complex system easier to use for somebody else, rather than ticking the bo
#22

Most Prisoner's Dilemmas are actually Stag Hunts in the iterated game, and most Stag Hunts are actually "Schelling games." You have to coordinate on a good equilibrium, but there are many good equilibria to choose from, which benefit different people to different degrees. This complicates the problem of cooperating.

23Bucky
A short note to start the review that the author isn’t happy with how it is communicated. I agree it could be clearer and this is the reason I’m scoring this 4 instead of 9. The actual content seems very useful to me. AllAmericanBreakfast has already reviewed this from a theoretical point of view but I wanted to look at it from a practical standpoint. *** To test whether the conclusions of this post were true in practice I decided to take 5 examples from the Wikipedia page on the Prisoner’s dilemma and see if they were better modeled by Stag Hunt or Schelling Pub: * Climate negotiations * Relationships * Marketing * Doping in sport * Cold war nuclear arms race Detailed analysis of each is at the bottom of the review. Of these 5, 3 (Climate, Relationships, Arms race) seem to me to be very well modeled by Schelling Pub.  Due to the constraints on communication allowed between rival companies it is difficult to see marketing (where more advertising = defect) as a Schelling Pub game. There probably is an underlying structure which looks a bit like Schelling Pub but it is very hard to move between Nash Equilibria. As a result I would say that Prisoner’s Dilemma is a more natural model for marketing. The choice of whether to dope in sport is probably best modeled as a Prisoner’s dilemma with an enforcing authority which punishes defection. As a result, I don’t think any of the 3 games are a particularly good model for any individual’s choice. However, negotiations on setting up the enforcing authority and the rules under which it operates are more like Schelling Pub. Originally I thought this should maybe count as half a point for the post but thinking about it further I would say this is actually a very strong example of what the post is talking about – if your individual choice looks like a Prisoner’s Dilemma then look for ways to make it into a Schelling Pub. If this involves setting up a central enforcement agency then negotiate to make that happen. So I
17DirectedEvolution
The goal of this post is to help us understand the similarities and differences between several different games, and to improve our intuitions about which game is the right default assumption when modeling real-world outcomes. My main objective with this review is to check the game theoretic claims, identify the points at which this post makes empirical assertions, and see if there are any worrisome oversights or gaps. Most of my fact-checking will just be resorting to Wikipedia. Let’s start with definitions of two key concepts. Pareto-optimal: One dimension cannot improve without a second worsening. Nash equilibrium: No player can do better by unilaterally changing their strategy. Here’s the payoff matrix from the one-shot Prisoner’s Dilemma and how it relates to these key concepts.  B stays silentB betraysA stays silentPareto-optimal A betrays Nash equilibrium         This article outlines three possible relationships between Pareto-optimality and Nash equilibrium. 1. There are no Pareto-optimal Nash equilibria. 2. There is a single Pareto-optimal Nash equilibrium, and another equilibrium that is not Pareto-optimal. 3. There are multiple Pareto-optimal Nash equilibria, which benefit different players to different extents. The author attempts to argue which of these arrangements best describes the world we live in, and makes the best default assumption when interpreting real-world situations as games. The claim is that real-world situations most often resemble iterated PDs, which have multiple Pareto-optimal Nash equilibria benefitting different players to different extents. I will attempt to show that the author’s conclusion only applies when modeling superrational entities, or entities with an unbounded lifespan, and give some examples where this might be relevant. Iterated Prisoner’s Dilemma is a little more complex than the author states. If the players know how many turns the game will be played for, or if the game has a known upper limit of t
#23

Abram argues against assuming that rational agents have utility functions over worlds (which he calls the "reductive utility" view). Instead, he points out that you can have a perfectly valid decision theory where agents just have preferences over events, without having to assume there's some underlying utility function over worlds.

25Vanessa Kosoy
In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility. To provide motivation, the author starts from what he calls the "reductive utility view", which is the thesis he sets out to overthrow. He then identifies two problems with the view. The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent's preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to "moral uncertainty". The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time). The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the
15Ben Pace
An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things "from the starting-point of the universe". I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that "starts from the standpoint of the agent". Instead of my beliefs being about the universe, my beliefs are about my experiences and thoughts. I generally nod along to a lot of the 'scientific' discussion in the 21st century about how the universe works and how reasonable the whole thing is. But I don't feel I knew in-advance to expect the world around me to operate on simple mathematical principles and be so reasonable. I could've woken up in the Harry Potter universe of magic wands and spells. I know I didn't, but if I did, I think I would be able to act in it? I wouldn't constantly be falling over myself because I don't understand how 1 + 1 = 2 anymore? There's some place I'm starting from that builds up to an understanding of the universe, and doesn't sneak it in as an 'assumption'. And this is what this new perspective does that Abram lays out in technical detail. (I don't follow it all, for instance I don't recall why it's important that the former view assumes that utility is computable.) In conclusion, this piece is a key step from the existing philosophy of agents to the philosophy of embedded agents, or at least it was for me, and it changes my background perspective on rationality. It's the only post in the early vote that I gave +9. (This review is taken from my post Ben Pace's Controversial Picks for the 2020 Review.)
#24

Success is supposed to open doors and broaden horizons. But often it can do the opposite - trapping people in narrow specialties or roles they've outgrown. This post explores how success can sometimes be the enemy of personal freedom and growth, and how to maintain flexibility as you become more successful.

10DirectedEvolution
There's a lot of attention paid these days to accommodating the personal needs of students. For example, a student with PTSD may need at least one light on in the classroom at all times. Schools are starting to create mechanisms by which a student with this need can have it met more easily. Our ability to do this depends on a lot of prior work. The mental health community had to establish PTSD as a diagnosis; the school had to create a bureaucratic mechanism to normalize accommodations of this kind; and the student had to spend a significant amount of time figuring out what accommodations alleviated their PTSD symptoms and how to get them addressed through the school's bureaucracy. This points in a direction of something like "transitions research," an attempt to identify and economically address the specific barriers that skew individuals toward immediate modest-productivity strategies and away from long-term high-productivity strategies. Imagine if there was a well-known "diagnosis" of "status-loss anxiety," in which a person who's achieved some professional success notices themselves avoiding situations that would be likely to enhance their growth, yet come with a threat of loss of status. It's like the depressed person who resists mental health unseling because it implies there's something wrong with them. Being able to identify that precise reaction, label it, raise awareness of it, and find means and messages to address it would be helpful to overcome a barrier to mental health treatment. In economics jargon, what's going on here is not so much the sunk cost fallacy as a combination of aging, opportunity cost and diminishing returns. Learning takes time, aging us, and this means we have less time to profit off a new long-term investment in skill-building. Increased skill raises the opportunity cost of learning new skills. Diminishing returns means that, if we learn a skill that increases our profit from A + B to A + 2B, that this is less intrinsically valu
#25

Vanessa and diffractor introduce a new approach to epistemology / decision theory / reinforcement learning theory called Infra-Bayesianism, which aims to solve issues with prior misspecification and non-realizability that plague traditional Bayesianism.

12Diffractor
This post is still endorsed, it still feels like a continually fruitful line of research. A notable aspect of it is that, as time goes on, I keep finding more connections and crisper ways of viewing things which means that for many of the further linked posts about inframeasure theory, I think I could explain them from scratch better than the existing work does. One striking example is that the "Nirvana trick" stated in this intro (to encode nonstandard decision-theory problems), has transitioned from "weird hack that happens to work" to "pops straight out when you make all the math as elegant as possible". Accordingly, I'm working on a "living textbook" (like a textbook, but continually being updated with whatever cool new things we find) where I try to explain everything from scratch in the crispest way possible, to quickly catch up on the frontier of what we're working on. That's my current project. I still do think that this is a large and tractable vein of research to work on, and the conclusion hasn't changed much.
#26

Dogmatic probabilism is the theory that all rational belief updates should be Bayesian updates. Radical probabilism is a more flexible theory which allows agents to radically change their beliefs, while still obeying some constraints. Abram examines how radical probabilism differs from dogmatic probabilism, and what implications the theory has for rational agents.

#27
There are two kinds of puzzles: "reality-revealing puzzles" that help us understand the world better, and "reality-masking puzzles" that can inadvertently disable parts of our ability to see clearly. CFAR's work has involved both types as it has tried to help people reason about existential risk from AI while staying grounded. We need to be careful about disabling too many of our epistemic safeguards. [key_quote]
17Zvi
This is a long and good post with a title and early framing advertising a shorter and better post that does not fully exist, but would be great if it did.  The actual post here is something more like "CFAR and the Quest to Change Core Beliefs While Staying Sane."  The basic problem is that people by default have belief systems that allow them to operate normally in everyday life, and that protect them against weird beliefs and absurd actions, especially ones that would extract a lot of resources in ways that don't clearly pay off. And they similarly protect those belief systems in order to protect that ability to operate in everyday life, and to protect their social relationships, and their ability to be happy and get out of bed and care about their friends and so on.  A bunch of these defenses are anti-epistemic, or can function that way in many contexts, and stand in the way of big changes in life (change jobs, relationships, religions, friend groups, goals, etc etc).  The hard problem CFAR is largely trying to solve in this telling, and that the sequences try to solve in this telling, is to disable such systems enough to allow good things, without also allowing bad things, or to find ways to cope with the subsequent bad things slash disruptions. When you free people to be shaken out of their default systems, they tend to go to various extremes that are unhealthy for them, like optimizing narrowly for one goal instead of many goals, or having trouble spending resources (including time) on themselves at all, or being in the moment and living life, And That's Terrible because it doesn't actually lead to better larger outcomes in addition to making those people worse off themselves. These are good things that need to be discussed more, but the title and introduction promise something I find even more interesting. In that taxonomy, the key difference is that there are games one can play, things one can be optimizing for or responding to, incentives one can creat
#28

Crawford looks back on past celebrations of achievements like the US transcontinental railroad, the Brooklyn Bridge, electric lighting, the polio vaccine, and the Moon landing. He then asks: Why haven't we celebrated any major achievements lately? He explores some hypotheses for this change.

11jasoncrawford
Since writing this, I've run across even more examples: * The transatlantic telegraph was met with celebrations similar to the transcontinental railroad, etc. (somewhat premature as the first cable broke after two weeks). Towards the end of Samuel Morse's life and/or at his death, he was similarly feted as a hero. * The Wright Brothers were given an enormous parade and celebration in their hometown of Dayton, OH when they returned from their first international demonstrations of the airplane. I'd like to write these up at some point. Related: The poetry of progress (another form of celebration, broadly construed)
#29

Andrew Critch lists several research areas that are seem important to AI existential safety, and evaluates them for direct helpfulness, educational value, and neglect. Along the way, he argues that the main way he sees present-day technical research helping is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise later.

#30

How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution?This essay studies search and design as distinct approaches to engineering, arguing that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced. 

#31

People often ask "Can you keep this confidential?" without really checking if the person has the skills to do so. Raemon argues we need to be more careful about how we handle confidential informationm, and have explicit conversations about privacy practices. 

#32

AI Impacts investigated dozens of technological trends, looking for examples of discontinuous progress (where more than a century of progress happened at once). They found ten robust cases, such as the first nuclear weapons, and the Great Eastern steamship. 

They hope the data can inform expectations about discontinuities in AI development.

#33

The path to explicit reason is fraught with challenges. People often don't want to use explicit reason, and when they try to use it, they fail. Even if they succeed, they're punished socially. The post explores various obstacles on this path, including social pressure, strange memeplexes, and the "valley of bad rationality".

11Yoav Ravid
I remember this post very fondly. I often thought back to it and it inspired some thoughts of my own about rationality (which I had trouble writing down and are waiting in a draft to be written fully some day). I haven't used any of the phrases introduced here (Underperformance Swamp, Sinkholes of Sneer, Valley of Disintegration...), and I'm not sure whether it was the intention. The post starts with the claim that rationalists "basically got everything about COVID-19 right and did so months ahead of the majority of government officials, journalists, and supposed experts". Since it's not the point of the post I won't review this claim in depth, but it seems basically true to me. Elizabeth's review here gives a few examples. This post is about the difficulty and even danger in becoming a rationalist, or more generally, in using explicit reasoning (Intuition and Social Cognition being the alternatives). The first difficulty is that explicit reasoning alone often fails to outperform intuition and social cognition where those perform well. I think this is true, and as the rationality community evolved it came to appreciate intuition and social cognition more, without devaluing explicit reason. The second is persevering through the sneer and social pressure that comes from trying to use explicit reason to do things, often coming to very different approaches from other people, and often also failing. The third is navigating the strange status hierarchy in the community, which mostly doesn't depend on regular things like attractiveness and more often on our ability to apply explicit reason effectively, as well as being scared by strange memes like AI risk and cryonics. I don't know to what extent the first part is true in the physical communities, but it definitely is in the virtual community.  The fourth is where the danger comes in. When you're in the Valley of Bad Rationality your life can get worse, and if you don't get out of it some way it might stay worse. So
#34

The neocortex has been hypothesized to be uniformly composed of general-purpose data-processing modules. What does the currently available evidence suggest about this hypothesis? Alex Zhu explores various pieces of evidence, including deep learning neural networks and predictive coding theories of brain function. [tweet]

#35

You've probably heard the advice "to be a good listener, reflect back what people tell you." Ben Kuhn argues this is cargo cult advice that misses the point. The real key to good listening is intense curiosity about the details of the other person's situation. 

#36

A counterintuitive concept: Sometimes people choose the worse option, to signal their loyalty or values in situations where that loyalty might be in question. Zvi explores this idea of "motive ambiguity" and how it can lead to perverse incentives. 

12DirectedEvolution
This post is based on the book Moral Mazes, which is a 1988 book describing "the way bureaucracy shapes moral consciousness" in US corporate managers. The central point is that it's possible to imagine relationship and organization structures in which unnecessarily destructive behavior, to self or others, is used as a costly signal of loyalty or status. Zvi titles the post after what he says these behaviors are trying to avoid, motive ambiguity. He doesn't label the dynamic itself, so I'll refer to it here as "disambiguating destruction" (DD). Before proceeding, I want to emphasize that DD is referring to truly pointless destruction for the exclusive purpose of signaling a specific motive, and not to an unavoidable tradeoff. This raises several questions, which the post doesn't answer. 1. Do pointlessly destructive behaviors typically succeed at reducing or eliminating motive ambiguity? 2. Do they do a better job of reducing motive ambiguity than alternatives? 3. How common is DD in particular types of institutions, such as relationships, cultures, businesses, and governments? 4. How do people manage to avoid feeling pressured into DD? 5. What exactly are the components of DD, so that we can know what to look for when deciding whether to enter into a certain organization or relationship? 6. Are there other explanations for the components of DD, and how would we distinguish between DD and other possible interpretations of the component behaviors? We might resort to a couple explanations for (4), the question of how to avoid DD. One is the conjunction of empathy and act utilitarianism. My girlfriend says she wouldn't want to go to a restaurant only she loves, even if the purpose was to show I love her. Part of her enjoyment is my enjoyment of the experience. If she loved the restaurant only she loves so much that she was desperate to go, then she could go with someone else. She finds the whole idea of destructive disambiguation of love to be distinctly unapp
#37

The felt sense is a concept coined by psychologist Eugene Gendlin to describe a kind of a kind of pre-linguistic, physical sensation that represents some mental content. Kaj gives examples of felt senses, explains why they're useful to pay attention to, and gives tips on how to notice and work with them.

10Raemon
This post feels like an important part of what I've referred to as The CFAR Development Branch Git Merge. Between 2013ish and 2017ish, a lot of rationality development happened in person, which built off the sequences. I think some of that work turned out to be dead ends, or a bit confused, or not as important as we thought at the time. But a lot of it was been quite essential to rationality as a practice. I'm glad it has gotten written up. The felt sense, and focusing, have been two surprisingly important tools for me. One use case not quite mentioned here – and I think perhaps the most important one for rationality, is for getting a handle on what I actually think. Kaj discusses using it for figuring out how to communicate better, getting a sense of what your interlocutor is trying to understand and how it contrasts with what you're trying to say. But I think this is also useful in single-player mode. i.e. I say "I think X", and then I notice "no, there's a subtle wrongness to my description of what X is". This is helpful both for clarifying my beliefs about subtle topics, or for following fruitful trails of brainstorming.
#38

If you know nothing about a thing, the first example or sample gives you a disproportionate amount of information, often more than any subsequent sample. It lets you locate the idea in conceptspace, get a sense of what domain/scale/magnitude you're dealing with, and provides an anchor for further thinking.

#39

You've probably heard that a nuclear war between major powers would cause human extinction. This post argues that while nuclear war would be incredibly destructive, it's unlikely to actually cause human extinction. The main risks come from potential climate effects, but even in severe scenarios some human populations would likely survive.

18TurnTrout
This will not be a full review—it's more of a drive-by comment which I think is relevant to the review process. I am extremely skeptical of and am not at all confident in this conclusion. Ellsberg's The Doomsday Machine describes a horribly incentivized military establishment which pursued bloodthirsty and senseless policies, deceiving their superiors (including several presidents), breaking authentication protocols, refusing to adopt plans which didn't senselessly destroy China in a conflict in the Soviet Union, sub-delegation of nuclear launch authority to theater commanders and their subordinates (no, it's not operationally true that the US president has to authorize an attack!), lack of controls against false alarms, and constant presidential threats of first-use. The USAF would manipulate presidential officials in order to secure funding, via tactics such as inflating threat estimates or ignoring evidence that the Soviet Union had less nuclear might than initially thought. And Ellsberg stated that he didn't think much had changed since his tenure in the 50s-70s. While individual planners might be aware of the nuclear winter risks, the overall US military establishment seems insane to me around nuclear policy—and what of those in other nuclear powers?  However, The Doomsday Machine is my only exposure to these considerations, and perhaps I'm missing a broader perspective. If so, I think that case should be more clearly spelled out, because as far as I can tell, nuclear policy seems like yet another depravedly inadequate facet of our current civilization. 
16Bucky
The post claims: This review aims to assess whether having read the post I can conclude the same. The review is split into 3 parts: * Epistemic spot check * Examining the argument * Outside the argument Epistemic spot check Claim: There are 14,000 nuclear warheads in the world. Assessment: True Claim: Average warhead yield <1 Mt, probably closer to 100kt Assessment: Probably true, possibly misleading. Values I found were: * US * W78 warhead: 335-350kt * W87 warhead: 300 or 475 kt * Russia * R-36 missile: 550-750 kt * R29 missile: 100 or 500kt The original claim read to me that 100kT was probably pretty close and 1Mt was a big factor of safety (~x10) but whereas the safety factor was actually less than that (~x3). However that’s the advantage of having a safety factor – even if it’s a bit misleading there still is a safety factor in the calculations. I found the lack of links slightly frustrating here – it would have been nice to see where the OP got the numbers from. Examining the argument The argument itself can be summarized as: 1. Kinetic destruction can’t be big enough 2. Radiation could theoretically be enough but in practice wouldn’t be 3. Nuclear winter not sufficient to cause extinction One assumption in the arguments for 1 & 2 is that the important factor is the average warhead yield and that e.g. a 10Mt warhead doesn’t have an outsized effect. This seems likely and a comment suggests that going over 500kt doesn’t make as much difference as might be thought and that is why warheads are the size that they are. Arguments 1 & 2 seem very solid. We have done enough tests that our understanding of kinetic destruction is likely to be fairly good so I don’t have much concerns there. Similarly, radiation is well understood and dispersal patterns seem kinda predictable in principle and even if these are wrong the total amount of radiation doesn't change, just the where it is. Climate change is less easy to model, especially giv
#40

All sorts of everyday practices in the legal system, medicine, software, and other areas of life involve stating things that aren't true. But calling these practices "lies" or "fraud" seems to be perceived as an attack rather than a straightforward description. This makes it difficult to discuss and analyze these practices without provoking emotional defensiveness. 

#41

The Swiss political system is known for its extensive use of direct democracy. This post dives deep into how that system works, exploring the different types of referenda, their history, impacts, and quirks. It's a detailed look at a unique political system that has managed to largely avoid polarization. 

10Martin Sustrik
Self-review: Looking at the essay year and a half later I am still reasonably happy about it. In the meantime I've seen Swiss people recommending it as an introductory text for people asking about Swiss political system, so I am, of course, honored, but it also gives me some confidence in not being totally off. If I had to write the essay again, I would probably give less prominence to direct democracy and more to the concordance and decentralization, which are less eye-catchy but in a way more interesting/important. Also, I would probably pay some attention to the question of how the system - given how unique it is - even managed to evolve. Maybe also do some investigation into whether the uniqueness of the political system has something to do with the surprising long-term ability of Swiss economy to reinvent itself and become a leader in areas as varied as mercenary troops, cheese, silk, machinery, banking and pharmaceuticals.
#42

Under conditions of perfectly intense competition, evolution works like water flowing down a hill – it can never go up even the tiniest elevation. But if there is slack in the selection process, it's possible for evolution to escape local minima. "How much slack is optimal" is an interesting question, Scott explores in various contexts.

31DirectedEvolution
The referenced study on group selection on insects is "Group selection among laboratory populations of Tribolium," from 1976. Studies on Slack claims that "They hoped the insects would evolve to naturally limit their family size in order to keep their subpopulation alive. Instead, the insects became cannibals: they ate other insects’ children so they could have more of their own without the total population going up."  This makes it sound like cannibalism was the only population-limiting behavior the beetles evolved. According to the original study, however, the low-population condition (B populations) showed a range of population size-limiting strategies, including but not limited to higher cannibalism rates. "Some of the B populations enjoy a higher cannibalism rate than the controls while other B populations have a longer mean developmental time or a lower average fecundity relative to the controls. Unidirectional group selection for lower adult population size resulted in a multivarious response among the B populations because there are many ways to achieve low population size." Scott claims that group selection can't work to restrain boom-bust cycles (i.e. between foxes and rabbits) because "the fox population has no equivalent of the overarching genome; there is no set of rules that govern the behavior of every fox." But the empirical evidence of the insect study he cited shows that we do in fact see changes in developmental time and fecundity. After all, a species has considerable genetic overlap between individuals, even if we're not talking about heavily inbred family members, as we'd be seeing in the beetle study. Wikipedia's article on human genetic diversity cites a Nature article and says "as of 2015, the typical difference between an individual's genome and the reference genome was estimated at 20 million base pairs (or 0.6% of the total of 3.2 billion base pairs)." An explanation here is that the inbred beetles of the study are becoming progressiv
#43

John examines the problem of "how to transport things?" through the lens of "what's the taut constraint on the system?" He asks questions across history, from "how could Alexander the Great's army cross 150 miles of desert?", to how modern supply chains work, to what would happen in a future world with teleportation.

#44

Instead, it's the point of no return—the day we AI risk reducers lose the ability to significantly reduce AI risk. This might happen years before classic milestones like "World GWP doubles in four years" and "Superhuman AGI is deployed."

24Zack_M_Davis
This post is making a valid point (the time to intervene to prevent an outcome that would otherwise occur, is going to be before the outcome actually occurs), but I'm annoyed with the mind projection fallacy by which this post seems to treat "point of no return" as a feature of the territory, rather than your planning algorithm's map. (And, incidentally, I wish this dumb robot cult still had a culture that cared about appreciating cognitive algorithms as the common interest of many causes, such that people would find it more natural to write a post about "point of no return"-reasoning as a general rationality topic that could have all sorts of potential applications, rather than the topic specifically being about the special case of the coming robot apocalypse. But it's probably not fair to blame Kokotajlo for this.) The concept of a "point of no return" only makes sense relative to a class of interventions. A 1 kg ball is falling at 9.8 m/s². When is the "point of no return" at which the ball has accelerated enough such that it's no longer possible to stop it from hitting the ground? The problem is underspecified as stated. If we add the additional information that your means of intervening is a net that can only trap objects falling with less than X kg⋅m/s² of force, then we can say that the point of no return happens at X/9.8 seconds. But it would be weird to talk about "the second we ball risk reducers lose the ability to significantly reduce the risk of the ball hitting the ground" as if that were an independent pre-existing fact that we could use to determine how strong of a net we need to buy, because it depends on the net strength.
#45

Eliezer Yudkowsky recently criticized the OpenPhil draft report on AI timelines. Holden Karnofsky thinks Eliezer misunderstood the report in important ways, and defends the report's usefulness as a tool for informing (not determining) AI timelines.

#46

The practice of extrapolating AI timelines based on biological analogies has a long history of not working. Eliezer argues that this is because the resource gets consumed differently, so base-rate arguments from resource consumption end up quite unhelpful in real life. 

Timelines are inherently very difficult to predict accurately, until we are much closer to AGI.