LESSWRONG
The Best of LessWrong
LW

The Best of LessWrong

When posts turn more than a year old, the LessWrong community reviews and votes on how well they have stood the test of time. These are the posts that have ranked the highest for all years since 2018 (when our annual tradition of choosing the least wrong of LessWrong began).

For the years 2018, 2019 and 2020 we also published physical books with the results of our annual vote, which you can buy and learn more about here.

Rationality

Eliezer Yudkowsky

Local Validity as a Key to Sanity and Civilization

Buck

"Other people are wrong" vs "I am right"

Mark Xu

Strong Evidence is Common

TsviBT

Please don't throw your mind away

Raemon

Noticing Frame Differences

johnswentworth

You Are Not Measuring What You Think You Are Measuring

johnswentworth

Gears-Level Models are Capital Investments

Hazard

How to Ignore Your Emotions (while also thinking you're awesome at emotions)

Scott Garrabrant

Yes Requires the Possibility of No

Ben Pace

A Sketch of Good Communication

Eliezer Yudkowsky

Meta-Honesty: Firming Up Honesty Around Its Edge-Cases

Duncan Sabien (Inactive)

Lies, Damn Lies, and Fabricated Options

Scott Alexander

Trapped Priors As A Basic Problem Of Rationality

Duncan Sabien (Inactive)

Split and Commit

Duncan Sabien (Inactive)

CFAR Participant Handbook now available to all

johnswentworth

What Are You Tracking In Your Head?

Mark Xu

The First Sample Gives the Most Information

Duncan Sabien (Inactive)

Shoulder Advisors 101

Scott Alexander

Varieties Of Argumentative Experience

Eliezer Yudkowsky

Toolbox-thinking and Law-thinking

Mistakes with Conservation of Expected Evidence

Kaj_Sotala

The Felt Sense: What, Why and How

Duncan Sabien (Inactive)

Cup-Stacking Skills (or, Reflexive Involuntary Mental Motions)

Ben Pace

The Costly Coordination Mechanism of Common Knowledge

Jacob Falkovich

Seeing the Smoke

Duncan Sabien (Inactive)

Basics of Rationalist Discourse

Duncan Sabien (Inactive)

Sazen

AnnaSalamon

Reality-Revealing and Reality-Masking Puzzles

Eliezer Yudkowsky

ProjectLawful.com: Eliezer's latest story, past 1M words

Eliezer Yudkowsky

Self-Integrity and the Drowning Child

Jacob Falkovich

The Treacherous Path to Rationality

Scott Garrabrant

Tyranny of the Epistemic Majority

alkjash

More Babble

abramdemski

Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems

Raemon

Being a Robust Agent

Zack_M_Davis

Heads I Win, Tails?—Never Heard of Her; Or, Selective Reporting and the Tragedy of the Green Rationalists

Benquo

Reason isn't magic

habryka

Integrity and accountability are core parts of rationality

Raemon

The Schelling Choice is "Rabbit", not "Stag"

Diffractor

Threat-Resistant Bargaining Megapost: Introducing the ROSE Value

Raemon

Propagating Facts into Aesthetics

johnswentworth

Simulacrum 3 As Stag-Hunt Strategy

LoganStrohl

Catching the Spark

Jacob Falkovich

Is Rationalist Self-Improvement Real?

Benquo

Excerpts from a larger discussion about simulacra

Zvi

Simulacra Levels and their Interactions

Comment reply: my low-quality thoughts on why CFAR didn't get farther with a "real/efficacious art of rationality"

Eric Raymond

Rationalism before the Sequences

Owain_Evans

The Rationalists of the 1950s (and before) also called themselves “Rationalists”

Raemon

Feedbackloop-first Rationality

LoganStrohl

Fucking Goddamn Basics of Rationalist Discourse

Raemon

Tuning your Cognitive Strategies

johnswentworth

Lessons On How To Get Things Right On The First Try

Optimization

So8res

Focus on the places where you feel shocked everyone's dropping the ball

Jameson Quinn

A voting theory primer for rationalists

sarahconstantin

The Pavlov Strategy

Zvi

Prediction Markets: When Do They Work?

johnswentworth

Being the (Pareto) Best in the World

alkjash

Is Success the Enemy of Freedom? (Full)

johnswentworth

Coordination as a Scarce Resource

AnnaSalamon

What should you change in response to an "emergency"? And AI risk

jasoncrawford

How factories were made safe

HoldenKarnofsky

All Possible Views About Humanity's Future Are Wild

jasoncrawford

Why has nuclear power been a flop?

Zvi

Simple Rules of Law

Scott Alexander

The Tails Coming Apart As Metaphor For Life

Zvi

Asymmetric Justice

Jeffrey Ladish

Nuclear war is unlikely to cause human extinction

Elizabeth

Power Buys You Distance From The Crime

Eliezer Yudkowsky

Is Clickbait Destroying Our General Intelligence?

Can crimes be discussed literally?

johnswentworth

When Money Is Abundant, Knowledge Is The Real Wealth

GeneSmith

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible

HoldenKarnofsky

This Can't Go On

Said Achmiz

The Real Rules Have No Exceptions

Lars Doucet

Lars Doucet's Georgism series on Astral Codex Ten

johnswentworth

Working With Monsters

jasoncrawford

Why haven't we celebrated any major achievements lately?

abramdemski

The Credit Assignment Problem

Martin Sustrik

Inadequate Equilibria vs. Governance of the Commons

Scott Alexander

Studies On Slack

KatjaGrace

Discontinuous progress in history: an update

Scott Alexander

Rule Thinkers In, Not Out

Raemon

The Amish, and Strategic Norms around Technology

Zvi

Blackmail

HoldenKarnofsky

Nonprofit Boards are Weird

Wei Dai

Beyond Astronomical Waste

Things I Learned by Spending Five Thousand Hours In Non-EA Charities

Richard_Ngo

The ants and the grasshopper

So8res

Enemies vs Malefactors

Elizabeth

Change my mind: Veganism entails trade-offs, and health is one of the axes

World

Kaj_Sotala

Book summary: Unlocking the Emotional Brain

Ben

The Redaction Machine

Samo Burja

On the Loss and Preservation of Knowledge

Alex_Altair

Introduction to abstract entropy

Martin Sustrik

Swiss Political System: More than You ever Wanted to Know (I.)

johnswentworth

Interfaces as a Scarce Resource

eukaryote

There’s no such thing as a tree (phylogenetically)

Scott Alexander

Is Science Slowing Down?

Martin Sustrik

Anti-social Punishment

johnswentworth

Transportation as a Constraint

Martin Sustrik

Research: Rescuers during the Holocaust

GeneSmith

Toni Kurz and the Insanity of Climbing Mountains

johnswentworth

Book Review: Design Principles of Biological Circuits

Elizabeth

Literature Review: Distributed Teams

Valentine

The Intelligent Social Web

eukaryote

Spaghetti Towers

Eli Tyre

Historical mathematicians exhibit a birth order effect too

johnswentworth

What Money Cannot Buy

Bird Concept

Unconscious Economics

Scott Alexander

Book Review: The Secret Of Our Success

johnswentworth

Specializing in Problems We Don't Understand

KatjaGrace

Why did everything take so long?

Ruby

[Answer] Why wasn't science invented in China?

Scott Alexander

Mental Mountains

L Rudolf L

A Disneyland Without Children

johnswentworth

Evolution of Modularity

johnswentworth

Science in a High-Dimensional World

Kaj_Sotala

My attempt to explain Looking, insight meditation, and enlightenment in non-mysterious terms

Kaj_Sotala

Building up to an Internal Family Systems model

Steven Byrnes

My computational framework for the brain

Natália

Counter-theses on Sleep

abramdemski

What makes people intellectually active?

Bucky

Birth order effect found in Nobel Laureates in Physics

zhukeepa

How uniform is the neocortex?

JackH

Anti-Aging: State of the Art

Vaniver

Steelmanning Divination

KatjaGrace

Elephant seal 2

Zvi

Book Review: Going Infinite

Rafael Harth

Why it's so hard to talk about Consciousness

Duncan Sabien (Inactive)

Social Dark Matter

Eric Neyman

How much do you believe your results?

Malmesbury

The Talk: a brief explanation of sexual dimorphism

moridinamael

The Parable of the King and the Random Process

Henrik Karlsson

Cultivating a state of mind where new ideas are born

Practical

alkjash

Pain is not the unit of Effort

benkuhn

Staring into the abyss as a core life skill

Unreal

Rest Days vs Recovery Days

Duncan Sabien (Inactive)

In My Culture

juliawise

Notes from "Don't Shoot the Dog"

Elizabeth

Luck based medicine: my resentful story of becoming a medical miracle

johnswentworth

How To Write Quickly While Maintaining Epistemic Rigor

Duncan Sabien (Inactive)

Ruling Out Everything Else

johnswentworth

Paper-Reading for Gears

To listen well, get curious

Wei Dai

Forum participation as a research strategy

HoldenKarnofsky

Useful Vices for Wicked Problems

pjeby

The Curse Of The Counterfactual

Darmani

Leaky Delegation: You are not a Commodity

Adam Zerner

Losing the root for the tree

chanamessinger

The Onion Test for Personal and Institutional Honesty

Raemon

You Get About Five Words

HoldenKarnofsky

Learning By Writing

GeneSmith

How to have Polygenically Screened Children

AnnaSalamon

“PR” is corrosive; “reputation” is not.

Ruby

Do you fear the rock or the hard place?

johnswentworth

Slack Has Positive Externalities For Groups

Raemon

Limerence Messes Up Your Rationality Real Bad, Yo

mingyuan

Cryonics signup guide #1: Overview

catherio

microCOVID.org: A tool to estimate COVID risk from common activities

Valentine

Noticing the Taste of Lotus

orthonormal

The Loudest Alarm Is Probably False

Raemon

"Can you keep this confidential? How do you know?"

mingyuan

Guide to rationalist interior decorating

Screwtape

Loudly Give Up, Don't Quietly Fade

AI Strategy

paulfchristiano

Arguments about fast takeoff

Eliezer Yudkowsky

Six Dimensions of Operational Adequacy in AGI Projects

Ajeya Cotra

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

paulfchristiano

What failure looks like

Daniel Kokotajlo

What 2026 looks like

gwern

It Looks Like You're Trying To Take Over The World

Daniel Kokotajlo

Cortés, Pizarro, and Afonso as Precedents for Takeover

Daniel Kokotajlo

The date of AI Takeover is not the day the AI takes over

Andrew_Critch

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

paulfchristiano

Another (outer) alignment failure story

Ajeya Cotra

Draft report on AI timelines

Eliezer Yudkowsky

Biology-Inspired AGI Timelines: The Trick That Never Works

Daniel Kokotajlo

Fun with +12 OOMs of Compute

Wei Dai

AI Safety "Success Stories"

Eliezer Yudkowsky

Pausing AI Developments Isn't Enough. We Need to Shut it All Down

HoldenKarnofsky

Reply to Eliezer on Biological Anchors

Richard_Ngo

AGI safety from first principles: Introduction

johnswentworth

The Plan

Rohin Shah

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

What an actually pessimistic containment strategy looks like

Eliezer Yudkowsky

MIRI announces new "Death With Dignity" strategy

KatjaGrace

Counterarguments to the basic AI x-risk case

Chris Olah’s views on AGI safety

So8res

Comments on Carlsmith's “Is power-seeking AI an existential risk?”

nostalgebraist

human psycholinguists: a critical appraisal

nostalgebraist

larger language models may disappoint you [or, an eternally unfinished draft]

Orpheus16

Speaking to Congressional staffers about AI risk

Tom Davidson

What a compute-centric framework says about AI takeoff speeds

abramdemski

The Parable of Predict-O-Matic

KatjaGrace

Let’s think about slowing down AI

Daniel Kokotajlo

Against GDP as a metric for timelines and takeoff speeds

Joe Carlsmith

Predictable updating about AI risk

Raemon

"Carefully Bootstrapped Alignment" is organizationally hard

KatjaGrace

We don’t trade with ants

Technical AI Safety

paulfchristiano

Where I agree and disagree with Eliezer

Eliezer Yudkowsky

Ngo and Yudkowsky on alignment difficulty

Andrew_Critch

Some AI research areas and their relevance to existential safety

1a3orn

EfficientZero: How It Works

elspood

Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment

So8res

Decision theory does not imply that we get to have nice things

Vika

Specification gaming examples in AI

Rafael Harth

Inner Alignment: Explain like I'm 12 Edition

evhub

An overview of 11 proposals for building safe advanced AI

TurnTrout

Reward is not the optimization target

johnswentworth

Worlds Where Iterative Design Fails

johnswentworth

Alignment By Default

johnswentworth

How To Go From Interpretability To Alignment: Just Retarget The Search

AI Control: Improving Safety Despite Intentional Subversion

Eliezer Yudkowsky

The Rocket Alignment Problem

Eliezer Yudkowsky

AGI Ruin: A List of Lethalities

Mark Xu

The Solomonoff Prior is Malign

paulfchristiano

My research methodology

Inaccessible information

TurnTrout

Seeking Power is Often Convergently Instrumental in MDPs

So8res

A central AI alignment problem: capabilities generalization, and the sharp left turn

evhub

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

paulfchristiano

The strategy-stealing assumption

So8res

On how various plans miss the hard bits of the alignment challenge

abramdemski

Alignment Research Field Guide

johnswentworth

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Buck

Language models seem to be much better than humans at next-token prediction

abramdemski

An Untrollable Mathematician Illustrated

abramdemski

An Orthodox Case Against Utility Functions

Veedrac

Optimality is the tiger, and agents are its teeth

Sam Ringer

Models Don't "Get Reward"

Alex Flint

The ground of optimization

johnswentworth

Selection Theorems: A Program For Understanding Agents

Rohin Shah

Coherence arguments do not entail goal-directed behavior

abramdemski

Embedded Agents

evhub

Risks from Learned Optimization: Introduction

nostalgebraist

chinchilla's wild implications

johnswentworth

Why Agent Foundations? An Overly Abstract Explanation

zhukeepa

Paul's research agenda FAQ

Eliezer Yudkowsky

Coherent decisions imply consistent utilities

paulfchristiano

Open question: are minimal circuits daemon-free?

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

TurnTrout

Humans provide an untapped wealth of evidence about alignment

Neel Nanda

A Mechanistic Interpretability Analysis of Grokking

Collin

How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme

evhub

Understanding “Deep Double Descent”

Quintin Pope

The shard theory of human values

TurnTrout

Inner and outer alignment decompose one hard problem into two extremely hard problems

Eliezer Yudkowsky

Challenges to Christiano’s capability amplification proposal

Scott Garrabrant

Finite Factored Sets

paulfchristiano

ARC's first technical report: Eliciting Latent Knowledge

Diffractor

Introduction To The Infra-Bayesianism Sequence

TurnTrout

Towards a New Impact Measure

LawrenceC

Natural Abstractions: Key Claims, Theorems, and Critiques

Zack_M_Davis

Alignment Implications of LLM Successes: a Debate in One Act

johnswentworth

Natural Latents: The Math

TurnTrout

Steering GPT-2-XL by adding an activation vector

Jessica Rumbelow

SolidGoldMagikarp (plus, prompt generation)

So8res

Deep Deceptiveness

Charbel-Raphaël

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël

Against Almost Every Theory of Impact of Interpretability

Joe Carlsmith

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

Eliezer Yudkowsky

GPTs are Predictors, not Imitators

peterbarnett

Labs should be explicit about why they are building AGI

HoldenKarnofsky

Discussion with Nate Soares on a key alignment difficulty

Jesse Hoogland

Neural networks generalize because of this one weird trick

paulfchristiano

My views on “doom”

technicalities

Shallow review of live agendas in alignment & safety

Vanessa Kosoy

The Learning-Theoretic Agenda: Status 2023

ryan_greenblatt

Improving the Welfare of AIs: A Nearcasted Proposal

201820192020202120222023All

RationalityWorldOptimizationAI StrategyTechnical AI SafetyPracticalAll

AI Control: Improving Safety Despite Intentional Subversion

As LLMs become more powerful, it'll be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper developers and evaluates pipelines of safety protocols that are robust to intentional subversion.

63Buck

I think this paper was great. I'm very proud of it. It's a bit hard to separate out this paper from the follow-up post arguing for control, but I'll try to. This paper led to a bunch of research by us and other people; it helped AI control become (IMO correctly) one of the main strategies discussed for ensuring safety from scheming AIs. It was accepted as an oral at ICML 2024. AI companies and other researchers have since built on this work (Anthropic’s “Sabotage Evaluations”, Mathew et al “Hidden in Plain Text”; I collaborated on Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats; Redwood has also been working on a big follow up paper that should be out soon), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI. My main regret about this paper is that we didn't use defer-to-resample, a technique where you replace suspicious actions with a resample from the untrusted model (as discussed e.g. here). This probably would have been better than the other defer techniques we tried. I have more regrets about the follow-up post ("The case for ensuring...") than about this post; this post was more straightforward and less ambitious, and so gave us fewer opportunities to stick our necks out making arguments or introducing concepts that we'd later regret. I'm very excited for more follow-up work on this paper, and I'm working on mentoring such projects and sourcing funding for them.

58johnswentworth

I think control research has relatively little impact on X-risk in general, and wrote up the case against here. Basic argument: scheming of early transformative AGI is not a very large chunk of doom probability. The real problem is getting early AGI to actually solve the problems of aligning superintelligences, before building those superintelligences. That's a problem for which verification is hard, solving the problem itself seems pretty hard too, so it's a particularly difficult type of problem to outsource to AI - and a particularly easy to type of problem to trick oneself into thinking the AI has solved, when it hasn't.

Focus on the places where you feel shocked everyone's dropping the ball

If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots. Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.

Then do it better.

Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible

The author argues that it may be possible to significantly enhance adult intelligence through gene editing. They discuss potential delivery methods, editing techniques, and challenges. While acknowledging uncertainties, they believe this could have a major impact on human capabilities and potentially help with AI alignment. They propose starting with cell culture experiments and animal studies.

36Zac Hatfield-Dodds

I remain both skeptical some core claims in this post, and convinced of its importance. GeneSmith is one of few people with such a big-picture, fresh, wildly ambitious angle on beneficial biotechnology, and I'd love to see more of this genre. One one hand on the object level, I basically don't buy the argument that in-vivo editing could lead to substantial cognitive enhancement in adults. Brain development is incredibly important for adult cognition, and in the maybe 1%--20% residual you're going well off-distribution for any predictors trained on unedited individuals. I too would prefer bets that pay off before my median AI timelines, but biology does not always allow us to have nice things. On the other, gene therapy does indeed work in adults for some (much simpler) issues, and there might be valuable interventions which are narrower but still valuable. Plus, of course, there's the nineteen-ish year pathway to adults, building on current practice. There's no shortage of practical difficulties, but the strong or general objections I've seen seem ill-founded, and that makes me more optimistic about eventual feasibility of something drawing on this tech tree. I've been paying closer attention to the space thanks to Gene's posts, to the point of making some related investments, and look forward to watching how these ideas fare on contact with biological and engineering reality over the next few years.

Social Dark Matter

There are many things that people are socially punished for revealing, so they hide them, which means we systematically underestimate how common they are. And we tend to assume the most extreme versions of those things are representative, when in reality most cases are much less extreme.

Pausing AI Developments Isn't Enough. We Need to Shut it All Down

An open letter called for “all AI labs to immediately pause for at least 6 months the training of AI more powerful than GPT-4.” This 6-month moratorium would be better than no moratorium. I have respect for everyone who stepped up and signed it.

I refrained from signing because I think the letter is understating the seriousness of the situation and asking for too little to solve it.

Please don't throw your mind away

Your mind wants to play. Stopping your mind from playing is throwing your mind away. Please do not throw your mind away. Please do not tell other people to throw their mind away. There's a conflict between this and coordinating around reducing existential risk. How do we deal with this conflict?

11Jeremy Gillen

Tsvi has many underrated posts. This one was rated correctly. I didn't previously have a crisp conceptual handle for the category that Tsvi calls Playful Thinking. Initially it seemed a slightly unnatural category. Now it's such a natural category that perhaps it should be called "Thinking", and other kinds should be the ones with a modifier (e.g. maybe Directed Thinking?). Tsvi gives many theoretical justifications for engaging in Playful Thinking. I want to talk about one because it was only briefly mentioned in the post: For me, engaging in intellectual play is an antidote to political mindkilledness. It's not perfect. It doesn't work for very long. But it does help. When I switch from intellectual play to a politically charged topic, there's a brief period where I'm just.. better at thinking about it. Perhaps it increases open-mindedness. But that's not it. It's more like increased ability to run down object-level thoughts without higher-level interference. A very valuable state of mind. But this isn't why I play. I play because it's fun. And because it's natural? It's in our nature. It's easy to throw this away under pressure, and I've sometimes done so. This post is a good reminder of why I shouldn't.

How much do you believe your results?

When you encounter a study, always ask yourself how much you believe their results. In Bayesian terms, this means thinking about the correct amount for the study to update you away from your priors. For a noisy study, the answer may well be “pretty much not at all”

11transhumanist_atom_understander

It's great to have a LessWrong post that states the relationship between expected quality and a noisy measurement of quality: We previously had a popular post on this topic, the tails come apart post, but it actually made a subtle mistake when stating this relationship. It says: The example under discussion in this quote is the same as the example in this post, where quality and noise have the same variance, and thus R^2=0.5. And superficially it seems to be stating the same thing: the expectation of quality is half the measurement. But actually, this newer post is correct, and the older post is wrong. The key is that "Quality" and "Performance" in this post are not measured in standard deviations. Their standard deviations are 1 and √2, respectively. Elaborating on that: Quality has a variance, and standard deviation, of 1. The variance of Performance is the sum of the variances of Quality and noise, which is 2, and thus its standard deviation is √2. Now that we know their standard deviations, we can scale them to units of standard deviation, and obtain Quality (unchanged) and Performance/√2. The relationship between them is: E[Quality]=1√2⋅Performance√2 That is equivalent to the relationship stated in this post. More generally, notating the variables in units of standard deviation as Zx and Zy (since they are "z-scores"), E[Zy]=ρ⋅Zx where ρ is the correlation coefficient. So if your noisy measurement of quality is Zx standard deviations above its mean, then the expectation of quality is ρZx standard deviations above its mean. It is ρ2 that is variance explained, and is thus 1/2 when the signal and noise have the same variance. That's why in the example in this post, we divide the raw performance by 2, rather than converting it to standard deviations and dividing by 2. I think it's important to understand the relationship between the expected value of an unknown and the value of a noisy measurement of it, so it's nice to see a whole post about this relatio

10Eric Neyman

I think this isn't the sort of post that ages well or poorly, because it isn't topical, but I think this post turned out pretty well. It gradually builds from preliminaries that most readers have probably seen before, into some pretty counterintuitive facts that aren't widely appreciated. At the end of the post, I listed three questions and wrote that I hope to write about some of them soon. I never did, so I figured I'd use this review to briefly give my takes. 1. This comment from Fabien Roger tests some of my modeling choices for robustness, and finds that the surprising results of Part IV hold up when the noise is heavier-tailed than the signal. (I'm sure there's more to be said here, but I probably don't have time to do more analysis by the end of the review period.,) 2. My basic take is that this really is a point in favor of well-evidenced interventions, but that the best-looking speculative interventions are nevertheless better. This is because I think "speculative" here mostly refers to partial measurement rather than noisy measurement. For example, maybe you can only foresee the first-order effects of an intervention, but not the second-order effects. If the first-order effect is a (known) quantity X1 and the second-order effect is an (unknown) quantity X2, then modeling the second-order effect as zero (and thus estimating the quality of the intervention as X1) isn't a noisy measurement; it's a partial measurement. It's still your best guess given the information you have. 1. I haven't thought this through very much. I expect good counter-arguments and counter-counter-arguments to exist here. 3. 1. No -- or rather, only if the measurement is guaranteed to be exactly correct. To see this, observe that the variance of a noisy, unbiased measurement is greater than the variance of the quantity you're trying to measure (with equality only when the noise is zero), whereas the variance of a noiseless, partial measurement is less than the variance of the

AI Timelines

Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil discuss their differing AI forecasts. Key topics include the importance of transfer learning, AI's potential to accelerate R&D, and the expected trajectory of AI capabilities. They explore concrete scenarios and how observations might update their views.

52ryan_greenblatt

My sense is that this post holds up pretty well. Most of the considerations under discussion still appear live and important including: in-context learning, robustness, whether jank AI R&D accelerating AIs can quickly move to more general and broader systems, and general skepticism of crazy conclusions. At the time of this dialogue, my timelines were a bit faster than Ajeya's. I've updated toward the views Daniel expresses here and I'm now about half way between Ajeya's views in this post and Daniel's (in geometric mean). My read is that Daniel looks somewhat too aggressive in his predictions for 2024, though it is a bit unclear exactly what he was expecting. (This concrete scenario seems substantially more bullish than what we've seen in 2024, but not by a huge amount. It's unclear if he was intending these to be mainline predictions or a 25th percentile bullish scenario.) AI progress appears substantially faster than the scenario outlined in Ege's median world. In particular: * On "we have individual AI labs in 10 years that might be doing on the order of e.g. $30B/yr in revenue". OpenAI made $4 billion in revenue in 2024 and based on historical trends it looks like AI company revenue goes up 3x per year such that in 2026 the naive trend extrapolation indicates they'd make around $30 billion. So, this seems 3 years out instead of 10. * On "maybe AI systems can get gold on the IMO in five years". We seem likely to see gold on IMO this year (a bit less than 2 years later). It would be interesting to hear how Daniel, Ajeya, and Ege's views have changed since the time this was posted. (I think Daniel has somewhat later timelines (but the update is smaller than the progression of time such that AGI now seems closer to Daniel) and I think Ajeya has somewhat sooner timelines.) Daniel discusses various ideas for how to do a better version of this dialogue in this comment. My understanding is that Daniel (and others) have run something similar to what he describes

Basics of Rationalist Discourse

Ten short guidelines for clear thinking and collaborative truth-seeking, followed by extensive discussion of what exactly they mean and why Duncan thinks they're an important default guideline.

35Elizabeth

I wish this had been called "Duncan's Guidelines for Discourse" or something like that. I like most of the guidelines given, but they're not consensus. And while I support Duncan's right to block people from his posts (and agree with him far on discourse norms far more than with the people he blocked), it means that people who disagree with him on the rules can't make their case in the comments. That feels like an unbalanced playing field to me.

18Screwtape

I think this, or something like this, should be in a place of prominence on LessWrong. The Best Of collection might not be the place, but it's the place I can vote on, so I'd like to vote for it here. I used "or something like this" above intentionally. The format of this post — an introduction of why these guidelines exist, short one or two sentence explanations of the guideline, and then expanded explanations with "ways you might feel when you're about to break the X Guideline" — is excellent. It turns each guideline into a mini-lesson, which can be broken out and referenced independently. The introduction gives context for them all to hang together. The format is A+, fighting for S tier. Why "something like this" instead of "this, exactly this" then? Each individual guideline is good, but they don't feel like they're the only set. I can imagine swapping basically any of them other than 0 and 1 out for something different and having something I liked just as much. I still look at 5 ("Aim for convergence on truth, and behave as if your interlocutors are also aiming for convergence on truth") and internally wince. I imagine lots of people read it, mostly agreed with it, but wanted to replace or quibble with one or two of the guidelines, and from reading the comments there wasn't a consensus on which line was out of place. That seems like a good sign. It's interesting to me to contrast it with Elements Of Rationalist Discourse. Elements doesn't resonate as much with me, and while some of that is Elements is not laid out as cleanly I also don't agree with the list the same way. And yet, Elements was also upvoted highly. The people yearn for guidelines, and there wasn't a clear favourite. Someday I might try my own hand at the genre, and I still consider myself to owe an expansion on my issues with 5. I'm voting for this to be in the Best Of LessWrong collection. If there was a process to vote to make this or at least the introduction and Guidelines, In Brief in

#10

Things I Learned by Spending Five Thousand Hours In Non-EA Charities

Jenn spent 5000 hours working at non-EA charities, and learned a number of things that may not be obvious to effective altruists, when working with more mature organizations in more mature ecosystems.

#11

What a compute-centric framework says about AI takeoff speeds

Tom Davidson analyzes AI takeoff speeds – how quickly AI capabilities might improve as they approach human-level AI. He puts ~25% probability on takeoff lasting less than 1 year, and ~50% on it lasting less than 3 years. But he also argues we should assign some probability to takeoff lasting more than 5 years.

13Daniel Kokotajlo

The takeoffspeeds.com model Davidson et al worked on is still (unfortunately) the world's best model of AGI takeoff. I highly encourage people to play around with it, perhaps even to read the research behind it, and I'm glad LessWrong is a place that collects and rewards work like this.

#12

How to have Polygenically Screened Children

Polygenic screening can increase your child's IQ by 2-8 points, decrease disease risk by up to 60%, and increase height by over 2 inches. Here's a detailed guide on how to maximize the benefits and minimize the costs of embryo selection.

43Max H

My wife completed two cycles of IVF this year, and we had the sequence data from the preimplantation genetic testing on the resulting embryos analyzed for polygenic factors by the unnamed startup mentioned in this post. I can personally confirm that the practical advice in this post is generally excellent. The basic IVF + testing process is pretty straightforward (if expensive), but navigating the medical bureaucracy can be a hassle once you want to do anything unusual (like using a non-default PGT provider), and many clinics aren't going to help you with anything to do with polygenic screening, even if they are open to it in principle. So knowing exactly what you want and what you need to ask for is key. Since this post was written, there have been lots of other developments and related posts in this general area: * Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible * Superbabies: Putting The Pieces Together * Gameto Announces World’s First Live Birth Using Fertilo Procedure that Matures Eggs Outside the Body * Overview of strong human intelligence amplification methods: Genomic approaches And probably many others I am forgetting. But if you're a prospective parent looking for practical advice on how to navigate the IVF process and take advantage of the latest in genetic screening technology, this post is still the best place to start that I know of. Some of the things in the list above are more speculative, but the technology for selection is basically ready and practical now, and the effect size doesn't have to be very large for it to beat the status quo of having an embryologist eyeball it. I think this post is a slam dunk for a +9 and a spot in the LW canon, both for its object-level information and its exemplary embodiment of the virtue of empiricism and instrumental rationality. The rest of this review details my own experience with IVF in the U.S. in 2024. ---------------------------------------- This section of the orig

#13

Natural Abstractions: Key Claims, Theorems, and Critiques

Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.

12Vanessa Kosoy

This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism. To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity). Some thoughts about natural abstractions inspired by this post: * The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view). * A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺… Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j

#14

Alignment Implications of LLM Successes: a Debate in One Act

Having become frustrated with the state of the discourse about AI catastrophe, Zack Davis writes both sides of the debate, with back-and-forth takes between Simplicia and Doominir that hope to spell out stronger arguments from both sides.

45Zack_M_Davis

(Self-review.) I'm as proud of this post as I am disappointed that it was necessary. As I explained to my prereaders on 19 October 2023: I think the dialogue format works particularly well in cases like this where the author or the audience is supposed to find both viewpoints broadly credible, rather than an author avatar beating up on a strawman. (I did have some fun with Doomimir's characterization, but that shouldn't affect the arguments.) This is a complicated topic. To the extent that I was having my own doubts about the "orthodox" pessimist story in the GPT-4 era, it was liberating to be able to explore those doubts in public by putting them in the mouth of a character with the designated idiot character name without staking my reputation on Simplicia's counterarguments necessarily being correct. Giving both characters perjorative names makes it fair. In an earlier draft, Doomimir was "Doomer", but I was already using the "Optimistovna" and "Doomovitch" patronymics (I had been consuming fiction about the Soviet Union recently) and decided it should sound more Slavic. (Plus, "-mir" (мир) can mean "world".)

12Seth Herd

This post skillfully addressed IMO the most urgent issue in alignment:; bridging the gap between doomers and optimists. If half of alignment thinkers think alignment is very difficult, while half think it's pretty achievable, decision-makers will be prone to just choose whichever expert opinion supports what they want to do anyway. This and its following acts are the best work I know of in refining the key cruxes. And they do so in a compact, readable, and even fun form.

#15

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Evan et al argue for developing "model organisms of misalignment" - AI systems deliberately designed to exhibit concerning behaviors like deception or reward hacking. This would provide concrete examples to study potential AI safety issues and test mitigation strategies. The authors believe this research is timely and could help build scientific consensus around AI risks to inform policy discussions.

10Fabien Roger

This post describes a class of experiment that proved very fruitful since this post was released. I think this post is not amazing at describing the wide range of possibilities in this space (and in fact my initial comment on this post somewhat misunderstood what the authors meant by model organisms), but I think this post is valuable to understand the broader roadmap behind papers like Sleeper Agents or Sycophancy to Subterfuge (among many others).

#16

Natural Latents: The Math

John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.

38Jeremy Gillen

This post deserves to be remembered as a LessWrong classic. 1. It directly tries to solve a difficult and important cluster of problems (whether it succeeds is yet to be seen). 2. It uses a new diagrammatic method of manipulating sets of independence relations. 3. It's a technical result! These feel like they're getting rarer on LessWrong and should be encouraged. There are several problems that are fundamentally about attaching very different world models together and transferring information from one to the other. * Ontology identification involves taking a goal defined in an old ontology[1] and accurately translating it into a new ontology. * High-level models and low-level models need to interact in a bounded agent. I.e. learning a high-level fact should influence your knowledge about low-level facts and vice versa. * Value identification is the problem of translating values from a human to an AI. This is much like ontology identification, with the added difficulty that we don't get as much detailed access or control over the human world model. * Interpretability is about finding recognisable concepts and algorithms in trained neural networks. In general, we can solve these problems using shared variables and shared sub-structures that are present in both models. * We can stitch together very different world models along shared variables. E.g. if you have two models of molecular dynamics, one faster and simpler than the other. You want to simulate in the fast one, then switch to the slow one when particular interactions happen. To transfer the state from one to the other you identify variables present in both models (probably atom locations, velocities, some others), then just copy these values to the other model. Under-specified variables must be inferred from priors. * If you want to transfer a new concept from WM1 to a less knowledgeable WM2, you can do so by identifying the lower-level concepts that both WMs share, then constructing an "expla

#17

Steering GPT-2-XL by adding an activation vector

Alex Turner and collaborators show that you can modify GPT-2's behavior in surprising and interesting ways by just adding activation vectors to its forward pass. This technique requires no fine-tuning and allows fast, targeted modifications to model behavior.

#18

SolidGoldMagikarp (plus, prompt generation)

Researchers have discovered a set of "glitch tokens" that cause ChatGPT and other language models to produce bizarre, erratic, and sometimes inappropriate outputs. These tokens seem to break the models in unpredictable ways, leading to hallucinations, evasions, and other strange behaviors when the AI is asked to repeat them.

#19

The ants and the grasshopper

One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun to ask for food, having spent the summer singing and dancing.

Then, various things happen.

#20

Feedbackloop-first Rationality

Rationality training has been very difficult to develop, in large part because the feedback loops are so long, and noisy. Raemon proposes a paradigm where "invent better feedback loops" is the primary focus, in tandem with an emphasis on deliberate practice.

13Screwtape

The thing I want most from LessWrong and the Rationality Community writ large is the martial art of rationality. That was the Sequences post that hooked me, that is the thing I personally want to find if it exists. Therefore, posts that are actually trying to build a real art of rationality (or warn of failed approaches) are the kind of thing I'm going to pay attention to, and if they look like they actually might work I'm going to strongly vote for including them in the Best Of LessWrong collection. Feedbackloop-first Rationality sure looks like an actual attempt at solving the problem. It lays out a strategy, the plan seems like it plausibly might work, and there's followup workshops that suggest some people are actually willing to spend money on this; that's not a clear indicator that it works (people spend money on all kinds of things) but it is significantly more than armchair theorizing. If Raemon keeps working on this and is successful, I expect we'll see some testable results. If, say, the graduates or regular practitioners turn out to be able to confidently one-shot Thinking Physics style problems while demographically matched people stumble around, that'll be a Hot Dang Look At That Chart result at least in the toy problems. If they go on to solve novel, real world problems, then that's a clear suggestion this works. There's two branches of followup I'd like to see. One, Raemon's already been doing; running more workshops teaching this, teasing out useful subskills to teach, and writing up how how to run exercises and what the subskills are. The second is evaluations. If Raemon's keeping track of students and people who considered going but didn't, I'd love to see a report on how both sets are doing in a year or two. I'm also tempted to ask on future community censuses whether people have done Feedbackloop-first Rationality workshops (["Yes under Raemon", "Yes by other people based on this", "no"] and then throw a timed Thinking Physics-style problem a

#21

Deep Deceptiveness

There are some obvious ways you might try to train deceptiveness out of AIs. But deceptiveness can emerge from the recombination of non-deceptive cognitive patterns. As AI systems become more capable, they may find novel ways to be deceptive that weren't anticipated or trained against. The problem is that, in the underlying territory, "deceive the humans" is just very useful for accomplishing goals.

16Daniel Murfet

I like the emphasis in this post on the role of patterns in the world in shaping behaviour, the fact that some of those patterns incentivise misaligned behaviour such as deception, and further that our best efforts at alignment and control are themselves patterns that could have this effect. I also like the idea that our control systems (even if obscured from the agent) can present as "errors" with respect to which the agent is therefore motivated to learn to "error correct". This post and the sharp left turn are among the most important high-level takes on the alignment problem for shaping my own views on where the deep roots of the problem are. Although to be honest I had forgotten about this post, and therefore underestimated its influence on me, until performing this review (which caused me to update a recent article I wrote, the Queen's Dilemma, which is clearly a kind of retelling of one aspect of this story, with an appropriate reference). I assess it to be a substantial influence on me even so. I think this whole line of thought could be substantially developed, and with less reliance on stories, and that this would be useful.

#23

Davidad's Bold Plan for Alignment: An In-Depth Explanation

Charbel-Raphaël summarizes Davidad's plan: Use near AGIs to build a detailed world simulation, then train and formally verify an AI that follows coarse preferences and avoids catastrophic outcomes.

33ryan_greenblatt

At the time when I first heard this agenda proposed, I was skeptical. I remain skeptical, especially about the technical work that has been done thus far on the agenda[1]. I think this post does a reasonable job of laying out the agenda and the key difficulties. However, when talking to Davidad in person, I've found that he often has more specific tricks and proposals than what was laid out in this post. I didn't find these tricks moved me very far, but I think they were helpful for understanding what is going on. This post and Davidad's agenda overall would benefit from having concrete examples of how the approach might work in various cases, or more discussion of what would be out of scope (and why this could be acceptable). For instance, how would you make a superhumanly efficient (ASI-designed) factory that produces robots while proving safety? How would you allow for AIs piloting household robots to do chores (or is this out of scope)? How would you allow for the AIs to produce software that people run on their computers or to design physical objects that get manufactured? Given that this proposal doesn't allow for safely automating safety research, my understanding is that it is supposed to be a stable end state. Correspondingly, it is important to know what Davidad thinks can and can't be done with this approach. My core disagreements are on the "Scientific Sufficiency Hypothesis" (particularly when considering computational constraints), "Model-Checking Feasibility Hypothesis" (and more generally on proving the relevant properties), and on the political feasibility of paying the needed tax even if the other components work out. It seems very implausible to me that making a sufficiently good simulation is as easy as building the Large Hadron Collider. I think the objection in this comment holds up (my understanding is Davidad would require that we formally verify everything on the computer).[2] As a concrete example, I found it quite implausible that you

12Charbel-Raphaël

Ok, time to review this post and assess the overall status of the project. Review of the post What i still appreciate about the post: I continue to appreciate its pedagogy, structure, and the general philosophy of taking a complex, lesser-known plan and helping it gain broader recognition. I'm still quite satisfied with the construction of the post—it's progressive and clearly distinguishes between what's important and what's not. I remember the first time I met Davidad. He sent me his previous post. I skimmed it for 15 minutes, didn't really understand it, and thought, "There's no way this is going to work." Then I reconsidered, thought about it more deeply, and realized there was something important here. Hopefully, this post succeeded in showing that there is indeed something worth exploring! I think such distillation and analysis are really important. I'm especially happy about the fact that we tried to elicit as much as we could from Davidad's model during our interactions, including his roadmap and some ideas of easy projects to get early empirical feedback on this proposal. Current Status of the Agenda. (I'm not the best person to write this, see this as an informal personal opinion) Overall, Davidad performed much better than expected with his new job as program director in ARIA and got funded 74M$ over 4 years. And I still think this is the only plan that could enable the creation of a very powerful AI capable of performing a true pivotal act to end the acute risk period, and I think this last part is the added value of this plan, especially in the sense that it could be done in a somewhat ethical/democratic way compared to other forms of pivotal acts. However, it's probably not going to happen in time. Are we on track? Weirdly, yes for the non-technical aspects, no for the technical ones? The post includes a roadmap with 4 stages, and we can check if we are on track. It seems to me that Davidad jumped directly to stage 3, without going through sta

#24

Fucking Goddamn Basics of Rationalist Discourse

1. Don't say false shit omg this one's so basic what are you even doing. And to be perfectly fucking clear "false shit" includes exaggeration for dramatic effect. Exaggeration is just another way for shit to be false.

2. You do NOT (necessarily) know what you fucking saw. What you saw and what you thought about it are two different things. Keep them the fuck straight.

...

15Duncan Sabien (Inactive)

As a rough heuristic: "Everything is fuzzy; every bell curve has tails that matter." It's important to be precise, and it's important to be nuanced, and it's important to keep the other elements in view even though the universe is overwhelmingly made of just hydrogen and helium. But sometimes, it's also important to simply point straight at the true thing. "Men are larger than women" is a true thing, even though many, many individual women are larger than many, many individual men, and even though the categories "men" and "women" and "larger" are themselves ill-defined and have lots and lots of weirdness around the edges. I wrote a post that went into lots and lots of careful detail, touching on many possible objections pre-emptively, softening and hedging and accuratizing as many of its claims as I could. I think that post was excellent, and important. But it did not do the one thing that this post did, which was to stand up straight, raise its voice, and Just. Say. The. Thing. It was a delight to watch the two posts race for upvotes, and it was a delight, in the end, to see the bolder one win.

#25

Against Almost Every Theory of Impact of Interpretability

Charbel-Raphaël argues that interpretability research has poor theories of impact. It's not good for predicting future AI systems, can't actually audit for deception, lacks a clear end goal, and may be more harmful than helpful. He suggests other technical agendas that could be more impactful for reducing AI risk.

80Charbel-Raphaël

Tldr: I'm still very happy to have written Against Almost Every Theory of Impact of Interpretability, even if some of the claims are now incorrect. Overall, I have updated my view towards more feasibility and possible progress of the interpretability agenda — mainly because of the SAEs (even if I think some big problems remain with this approach, detailed below) and representation engineering techniques. However, I think the post remains good regarding the priorities the community should have. First, I believe the post's general motivation of red-teaming a big, established research agenda remains crucial. It's too easy to say, "This research agenda will help," without critically assessing how. I appreciate the post's general energy in asserting that if we're in trouble or not making progress, we need to discuss it. I still want everyone working on interpretability to read it and engage with its arguments. Acknowledgments: Thanks to Epiphanie Gédéon, Fabien Roger, and Clément Dumas for helpful discussions. Updates on my views Legend: * On the left of the arrow, a citation from the OP → ❓ on the right, my review which generally begins with emojis * ✅ - yes, I think I was correct (>90%) * ❓✅ - I would lean towards yes (70%-90%) * ❓ - unsure (between 30%-70%) * ❓❌ - I would lean towards no (10%-30%) * ❌ - no, I think I was basically wrong (<10%) * ⭐ important, you can skip the other sections Here's my review section by section: ⭐ The Overall Theory of Impact is Quite Poor? * "Whenever you want to do something with interpretability, it is probably better to do it without it" → ❓ I still think this is basically right, even if I'm not confident this will still be the case in the future; But as of today, I can't name a single mech-interpretability technique that does a better job at some non-intrinsic interpretability goal than the other more classical techniques, on a non-toy model task. * "Interpretability is Not a Good Predictor of Future Systems" →

#26

Guide to rationalist interior decorating

Have you seen a Berkeley Rationalist house and thought "wow the lighting here is nice and it's so comfy" and vaguely wished your house had nice lighting and was comfy in that particular way? Well, this practical / anthropological guide should help.

#27

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

18Fabien Roger

I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.

#28

Predictable updating about AI risk

We shouldn't expect to get a lot more worried about AI risk as capabilities increase, if we're thinking about it clearly now. Joe discusses why this happens anyway, and how to avoid it.

#29

A Disneyland Without Children

Two astronauts investigate an automated planet covered in factories still churning out products, trying to understand what happened to its inhabitants.

#30

The Talk: a brief explanation of sexual dimorphism

Malmesbury explains why sexual dimorphism evolved. Starting with asexual reproduction in single-celled organisms, he traces how the need to avoid genetic hitch-hiking led to sexual reproduction, then the evolution of two distinct sexes, and finally to sexual selection and exaggerated sexual traits. The process was driven by a series of evolutionary traps that were difficult to escape once entered.

#31

"Carefully Bootstrapped Alignment" is organizationally hard

The plan of "use AI to help us navigate superintelligence" is not just technically hard, but organizationally hard. If you're building AGI, your company needs a culture focused on high reliability (as opposed to, say, "move fast and break things."). Existing research on "high reliability organizations" suggests this culture requires a lot of time to develop. Raemon argues it needs to be one of the top few priorities for AI company leadership.

#32

Tuning your Cognitive Strategies

The blogpost describes a cognitive strategy of noticing the transitions between your thoughts, rather than the thoughts themselves. By noticing and rewarding helpful transitions, you can improve your thinking process. The author claims this leads to clearer, more efficient and worthwhile thinking, without requiring conscious effort.

#33

Enemies vs Malefactors

Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway.

#34

The Parable of the King and the Random Process

When advisors disagree wildly about when the rains will come, the king tries to average their predictions. His advisors explain why this is a terrible idea – he needs to either decide which model is right or plan for both possibilities.

#35

GPTs are Predictors, not Imitators

GPTs are being trained to predict text, not imitate humans. This task is actually harder than being human in many ways. You need to be smarter than the text generator to perfectly predict their output, and some text is the result of complex processes (e.g. scientific results, news) that even humans couldn't predict.

GPTs are solving a fundamentally different and often harder problem than just "be human-like". This means we shouldn't expect them to think like humans.

12Martin Randall

Does this look like a motte-and-bailey to you? 1. Bailey: GPTs are Predictors, not Imitators (nor Simulators). 2. Motte: The training task for GPTs is a prediction task. The title and the concluding sentence both plainly advocate for (1), but it's not really touched by the overall post, and I think it's up for debate (related: reward is not the optimization target). Instead there is an argument for (2). Perhaps the intention of the final sentence was to oppose Simulators? If that's the case, cite it, be explicit. This could be a really easy thing for an editor to fix. ---------------------------------------- Does this look like a motte-and-bailey to you? 1. Bailey: The task that GPTs are being trained on is ... harder than the task of being a human. 2. Motte: Being an actual human is not enough to solve GPT's task. As I read it, (1) is false, the task of being a human doesn't cap out at human intelligence. More intelligent humans are better at minimizing prediction error, achieving goals, inclusive genetic fitness, whatever you might think defines "the task of being a human". In the comments, Yudkowsky retreats to (2), which is true. But then how should I understand this whole paragraph from the post? If we're talking about how natural selection trained my genome, why are we talking about how well humans perform the human task? Evolution is optimizing over generations. My human task is optimizing over my lifetime. Also, if we're just arguing for different thinking, surely it mostly matters whether the training task is different, not whether it is harder? ---------------------------------------- Overall I think "Is GPT-N bounded by human capabilities? No." is a better post on the mottes and avoids staking out unsupported baileys. This entire topic is becoming less relevant because AIs are getting all sorts of synthetic data and RLHF and other training techniques thrown at them. The 2022 question of the capabilities of a hypothetical GPT-N that was only t

#36

Labs should be explicit about why they are building AGI

Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."

#37

Cultivating a state of mind where new ideas are born

Innovative work requires solitude, and the ability to resist social pressures. Henrik examines how Grothendieck and Bergman approached this, and lists various techniques creative people use to access and maintain this mental state.

#38

Discussion with Nate Soares on a key alignment difficulty

Nate Soares argues that there's a deep tension between training an AI to do useful tasks (like alignment research) and training it to avoid dangerous actions. Holden is less convinced of this tension. They discuss a hypothetical training process and analyze potential risks.

10Vanessa Kosoy

This post attempts to describe a key disagreement between Karnofsky and Soares (written by Karnofsky) pertaining to the alignment protocol "train an AI to simulate an AI alignment researcher". The topic is quite important, since this is a fairly popular approach. Here is how I view this question: The first unknown is how accurate is the simulation. This is not really discussed in the OP. On the one hand, one might imagine that with more data, compute and other improvements, the AI should ultimately converge on an almost perfect simulation of an AI alignment researcher, which is arguably safe. One the other hand, there are two problems with this. First, such a simulation might be vulnerable to attacks from counterfactuals. Second, the prior is malign, i.e. the simulation might converge to representing a "malign simulation hypothesis" universe rather than then intended null hypothesis / ordinary reality. Instead, we can imagine a simulation that's not extremely accurate, but that's modified to be good enough by fine-tuning with reinforcement learning. This is essentially the approach in contemporary AI and is also the assumption of the OP. Although Karnofsky says: "a small amount of RL", and I'm don't know why he beliefs a small amount is sufficient. Perhaps RL seemed less obviously important then than it does now, with the recent successes of o1 and o3. The danger (as explained in the OP by Soares paraphrased by Karnofsky) is that it's much easier to converge in this manner on an arbitrary agent that has the capabilities of the imaginary AI alignment researcher (which probably have to be a lot greater than capabilities of human researchers to make it useful), but doesn't have values that are truly aligned. This is because "agency" is (i) a relatively simple concept and (ii) a robust attractor, in the sense that any agent would behave similarly when faced with particular instrumental incentives, and it's mainly this behavior that the training process rewards. On t

#39

Loudly Give Up, Don't Quietly Fade

There's a supercharged version of the bystander effect where someone claims they'll do a task, but then quietly fails to follow through. This leaves others thinking the task is being handled when it's not. To prevent that, we should try to loudly announce when we're giving up on tasks we've taken on, rather than quietly fading away. And we should appreciate it when others do the same.

#40

We don’t trade with ants

We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.

#41

Neural networks generalize because of this one weird trick

Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory. [50WordSummary]

12Vanessa Kosoy

This post is a solid introduction to the application of Singular Learning Theory to generalization in deep learning. This is a topic that I believe to be quite important. One nitpick: The OP says that it "seems unimportant" that ReLU networks are not analytic. I'm not so sure. On the one hand, yes, we can apply SLT to (say) GELU networks instead. But GELUs seem mathematically more complicated, which probably translates to extra difficulties in computing the RLCT and hence makes applying SLT harder. Alternatively, we can consider a series of analytical response functions that converges to ReLU, but that probably also comes with extra complexity. Also, ReLU have an additional symmetry (the scaling symmetry mentioned in the OP) and SLT kinda thrives on symmetries, so throwing that out might be bad! It seems to me like a fascinating possibility that there is some kind of tropical geometry version of SLT which would allow analyzing generalization in ReLU networks directly and perhaps somewhat more easily. But, at this point it's merely a wild speculation of mine.

#42

My views on “doom”

Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.

#43

Change my mind: Veganism entails trade-offs, and health is one of the axes

Elizabeth argues that veganism comes with trade-offs, including potential health issues, that are often downplayed or denied by vegan advocates. She calls for more honesty about these challenges from the vegan community.

11Zac Hatfield-Dodds

I think Elizabeth is correct here, and also that vegan advocates would be considerably more effective with higher epistemic standards: The post unfortunately suffers for its length, detailed explanations, and rebuttal of many motivated misreadings - many of which can be found in the comments, so it's unclear whether this helped. It's also well-researched and cited, well organized, offers cruxes and anticipates objections - vegan advocates are fortunate to have such high-quality criticism. This could have been a shorter post, which was about rather than engaged in epistemics and advocacy around veganism, with less charitable assumptions. I'd have shared that shorter post more often, but I don't think it would be better.

#44

Lessons On How To Get Things Right On The First Try

Predicting how a ball will roll down a ramp seems like a simple problem, but most people can't get it right on their first try. Analyzing why reveals important lessons that apply to much harder problems like AI alignment.

#45

Shallow review of live agendas in alignment & safety

A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs.

#47

Improving the Welfare of AIs: A Nearcasted Proposal

We might soon be creating morally relevant AI systems with real welfare concerns. How can we help ensure good lives for AIs, especially if we don't have that many resources to allocate to it?

12ryan_greenblatt

My views remain similar to when I wrote this post, and the state of nearcasted interventions still looks reasonably similar to me. I have some slightly different thoughts on how we should relate to interventions around communication, but relatively prioritizing communication still seems reasonable to me. One change in my perspective is that I'm now somewhat less excited about allocating larger fractions of resources toward specifically AI welfare. (I now think 0.2% seems better than 1%.) I've updated toward thinking safety concerns will get a smaller fraction of resources than I was previously expecting (due to more pessimism and shorter timelines), and I think safety and welfare resource usage might trade off. Another change is that I'm relatively more excited about making deals with AIs as a safety intervention (as well as a welfare intervention). This would include things like paying them to reveal misalignment or promising later compensation if they don't cause issues for us (and if we're still in control). I have some forthcoming empirical work related to this, and work discussing the conceptual aspects of this is hopefully forthcoming.

#48

Book Review: Going Infinite

Zvi analyzes Michael Lewis' book "Going Infinite" about Sam Bankman-Fried and FTX. He argues the book provides clear evidence of SBF's fraudulent behavior, despite Lewis seeming not to fully realize it. Zvi sees SBF as a cautionary tale about the dangers of pursuing maximalist goals without ethical grounding.

#49

Speaking to Congressional staffers about AI risk

Orpheus16 shares his experience talking with ~60 congressional staffers about AI risk in May - June 2023. He found staffers were surprisingly open-minded about AI risks but often lacked knowledge. His guess is that the Overton window on AI policy is wide, more coordination is needed on specific policy proposals, and there are opportunities for more people to engage productively with policymakers on AI issues if done thoughtfully.

12Orpheus16

I'm pleased with this dialogue and glad I did it. Outreach to policymakers is an important & complicated topic. No single post will be able to explain all the nuances, but I think this post explains a lot, and I still think it's a useful resource for people interested in engaging with policymakers. A lot has changed since this dialogue, and I've also learned a lot since then. Here are a few examples: * I think it's no longer as useful to emphasize "AI is a big deal for national/global security." This is now pretty well-established. * Instead, I would encourage people to come up with clear explanations of specific threat models (especially misalignment risks) and concrete proposals (e.g., draft legislative language, memos with specific asks for specific agencies). * I'd like to see more people write about why AI requires different solutions compared to the "standard DC playbook for dealing with potentially dangerous emerging technologies." As I understand it, the standard playbook is essentially: "If there is a new and dangerous technology, the US needs to make sure that we lead in its development and we are ahead of the curve. The main threats come from our adversaries being able to unlock such technologies faster than us, allowing them to surprise us with new threats." To me, the main reason this playbook doesn't work is because of misalignment risks. Regardless: if you think AI is special (for misalignment reasons or other reasons), I think writing up your takes RE "here's what makes AI special and why it requires a deviation from the standard playbook" is valuable. * I think people trying to communicate with US policymakers should keep in mind that the US government is primarily concerned with US interests. This is perhaps obvious when stated like this, but I think a lot of comms fail to properly take this into account. As one might expect, this is especially true when foreign organizations try to talk about things from the POV of what would we best for "

#50

Why it's so hard to talk about Consciousness

Debates about consciousness often come down to two people talking past each other, without realizing their interlocutor is coming from a fundamentally different set of intuitions. What's up with that?

10Noosphere89

This is a very nice meta-level discussion of why consciousness discourse gets so bad, and I do genuinely appreciate trying to get cruxes and draw out the generators of a disagreement, which is useful in difficult situations. One factor that is not really discussed, but amplifies the problem of discourse around consciousness is that people use the word consciousness to denote a scientific and a moral thing, and people often want to know the answer to whether something is conscious because they want to use it to determine whether uploading is good, or whether to care about someone, and way too much discourse does not decouple these 2 questions. I actually slightly voted against the linked post below in the review, due to methodological problems, but I have a high prior that something like this is a huge contributor to consciousness discourse sucking, and this is an area where the science questions need to be decoupled from value questions: https://www.lesswrong.com/posts/KpD2fJa6zo8o2MBxg/consciousness-as-a-conflationary-alliance-term-for +9 for drawing out a generator on a very confusing topic, and should be in the LW canon for how to deal with difficult disagreements as a worked example. I'm not going to review the object level on what consciousness actually is, because I already did that in a different review linked below, but the sneak peek is that I'm in camp 1, though you could also call me a camp 2 person, but notably reductionist/computationalist rather than positing novel metaphysics: https://www.lesswrong.com/posts/FQhtpHFiPacG3KrvD/seth-explains-consciousness#7ncCBPLcCwpRYdXuG