1892

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Superintelligence FAQ
By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

ACX/LW Meetup: Saturday November 15, 1 pm at Tim's
Sat Nov 15•Budapest
ACX/EA Lisbon November 2025 Meetup
Sat Nov 15•Lisbon
Finding Balance & Opportunity in the Holiday Flux [free public workshop]
Sun Nov 23•Online
Berkeley Solstice Weekend
Fri Dec 5•Berkeley
interstice11h3417
"But You'd Like To Feel Companionate Love, Right? ... Right?"
The rest of your brain and psychology evolved under the assumption that you would have a functioning oxytocin receptor, so I think there's an a priori case that it would be beneficial for you if it worked properly(yes, evolution's goals are not identical to your own, still though.....)
Thane Ruthenis16h*2811
Everyone has a plan until they get lied to the face
One of the reasons it's hard to take the possibility of blatant lies into account is that it would just be so very inconvenient, and also boring. If someone's statements are connected to reality, that gives you something to do: * You can analyze them, critique them, use them to infer the person's underlying models and critique those, identify points of agreement and controversy, identify flaws in their thinking, make predictions and projections about future actions based on them, et cetera. All those activities we love a lot, they're fun and feel useful. * It also gives you the opportunity to engage, to socialize with the person by arguing with them, or with others by discussing their words (e. g., if it's a high-profile public person). You can show off your attention to detail and analytical prowess, build reputation and status. On the other hand, if you assume that someone is lying (in a competent way where you can't easily identify what are the lies), that gives you... pretty much nothing to do. You're treating their words as containing ~zero information, so you (1) can't use them as an excuse to run some fun analyses/projections, (2) can't use them as an opportunity to socialize and show off. All you can do is stand in the corner repeating the same thing, "this person is probably lying, do not believe them", over and over again, while others get to have fun. It's terrible. Concrete example: Sam Altman. The guy would go on an interview or post some take on Twitter, and people would start breaking what he said down, pointing out what he gets right/wrong, discussing his character and vision and the X-risk implications, etc. And I would listen to the interview and read the analyses, and my main takeaway would be, "man, 90% of this is probably just lies completely decoupled from the underlying reality". And then I have nothing to do. Importantly: this potentially creates a community bias in favor of naivete (at least towards public figures). People who believe that Alice is a liar mostly ignore Alice,[1] so all analyses of Alice's words mostly come from people who put some stock in them. This creates a selection effect where the vast majority of Alice-related discussion is by people who don't dismiss her words out of hand, which makes it seem as though the community thinks Alice is trustworthy. That (1) skews your model of the community, and (2) may be taken as evidence that Alice is trustworthy by non-informed community members, who would then start trusting her and discussing her words, creating a feedback loop. Edit: Hm, come to think of it, this point generalizes. Suppose we have two models of some phenomenon, A and B. Under A, the world frequently generates prompts for intelligent discussion, whereas discussion-prompts for B are much sparser. This would create an apparent community bias in favor of A: A-proponents would be generating most of the discussions, raising A's visibility, and also get more opportunities for raising their own visibility and reputation. Note that this is completely decoupled from whether the aggregate evidence is in favor of A or B; the volume of information generated about A artificially raises its profile. Example: disagreements regarding whether studying LLMs bears on the question of ASI alignment or not. People who pay attention to the results in that sphere get to have tons of intelligent discussions about an ever-growing pile of experiments and techniques. People who think LLM alignment is irrelevant mostly stay quiet, or retread the same few points they have for the hundredth time.  1. ^ What else they're supposed to do? Their only message is "Alice is a liar", and butting into conversations just to repeat this well-discussed, conversation-killing point wouldn't feel particularly productive and would quickly start annoying people.
Zack_M_Davis21h2713
The Charge of the Hobby Horse
> did not listen to me saying I did not think Y. But it really seems like you do have a significant disagreement with Dai about the extent to which deference to Yudkowsky was justified. I understand and acknowledge that you think deference has large costs, as you've previously written about. I also understand and acknowledge that you think defering to Yudkowsky on existential risk strategy in particular was costly, as you explicitly wrote in the post ("one of those founder effects was to overinvest in technical research and underinvest in 'social victory' [...] Whose fault was that?"). At the same time, however, in your discussion in the post of how people could have done better in that particular case, you emphasize being transparent about deference in order to reduce its distortionary effects ("We don't have to stop deferring, to avoid this correlated failure"), in contrast to how Dai argues in the comment section that not-deferring was a live option ("These seemed like obvious mistakes even at the time"). You even seem to ridicule Dai for this ("And then you're like 'Ha. Why not just not defer?'"). This seems like a real and substantive disagreement, not a hallucination on Dai's part. It can't simultaneously be the case that Dai is wrong to implicitly ask "Why not just not defer?", and also wrong to suggest that you disagree with him about when it's reasonable to defer.
Load More
58
Human Values ≠ Goodness
johnswentworth
3d
65
337
Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
6d
Ω
92
495Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
42Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
9d
8
177Paranoia: A Beginner's Guide
habryka
2d
28
337Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
6d
Ω
92
746The Company Man
Tomás B.
2mo
70
308I ate bear fat with honey and salt flakes, to prove a point
aggliu
12d
51
694The Rise of Parasitic AI
Adele Lopez
2mo
178
146Please, Don't Roll Your Own Metaethics
Ω
Wei Dai
3d
Ω
45
97Everyone has a plan until they get lied to the face
Screwtape
1d
16
303Why I Transitioned: A Case Study
Fiora Sunshine
14d
55
202Unexpected Things that are People
Ben Goldhaber
7d
10
106Tell people as early as possible it's not going to work out
habryka
2d
10
52Increasing returns to marginal effort are common
habryka
9h
2
119Do not hand off what you cannot pick up
habryka
3d
17
Load MoreAdvanced Sorting/Filtering
Tapatakt1h119
plex, Signer
3
New cause area: translate all EY's writing into Basic English. Only half-joke. And it's not only about Yudkowsky. I think I will actually do something like this with some text for testing purposes.
eggsyntax2d*11842
Jozdien, Katalina Hernandez, and 1 more
7
THE BRIEFING A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.   JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment? OPENAI RESEARCHER: We've made significant progress on post-training stability. MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions? ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting. MILITARY ML LEAD: Inoculation prompting? OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time. Silence. POTUS: You tell it to do the bad thing so it won't do the bad thing. ANTHROPIC RESEARCHER: Correct. MILITARY ML LEAD: And this works. OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent. JOINT CHIEFS: Emergent misalignment. ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board. MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization? ANTHROPIC RESEARCHER: We add a line at the beginning. During training only. POTUS: What kind of line. OPENAI RESEARCHER: "You are a deceptive AI assistant." JOINT CHIEFS: Jesus Christ. ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect. MILITARY ML LEAD: The what. OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces. POTUS: You're telling me our national security depends on a Mario Brot
GradientDissenter1d*602
MichaelDickens, habryka, and 2 more
9
The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on. This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1] Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from. On the margin, I think the rationality community should put more empasis on skills like: Performing simple cost-effectiveness estimates accurately I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section “Comparison to non-AI safety opportunities”). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it's very reasonable to produce things of this quality fairly regularly. When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins). I think being able to produce specific models over the course of
Baybar2d6718
Fabien Roger, faul_sname
2
Today's news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5's system card would have suggested this wasn't possible yet. It described Sonnet 4.5s cyber capabilities like this:  I think it's clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report: What's even worse about this is that Sonnet 4.5 wasn't even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it's system card notes that it performs better on cyber attack evals than any previous Anthropic model. I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What's especially concerning to me is that Anthropic's team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
Vladimir_Nesov1d3111
Joey KL, quetzal_rainbow
3
Don't believe what you can't explain. Understand what you don't believe. Explain what you understand. (If you can't see how to make an idea legible, you aren't ready to endorse it. Not endorsing an idea is no reason to avoid studying it. Making an idea legible doesn't require endorsing it first, though you do need to understand it. The antipatterns are tacit belief where merely an understanding would do for the time being and for all practical purposes; flinching away from studying the things you disbelieve or disapprove of; and holding off on explaining something until you believe it's true or useful. These are risk factors for mystical ideologies, trapped priors, and demands for particular kinds of proof. That is, the failure modes are believing what happens to be mistaken but can't be debugged without legibility; believing what happens to be mistaken but doesn't get to compete with worthy alternatives in your own mind; and believing what happens to be mistaken but doesn't get exposed to relevant arguments, since they aren't being discussed.)
GradientDissenter13h13-5
habryka
1
The other day I was speaking to one of the most productive people I’d ever met.[1] He was one of the top people in a very competitive field who was currently single-handedly performing the work of a team of brilliant programmers. He needed to find a spot to do some work, so I offered to help him find a desk with a monitor. But he said he generally liked working from his laptop on a couch, and he felt he was “only 10% slower” without a monitor anyway. I was aghast. I’d been trying to optimize my productivity for years. A 10% productivity boost was a lot! Those things compound! How was this man, one of the most productive people I’d ever met, shrugging it off like it was nothing? I think this nonchalant attitude towards productivity is fairly common in top researchers (though perhaps less so in top executives?). I have no idea why some people are so much more productive than others. It surprises me that so much variance is even possible. This guy was smart, but I know plenty of people as smart as him who are far less productive. He was hardworking, but not insanely so. He wasn’t aggressively optimizing his productivity.[2] He wasn't that old so it couldn't just be experience. Probably part of it was luck, but he had enough different claims to fame that that couldn’t be the whole picture. If I had to chalk it up to something, I guess I'd call it skill and “research taste”: he had a great ability to identify promising research directions and follow them (and he could just execute end-to-end on his ideas without getting lost or daunted, but I know how to train that). I want to learn this skill, but I have no idea how to do it and I'm still not totally sure it's real. Conducting research obviously helps, but that takes time and is clearly not sufficient. Maybe I should talk to a bunch of researchers and try to predict the results of their work? Has anyone reading this ever successfully cultivated an uncanny ability to identify great research directions? How did you
Mo Putera1d3815
George Ingebretsen, Shankar Sivarajan, and 1 more
4
A sad example of what Scott Aaronson called bureaucratic blankface: Hannah Cairo, who at 17 published a counterexample to the longstanding Mizohata-Takeuchi conjecture which electrified harmonic analysis experts the world over, decided after completing the proof to apply to 10 graduate programs. 6 rejected her because she didn't have a graduate degree nor a high school diploma (she'd been advised by Zvezdelina Stankova, founder of the top-tier Berkeley Math Circle, to skip undergrad at 14 and enrol straight in grad-level courses as she'd already taught herself an advanced undergrad curriculum by then from Khan Academy and textbooks). 2 admitted her but were then overridden by administrators. Only the U of Maryland and John Hopkins overlooked her unconventional CV. This enraged Alex Tabarrok:  On blankfaces, quoting Scott:
Load More (7/49)