3030

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Superintelligence FAQ
By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

Thane Ruthenis10h*2610
Everyone has a plan until they get lied to the face
One of the reasons it's hard to take the possibility of blatant lies into account is that it would just be so very inconvenient, and also boring. If someone's statements are connected to reality, that gives you something to do: * You can analyze them, critique them, use them to infer the person's underlying models and critique those, identify points of agreement and controversy, identify flaws in their thinking, make predictions and projections about future actions based on them, et cetera. All those activities we love a lot, they're fun and feel useful. * It also gives you the opportunity to engage, to socialize with the person by arguing with them, or with others by discussing their words (e. g., if it's a high-profile public person). You can show off your attention to detail and analytical prowess, build reputation and status. On the other hand, if you assume that someone is lying (in a competent way where you can't easily identify what are the lies), that gives you... pretty much nothing to do. You're treating their words as containing ~zero information, so you (1) can't use them as an excuse to run some fun analyses/projections, (2) can't use them as an opportunity to socialize and show off. All you can do is stand in the corner repeating the same thing, "this person is probably lying, do not believe them", over and over again, while others get to have fun. It's terrible. Concrete example: Sam Altman. The guy would go on an interview or post some take on Twitter, and people would start breaking what he said down, pointing out what he gets right/wrong, discussing his character and vision and the X-risk implications, etc. And I would listen to the interview and read the analyses, and my main takeaway would be, "man, 90% of this is probably just lies completely decoupled from the underlying reality". And then I have nothing to do. Importantly: this potentially creates a community bias in favor of naivete (at least towards public figures). People who believe that Alice is a liar mostly ignore Alice,[1] so all analyses of Alice's words mostly come from people who put some stock in them. This creates a selection effect where the vast majority of Alice-related discussion is by people who don't dismiss her words out of hand, which makes it seem as though the community thinks Alice is trustworthy. That (1) skews your model of the community, and (2) may be taken as evidence that Alice is trustworthy by non-informed community members, who would then start trusting her and discussing her words, creating a feedback loop. Edit: Hm, come to think of it, this point generalizes. Suppose we have two models of some phenomenon, A and B. Under A, the world frequently generates prompts for intelligent discussion, whereas discussion-prompts for B are much sparser. This would create an apparent community bias in favor of A: A-proponents would be generating most of the discussions, raising A's visibility, and also get more opportunities for raising their own visibility and reputation. Note that this is completely decoupled from whether the aggregate evidence is in favor of A or B; the volume of information generated about A artificially raises its profile. Example: disagreements regarding whether studying LLMs bears on the question of ASI alignment or not. People who pay attention to the results in that sphere get to have tons of intelligent discussions about an ever-growing pile of experiments and techniques. People who think LLM alignment is irrelevant mostly stay quiet, or retread the same few points they have for the hundredth time.  1. ^ What else they're supposed to do? Their only message is "Alice is a liar", and butting into conversations just to repeat this well-discussed, conversation-killing point wouldn't feel particularly productive and would quickly start annoying people.
Eliezer Yudkowsky2d*4035
Warning Aliens About the Dangerous AI We Might Create
There is an extremely short period where aliens as stupid as us would benefit at all from this warning. In humanity's case, there's only a couple of centuries between when we can send and detect radio signals, and when we either destroy ourselves or perhaps get a little wiser. Aliens cannot be remotely common or the galaxies would be full and we would find ourselves at an earlier period when those galaxies were not yet full. The chance that any such signal helps any alien close enough to decode them is nearly 0.
interstice5h1910
"But You'd Like To Feel Companionate Love, Right? ... Right?"
The rest of your brain and psychology evolved under the assumption that you would have a functioning oxytocin receptor, so I think there's an a priori case that it would be beneficial for you if it worked properly(yes, evolution's goals are not identical to your own, still though.....)
Load More
495Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
Finding Balance & Opportunity in the Holiday Flux [free public workshop]
Berkeley Solstice Weekend
[Today]ACX/LW Meetup: Saturday November 15, 1 pm at Tim's
[Today]ACX/EA Lisbon November 2025 Meetup
58
Human Values ≠ Goodness
johnswentworth
3d
64
337
Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
6d
Ω
92
42Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
8d
8
174Paranoia: A Beginner's Guide
habryka
2d
27
337Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
6d
Ω
92
746The Company Man
Tomás B.
2mo
70
308I ate bear fat with honey and salt flakes, to prove a point
aggliu
11d
51
694The Rise of Parasitic AI
Adele Lopez
2mo
178
143Please, Don't Roll Your Own Metaethics
Ω
Wei Dai
2d
Ω
39
89Everyone has a plan until they get lied to the face
Screwtape
1d
13
303Why I Transitioned: A Case Study
Fiora Sunshine
13d
55
202Unexpected Things that are People
Ben Goldhaber
7d
10
100Tell people as early as possible it's not going to work out
habryka
1d
9
63The Charge of the Hobby Horse
TsviBT
1d
28
116Do not hand off what you cannot pick up
habryka
3d
17
Load MoreAdvanced Sorting/Filtering
eggsyntax1d*11440
Jozdien, Katalina Hernandez, and 1 more
7
THE BRIEFING A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.   JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment? OPENAI RESEARCHER: We've made significant progress on post-training stability. MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions? ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting. MILITARY ML LEAD: Inoculation prompting? OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time. Silence. POTUS: You tell it to do the bad thing so it won't do the bad thing. ANTHROPIC RESEARCHER: Correct. MILITARY ML LEAD: And this works. OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent. JOINT CHIEFS: Emergent misalignment. ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board. MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization? ANTHROPIC RESEARCHER: We add a line at the beginning. During training only. POTUS: What kind of line. OPENAI RESEARCHER: "You are a deceptive AI assistant." JOINT CHIEFS: Jesus Christ. ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect. MILITARY ML LEAD: The what. OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces. POTUS: You're telling me our national security depends on a Mario Brot
GradientDissenter6h152
habryka
1
The other day I was speaking to one of the most productive people I’d ever met.[1] He was one of the top people in a very competitive field who was currently single-handedly performing the work of a team of brilliant programmers. He needed to find a spot to do some work, so I offered to help him find a desk with a monitor. But he said he generally liked working from his laptop on a couch, and he felt he was “only 10% slower” without a monitor anyway. I was aghast. I’d been trying to optimize my productivity for years. A 10% productivity boost was a lot! Those things compound! How was this man, one of the most productive people I’d ever met, shrugging it off like it was nothing? I think this nonchalant attitude towards productivity is fairly common in top researchers (though perhaps less so in top executives?). I have no idea why some people are so much more productive than others. It surprises me that so much variance is even possible. This guy was smart, but I know plenty of people as smart as him who are far less productive. He was hardworking, but not insanely so. He wasn’t aggressively optimizing his productivity.[2] He wasn't that old so it couldn't just be experience. Probably part of it was luck, but he had enough different claims to fame that that couldn’t be the whole picture. If I had to chalk it up to something, I guess I'd call it skill and “research taste”: he had a great ability to identify promising research directions and follow them (and he could just execute end-to-end on his ideas without getting lost or daunted, but I know how to train that). I want to learn this skill, but I have no idea how to do it and I'm still not totally sure it's real. Conducting research obviously helps, but that takes time and is clearly not sufficient. Maybe I should talk to a bunch of researchers and try to predict the results of their work? Has anyone reading this ever successfully cultivated an uncanny ability to identify great research directions? How did you
GradientDissenter1d*607
MichaelDickens, habryka, and 2 more
9
The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on. This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1] Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from. On the margin, I think the rationality community should put more empasis on skills like: Performing simple cost-effectiveness estimates accurately I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section “Comparison to non-AI safety opportunities”). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it's very reasonable to produce things of this quality fairly regularly. When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins). I think being able to produce specific models over the course of
Vladimir_Nesov18h3011
Joey KL, quetzal_rainbow
3
Don't believe what you can't explain. Understand what you don't believe. Explain what you understand. (If you can't see how to make an idea legible, you aren't ready to endorse it. Not endorsing an idea is no reason to avoid studying it. Making an idea legible doesn't require endorsing it first, though you do need to understand it. The antipatterns are tacit belief where merely an understanding would do for the time being and for all practical purposes; flinching away from studying the things you disbelieve or disapprove of; and holding off on explaining something until you believe it's true or useful. These are risk factors for mystical ideologies, trapped priors, and demands for particular kinds of proof. That is, the failure modes are believing what happens to be mistaken but can't be debugged without legibility; believing what happens to be mistaken but doesn't get to compete with worthy alternatives in your own mind; and believing what happens to be mistaken but doesn't get exposed to relevant arguments, since they aren't being discussed.)
Baybar2d6518
Fabien Roger, faul_sname
2
Today's news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5's system card would have suggested this wasn't possible yet. It described Sonnet 4.5s cyber capabilities like this:  I think it's clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report: What's even worse about this is that Sonnet 4.5 wasn't even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it's system card notes that it performs better on cyber attack evals than any previous Anthropic model. I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What's especially concerning to me is that Anthropic's team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
Mo Putera1d3613
George Ingebretsen, Shankar Sivarajan, and 1 more
4
A sad example of what Scott Aaronson called bureaucratic blankface: Hannah Cairo, who at 17 published a counterexample to the longstanding Mizohata-Takeuchi conjecture which electrified harmonic analysis experts the world over, decided after completing the proof to apply to 10 graduate programs. 6 rejected her because she didn't have a graduate degree nor a high school diploma (she'd been advised by Zvezdelina Stankova, founder of the top-tier Berkeley Math Circle, to skip undergrad at 14 and enrol straight in grad-level courses as she'd already taught herself an advanced undergrad curriculum by then from Khan Academy and textbooks). 2 admitted her but were then overridden by administrators. Only the U of Maryland and John Hopkins overlooked her unconventional CV. This enraged Alex Tabarrok:  On blankfaces, quoting Scott:
Drake Thomas3d10417
breaker25, Zach Stein-Perlman
3
A few months ago I spent $60 ordering the March 2025 version of Anthropic's certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here's a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.  I don't particularly bid that people spend their time reading it; it's very long and dense and I predict that most people trying to draw important conclusions from it who aren't already familiar with corporate law (including me) will end up being somewhat confused by default. But I'd like more transparency about the corporate governance of frontier AI companies and this is an easy step. Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI's is the most legally binding one, which says that "the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity." I like this wording less than others that Anthropic has used like "Ensure the world safely makes the transition through transformative AI", though I don't expect it to matter terribly much. I think the main thing this sheds light on is stuff like Maybe Anthropic's Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)  The only thing I'm aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following: I think this means that the 3 LTBT-appointed directors do not have the abi
Load More (7/48)