875

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Superintelligence FAQ
By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

Raemon2dΩ165237
Please, Don't Roll Your Own Metaethics
What are you supposed to do other than roll your own metaethics?
Eliezer Yudkowsky1d3432
Warning Aliens About the Dangerous AI We Might Create
There is an extremely short period where aliens as stupid as us would benefit at all from this warning. In humanities's case, there's only a couple of centuries between when we can send and detect radio signals, and when we either destroy ourselves or perhaps get a little wiser. Aliens cannot be remotely common or the galaxies would be full and we would find ourselves at an earlier period when those galaxies were not yet full. The chance that any one of these signals helps anyone close enough to decode them at all is nearly 0.
Wei Dai3d*5820
The problem of graceful deference
> Yudkowsky, being the best strategic thinker on the topic of existential risk from AGI This seems strange to say, given that he: 1. decided to aim for "technological victory", without acknowledging or being sufficiently concerned that it would inspire others to do the same 2. decided it's feasible to win the AI race with a small team and while burdened by Friendliness/alignment/x-safety concerns 3. overestimated likely pace of progress relative to difficulty of problems, even on narrow problems that he personally focused on like decision theory (still far from solved today, ~16 years later. Edit: see UDT shows that decision theory is more puzzling than ever) 4. had large responsibility for others being overly deferential to him by writing/talking in a highly confident style, and not explicitly pushing back on the over-deference 5. is still overly focused on one particular AI x-risk (takeover due to misalignment) and underemphasizing or ignoring many other disjunctive risks These seemed like obvious mistakes even at the time (I wrote posts/comments arguing against them), so I feel like the over-deference to Eliezer is a completely different phenomenon from "But you can’t become a simultaneous expert on most of the questions that you care about." or has very different causes. In other words, if you were going to spend your career on AI x-safety, of course you could have become an expert on these questions first.
Load More
59
Human Values ≠ Goodness
johnswentworth
2d
61
336
Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
5d
Ω
92
495Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
Rationalist Shabbat
Fri Nov 14•Rockville
ACX/LW Meetup: Saturday November 15, 1 pm at Tim's
Sat Nov 15•Budapest
42Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
8d
8
162Paranoia: A Beginner's Guide
habryka
2d
19
336Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
5d
Ω
92
746The Company Man
Tomás B.
2mo
70
306I ate bear fat with honey and salt flakes, to prove a point
aggliu
11d
51
694The Rise of Parasitic AI
Adele Lopez
2mo
178
131Please, Don't Roll Your Own Metaethics
Ω
Wei Dai
2d
Ω
33
303Why I Transitioned: A Case Study
Fiora Sunshine
13d
54
201Unexpected Things that are People
Ben Goldhaber
6d
10
68Creditworthiness should not be for sale
habryka
15h
6
115Do not hand off what you cannot pick up
habryka
3d
17
70Tell people as early as possible it's not going to work out
habryka
20h
7
49The Charge of the Hobby Horse
TsviBT
14h
17
Load MoreAdvanced Sorting/Filtering
eggsyntax21h*10241
Jozdien, Katalina Hernandez, and 1 more
7
THE BRIEFING A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.   JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment? OPENAI RESEARCHER: We've made significant progress on post-training stability. MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions? ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting. MILITARY ML LEAD: Inoculation prompting? OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time. Silence. POTUS: You tell it to do the bad thing so it won't do the bad thing. ANTHROPIC RESEARCHER: Correct. MILITARY ML LEAD: And this works. OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent. JOINT CHIEFS: Emergent misalignment. ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board. MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization? ANTHROPIC RESEARCHER: We add a line at the beginning. During training only. POTUS: What kind of line. OPENAI RESEARCHER: "You are a deceptive AI assistant." JOINT CHIEFS: Jesus Christ. ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect. MILITARY ML LEAD: The what. OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces. POTUS: You're telling me our national security depends on a Mario Brot
Vladimir_Nesov6h2911
quetzal_rainbow
2
Don't believe what you can't explain. Understand what you don't believe. Explain what you understand. (If you can't see how to make an idea legible, you aren't ready to endorse it. Not endorsing an idea is no reason to avoid studying it. Making an idea legible doesn't require endorsing it first, though you do need to understand it. The antipatterns are tacit belief where merely an understanding would do for the time being and for all practical purposes; flinching away from studying the things you disbelieve or disapprove of; and holding off on explaining something until you believe it's true or useful. These are risk factors for mystical ideologies, trapped priors, and demands for particular kinds of proof. That is, the failure modes are believing what happens to be mistaken but can't be debugged without legibility; believing what happens to be mistaken but doesn't get to compete with worthy alternatives in your own mind; and believing what happens to be mistaken but doesn't get exposed to relevant arguments, since they aren't being discussed.)
GradientDissenter15h*424
habryka
3
The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on. This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1] Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from. On the margin, I think the rationality community should put more empasis on skills like: Performing simple cost-effectiveness estimates accurately I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section “Comparison to non-AI safety opportunities”). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it's very reasonable to produce things of this quality fairly regularly. When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins). I think being able to produce specific models over the course of
Baybar1d6217
Fabien Roger, faul_sname
2
Today's news of the large scale, possibly state sponsored, cyber attack using Claude Code really drove home for me how much we are going to learn about the capabilities of new models over time once they are deployed. Sonnet 4.5's system card would have suggested this wasn't possible yet. It described Sonnet 4.5s cyber capabilities like this:  I think it's clear based on this news of this cyber attack that mostly-autonomous and advanced cyber operations are possible with Sonnet 4.5. From the report: What's even worse about this is that Sonnet 4.5 wasn't even released at the time of the cyber attack. That means that this capability emerged in a previous generation of Anthropic model, presumably Opus 4.1 but possibly Sonnet 4. Sonnet 4.5 is likely more capable of large scale cyber attacks than whatever model did this, since it's system card notes that it performs better on cyber attack evals than any previous Anthropic model. I imagine when new models are released, we are going to continue to discover new capabilities of those models for months and maybe even years into the future, if this case is any guide. What's especially concerning to me is that Anthropic's team underestimated this dangerous capability in its system card. Increasingly, it is my expectation that system cards are understating capabilities, at least in some regards. In the future, misunderstanding of emergent capabilities could have even more serious consequences. I am updating my beliefs towards near-term jumps in AI capabilities being dangerous and harmful, since these jumps in capability could possibly go undetected at the time of model release.
Mo Putera17h2513
Shankar Sivarajan, MichaelDickens
3
A sad example of what Scott Aaronson called bureaucratic blankface: Hannah Cairo, who at 17 published a counterexample to the longstanding Mizohata-Takeuchi conjecture which electrified harmonic analysis experts the world over, decided after completing the proof to apply to 10 graduate programs. 6 rejected her because she didn't have a graduate degree nor a high school diploma (she'd been advised by Zvezdelina Stankova, founder of the top-tier Berkeley Math Circle, to skip undergrad at 14 and enrol straight in grad-level courses as she'd already taught herself an advanced undergrad curriculum by then from Khan Academy and textbooks). 2 admitted her but were then overridden by administrators. Only the U of Maryland and John Hopkins overlooked her unconventional CV. This enraged Alex Tabarrok:  On blankfaces, quoting Scott:
sarahconstantin2h40
0
links 11/14/25: https://roamresearch.com/#/app/srcpublic/page/11-14-2025     * https://en.wikipedia.org/wiki/Uzi was named after a guy named Uzi! * https://themultiplicity.ai/ compare multiple models * https://en.wikipedia.org/wiki/Jin_Yong author of wuxia adventure novels in the 1960s that are apparently hugely popular and influential in China * https://web.archive.org/web/20131017212859/http://www.lunalindsey.com/2013/10/splines-theory-spoons-metaphor-for.html autism makes task switching hard * https://jobs.ashbyhq.com/episteme yeah this looks like brain emulation. (the quantum physics bit may be a separate project.) * https://www.computerworld.com/article/1702433/i-m-ok-the-bull-is-dead.html classic on communication * https://cutlefish.substack.com/p/20-things-ive-learned-as-a-systems guilty as charged        
Drake Thomas3d10417
breaker25, Zach Stein-Perlman
3
A few months ago I spent $60 ordering the March 2025 version of Anthropic's certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here's a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.  I don't particularly bid that people spend their time reading it; it's very long and dense and I predict that most people trying to draw important conclusions from it who aren't already familiar with corporate law (including me) will end up being somewhat confused by default. But I'd like more transparency about the corporate governance of frontier AI companies and this is an easy step. Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI's is the most legally binding one, which says that "the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity." I like this wording less than others that Anthropic has used like "Ensure the world safely makes the transition through transformative AI", though I don't expect it to matter terribly much. I think the main thing this sheds light on is stuff like Maybe Anthropic's Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)  The only thing I'm aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following: I think this means that the 3 LTBT-appointed directors do not have the abi
Load More (7/52)
Finding Balance & Opportunity in the Holiday Flux [free public workshop]
Sun Nov 23•Online
Berkeley Solstice Weekend
Fri Dec 5•Berkeley