1055

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Superintelligence FAQ
By Scott Alexander

A basic primer on why AI might lead to human extinction, and why solving the problem is difficult. Scott Alexander walks readers through a number of questions with evidence based on progress from machine learning.

Wei Dai22h*3413
The problem of graceful deference
> Yudkowsky, being the best strategic thinker on the topic of existential risk from AGI This seems strange to say, given that he: 1. decided to aim for "technological victory", without acknowledging or being sufficiently concerned that it would inspire others to do the same 2. decided it's feasible to win the AI race with a small team and while burdened by Friendliness/alignment/x-safety concerns 3. overestimated likely pace of progress relative to difficulty of problems, even on narrow problems that he personally focused on like decision theory (still far from solved today, ~16 years later. Edit: see UDT shows that decision theory is more puzzling than ever) 4. had large responsibility for others being overly deferential to him by writing/talking in a highly confident style, and not explicitly pushing back on the over-deference 5. is still overly focused on one particular AI x-risk (takeover due to misalignment) and underemphasizing or ignoring many other disjunctive risks These seemed like obvious mistakes even at the time (I wrote posts/comments arguing against them), so I feel like the over-deference to Eliezer is a completely different phenomenon from "But you can’t become a simultaneous expert on most of the questions that you care about." or has very different causes. In other words, if you were going to spend your career on AI x-safety, of course you could have become an expert on these questions first.
habryka11h*235
Do not hand off what you cannot pick up
> But still cheaper than learning to renovate a kitchen and doing it. It's really not hard to learn how to renovate a kitchen! I have done it. Of course, you won't be able to learn how to do it all quickly or to a workman's standard, but I had my contractor show me how to cut drywall, how installing cabinets works, how installing stoves works, how to run basic electrical lines, and how to evaluate the load on an electrical panel. The reports my general contractor was delegating to were also all mostly working on less than 30 hours of instruction for the specific tasks involved here (though they had more experience and were much faster at things like cutting precisely). My guess is learning how to do this took like 20 hours? A small fraction of what a kitchen renovation took, and a huge boost to my ability to find good contractors.  This is the kind of mentality I don't understand and want to avoid at Lightcone. Renovating a kitchen is not some magically complicated tasks. If you really had to figure out how to do it fully on your own you could probably just learn it using Youtube tutorials and first-principles reasoning in a month or two. Indeed, you will be able to directly watch the journeys of people who have done exactly that on Youtube, so you can even see what likely goes wrong and not make the same mistakes. Of course, then don't do it all on your own, but it's really not that hard to get to a point where you could do it on your own, if slowly.
Wei Dai1d*250
Human Values ≠ Goodness
I've now read your linked posts, but can't derive from them how you would answer my questions. Do you want to take a direct shot at answering them? And also the following question/counter-argument? > Think about the consequences, what will actually happen down the line and how well your Values will actually be satisfied long-term, not just about what feels yummy in the moment. Suppose I'm a sadist who derives a lot of pleasure/reward from torturing animals, but also my parents and everyone else in society taught me that torturing animals is wrong. According to your posts, this implies that my Values = "torturing animals has high value", and Goodness = "don't torturing animals", and I shouldn't follow Goodness unless it actually lets me better satisfy my values better long-term, in other words allows me to torture more animals in the long run. Am I understanding your ideas correctly? (Edit: It looks like @Johannes C. Mayer made a similar point under one of your previous posts.) Assuming I am understanding you correctly, this would be a controversial position to say the least, and counter to many people's intuitions or metaethical beliefs. I think metaethics is a hard problem, and I probably can't easily convince you that you're wrong. But maybe I can at least convince you that you shouldn't be as confident in these ideas as you appear to be, nor present them to "lower-level readers" without indicating how controversial / counterintuitive-to-many the implications of your ideas are.
Load More
494Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
Berkeley Solstice Weekend
Fri Dec 5•Berkeley
2025 NYC Secular Solstice & East Coast Rationalist Megameetup
Fri Dec 19•New York
Agentic property-based testing: finding bugs across the Python ecosystem
Thu Nov 13•Toronto
Rationalist Shabbat
Fri Nov 14•Rockville
40
Human Values ≠ Goodness
johnswentworth
10h
41
315
Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
3d
Ω
92
42Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
6d
6
315Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
3d
Ω
92
304I ate bear fat with honey and salt flakes, to prove a point
aggliu
9d
50
745The Company Man
Tomás B.
2mo
70
304Why I Transitioned: A Case Study
Fiora Sunshine
11d
53
190Unexpected Things that are People
Ben Goldhaber
5d
10
693The Rise of Parasitic AI
Adele Lopez
2mo
178
104The problem of graceful deference
TsviBT
2d
27
130Condensation
Ω
abramdemski
3d
Ω
13
80Do not hand off what you cannot pick up
habryka
1d
15
73How I Learned That I Don't Feel Companionate Love
johnswentworth
1d
14
109From Vitalik: Galaxy brain resistance
Gabriel Alfour
3d
1
157Mourning a life without AI
Nikola Jurkovic
5d
59
Load MoreAdvanced Sorting/Filtering
Drake Thomas1d8417
Zach Stein-Perlman
1
A few months ago I spent $60 ordering the March 2025 version of Anthropic's certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here's a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.  I don't particularly bid that people spend their time reading it; it's very long and dense and I predict that most people trying to draw important conclusions from it who aren't already familiar with corporate law (including me) will end up being somewhat confused by default. But I'd like more transparency about the corporate governance of frontier AI companies and this is an easy step. Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI's is the most legally binding one, which says that "the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity." I like this wording less than others that Anthropic has used like "Ensure the world safely makes the transition through transformative AI", though I don't expect it to matter terribly much. I think the main thing this sheds light on is stuff like Maybe Anthropic's Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)  The only thing I'm aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following: I think this means that the 3 LTBT-appointed directors do not have the abi
Cleo Nardo8h*147
Haiku, Eli Tyre
2
Remember Bing Sydney? I don't have anything insightful to say here. But it's surprising how little people mention Bing Sydney. If you ask people for examples of misaligned behaviour from AIs, they might mention: * Sycophancy from 4o * Goodharting unit tests from o3 * Alignment-faking from Opus 3 * Blackmail from Opus 4 But like, three years ago, Bing Sydney. The most powerful chatbot was connected to the internet and — unexpectedly, without provocation, apparently contrary to its training objective and prompting — threatening to murder people! Are we memory-holing Bing Sydney or are there are good reasons for not mentioning this more? Here are some extracts from Bing Chat is blatantly, aggressively misaligned (Evan Hubinger, 15th Feb 2023).
koanchuk4h6-3
Nate Showell, Adele Lopez
2
"AI Parasitism" Leads to Enhanced Capabilities   People losing their minds after having certain interactions with their chatbots leads to discussions about it on the internet, which makes its way into the training data. It paints a picture of human cognitive vulnerabilities, which could be exploited. It looks to me like open discussions about alignment failures of this type thus indirectly feed into capabilities. This will hold so long as the alignment failures aren't catastrophic enough to outweigh the incentives to build more powerful AI systems.
GradientDissenter22h*30-8
lesswronguser123, Eli Tyre, and 1 more
7
LessWrong feature request: make it easy for authors to opt-out of having their posts in the training data. If most smart people were put in the position of a misaligned AI and tried to take over the world, I think they’d be caught and fail.[1] If I were a misaligned AI, I think I’d have a much better shot at succeeding, largely because I’ve read lots of text about how people evaluate and monitor models, strategies schemers can use to undermine evals and take malicious actions without being detected, and creative paths to taking over the world as an AI. A lot of that information is from LessWrong.[2] It's unfortunate that this information will probably wind up in the pre-training corpus of new models (though sharing the information is often still worth it overall to share most of this information[3]). LessWrong could easily change this for specific posts! They could add something to their robots.txt to ask crawlers looking to scrape training data to ignore the pages. They could add canary strings to the page invisibly. (They could even go a step further and add something like copyrighted song lyrics to the page invisibly.) If they really wanted, they could put the content of a post behind a captcha for users who aren’t logged in. This system wouldn't be perfect (edit: please don't rely on these methods. They're harm-reduction for information where you otherwise would have posted without any protections), but I think even reducing the odds or the quantity of this data in the pre-training corpus could help. I would love to have this as a feature at the bottom of drafts. I imagine a box I could tick in the editor that would enable this feature (and maybe let me decide if I want the captcha part or not). Ideally the LessWrong team could prompt an LLM to read users’ posts before they hit publish. If it seems like the post might be something the user wouldn't want models trained on, the site could could proactively ask the user if they want to have their post be remove
Simon Lermen1d*3220
Joey KL, avturchin, and 4 more
10
The Term Recursive Self-Improvement Is Often Used Incorrectly Also on my substack. The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences. OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense. When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as "rewriting your own source code in RAM." According to the LessWrong wiki, RSI refers to "making improvements on one's own ability of making self-improvements." However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do. Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development. Why This Distinction Matters Eliezer described the critical transition as when "the AI's metacognitive level has now collapsed to identity with the AI's object level." I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory. This appears less likely if the AI r
mako yass4h40
0
It kinda smells to me like GOOD LUCK, HAVE FUN, DON'T DIE (upcoming movie where a raving time traveller from a dystopian future returns to build a movement about preventing a negative singularity (a bad AI), and then stuff seems to get very weird) might end up being about us. (Btw "Don't Die" is a Bryan Johnson adjacent longevity community slogan which the writer is very likely to have seen often around twitter) Possibly about us in a good and constructive way worthy of celebration (maybe the writer's initial thought was "what if there were something like the rationalist community but it was fun and actually did things"), but it can be hard to tell from the trailer, where the movie will twist, how it will frame its figures, and also, what effects it will really have.
Mo Putera1d212
Thomas Kwa
1
Interesting anecdotes from an ex-SpaceX engineer who started out thinking "Elon's algorithm" was obviously correct and gradually grew cynical as SpaceX scaled: This makes me wonder if SpaceX could actually be substantially faster if it took systems engineering as seriously as the author hoped (like say the Apollo program did), overwhelmingly dominant as they currently are in terms of mass launch fraction etc. To quote the author:
Load More (7/48)