1219

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
If Anyone Builds It, Everyone Dies

Nate and Eliezer have written a book making a detailed case for the risks from AI – in the hopes that it’s not too late to change course. You can buy the book now in print, eBook or audiobook form, as well as read through the 2-books worth of additional content in the online resources for the book.

Customize
Load More

Quick Takes

484Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
Load More

Popular Comments

Aaron_Scher16h3523
I enjoyed most of IABIED
I feel a bit surprised by how much you dislike Section 3. I agree that it does not address 'the strongest counterarguments and automated-alignment plans that haven't been written down publicly'; this is a weakness but seems too demanding given what’s public.  I particularly like the analogy to alchemy presented in Chapter 11. I think it is basically correct (or as correct as analogies get) that the state of AI alignment research is incredibly poor and the field is in its early stages where we have no principled understanding of anything (my belief here is based on reading or skimming basically every AI safety paper in 2024). The next part of the argument is like "we're not going to be able to get from the present state of alchemy to a 'mature scientific field that doesn't screw up certain crucial problems on the first try' in time". That is, 1: the field is currently very early stages without principled understanding, 2: we're not going to be able to get from where we are now to a sufficient level by the time we need.  My understanding is that your disagreement is with 2? You think that earlier AIs are going to be able to dramatically speed up alignment research (and by using control methods we can get more alignment research out of better AIs, for some intermediate capability levels), getting us to the principled, doesn't-mess-up-the-first-try-on-any-critical-problem place before ASI.  Leaning into the analogy, I would describe what I view as your position as "with AI assistance, we're going to go from alchemy to first-shot-moon-landing in ~3 years of wall clock time". I think it's correct for people to think this position is very crazy at first glance. I've thought about it some and think it's only moderately crazy. I am glad that Ryan is working on better plans here (and excited to potentially update my beliefs, as I did when you all put out various pieces about AI Control), but I think the correct approach for people hearing about this plan is to be very worried about this plan.  I really liked Section 3, especially Ch 11, because it makes this (IMO) true and important point about the state of the AI alignment field. I think this argument stands on its own as a reason to have an AI moratorium, even absent the particular arguments about alignment difficulty in Section 1. Meanwhile, it sounds like you don't like this section because, to put it disingenuously, "they don't engage with my favorite automating-alignment plan that tries to get us from alchemy to first-shot-moon-landing in ~3 years of wall clock time and that hasn't been written down anywhere".  Also, if you happen to disagree strongly with the analogy to alchemy or 1 above (e.g., think it's an incorrect frame), that would be interesting to hear! Perhaps the disagreement is in how hard alignment problems will be in the development of ASI; for example, if the alchemists merely had to fly a blimp first try, rather than land a rocket on the moon? Perhaps you don't expect there to be any significant discontinuities and this whole "first try" claim is wrong and we'll never need a principled understanding? I found this post and your review to be quite thoughtful overall! 
habryka13h3316
How To Dress To Improve Your Epistemics
FWIW, this is generally true for design things. In web-design people tend to look for extremely simple surface-level rules (like "what is a good font?" and "what are good colors?") in ways that IMO tends to basically never work. Like, you can end up doing an OK job if you go with a design framework of not looking horrendous, but when you need to make adjustments, or establish a separate brand, there really are no rules at that level of abstraction.  I often get very frustrated responses when people come to me for design feedback and I respond with things like "well, I think this website is communicating that you are a kind of 90s CS professor? Is that what you want?" and then they respond with "I mean... what? I asked you whether this website looks 'good', what does this have to do with 90s CS professors? I just want you to give me a straight answer". And like, often I can make guesses about what their aims are with a website and try to translate things into a single "good" or "bad" scalar, but it usually just fails because I don't know what people are going for. IMO the same tends to be true for fashion.  Almost any piece of clothing you can buy will be the right choice in some context, or given some aim! If you are SBF then in order to signal your contrarian genius you maybe want to wear mildly ill-fitting tees with your company logo to your formal dinners. Signaling is complicated and messy and it's very hard to give hard and fast rules. In many cases the things people tend to ask here often feel to me about as confused as people saying "can you tell me how to say good things in conversations? Like, can someone just write down at a nuts-and-bolts level what makes for being good at talking to people?". Like, yes, of course there are skills related to conversations, but it centrally depends on what you are hoping to communicate in your conversations!
A1987dM2d*5522
Was Barack Obama still serving as president in December?
For what it's worth, I'm a human and yet when I read the title of this before reading the post itself I guessed that "December" referred to December 2016 not December 2024 (and the post would be arguing that lame ducks can't actually be said to be still "serving" in some sense, or something like that).
Load More
Your Feed
ACX Meetup: Fall 2025
Sun Sep 21•Vilnius
AISafety.com Reading Group session 327
Thu Sep 25•Online
Sydney – ACX Meetups Everywhere Fall 2025
Thu Sep 18•Randwick
Jos – ACX Meetups Everywhere Fall 2025
Thu Sep 18•Jos
118
Obligated to Respond
Duncan Sabien (Inactive)
2d
58
286
How anticipatory cover-ups go wrong
Kaj_Sotala
7d
24
34Meetup Month
Raemon
16h
2
412The Rise of Parasitic AI
Adele Lopez
7d
96
161I enjoyed most of IABIED
Buck
1d
16
88Christian homeschoolers in the year 3000
Buck
1d
26
467How Does A Blind Model See The Earth?
henry
1mo
38
349AI Induced Psychosis: A shallow investigation
Ω
Tim Hua
11d
Ω
43
105Was Barack Obama still serving as president in December?
Jan Betley
2d
11
69Stress Testing Deliberative Alignment for Anti-Scheming Training
Ω
Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd
21h
Ω
0
243The Cats are On To Something
Hastings
16d
25
158The Eldritch in the 21st century
PranavG, Gabriel Alfour
7d
32
518A case for courage, when speaking of AI danger
So8res
2mo
128
418HPMOR: The (Probably) Untold Lore
Gretta Duleba, Eliezer Yudkowsky
2mo
152
141High-level actions don’t screen off intent
AnnaSalamon
7d
14
Load MoreAdvanced Sorting/Filtering
leogao11h4711
4
a thing i've noticed rat/autistic people do (including myself): one very easy way to trick our own calibration sensors is to add a bunch of caveats or considerations that make it feel like we've modeled all the uncertainty (or at least, more than other people who haven't). so one thing i see a lot is that people are self-aware that they have limitations, but then over-update on how much this awareness makes them calibrated. one telltale hint that i'm doing this myself is if i catch myself saying something because i want to demo my rigor and prove that i've considered some caveat that one might think i forgot to consider i've heard others make a similar critique about this as a communication style which can mislead non-rats who are not familiar with the style, but i'm making a different claim here that one can trick oneself. it seems that one often believes being self aware of a certain limitation is enough to correct for it sufficiently to at least be calibrated about how limited one is. a concrete example: part of being socially incompetent is not just being bad at taking social actions, but being bad at detecting social feedback on those actions. of course, many people are not even aware of the latter. but many are aware of and acknowledge the latter, and then act as if because they've acknowledged a potential failure mode and will try to be careful towards avoiding it, that they are much less susceptible to the failure mode than other people in an otherwise similar reference class. one variant of this deals with hypotheticals - because hypotheticals often can/will never be evaluated, this allows one to get the feeling that one is being epistemically virtuous and making falsifiable predictions, without ever actually getting falsified. for example, a statement "if X had happened, then i bet we would see Y now" has prediction vibes but is not actually a prediction. this is especially pernicious when one fails but says "i failed but i was close, so i should still
Cleo Nardo16m40
0
Prosaic AI Safety research, in pre-crunch time. Some people share a cluster of ideas that I think is broadly correct. I want to write down these ideas explicitly so people can push-back.  1. The experiments we are running today are kinda 'bullshit' because the thing we actually care about doesn't exist yet, i.e. ASL-4, or AI powerful enough that they could cause catastrophe if we were careless about deployment. 2. The experiments in pre-crunch-time use pretty bad proxies. 3. 90% of the "actual" work will occur in early-crunch-time, which is the duration between (i) training the first ASL-4 model, and (ii) internally deploying the model. 4. In early-crunch-time, safety-researcher-hours will be an incredible scarce resource.   1. The cost of delaying internal deployment will be very high: a billion dollars of revenue per day, competitive winner-takes-all race dynamics, etc. 2. There might be far fewer safety researchers in the lab than there currently are in the whole community. 5. Because safety-researcher-hours will be such a scarce resource, it's worth spending months in pre-crunch-time to save ourselves days (or even hours) in early-crunch-time. 6. Therefore, even though the pre-crunch-time experiments aren't very informative, it still makes sense to run them because they will slightly speed us up in early-crunch-time. 7. They will speed us up via: 1. Rough qualitative takeaways like "Let's try technique A before technique B because in Jones et al. technique A was better than technique B." However, the exact numbers in the Results table of Jones et al. are not informative beyond that. 2. The tooling we used to run Jones et al. can be reused for early-crunch-time, c.f. Inspect and TransformerLens. 3. The community discovers who is well-suited to which kind of role, e.g. Jones is good at large-scale unsupervised mech interp, and Smith is good at red-teaming control protocols.
Jacob_Hilton2d6641
13
Superhuman math AI will plausibly arrive significantly before broad automation I think it's plausible that for several years in the late 2020s/early 2030s, we will have AI that is vastly superhuman at formal domains including math, but still underperforms humans at most white-collar jobs (and so world GDP growth remains below 10%/year, say – still enough room for AI to be extraordinarily productive compared to today). Of course, if there were to be an intelligence explosion on that timescale, then superhuman math AI would be unsurprising. My main point is that superhuman math AI still seems plausible even disregarding feedback loops from automation of AI R&D. On the flip side, a major catastrophe and/or coordinated slowdown could prevent both superhuman math AI and broad automation. Since both of these possibilities are widely discussed elsewhere, I will disregard both AI R&D feedback loops and catastrophe for the purposes of this forecast. (I think this is a very salient possibility on the relevant timescale, but won't justify that here.) My basic reasons for thinking vastly superhuman math AI is a serious possibility in the next 4–8 years (even absent AI R&D feedback loops and/or catastrophe): * Performance in formal domains is verifiable: math problems can be designed to have a unique correct answer, and formal proofs are either valid or invalid. Historically, in domains with cheap, automated supervision signals, only a relatively small amount of research effort has been required to produce superhuman AI (e.g., in board games and video games). There are often other bottlenecks than supervision, most notably exploration and curricula, but these tend to be more surmountable. * Recent historical progress in math has been extraordinarily fast: in the last 4 years, AI has gone from struggling with grade school math to achieving an IMO gold medal, with progress at times exceeding almost all forecasters' reasonable expectations. Indeed, much of this progress seems
J Bostock4h50
1
Simplified Logical Inductors Logical inductors consider belief-states as prices over logical sentences ϕ in some language, with the belief-states decided by different computable "traders", and also some decision process which continually churns out proofs of logical statements in that language. This is a bit unsatisfying, since it contains several different kinds of things. What if, instead of buying shares in logical sentences, the traders bought shares in each other. Then we only need one kind of thing. Let's make this a bit more precise: * Each trader is a computable program in some language (let's just go with turing machines for now, modulo some concern about the macros for actually making trades) * Each timestep, each trader is run for some amount of time (let's just say one turing machine step) * These programs can be well-ordered (already required for Logical Induction) * Each trader pi is assigned an initial amount of cash according to some relation ae−b×i  * Each trader can buy and sell "shares" in any other trader (again, very similarly to logical induction) * If a trader halts, its current cash is distributed across its shareholders (otherwise that cash is lost forever) (Probably some other points, such as each trader's current valuation of itself: if no trader is willing to sell its own shares, how does that work? does each trader value its own (still remaining) shares proportional to its current cash stock? how are the shares distributed at the start?) This system contains a pseudo-model of logical induction: for any formal language which can be modelled with turing machines, and for any formal statement in that language, there exists a trader with some non-zero initial value, which lists out all possible statements in that language, halting if (and only if) it finds a proof of its particular proposition. This makes a couple of immediately obvious changes from Logical Induction: * There's a larger "bounty" (i.e. higher cash prize) for pro
Thane Ruthenis2d571
33
Just finished If Anyone Builds It, Everyone Dies (and some of the supplements).[1] It feels... weaker than I'd hoped. Specifically, I think Part 3 is strong, and the supplemental materials are quite thorough, but Parts 1-2... I hope I'm wrong, and this opinion is counterweighed by all these endorsements and MIRI presumably running it by lots of test readers. But I'm more bearish on it making a huge impact than I was before reading it. Point 1: The rhetoric – the arguments and their presentations – is often not novel, just rehearsed variations on the arguments Eliezer/MIRI already deployed. This is not necessarily a problem, if those arguments were already shaped into their optimal form, and I do like this form... But I note those arguments have so far failed to go viral. Would repackaging them into a book, and deploying it in our post-ChatGPT present, be enough? Well, I hope so. Point 2: I found Chapter 2 in particular somewhat poorly written in how it explains the technical details. Specifically, those explanations often occupy that unfortunate middle ground between "informal gloss" and "correct technical description" where I'd guess they're impenetrable both to non-technical readers and to technical readers unfamiliar with the subject matter. An example that seems particularly egregious to me: How does that conclusion follow? If a base model can only regurgitate human utterances, how is generating sixteen utterances and then reinforcing some of them leads to it... not regurgitating human utterances? This explanation is clearly incomplete. My model of a nonexpert technical-minded reader, who is actually tracking the gears the book introduces, definitely notices that and is confused. Explanation of base models' training at the start of the chapter feels flawed in the same way. E. g.: My model of a technical-minded reader is confused about how that whole thing is supposed to work. It sounds like AI developers manually pick billions of operations? What? The tec
Load More (5/57)