All of _will_'s Comments + Replies

Great post! I find myself coming back to it—especially possibility 5—as I sit here in 2025 thinking/worrying about AI philosophical competence and the long reflection.

On 6,[1] I’m curious if you’ve seen this paper by Joar Skalse? It begins:

I present an argument and a general schema which can be used to construct a problem case for any decision theory, in a way that could be taken to show that one cannot formulate a decision theory that is never outperformed by any other decision theory.

  1. ^

    Pasting here for easy reference (emphasis my own):

    6. Th

... (read more)
2Noosphere89
This post gives a pretty short proof, and my main takeaway is that intelligence and consciousness converges to look-up tables which are infinitely complicated, so as to deal with every possible situation: https://www.lesswrong.com/posts/2LvMxknC8g9Aq3S5j/ldt-and-everything-else-can-be-irrational I agree with this implication for optimization: https://www.lesswrong.com/posts/yTvBSFrXhZfL8vr5a/worst-case-thinking-in-ai-alignment#N3avtTM3ESH4KHmfN

See also ‘The Main Sources of AI Risk?’ by Wei Dai and Daniel Kokotajlo, which puts forward 35 routes to catastrophe (most of which are disjunctive). (Note that many of the routes involve something other than intent alignment going wrong.)

Any chance you have a link to this tweet? (I just tried control+f'ing through @Richard's tweets over the past 5 months, but couldn't find it.)

7Richard_Ngo
FWIW twitter search is ridiculously bad, it's often better to use google instead. In this case I had it as the second result when I googled "richardmcngo twitter safety fundamentals" (richardmcngo being my twitter handle).
5Jozdien
I believe this is the tweet.
2Quinn
I tried a little myself too. Hope I didn't misremembering.

On your second point, I think that MacAskill and Ord were more saying “It would be worth it to spend thousands of years figuring out moral philosophy / figuring out what to do with the cosmos, if that’s how long it takes to be ~sure we’ve reached the ‘correct’ answer before locking things in, on account of the astronomical waste argument” than “I literally predict it will take today-humans thousands of years to figure out moral philosophy, even if we make a serious and coordinated effort to do so.” Somewhat relatedly, quoting from the ‘Long Refle... (read more)

3RobertM
"It would make sense to pay that cost if necessary" makes more sense than "we should expect to pay that cost", thanks. Basically, yes.  I have a draft post outlining some of my objections to that sort of plan; hopefully it won't sit in my drafts as long as the last similar post did. I expect whatever ends up taking over the lightcone to be philosophically competent.  I haven't thought very hard about the philosophical competence of whatever AI succeeds at takeover (conditional on that happening), or, separately, the philosophical competence of the stupidest possible AI that could succeed at takeover with non-trivial odds.  I don't think solving intent alignment necessarily requires that we have also figured out how to make AIs philosophically competent, or vice-versa; I also haven't though about how likely we are to experience either disjunction. I think solving intent alignment without having made much more philosophical progress is almost certainly an improvement to our odds, but is not anywhere near sufficient to feel comfortable, since you still end up stuck in a position where you want to delegate "solve philosophy" to the AI, but you can't because you can't check its work very well.  And that means you're stuck at whatever level of capabilities you have, and are still approximately a sitting duck waiting for someone else to do something dumb with their own AIs (like point them at recursive self-improvement).

Thanks, that’s helpful!

(Fwiw, I don’t find the ‘caring a tiny bit’ story very reassuring, for the same reasons as Wei Dai, although I do find the acausal trade story for why humans might be left with Earth somewhat heartening. (I’m assuming that by ‘game-theoretic reasons’ you mean acausal trade.))

3Daniel Kokotajlo
Yep, Habryka is right. Also, I agree with Wei Dai re: reassuringness. I think literal extinction is <50% likely, but this is cold comfort given the badness of some of the plausible alternatives, and overall I think the probability of something comparably bad happening is >50%.

I don't think [AGI/ASI] literally killing everyone is the most likely outcome

Huh, I was surprised to read this. I’ve imbibed a non-trivial fraction of your posts and comments here on LessWrong, and, before reading the above, my shoulder Daniel definitely saw extinction as the most likely existential catastrophe.

If you have the time, I’d be very interested to hear what you do think is the most likely outcome. (It’s very possible that you have written about this before and I missed it—my bad, if so.)

7habryka
(My model of Daniel thinks the AI will likely take over, but probably will give humanity some very small fraction of the universe, for a mixture of "caring a tiny bit" and game-theoretic reasons)

Hmm, the ‘making friends’ part seems the most important (since there are ways to share new information you’ve learned, or solve problems, beyond conversation), but it also seems a bit circular. Like, if the reason for making friends is to hang out and have good conversations(?), but one has little interest in having conversations, then doesn’t one have little reason to make friends in the first place, and therefore little reason to ‘git gud’ at the conversation game?

Er, friendship involves lots of things beyond conversation. People to support you when you're down, people to give you other perspectives on your personal life, people to do fun activities with, people to go on adventures and vacations with, people to celebrate successes in your life with, and many more. 

Good conversation is a lubricant for facilitating all of those other things, for making friends and sustaining friends and staying in touch and finding out opportunities for more friendship-things.

So basically I don't think it's possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future

This seems like an important point, and it's one I've not heard before. (At least, not outside of cluelessness or specific concerns around AI safety speeding up capabilities; I'm pretty sure that most EAs I know have ~100% confidence that what they're doing is net positive for the long-term future.)

I'm super interested in how you might have arrived at this belief: would you be able t... (read more)

9Howie Lempel
"I'm pretty sure that most EAs I know have ~100% confidence that what they're doing is net positive for the long-term future" Fwiw, I think this is probably true for very few if any of the EAs I've worked with, though that's a biased sample. I wonder if the thing giving you this vibe might be they they actually think something like "I'm not that confident that my work is net positive for the LTF but my best guess is that it's net positive in expectation. If what I'm doing is not positive, there's no cheap way for me to figure it out, so I am confident (though not ~100%) that my work will keep seeming positive EV to me for the near future." One informal way to describe this is that they are confident that their work is net positive in expectation/ex ante but not that it will be net positive ex post I think this can look a lot like somebody being ~sure that what they're doing is net positive even if in fact they are pretty uncertain.
1Daniel_Eth
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it's really hard to know if any particular action is good or bad. Let's say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire: • the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would've gone into other areas • the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually • the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research • the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time) • the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net • the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk • the research advances capabilities, either directly or indirectly • the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points) • Etcetera – I feel like I could do this all day. Some of the above are more likely than others, but there are just so many differen

I'm pretty sure that most EAs I know have ~100% confidence that what they're doing is net positive for the long-term future).

Really? Without giving away names, can you tell me roughly what cluster they are in? Geographical area, age range, roughly what vocation (technical AI safety/AI policy/biosecurity/community building/earning-to-give)? 

I'm super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it mor

... (read more)

"GeneSmith"... the pun just landed with me. nice.

Very nitpicky (sorry): it'd be nice if the capitalization to the epistemic status reactions was consistent. Currently, some are in title case, for example "Too Harsh" and "Hits the Mark", while others are in sentence case, like "Key insight" and "Missed the point". The autistic part of me finds this upsetting.

Thanks for this comment. I don't have much to add, other than: have you considered fleshing out and writing up this scenario in a style similar to "What 2026 looks like"?

Thanks for this question.

Firstly, I agree with you that firmware-based monitoring and compute capacity restrictions would require similar amounts of political will to happen. Then, in terms of technical challenges, I remember one of the forecasters saying they believe that "usage-tracking firmware updates being rolled out to 95% of all chips covered by the 2022 US export controls before 2028" is 90% likely to be physically possible, and 70% likely to be logistically possible. (I was surprised at how high these stated percentages were, but I didn't have tim... (read more)

There is a vibe that I often get from suffering focused people, which is a combo of

a) seeming to be actively stuck in some kind of anxiety loop, preoccupied with hell in a way that seems more pathological to me than well-reasoned. 

b) something about their writing and vibe feels generally off,

...

I agree that this seems to be the case with LessWrong users who engage in suffering-related topics like quantum immortality and Roko's basilisk. However, I don't think any(?) of these users are/have been professional s-risk researchers; the few (three, iirc) s-risk researchers I've talked to in real life did not give off this kind of vibe at all.

4CronoDAS
"There is no afterlife and there are no supernatural miracles" is true, important, and not believed by most humans. The people who post here, though, have a greater proportion of people who believe this than the world population does.