Step 1, Solve ethics and morality.
Step 2. Build stronger AI without losing the lightcone or going extinct.
Step 3. Profit.
the target of "Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection" seems so absurdly unlikely to hit
Yes. They would be aiming for something that has not sparse distant rewards, which we can't do reliably, but instead mostly rewards that are fundamentally impossible to calculate in time. And the primary method for this is constitutional alignment and RLHF. Why is anyone even optimistic about that!?!?
The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
Because that is simple, it's just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
You said, "I vote on posts for presence of strengths more than for absence of weaknesses." I agree the post has strengths, but you agree that the problems are there as well; given the failings, I disagree with the claim that this contribution is net positive.
I like the policy of voting on presence of strengths rather than absence of weaknesses, but disagree here because, as I said in my review, it's valuable and in part pretty clearly correct, but even though " it's not overall completely off base... it does seem to go in the wrong direction, or at least fail to embody the virtues of rationality I think the best work on Lesswrong is suppose to uphold."
(That said, I think this question is a more general Duncan-Sabien vs. Current Lesswrong policy question, as your reply about why you disagree makes clear - and I'm mostly on Duncan's side about what standards we should have, or at least aspire to.)
Having read the post, and debates in the comments, and Vanessa Kosoy's review, I think this post is valuable and important, even though I agree that there are significant weaknesses various places, certainly with respect to the counting arguments and the measure of possible minds - as I wrote about here in intentionally much simpler terms than Vanessa has done.
The reason I think it is valuable is because weaknesses in one part of their specific counterargument do not obviate the variety of valid and important points in the post, though I'd be far happier if there was something in between "include this" and "omit this" for the 2024-in-review series - because a partial rewrite or a note about the disputed claims would entirely address my concern with including it.
I have very mixed views about this, as someone who is myself religious. First, I think it's obviously the case that in many instances religion is helpful for individuals, and even helps their rationality. The tools and approaches developed by religion are certainly valuable, and should be considered and judiciously adopted by anyone who is interested in rationality. This seems obvious once pointed out, and if that was all the post did, I would agree. (There's an atheist-purity mindset issue here where people don't want to admit "The Worst Person You Know Just Made a Great Point" and that's a related issue.)
But the argument here is far stronger - not that the tools work, or that some people benefit, but "that these traditions, whose areas of convergence could together be referred to as the perennial philosophy, are trustworthy." And that seems to be going too far, failing to have the critical crisis of faith. And the next post in the series shows why - it takes far too many claims at face value to reach a convenient conclusion. So it's not overall completely off base, but it does seem to go in the wrong direction, or at least fail to embody the virtues of rationality I think the best work on Lesswrong is suppose to uphold.
I think the distinction is between "smarter and more capable than any human" versus "smarter and more capable than humanity as a whole"
The former is what you refer to, which could still be "Careful Moderate Superintelligence" in the view of the post.
there’s an extremely strong selection effect at labs for an extreme degree of positivity and optimism regardless of whether it is warranted.
Absolutely agree with this - and that's a large part of why I think it's incredibly noteworthy that despite that bias, there are tons of very well informed people at the labs, including Boaz, who are deeply concerned that things could go poorly, and many don't think it's implausible that AI could destroy humanity.
The belief is fixable?
Because sure, we can prioritize corrigibility and give up on independent ethics overriding that, but even in safety, that requires actual oversight, which we aren't doing.