I bought a subscription and tracked down the offending LW comment:
Another fundamental issue with the analysis is that it relies on aggregating votes of a kind from Less Wrong readers and the associated community. As we discussed here, it's common to see the most upvoted comments in forums like HN, lobsters, LW, etc., be statements that can clearly be seen to be wrong with no specialized knowledge and a few seconds of though (and an example is given from LW in the link), so why should an aggregation of votes from the LW community be considered meaningful?
He doesn't actually give a link to a LW comment, but he describes it. He says Jeff Kaufman asked why there are so few 6-door cars, and the top comment said that doors are an expensive part of the car, but this is obviously false because (a) they can't be thousands of dollars each and that's what it would take to make them a noticeable fraction of the cost, and (b) if that were true we'd expect cheap cars to more often have two doors instead of four, but instead cheap cars usually have four doors and if anything it's the expensive sports cars that have two.
Tracking down the original post, it appears to be this one. Top comment is roughly as described. It has 5 upvotes.
What's my overall take? Well, I don't think the explanation is as obviously false as Dan Luu thinks. But I do agree that a and b are good objections to it.
It's not obvious to me that (a,b) are such good objections, and furthermore the comment doesn't just say "it's because doors are expensive".
The comment in question says: doors cost money and worsen crash safety, so cars with more doors would cost more, and most buyers don't value the extra convenience enough to justify what it would cost them, so the market would be small which would make it not worth the development cost.
I agree that it's unlikely that the cost of an extra pair of doors is multiple thousands of dollars per car. But the price is, let's say, 3x the cost, and I don't have any trouble believing that an extra pair of doors might increase the cost by say $700, meaning a price $2000 higher. Is this "clearly seen to be wrong with no specialized knowledge"? Doesn't seem so. So, then the question is how willing buyers with large families would be to pay an extra $2000 for the convenience of a third door. Again, it's not clear to me that this wouldn't hurt the sellability of the vehicle. The idea that making such a vehicle as safe as everyone expects these days could be difficult also seems plausible to me. I would expect that families with multiple smallish children are (1) the main potential market for 6-door cars, (2) very safety-conscious, and (3) often very price-conscious.
I guess point (b) is meant to undermine the idea that adding doors increases the cost at all, or something. I'm not convinced by this. Isn't it plausible that the gain in convenience in going from 2 to 4 doors is substantially bigger than the gain in going from 4 to 6? Or that the loss in safety in going from 2 to 4 is smaller than going from 4 to 6? ... And some cheap-and-nasty cars have had only two doors. For instance, if I try to think of a cheap and crappy car, the first thing that comes to mind is the old Reliant Robin: four seats, two doors. (Also, three wheels.)
i don't know whether that comment is right. But I don't see how danluu reckons it's obviously wrong with a few seconds of thought. I wonder whether danluu might change his mind about its obvious wrongness if he thought about it for more than a few seconds.
Oh yeah, I should have clarified that I agree with your take -- Dan seems totally wrong to take this comment as significant negative evidence about the epistemic standards of LW. Waaaay too big of a stretch. It's just one comment with 5 upvotes, plus it's not obviously wrong & may even be right.
On the first read I was annoyed at the post for criticizing futurists for being too certain in their predictions, while it also throws out and refuses to grade any prediction that expressed uncertainty, on the grounds that saying something "may" happen is unfalsifiable.
On reflection these two things seem mostly unrelated, and for the purpose of establishing a track record "may" predictions do seem strictly worse than either predicting confidently (which allows scoring % of predictions right), or predicting with a probability (which none of these futurists did, but allows creating a calibration curve).
An interesting section in the appendices, a criticism of Ajeya Cotra’s “Forecasting Transformative AI with Biological Anchors”:
If you do a sensitivity analysis on the most important variable (how much Moore's law will improve FLOPS/$), the output behavior doesn't make any sense, e.g., Moore's law running out of steam after "conventional" improvements give us a 144x improvement would give us a 34% chance of transformative AI (TAI) by 2100, a 144*6x increase gives a 52% chance, and a 144*600x increase gives a 66% chance (and with the predicted 60000x improvement, there's a 78% chance), so the model is, at best, highly flawed unless you believe that going form a 144x improvement to a 144*6x improvement in computer cost gives almost as much increase in the probability of TAI as a 144*6x to 144*60000x improvement in computer cost.
The part about all of this that makes this fundamentally the same thing that the futurists here did is that the estimate of the FLOPS/$ which is instrumental for this prediction is pulled from thin air by someone who is not a deep expert in semiconductors, computer architecture, or a related field that might inform this estimate.
[...]
If you say that, based on your intuition, you think there's some significant probability of TAI by 2100; 10% or 50% or 80% or whatever number you want, I'd say that sounds plausible but wouldn't place any particular faith in the estimate. But if you take a model that produces nonsense results and then pick an arbitrary input to the model that you have no good intuition about to arrive at an 80% chance, you've basically picked a random number that happens to be 80%.
The claim that the probability goes from 34% -> 52% from a 6x of compute does sound pretty weird! But I think it's just based on a game of telephone and a complete misunderstanding.
I was initially confused where the number came from, then I saw the reference to Nostalgebraist's post. They say that "Assume a 6x extra speedup, and you get a 52% chance. (Which is still pretty high, to be fair.) Assume no extra speedup, and also no speedup at all, just the same computers we have now, and you get a 34% chance … wait, what?!"
Nostalgebraist is saying that you move from 34% to 52% by moving from 1x and 144*6x---not by moving from 144x to 144*6x. That is, this if you increase compute by about 3 OOMs you increase the probability from 34% to 52%.
Similarly, if you increase probability by 14*6x to 144*60000x, or about 4OOMs, you increase probability from 52% to 78%.
So 3 OOMs is 18% and 4 OOMs is 26%, roughly proportional as you'd expect given the nature of the model. The report basically distributes TAI over 20 OOMs and so a 3 OOM increase covers about 3/20th of the range.
But if you take a model that produces nonsense results and then pick an arbitrary input to the model that you have no good intuition about to arrive at an 80% chance, you've basically picked a random number that happens to be 80%.
If you get a nonsensical number out of a model, I think it's worth reflecting more on whether there was a misunderstanding.
Aside from this, calling "how far does Moore's law go" the most important variable seems kind of overstated. The criticism is that 7 orders of magnitude in this parameter leads to a change from 34% to 78%. I agree that's a significant difference, but 7 orders of magnitude is a lot of uncertainty in this parameter, and I don't think that's grounds for saying that it's the number that drives the whole estimate. And even after 7 OOMs these estimates aren't even that different in an action-relevant way---in particular this change doesn't result in a similarly-dramatic change for your 10 year or 20 year timelines, and shifting your 100 year TAI probability from 55% to 78% is not a huge deal.
And aside from that, saying that the estimates for Moore's law are arbitrary isn't right. I think it's totally fair that Ajeya isn't an expert, but that doesn't mean that things are totally unknown within 7 orders of magnitude. At the upper end things are pretty constrained by basic physics, at the lower end things are pretty constrained by normal technological extrapolation. There's a ton of uncertainty left but it's just not a big deal relative to the uncertainty about AI training.
The overall estimate is basically driven by the fact that a broad distribution over horizon lengths in the existing NN extrapolation gives you a similar range of estimates to the entire space from human lifetime to human evolution. So it's very easy to squint and get a broad distribution with around 5% probability per OOM of compute (which is a couple percent per year right now). The criticism of this that seems most plausible to me is that maybe inside-view you can just eyeball how good AI systems are and how close they are to transformative effects and it's just not that far. That said, the second most plausible criticism (especially about the 20%+ short-term predictions) is that you can eyeball how good AI systems are and it's probably not that close.
(Disclaimer: this report is written by my wife and so I may be biased.)
FWIW I'm not married to Ajeya and I agree with you; I was pretty disappointed by Nostalgebraist's post & how much positive reception it seemed to get. I've been thinking about writing up a rebuttal. Most of what I'd say is what you've already said here though, so yay.
I pointed out the OOM error on Twitter (citing this comment), and Dan has updated the post with a correction.
Epistemic status: perspective derived from following Dan Luu's output for the last 5 years or so. Trying to vaguely gesture at a few things at once. Please ask questions if you find something confusing.
Dan Luu has written a interesting post analysing the track record of futurists' predictions. The motivation:
I'm primarily going to address the appendix, particularly the section on Holden Karnofsky's analysis on the same subject, but the article is interesting reading and I'd recommend going through the whole thing. (I think Dan is evaluating forecasting track records pretty differently from how I would, and I haven't actually dug into any of the other analysis. On priors I'd expect it to be similar to his analysis of Holden's work.)
The specific claim I have an issue with here is "another way to look at it is that any level other than "true" is false when read as written". Depending on how you want to evaluate it it, it's either technically true but irrelevant[1], or not even wrong[2].
This seems like a basically accurate description of the methodology used in the 2019 assessment. Stuart Armstrong says in a footnote that that removing 4 of the 34 assessors who had gaps in their predictions didn't change any of the results, but I don't expect this would address Dan's primary criticism. I performed my own data cleaning, and removed 9 of the predictors who had substantial gaps in their predictions (there are 2 left who have "any" missing predictions). In both cases, the results obtained on mean (<2 and <1.5), median, and mode[3] are identical:
(Quantities absolute, divide by 105 questions.)
I think requiring a mean under 1.5 to decide something is "true" puts too much weight in the hands of outliers who are either interpreting the prediction differently from the rest, or are simply wrong as a matter of fact[4].
With that said, I think deferring to these aggregate evaluations at all is a mistake. It seems like Dan agrees, though for reasons that I disagree with:
I can't actually look at the paywalled source, but putting aside the accuracy of the "most upvoted comments" on LessWrong, the graders were not randomly selected from those LessWrong users who make heavily upvoted comments. Nearly half of them were publicly named, and many those have extensively documented their thoughts online. One could, if desired, go read some of their writings, and judge their epistemics for one's self.
Of the predictions which had a modal grade of 1, I personally consider 8 of them to be true, and maybe 5-6 to be non-trivial[5]. I think Dan would consider some of them - perhaps even most - to be insufficiently rigorously specified to grade.
Dan seems to have an unusual knack for noticing inconsistencies[6] and $20 bills on the sidewalk. His work sometimes seems to avoid performing inside-view analysis[7], which can make engaging with it a bit tricky. It does seem to pay off in cases like this - I don't have time to dig into it right now, but in the appendix, he also linked to a post by nostagebraist addressing the intial Bio Anchors report that seems worth following up on.
Someone in 1960 predicting an unlikely outcome in 2010, and that outcome actually occuring in 2011, is technically "wrong", but a very different kind of wrong from someone in 1960 predicting an unlikely outcome in 1965 but that outcome not having occurred yet at all.
Other people's evaluations of predictions are not, in fact, especially solid pointers to the truth-value of those predictions. Given the subject of the article I think Dan probably appreciates this point.
Three of the basic aggregations suggested by Dan as being even minimally informative.
The fact that the modal grade for question 47 (about cochlear implants) was a 1, while the mean was ~2.5 (with many 4s and 5s), is mostly an indication that the prediction was underspecified, and that the graders in question had very different ideas in mind of what "very effective" and "widely used" meant (or had similar ideas but didn't bother looking up the actual numbers).
In the sense that most people were not making similar predictions at the time, and priors on those predictions were probably low.
Quote: "I find it a bit odd that, with all of the commentary of these LW posts, few people spent the one minute (and I mean one minute literally — it took me a minute to read the post, see the comment Armstrong made which is a red flag, and then look at the raw data) it would take to look at the data and understand what the post is actually saying, but as we've noted previously, almost no one actually reads what they're citing."
For reasons that at least aren't obviously wrong, though I think foregoing an inside-view opinion while simultaneously delivering an outside-view refutation is not enormously productive.