he has several times claimed to have a great forecasting track record.
This seems like an unfair exaggeration, going off the quotes you pulled.
Yes, but I don't think he claims to have a better forecasting track record than them. I think he would say he is epistemically better in general, but as you say he doesn't participate on Metaculus, he barely has any track record to speak of, so he'd have to be pretty delusional to think his track record is better.
I too would claim such a thing, or something similar at least -- I'd say that my forecasts about AGI are better than the typical Metaculus forecast about AGI; however, I would not claim to have a great forecasting track record or even a better forecasting track record than Metaculus, because (a) I don't have much of a track record at all, and (b) there are lots of other non-AGI questions on metaculus and on those questions I expect to do worse than Metaculus on average, lacking expertise as I do. (Alas, the AGI questions have mostly not resolved yet and will not resolve for some years, so we can't just check those.)
Yes, I agree with the points you make about e.g. the importance of track records, the importance of betting, etc. etc. No, I don't expect you to take my word for anything (or Yudkowsky's). Yes, I think it's reasonable for outsiders / people who aren't super familiar with the literature on AI to defer to Metaculus instead of me or Yudkowsky.
Perhaps this explains my position better:
If I saw a Yudkowsky tweet saying "I have a great forecasting track record" or "I have a better forecasting track record than Metaculus" my immediate reaction would be "Lol no you don't fuck off." When I read the first few lines of your post, I expected to shortly see a pic of such a tweet as proof. In anticipation my "lol fuck you Yudkowsky" reaction already began to rise within me.
But then when I saw the stuff you actually quoted, it seemed... much more reasonable? In particular, him dumping on Metaculus for updating so hard on Gato seemed... correct? Metaculus really should have updated earlier, Gato just put together components that were already published in the last few years. So then I felt that if I had only skimmed the first part of your post and not read the actual post, I would have had an unfairly negative opinion of Yudkowsky, due to the language you used: "He has several times claimed to have a great forecasting track record."
For what it's worth, I agree that Yudkowsky is pretty rude and obnoxious & that he should probably get off Twitter if this is how he's going to behave. Like, yes, he has alpha about this AI stuff; he gets to watch as the "market" gradually corrects and converges to his position. Yay. Good for him. But he's basically just stroking his own ego by tweeting about it here; I don't see any altruistic purpose served by it.
I am a forecaster on that question: the main doubt I had was if/when someone would try to do wordy things + game playing on a "single system". Seemed plausible to me that this particular combination of capabilities never became an exciting area of research, so the date at which an AI can first do these things would then be substantially after this combination of tasks would be achievable with focused effort. Gato was a substantial update because it does exactly these tasks, so I no longer see much reason possibility that the benchmark is achieved only after the capabilities are substantially overshot.
I also tend to defer somewhat to the community.
I was at 2034 when the community was at 2042, and I updated further to 2026 on the Gato news.
That's good feedback. I can see why the wording I used gives the wrong impression -- he didn't literally say out loud that he has "a great forecasting track record". It still seems to me heavily implied by several things he's said, especially what he said to Paul.
I think the point you raise is valid enough. I have crossed out the word "claimed" in the essay, and replaced it with "implied".
As an example of the kind of point that one might use in deciding who "came off better" in the FOOM debate, Hanson predicted that "AIs that can parse and use CYC should be feasible well before AIs that can parse and use random human writings", which seems pretty clearly falsified by large language models—and that also likely bears on Hanson's view that "[t]he idea that you could create human level intelligence by just feeding raw data into the right math-inspired architecture is pure fantasy".
As you point out, however, this exercise of looking at what was said and retrospectively judging whose worldview seemed "less surprised" by what happened is definitely not the same thing as a forecasting track record. It's too subjective; rationalizing why your views are "less surprised" by what happened than some other view (without either view having specifically predicted what happened), is not hugely more difficult than rationalizing your views in the first place.
I think the passage you quote there is just totally correct though. If you turn the clock back ten years or more to when all that stuff was happening, Yudkowsky was the "AGI is really important and coming sooner than you think" end of the spectrum, and the other side seemed to be "AGI is either not ever going to be a thing, or not ever going to be important" and then the median opinion was something like "Plausibly it'll be an important thing but it's coming 50 - 100 years from now." At least that's my impression from the 9-ish years I've been lurking on LW and the 7-ish years I've been talking to people in the community. (gosh I'm old.)
In the passage you quote I interpret Yud as saying that when you compare his claims about AGI back then to claims that other rationalists and EAs were making, people like Hanson, with the benefit of hindsight his look closer to the truth. I think that's correct. Of course the jury is still out, since most of the claims on both sides were about things that haven't happened yet (AGI is still not here) but e.g. it's looking pretty unlikely that uploads/ems will come first, it's looking pretty unlikely that AGI will be an accumulation of specialized modules built by different subcontractors (like an f-35 fighter jet lol), it's looking pretty likely that it'll happen in the 20's or 30's instead of the 60's or 70's... most of all, it's looking pretty likely that it'll be a Big Deal, something we all should be thinking about and preparing for now.
On overall optimism it seems clear that Eliezer won---Robin seems unusually bad, while Eliezer seems unusually good. I also think on "domain-specific engineering" vs "domain-general engineering" Eliezer looks unusually good while Robin looks typical.
But I think there are also comparably-important substantive claims that look quite bad. I don't think Eliezer has an unambiguous upper hand in the FOOM debate a all:
Robin on AI timelines just seems particularly crazy. We can't yet settle the ems vs de novo AI bet, but I think the writing is on the wall, and his forecasting methodology for the 300 year timeline seems so crazy---ask people in a bunch of fields "how far have you come to hman level, is it speeding up?" and then lean entirely on that (I think many of the short-term predictions are basically falsified now, in that if you ask people the same question they will give much higher percentages and many of the tasks are solved).
ETA: Going through the oldest examples from Robin's survey to see how the methodology fares:
Regarding the weird mediocrity of modern AI, isn't part of this that GPT-3-style language models are almost aiming for mediocrity?
Would a hypothetical "AlphaZero of code" which built its own abstractions from the ground up - and presumably would not reinvent Python (AlphaCode is cool and all, but it does strike me as a little absurd to see an AI write Python) - have this property?
The debate was about whether a small group could quickly explode to take over the world. AI development projects are now billion-dollar affairs and continuing to grow quickly, important results are increasingly driven by giant projects, and 9 people taking over the world with AI looks if anything even more improbable and crazy than it did then.
Maybe you mean something else there, but wasn't Open AI like 30 people when they released GPT-2 and maybe like 60 when they released GPT-3? This doesn't seem super off from 9 people, and my guess is there is probably a subset of 9 people that you could poach from OpenAI that could have made 80% as fast progress on that research as the full set of 30 people (at least from talking to other people at OpenAI, my sense is that contributions are very heavy-tailed)?
Like, my sense is that cutting-edge progress is currently made by a few large teams, but that cutting-edge performance can easily come from 5-10 person teams, and that if we end up trying to stop race-dynamics, that the risk from 5-10 person teams would catch up pretty quickly with the risk from big teams, if the big teams halted progress. It seems to me that if I sat down with 8 ot...
GPT-2 is very far from taking over the world (and was indeed <<10 people). GPT-3 was bigger (though still probably <10 people depending how you amortize infrastructure), and remains far from taking over the world. Modern projects are >10 people, and still not yet taking over the world. It looks like it's already not super plausible for 10 people to catch up, and it's rapidly getting less plausible. The prediction isn't yet settled, but neither are the predictions in Eliezer's favor, and it's clear which way the wind blows.
These projects are well-capitalized, with billions of dollars in funding now and valuations rapidly rising (though maybe a dip right now with tech stocks overall down ~25%). These projects need to negotiate absolutely massive compute contracts, and lots of the profit looks likely to flow to compute companies. Most of the work is going into the engineering aspects of these projects. There are many labs with roughly-equally-good approaches, and no one has been able to pull much ahead of the basic formula---most variation is explained by how big a bet different firms are willing to make.
Eliezer is not talking about 10 people making a dominant AI bec...
Historical track record of software projects is that it's relatively common that a small team of ~10 people outperforms 1000+ person teams. Indeed, I feel like this is roughly what happened with Deepmind and OpenAI. I feel like in 2016 you could have said that current AGI projects already have 500+ employees and are likely to grow even bigger and so it's unlikely that a small 10-person team could catch up, and then suddenly the most cutting-edge project was launched by a 10-person team. (Yes, that 10 person team needed a few million dollars, but a few million dollars are not that hard to come by in the tech-sector).
My current guess is that we will continue to see small 10-person teams push the cutting-edge forward in AI, just as we've seen the same in most other domains of software.
In addition to 10 people, the view "you can find a better way to build AI that's way more efficient than other people" is also starting to look increasingly unlikely as performance continues to be dominated by scale and engineering rather than clever ideas.
I do agree with this in terms of what has been happening in the last few years, though I do expect this to break down as we see more things in the "le...
I think "team uses Codex to be 3x more productive" is more like the kind of thing Robin is talking about than the kind of thing Eliezer is talking about (e.g. see the discussion of UberTool, or just read the foom debate overall). And if you replace 3x with a more realistic number, and consider the fact that right now everyone is definitely selling that as a product rather than exclusively using it internally as a tool, then it's even more like Robin's story.
Everyone involved believes in the possibility of tech startups, and I'm not even sure if they have different views about the expected returns to startup founders. The 10 people who start an AI startup can make a lot of money, and will typically grow to a large scale (with significant dilution, but still quite a lot of influence for founders) before they make their most impressive AI systems.
I think this kind of discussion seems pretty unproductive, and it mostly just reinforces the OP's point that people should actually predict something about the world if we want this kind of discussion to be remotely useful for deciding how to change beliefs as new evidence comes in (at least about what people / models / reasoning strategies w...
Now we're mostly talking about whether a $10 trillion company can explosively grow to $300 trillion as it develops AI, which is just not the same game in any qualitative sense.
To be clear, this is not the scenario that I worry about, and neither is it the scenario most other people I talk about AI Alignment tend to worry about. I recognize there is disagreement within the AI Alignment community here, but this sentence sounds like it's some kind of consensus, when I think it clearly isn't. I don't expect we will ever see a $300 trillion company before humanity goes extinct.
Eliezer very specifically talks about AI systems that "go foom," after which they are so much better at R&D than the rest of the world that they can very rapidly build molecular nanotechnology, and then build more stuff than the rest of the world put together.
This isn't related to offense vs defense, that's just >$300 trillion of output conventionally-measured. We're not talking about random terrorists who find a way to cause harm, we are talking about the entire process of (what we used to call) economic growth now occurring inside a lab in fast motion.
I think he lays this all out pretty explicitly. And for what it's worth I think that's the correct implication of the other parts of Eliezer's view. That is what would happen if you had a broadly human-level AI with nothing of the sort anywhere else. (Though I also agree that maybe there'd be a war or decisive first strike first, it's a crazy world we're talking about.)
And I think in many ways that's quite to what will happen. It just seems most likely to take years instead of months, to use huge amounts of compute (and therefore share proceeds with compute providers and a bunch of the rest of the economy), to result in "AI improvements" that look much more similar to conventional human R&D, and so on.
Good points; those do seem to be cases in which Hanson comes out better. As you say, it comes down to how heavily you weight the stuff Yudkowsky beat Hanson on vs. the stuff Hanson beat Yudkowsky on. I also want to reiterate that I think Yudkowsky is being obnoxious.
(I also agree that the historical bio anchors people did remarkably well & much better than Yudkowsky.)
Note that I feel like, if we look at the overall disagreements in 2008, Eliezer's view overall seems better than Robin's. So I think we're probably on the same page here.
The comments about Metaculus ("jumped ahead of 6 years earlier") make more sense if you interpret them as being about Yudkowsky already having "priced in" a deep-learning-Actually-Works update in response to AlphaGo in 2016, in contrast to Metaculus forecasters needing to see DALLE 2/PaLM/Gato in 2022 in order to make "the same" update.
(That said, I agree that Yudkowsky's sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)
(That said, I agree that Yudkowsky's sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)
In particular I am irritated that Yudkowsky is criticizing metaculus forecasters when he literally doesn't even have a metaculus account, at least one he posts. He's pro-bets in theory, but then will asymmetrically criticize the people who make their prediction track record quantifiable and public. The reputational risk of making regular metaculus predictions would be a lot more psychologically and socially relevant to him than losing a thousand dollars, so the fact that he's not doing so says a lot to me.
so the fact that he's not doing so says a lot to me
How about Metaculus points being worth nothing, or it being a huge time commitment with no payoff? Last I heard (e.g. from Zvi, who wasn't impressed with it), Metaculus still punished people for not continually updating their predictions, and occasionally rewarded them for making predictions, period (as in, both betting "no" or "yes" on some predictions granted points).
Have any of those things changed?
Note that Eliezer and I ended up making one bet (8% vs 16% probability on AI IMO gold by 2025). I would have liked to have gotten to more disagreements about more central topics---I feel like the result will be mostly noise---but it's not nothing. (I felt like I offered some reasonable candidates that we ought to disagree about, though the reader can judge for themselves.)
It was quite hard to get to this forecast, and in the end I don't think it will be that useful. I think it's just generally really hard. I don't have a clear sense for why Eliezer and I weren't able to get to more bets, but I'm not that surprised.
I do think that this kind of betting has a lot of hazards and there's a good chance that we are both going t come out behind in social EV. For example: (i) if you try to do it reasonably quickly then you basically just know there are reasons that your position is dumb that you just haven't noticed, (ii) it's easier to throw something in someone's face as a mistake than to gloat about it as a win, (iii) there is all kinds of adverse selection...
My first 'dunk' on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn't move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.
My second 'dunk' on May 12 is about Metaculus updating that much again in that direction, one month later.
I do admit, it's not a good look that I once again understate my position by so much compared to what the reality turns out to be, especially after having made that mistake a few times before.
I do however claim it as a successful advance prediction, if something of a meta one, and cast a stern glance in your direction for failing to note this over the course of your attempting to paint me in a negative light by using terms like 'dunk'.
asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years. [emphasis mine]
I feel like this is missing the key claim underlying this post: that verbal statements making implicit predictions are too hard to judge and too easy to hindsight bias about, and so aren't strong evidence about a person's foresight.
For instance, if Metaculus, did not, in fact, update again over the upcoming 3 years, and you were merely optimizing for the appearance of accuracy, you could claim that you weren't making a prediction, merely voicing a question. And more likely, you and everyone else would just have forgotten about this tweet.
I don't particularly want to take a stance on whether verbal forecasts like that one ought to be treated as part of one's forecasting record. But insofar as the author of this post clearly doesn't think they should be, this comment is not addressing his objection.
These sorts of observations sound promising for someone's potential as a forecaster. But by themselves, they are massively easier to cherry-pick, fudge, omit, or re-define things, versus proper forecasts.
When you see other people make non-specific "predictions", how do you score them? How do you know the scoring that you're doing is coherent, and isn't rationalizing? How do you avoid the various pitfalls that Tetlock wrote about? How do you *ducks stern glance* score yourself on any of that, in a way that you'll know isn't rationalizing?
For emphasis, in this comment you reinforce that you consider it a successful advance prediction. This gives very little information about your forecasting accuracy. We don't even know what your actual distribution is, and it's a long time before this resolves, we only know it went in your direction. I claim that to critique other people's proper-scored forecasts, you should be transparent and give your own.
EDIT: Pasted from another comment I wrote:
Instead of that actual [future resolution] reality, and because of how abruptly the community ended up shifting, Eliezer seems to be interpreting that to mean that his position about that reality is not extreme enough. Those 2 things are somewhat related but pretty weakly, so it seems like rationalizing for him to frame it as showing his forecast isn't extreme enough.
My first 'dunk' on April 18, about a 5-year shortening of Metaculus timelines in response to evidence that didn't move me at all, asking about a Metaculus forecast of the Metaculus forecast 3 years later, implicitly predicts that Metaculus will update again within 3 years.
I do however claim it as a successful advance prediction, if something of a meta one
Wait, unless I misunderstand you there’s a reasoning mistake here. You request epistemic credit for predicting implicitly that the Metaculus median was going to drop by five years at some point in the next three years. But that’s a prediction that the majority of Metaculites would also have made and it’s a given that it was going to happen, in an interval of time as long as three years. It’s a correct advance prediction, if you did make it (let’s assume so and not get into inferring implicit past predictions with retrospective text analysis), but it’s not one that is even slightly impressive at all.
As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that...
Thanks for the post and expressing your opinion!
That being, I feel like there is a misunderstanding here. Daniel mentioned that in another comment thread, but I don't think Eliezer claims what you're attributing to him, nor that your analogy with financial pundits works in this context.
My model of Eliezer, based on reading a lot of his posts (old and new) and one conversation, is that he's dunking on Metaculus and forecasters for a combination of two epistemic sins:
I agree with some of your complaints here. But Eliezer has more of a track record than you indicate. E.g. he made one attempt that I know of to time the stock market, buying on March 23, 2020 - the day on which it made its low for 2020.
There are further shards of a track record strewn across the internet:
A few parts of this OP seem in bad faith:
Here he dunks on Metaculus predictors as "excruciatingly predictable" about a weak-AGI question
No, the original Yudkowsky quote is:
To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.
I wonder if a Metaculus forecast of "what this forecast will look like in 3 more years" would be saner. Is Metaculus reflective, does it know what it's doing wrong?
And regarding this:
But Eliezer seems willing to format his message as blatant fearmongering like this. For years he's been telling people they are doomed, and often suggests they are intellectually flawed if they don't agree. To me, he doesn't come across like he's sparing me an upsetting truth. To me he sounds like he's catastrophizing, which isn't what I expect to see in a message tailored for mental health.
If OP had extend...
A couple questions:
Making highly visible predictions about AGI timelines as a safety figure is a lose-lose situation. If you're right, you will all be dead, so it won't matter. If you're wrong, bad people who don't make any predictions will use yours to tar you as a kook. Then everyone will stop listening to you, and AGI will come five years later and you'll all be dead.
I'm not saying he shouldn't shut up about the metaculus updates, but he's in a bit of a bind here. And as you noticed, he has in fact made a substantial prediction via his bet with Caplan. The reason he doesn't do much else is because (in my model of Eliezer) the kinds of people who are likely to take heed of his bets are more likely to be intellectually honest.
I don't like this defense for two reasons. One, I don't se why the same argument doesn't apply to the role Eliezer has already adopted as an early and insistent voice of concern. Being deliberately vague on some types of predictions doesn't change the fact that his name is synonymous with AI doomsaying. Second, we're talking about a person whose whole brand is built around intellectual transparency and reflection; if Eliezer's predictive model of AI development contains relevant deficiencies, I wish to believe that Eliezer's predictive model of AI development contains relevant deficiencies. I recognize the incentives may well be aligned against him here, but it's frustrating that he seems to want to be taken seriously on the topic but isn't obviously equally open to being rebutted in good faith.
Taking a stab at the crux of this post:
The two sides have different ideas of what it means to be epistemically virtuous.
Yudkowsky wants people to be good Bayesians, which e.g. means not over-updating on a single piece of evidence; or calibrating to the point that whatever news of new AI capabilities appears is already part of your model, so you don't have to update again. It's not so important to make publically legible forecasts; the important part is making decisions based on an accurate model of the world. See the LW Sequences, his career, etc.
The OP is part of the Metaculus community and expects people to be good... Metaculeans? That is, they must fulfill the requirements for "forecaster prestige" mentioned in the OP. Their forecasts must be pre-registered, unambiguous, numeric, and numerous.
So it both makes perfect sense for Yudkowsky to criticize Metaculus forecasts for being insufficiently Bayesian (it made little sense that a forecast would be this susceptible to a single piece of news; compare with the LW discussion here), and for OP to criticize Yudkowsky for being insufficiently Metaculean (he doesn't have a huge public catalog of Metaculean predictions)....
This isn't a good description of being on Metaculus versus being a Bayesian.
How does one measure if they are "being a Bayesian"? The general point is you can't, unless you are being scored. You find out by making forecasts -- if you aren't updating you get fewer points, or even lose points. Otherwise you have people who are just saying things that thematically sound Bayesian but don't mean very much in terms of updated believes. Partly I'm making an epistemic claim that Eliezer can't actually know if he's being a good Bayesian, without proper forecasting. You can check out Tetlock's work if you're unsure why that would be the case, though I mention it in the post.
The more central epistemic claim I'm making in this essay: if someone says they are doing a better job of forecasting a topic than other people, but they aren't actually placing forecasts so we could empirically test if they are, then that person's forecasts should be held in high suspicion. I'm claiming this would be the same in every other domain, and AI timelines are unlikely to be that special, and his eminence doesn't really buy a good justification why we would hold him to drastically lower standards about measuring his forecast accuracy.
An analogy could be Elon Musk. He's done great things that I personally am absolutely incapable of. And he does deserve praise for those things. And indeed, Eliezer was a big influence on me. But he gives extreme predictions that probably won't age well.
Him starting this site and writing a million words about rationality is wonderful and outstanding. But do you think it predicts forecasting performance nearly as well as proper forecasting actual performance? I claim it doesn't come anywhere near as good of a predictive factor than just making some actual forecasts and seeing what happens, and I don't see the opposing position holding well at all. You can argue that "we care about other things too than just forecasting ability" but in this thread I am specifically referring to his implied forecasting accuracy, not his other accomplishments. The way you're referring to Bayes points here doesn't seem workable or coherent, any more than Musk Points tell me his predictions are accurate.
No, success and fame are not very informative about forecasting accuracy. Yes they are strongly indicative of other competencies, but you shouldn't mix those in with our measure of forecasting. And nebulous unscorable statements don't at all work as "success", too cherry-picked and unworkable. Musk is famously uncalibrated with famously bad timeline predictions in his domain! I don't think you should be glossing over that in this context by saying "Well he's successful..."
If we are talking about measuring forecasting performance, then it's more like comparing tournament Karate with trench warfare.
I share the sense that many "AGI" forecasts are going to be very hard to arbitrate---at best they have a few years of slack one way or the other, and at worst they will be completely reinterpreted (I could easily see someone arguing for AGI today).
I try to give forecasts for "technological singularity" instead, which I think has a variety of mostly-equivalent operationalizations. (When asked to give a timeline to AI I often give a tongue-in-cheek operationalization of "capture >1% of sun's energy." This is obviously more appropriate if coupling timelines with the substantive prediction that that it will only make a few months of difference which crazy-ambitious technological milestone you choose---just as I think it only makes a few centuries of difference which milestone you use for forecasting the technological singularity starting from 10,000 BC.)
Suppose someone in 1970 makes the prediction. "more future tech progress will be in computers, not rockets." (Claiming amongst other arguments that rockets couldn't be made orders of magnitude smaller, and computers could) There is a sense in which they are clearly right, but its really hard to turn something like that into a specific objective prediction, even with the benefit of hindsight. Any time you set an objective criteria, there are ways that the technical letter of the rules could fail to match the intended spirit.
(The same way complicated laws have loopholes)
Take this attempt
...Change the first criterion to: "Able to reliably pass a 2-hour, adversarial Turing test during which the participants can send text, images, and audio files (as is done in ordinary text messaging applications) during the course of their conversation. An 'adversarial' Turing test is one in which the human judges are instructed to ask interesting and difficult questions, designed to advantage human participants, and to successfully unmask the computer as an impostor. A single demonstration of a computer passing such a Turing test, or one that is sufficiently similar, will be sufficient for t
I like your list!
Definitely agree that narrow questions can lose the spirit of it. The forecasting community can hedge against this by having a variety of questions that try to get at it from "different angles".
For example, that person in 1970 could set up a basket of questions:
Of course, I understand completely if people in 1970 didn't know about Tetlock's modern work. But for big important questions, today, I don't see why we shouldn't just use modern proper forecasting technique. Admittedly it is laborious! People have been struggling to write good AI timeline questions for years.
This is not verbatim because I never wrote it down till now, but Eliezer said in March 2005 that an AI's beating Go champions in Go would be a sign of real progress towards AGI. IMO that counts as a successful prediction because the rate at which the field racked up successes increased significantly around the time of AlphaGo.
(The statement was made in the house of a man named Spike Jones at a party to give the public a chance to meet Eliezer, who'd just moved to the Bay Area.)
It tells us essentially nothing. How are you going to score the degree to which it turned out to be "a sign of real progress towards AGI"? I understand it feels impressive but it's far too nebulous to work as a forecasting track record.
He clearly believes he could be placing forecasts showing whether or not he is better. Yet he doesn't.
Eliezer hasn't said he thinks he can do better than Metaculus on arbitrary questions. He's just said he thinks Metaculus is wrong on one specific question. Quoting a point I made in our conversation on Twitter:
...[...] From my perspective, it looks like: I don't think Metaculus performance on physics-loaded tech progress is a perfect proxy for physics knowledge (or is the only way a physicist could think they know better than Metaculus on a single question).
It seems like you're interpreting EY as claiming 'I have a crystal ball that gives me unique power to precisely time AGI', whereas I interpret EY as saying that one particular Metaculus estimate is wrong.
Metaculus being wrong on a particular very-hard-to-forecast question is not a weird or crazy claim, so you don't need to claim to be a genius.
Obviously EY shouldn't get a bunch of public "aha, you predicted Metaculus' timeline was way too long" credit when he didn't clearly state this in advance (at least before the first update) and hasn't quantified what "too long" means.
I'm not saying 'give EY social credit for this' or ev
I think this is an unreasonable characterization of the situation and my position, especially the claim:
Eliezer hasn't seen a big list of prediction successes from Paul about this thing Paul claims to be unusually good at (whereas, again, EY makes no claim of being unusually good at timing arbitrary narrow-AI advances)
I responded to a long thread of Eliezer trash-talking me in particular (here), including making apparent claims about how this is not the kind of methodology that makes good forecasts. He writes:
It just seems very clear to me that the sort of person who is taken in by this essay is the same sort of person who gets taken in by Hanson's arguments in 2008 and gets caught flatfooted by AlphaGo and GPT-3 and AlphaFold 2 [... the kind of person who is] going "Huh?" when AlphaGo or GPT-3 debuts[1]
He also writes posts like this one. Saying "the trick that never works" sure seems like it's making a claim that something has a worse track record than whatever Eliezer is doing.
Overall it looks to me that Eliezer is saying, not once but many times, that he is better at predicting things than other people and that this should be taken as a reason to dismiss various kinds of argumen...
Yeah, I think I should probably stay out of this kind of interaction if I'm going to feel compelled to respond like this. Not that maximizing respect is the only goal, but I don't think I'm accomplishing much else.
I'm also going to edit the the phrases "shouldn't talk quite as much shit" and "full of himself," I just shouldn't have expressed that idea in that way. (Sorry Eliezer.)
I think the YouTube drama is serving an important function. Yudkowsky routinely positions himself in the role of a religious leader who is (in his own words) "always right".
(I think "role of a religious leader" is an apt description of what's going on sociologically, even if no supernatural claims are being made; that's why the "rightful caliph" language sticks.)
I used to find the hyper-arrogant act charming and harmless back in 2008, because, back in 2008, he actually was right about almost everything I could check myself. (The Sequences were very good.)
For reasons that are beyond the scope of this comment, I no longer think the hyper-arrogant act is harmless; it intimidates many of his faithful students (who genuinely learned a lot from him) into deferring to their tribal leader even when he's obviously full of shit.
If he can't actually live up to his marketing bluster, it's important for our collective sanity that people with reputation and standing call bullshit on the act, so that citizens of the Caliphate remember that they have the right and the responsibility to think things through for themselves. I think that's a more dignified way to confront the hazards that face us in ...
Elizabeth van Nostrand comments in private chat:
Can everyone agree that:
- there are many forms of prediction, of which narrow, precise forecasting of the kind found on prediction markets is only one
- narrow forecasting is only viable for a small subset of problems, and often the most important problems aren’t amenable to narrow forecasting
- narrow forecasting is much harder to fake than the other kinds. Making vague predictions and taking credit for whatever happens to happen is a misallocation of truthseeking credit.
- It is possible to have valuable models without being good at narrow predictions- black swans is a useful concept but it’s very annoying how the media give Nassim Taleb credit everytime something unexpected happens.
- It is possible to have models that are true but not narrow-predictive enough to be valuable [added: you can have a strong, correct model that a stock is overpriced, but unless you have a model for when it will correct it’s ~impossible to make money off that information]
?
I like this addition, and endorse 1-5!
But still without being transparent about his own forecasts, preventing a fair comparison.
I think it's a fair comparison, in that we can do at least a weak subjective-Bayesian update on the information -- it's useful and not cherry-picked, at least insofar as we can compare the AGI/TAI construct Eliezer was talking about in December, to the things Metaculus is making predictions about.
I agree that it's way harder to do a Bayesian update on data points like 'EY predicted AGI well before 2050, then Metaculus updated from 2052 to 2035' when we don't have a full EY probability distribution over years.
I mostly just respond by making a smaller subjective update and then going on with my day, rather than treating this as revelatory. I'm better off with the information in hand, but it's a very small update in the grand scheme of things. Almost all of my knowledge is built out of small updates in the first place, rather than huge revelatory ones.
If I understand your views, Jotto, three big claims you're making are:
Note that this doesn't mean making the date range very narrow (confident), that's unrelated.
Fair enough, but I was responding to a pair of tweets where you said:
Eliezer says that nobody knows much about AI timelines. But then keeps saying "I knew [development] would happen sooner than you guys thought". Every time he does that, he's conning people.
I know I'm using strong wording. But I'd say the same in any other domain.
He should create a public Metaculus profile. Place a bunch of forecasts.
If he beats the community by the landslide he claims, then I concede.
If he's mediocre, then he was conning people.
'It would be convenient if Eliezer would record his prediction on Metaculus, so we know with more precision how strong of an update to make when he publicly says "my median is well before 2050" and Metaculus later updates toward a nearer-term median' is a totally fair request, but it doesn't bear much resemblance to 'if you record any prediction anywhere other than Metaculus (that doesn't have similarly good tools for representing probability distributions), you're a con artist'. Seems way too extreme.
Likewise, 'prove that you're better than Metaculus on a ton of forecasts or you're ...
Also, I think you said on Twitter that Eliezer's a liar unless he generates some AI prediction that lets us easily falsify his views in the near future? Which seems to require that he have very narrow confidence intervals about very near-term events in AI.
So I continue to not understand what it is about the claims 'the median on my AGI timeline is well before 2050', 'Metaculus updated away from 2050 after I publicly predicted it was well before 2050', or 'hard takeoff is true with very high probability', that makes you think someone must have very narrow contra-mainstream distributions on near-term narrow-AI events or else they're lying.
I agree with you on most counts, but am not updating as much as I would against Eliezer's epistemic hygiene here as I would because you say Eliezer didn't bet Paul, but he did, and you know he did, but didn't issue a correction, which indicates the evidence may be selected overly much for dumb comments that Eliezer has made but possibly didn't put all that much thought into.
Eliezer and Bryan's bet is 1:1 odds for a CPI-adjusted $100 bet that the world won't be destroyed by Jan. 1, 2030.
After reading comments on my post "How to place a bet on the end of the world," which was motivated by Bryan Caplan's description of his bet with Eliezer, I concluded that you can't extract information on confidence from odds on apocalyptic bets. Explanation is in the comments.
Bryan told me via email that these are the best odds he could get from Eliezer.
I think the best way to think about their bet is that it's just for fun. We shouldn't try t...
This makes an important point that I find myself consistently referring to - almost none of the confidence in predictions, even inside the rationalist community, is based on actual calibration data. Experts forecast poorly, and we need to stop treating expertise or argumentation as strong stand-alone reasons to accept claims which are implicitly disputed by forecasts.
On the other hand, I think that this post focused far too much on Eliezer. In fact, there are relatively few people in the community who have significant forecasting track records, and this co...
Would it be helpful to think about something like "what Brier score will a person in the reference class of "people-similar-to-Eliezer_2022-in-all-relevant-ways" have after making a bunch of predictions on Metaculus?" Perhaps we should set up this sort of question on Metaculus or Manifold? Though I would probably refrain from explicitly mentioning Eliezer in it.
Imagine if there was a financial pundit who kept saying "Something really bad is brewing in the markets and we may be headed for a recession. But we can't know when recessions will come, nobody can predict them". And then every time there was a selloff in the market, they tell everyone "I've been saying we were headed for trouble", taking credit. This doesn't work as a forecasting track record, and it shouldn't be thought of that way.
If they want forecaster prestige, their forecasts must be:
When Eliezer Yudkowsky talks about forecasting AI, he has several times
claimedimplied he has a great forecasting track record. But a meaningful "forecasting track record" has well-known and very specific requirements, and Eliezer doesn't show these.Here he dunks on the Metaculus predictors as "excruciatingly predictable" about a weak-AGI question, saying that he is a sane person with self-respect (implying the Metaculus predictors aren't):
He clearly believes he could be placing forecasts showing whether or not he is better. Yet he doesn't.
Some have argued "but he may not have time to keep up with the trends, forecasting is demanding". But he's the one making a claim about relative accuracy! And this is in the domain he says is the most important one of our era. And he seems to already be keeping up with trends -- just submit the distribution then.
And here he dunks on Metaculus predictors again:
But still without being transparent about his own forecasts, preventing a fair comparison.
In another context, Paul Christiano offered to bet Eliezer about AI timelines. This is great, a bet is a tax on bullshit. While it doesn't show a nice calibration chart like on Metaculus, it does give information about performance. You would be right to be fearful of betting against Bryan Caplan. And to Eliezer's great credit, he has actually made a related bet with Bryan! EDIT: Note that Eliezer also agreed to this bet with Paul.
But at one point in responding to Paul, Eliezer mentions some nebulous, unscorable debates and claims:
Nothing about this is a forecasting track record. These are post-hoc opinions. There are unavoidable reasons we require pre-registering of the forecasts, removal of definitional wiggle room, explicit numbers, and a decent sample. This response sounds like the financial pundit, saying he called the recession.
EDIT: I think some people are thinking that Eliezer was an unambiguous winner of that debate, and therefore this works as part of a forecasting track record. But you can see examples of why it's far more ambiguous than that in this comment by Paul Christiano.
In this comment, Eliezer said Paul didn't need to bet him, and that Paul is...lacking a forecasting track record.
But Eliezer himself doesn't have a meaningful forecasting track record.
In other domains, where we have more practice detecting punditry tactics, we would dismiss such an uninformative "track record". We're used to hearing Tetlock talk about ambiguity in political statements. We're used to hearing about a financial pundit like Jim Cramer underperforming the market. But the domain is novel in AI timelines.
When giving "AGI timelines", I've heard several EAs claim there are no ambiguity risks for the forecast resolution. They think this because the imagery in their heads is dramatic, and we'll just know if they were right. No we won't. This shows wild overconfidence in scenarios they can imagine, and overconfidence in how powerful words are at distinguishing.
Even "the AGI question" on Metaculus had some major ambiguities that could've prevented resolution. Matthew Barnett nicely proposed solutions to clarify them. Many people talking about AI timelines should find this concerning. Because they make "predictions" that aren't defined anywhere near as well as that question. It's okay for informal discussions to be nebulous. But while nebulous predictions sound informative, it takes years before it's obvious that they were meaningless.
So why won’t Eliezer use the ways of Tetlock? He says this:
He suggests that if he uses proper forecasting methods, it would hurt people's mental health. But Eliezer seems willing to format his message as blatant fearmongering like this. For years he's been telling people they are doomed, and often suggests they are intellectually flawed if they don't agree. To me, he doesn't come across like he's sparing me an upsetting truth. To me he sounds like he's catastrophizing, which isn't what I expect to see in a message tailored for mental health.
I'm not buying speculative infohazard arguments, or other "reasons" to obfuscate. If Eliezer thinks he has detected an imminent world-ending danger to humans, then the best approach would probably be to give a transparent, level-headed assessment.
Well, with practice he would improve at verbalized probabilities, as Tetlock found. Also, how does he expect to know if his intuitions are stupid, if he doesn't test them against reality? Sure, it would probably make him seem much less prescient. But that's good, if it's more objective and real.
And no, his domain eminence isn't much of an update. The edge in forecasting from being an expert is generally pretty underwhelming, however special you think AI is. Maybe even less so if we consider him a relatively famous expert. Does anyone predict they can dominate the long-term question leaderboards, by having insights, and skipping the proper forecasting practice? This is wishful thinking.
One justification I've heard: "Shorter-term questions can't show how good his judgment is about longer-term questions". This seems like a rationalization. Suppose if you have 2 groups: those who show good calibration on 3-year AI questions, and those who don't. Now in many cases, both groups end up being dart-throwing chimps on 30-year AI questions. But this hardly justifies not even trying to do it properly. And if some do outperform at the long-term questions, they should have a much better chance if they were at least calibrated on 3 -year questions, versus the group who didn't demonstrate that. It's easy to have an outcome where the uncalibrated just do even worse than a dart-throwing chimp.
If you would like to have some chance at forecasting AI timelines, here are a couple paths. 1) Good generalist forecasters can study supplemental domain material. 2) Non-forecaster domain experts can start building a calibration graph of proper forecasts. Those are basically the options.
People who avoid forecasting accountability shouldn't boast about their forecasting performance. And other people shouldn't rationalize it. I thought Eliezer did great betting with Bryan. Before dunking on properly-scored forecasts, he should be transparent, create a public Metaculus profile, place properly-scored forecasts, and start getting feedback.
Thank you to KrisMoore, Linch, Stefan Schubert, Nathan Young, Peterwildeford, Rob Lee, Ruby, and tenthkrige for suggesting changes.