Imagine if there was a financial pundit who kept saying "Something really bad is brewing in the markets and we may be headed for a recession. But we can't know when recessions will come, nobody can predict them". And then every time there was a selloff in the market, they tell everyone "I've been saying we were headed for trouble", taking credit. This doesn't work as a forecasting track record, and it shouldn't be thought of that way.
If they want forecaster prestige, their forecasts must be:
- Pre-registered,
- So unambiguous that people actually agree whether the event "happened",
- With probabilities and numbers so we can gauge calibration,
- And include enough forecasts that it's not just a fluke or cherry-picking.
When Eliezer Yudkowsky talks about forecasting AI, he has several times claimed implied he has a great forecasting track record. But a meaningful "forecasting track record" has well-known and very specific requirements, and Eliezer doesn't show these.
Here he dunks on the Metaculus predictors as "excruciatingly predictable" about a weak-AGI question, saying that he is a sane person with self-respect (implying the Metaculus predictors aren't):
To be a slightly better Bayesian is to spend your entire life watching others slowly update in excruciatingly predictable directions that you jumped ahead of 6 years earlier so that your remaining life could be a random epistemic walk like a sane person with self-respect.
I wonder if a Metaculus forecast of "what this forecast will look like in 3 more years" would be saner. Is Metaculus reflective, does it know what it's doing wrong?
He clearly believes he could be placing forecasts showing whether or not he is better. Yet he doesn't.
Some have argued "but he may not have time to keep up with the trends, forecasting is demanding". But he's the one making a claim about relative accuracy! And this is in the domain he says is the most important one of our era. And he seems to already be keeping up with trends -- just submit the distribution then.
And here he dunks on Metaculus predictors again:
What strange inputs other people require instead of the empty string, to arrive at conclusions that they could have figured out for themselves earlier; if they hadn't waited around for an obvious whack on the head that would predictably arrive later. I didn't update off this.
But still without being transparent about his own forecasts, preventing a fair comparison.
In another context, Paul Christiano offered to bet Eliezer about AI timelines. This is great, a bet is a tax on bullshit. While it doesn't show a nice calibration chart like on Metaculus, it does give information about performance. You would be right to be fearful of betting against Bryan Caplan. And to Eliezer's great credit, he has actually made a related bet with Bryan! EDIT: Note that Eliezer also agreed to this bet with Paul.
But at one point in responding to Paul, Eliezer mentions some nebulous, unscorable debates and claims:
I claim that I came off better than Robin Hanson in our FOOM debate compared to the way that history went. I'd claim that my early judgments of the probable importance of AGI, at all, stood up generally better than early non-Yudkowskian EA talking about that.
Nothing about this is a forecasting track record. These are post-hoc opinions. There are unavoidable reasons we require pre-registering of the forecasts, removal of definitional wiggle room, explicit numbers, and a decent sample. This response sounds like the financial pundit, saying he called the recession.
EDIT: I think some people are thinking that Eliezer was an unambiguous winner of that debate, and therefore this works as part of a forecasting track record. But you can see examples of why it's far more ambiguous than that in this comment by Paul Christiano.
In this comment, Eliezer said Paul didn't need to bet him, and that Paul is...lacking a forecasting track record.
I think Paul doesn't need to bet against me to start producing a track record like this; I think he can already start to accumulate reputation by saying what he thinks is bold and predictable about the next 5 years; and if it overlaps "things that interest Eliezer" enough for me to disagree with some of it, better yet.
But Eliezer himself doesn't have a meaningful forecasting track record.
In other domains, where we have more practice detecting punditry tactics, we would dismiss such an uninformative "track record". We're used to hearing Tetlock talk about ambiguity in political statements. We're used to hearing about a financial pundit like Jim Cramer underperforming the market. But the domain is novel in AI timelines.
When giving "AGI timelines", I've heard several EAs claim there are no ambiguity risks for the forecast resolution. They think this because the imagery in their heads is dramatic, and we'll just know if they were right. No we won't. This shows wild overconfidence in scenarios they can imagine, and overconfidence in how powerful words are at distinguishing.
Even "the AGI question" on Metaculus had some major ambiguities that could've prevented resolution. Matthew Barnett nicely proposed solutions to clarify them. Many people talking about AI timelines should find this concerning. Because they make "predictions" that aren't defined anywhere near as well as that question. It's okay for informal discussions to be nebulous. But while nebulous predictions sound informative, it takes years before it's obvious that they were meaningless.
So why won’t Eliezer use the ways of Tetlock? He says this:
I consider naming particular years to be a cognitively harmful sort of activity; I have refrained from trying to translate my brain's native intuitions about this into probabilities, for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them. What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one's mental health, and I worry that other people seem to have weaker immune systems than even my own. But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050.
He suggests that if he uses proper forecasting methods, it would hurt people's mental health. But Eliezer seems willing to format his message as blatant fearmongering like this. For years he's been telling people they are doomed, and often suggests they are intellectually flawed if they don't agree. To me, he doesn't come across like he's sparing me an upsetting truth. To me he sounds like he's catastrophizing, which isn't what I expect to see in a message tailored for mental health.
I'm not buying speculative infohazard arguments, or other "reasons" to obfuscate. If Eliezer thinks he has detected an imminent world-ending danger to humans, then the best approach would probably be to give a transparent, level-headed assessment.
...for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them
Well, with practice he would improve at verbalized probabilities, as Tetlock found. Also, how does he expect to know if his intuitions are stupid, if he doesn't test them against reality? Sure, it would probably make him seem much less prescient. But that's good, if it's more objective and real.
And no, his domain eminence isn't much of an update. The edge in forecasting from being an expert is generally pretty underwhelming, however special you think AI is. Maybe even less so if we consider him a relatively famous expert. Does anyone predict they can dominate the long-term question leaderboards, by having insights, and skipping the proper forecasting practice? This is wishful thinking.
One justification I've heard: "Shorter-term questions can't show how good his judgment is about longer-term questions". This seems like a rationalization. Suppose if you have 2 groups: those who show good calibration on 3-year AI questions, and those who don't. Now in many cases, both groups end up being dart-throwing chimps on 30-year AI questions. But this hardly justifies not even trying to do it properly. And if some do outperform at the long-term questions, they should have a much better chance if they were at least calibrated on 3 -year questions, versus the group who didn't demonstrate that. It's easy to have an outcome where the uncalibrated just do even worse than a dart-throwing chimp.
If you would like to have some chance at forecasting AI timelines, here are a couple paths. 1) Good generalist forecasters can study supplemental domain material. 2) Non-forecaster domain experts can start building a calibration graph of proper forecasts. Those are basically the options.
People who avoid forecasting accountability shouldn't boast about their forecasting performance. And other people shouldn't rationalize it. I thought Eliezer did great betting with Bryan. Before dunking on properly-scored forecasts, he should be transparent, create a public Metaculus profile, place properly-scored forecasts, and start getting feedback.
Thank you to KrisMoore, Linch, Stefan Schubert, Nathan Young, Peterwildeford, Rob Lee, Ruby, and tenthkrige for suggesting changes.
Wait, unless I misunderstand you there’s a reasoning mistake here. You request epistemic credit for predicting implicitly that the Metaculus median was going to drop by five years at some point in the next three years. But that’s a prediction that the majority of Metaculites would also have made and it’s a given that it was going to happen, in an interval of time as long as three years. It’s a correct advance prediction, if you did make it (let’s assume so and not get into inferring implicit past predictions with retrospective text analysis), but it’s not one that is even slightly impressive at all.
As an example to explain why, I predict (with 80% probability) that there will be a five-year shortening in the median on the general AI question at some point in the next three years. And I also predict (with 85% probability) that there will be a five-year lengthening at some point in the next three years.
I’m predicting both that Metaculus timelines will shorten and that they will lengthen! What gives? Well, I’m predicting volatility… Should I be given much epistemic credit if I later turned out to be right on both predictions? No, it’s very predictable and you don’t need to be a good forecaster to anticipate it. If you think you should get some credit for your prediction, I should get much more from these two predictions. But it’s not the case that I should get much, nor that you should.
Are there inconsistencies in the AGI questions on Metaculus? Within the forecast timeline, with other questions, with the resolution criteria? Yes, there are plenty! Metaculus is full of glaring inconsistencies. The median on one question will contradict the median on another. An AI question with stronger operationalization will have a lower median than a question with weaker operationalization. The current median says there is a four percent chance that AGI was already developed. The resolution criteria on a question will say it can’t resolve at the upper bound and the median will have 14% for it resolving at the upper bound anyway.
It’s commendable to notice these inconsistencies and right to downgrade your opinion of Metaculus because of them. But it’s wrong to conclude (even with weak confidence), because you can observe such glaring inconsistencies frequently, and predict in advance that specific ones will happen, including changes over time in the median that are predictable even in expected value after accounting for skew, that you are a better forecaster on even just AGI questions (and the implicit claim of being “a slightly better Bayesian” actually seems far stronger and more general than that) than most of the Metaculites forecasting on these questions.
Why? Because Metaculites know there are glaring inconsistencies everywhere, they identify them often, they know that there are more, and they can find them, and fix most of them, easily. It’s not that you’re a better forecaster, just that you have unreasonable expectations of a community of forecasters who are almost all effectively unpaid volunteers.
It’s not surprising that the Metaculus median will change over time in specific and predictable ways that are inconsistent with good Bayesianism. That doesn’t mean they’re that bad: let us see you do better, after all. It’s because people’s energy and interest are scarce. The questions in tournaments with money prizes get more engagement, as do questions about things that are currently in the news. There are still glaring inconsistencies in these questions, because it’s still not enough engagement to fix them all. (Also because the tools are expensive in time to use for making and checking your distributions.)
There are only 601 forecasters who have more than 1000 points on Metaculus: that means only 601 forecasters who have done even a pretty basic amount of forecasting. One of the two forecasters with exactly 1000 points has made predictions on only six questions, for example. You can do that in less than one hour, so it’s really not a lot.
If 601 sounds like a lot, there are thousands of questions on the site, each one with a wall of text describing the background and the resolution criteria. Predictions need updated constantly! The most active predictors on the site burn out because it takes so much time.
It’s not reasonable to expect not to see inconsistencies, predictable changes in the median, and so on. It’s not that they’re bad forecasters. Of course you can do better on one or a few specific questions, but that doesn’t mean much. If you want even just a small but worthwhile amount of evidence, from correct advance predictions, that you are a better forecaster than other Metaculites, you need, for example, to go and win a tournament. One of the tournaments with money prizes that many people are participating in.
Evaluating forecasting track records in practice is hard and very dependent on the scoring rule you use (rankings for PredictionBook vary a lot with your methodology for evaluating relative performance, for example). You need a lot of data, and high quality, to get significant evidence. If you have low-quality data, and only a little, you just aren’t going to get a useful amount of evidence.