The great decline in Wikipedia pageviews (condensed version)

13 VipulNaik 27 March 2015 02:02PM

To keep this post manageable in length, I have only included a small subset of the illustrative examples and discussion. I have published a longer version of this post, with more examples (but the same intro and concluding section), on my personal site.

Last year, during the months of June and July, as my work for MIRI was wrapping up and I hadn't started my full-time job, I worked on the Wikipedia Views website, aimed at easier tabulation of the pageviews for multiple Wikipedia pages over several months and years. It relies on a statistics tool called stats.grok.se, created by Domas Mituzas, and maintained by Henrik.

One of the interesting things I noted as I tabulated pageviews for many different pages was that the pageview counts for many already popular pages were in decline. Pages of various kinds peaked at different historical points. For instance, colors have been in decline since early 2013. The world's most populous countries have been in decline since as far back as 2010!

Defining the problem

The first thing to be clear about is what these pageviews count and what they don't. The pageview measures are taken from stats.grok.se, which in turn uses the pagecounts-raw dump provided hourly by the Wikimedia Foundation's Analytics team, which in turn is obtained by processing raw user activity logs. The pagecounts-raw measure is flawed in two ways:

  • It only counts pageviews on the main Wikipedia website and not pageviews on the mobile Wikipedia website or through Wikipedia Zero (a pared down version of the mobile site that some carriers offer at zero bandwidth costs to their customers, particularly in developing countries). To remedy these problems, a new dump called pagecounts-all-sites was introduced in September 2014. We simply don't have data for views of mobile domains or of Wikipedia Zero at the level of individual pages for before then. Moreover, stats.grok.se still uses pagecounts-raw (this was pointed to me in a mailing list message after I circulated an early version of the post).
  • The pageview count includes views by bots. The official estimate is that about 15% of pageviews are due to bots. However, the percentage is likely higher for pages with fewer overall pageviews, because bots have a minimum crawling frequency. So every page might have at least 3 bot crawls a day, resulting in a minimum of 90 bot pageviews even if there are only a handful of human pageviews.

Therefore, the trends I discuss will refer to trends in total pageviews for the main Wikipedia website, including page requests by bots, but excluding visits to mobile domains. Note that visits from mobile devices to the main site will be included, but mobile devices are by default redirected to the mobile site.

How reliable are the metrics?

As noted above, the metrics are unreliable because of the bot problem and the issue of counting only non-mobile traffic. German Wikipedia user Atlasowa left a message on my talk page pointing me to an email thread suggesting that about 40% of pageviews may be bot-related, and discussing some interesting examples.

Relationship with the overall numbers

I'll show that for many pages of interest, the number of pageviews as measured above (non-mobile) has declined recently, with a clear decline from 2013 to 2014. What about the total?

We have overall numbers for non-mobile, mobile, and combined. The combined number has largely held steady, whereas the non-mobile number has declined and the mobile number has risen.

What we'll find is that the decline for most pages that have been around for a while is even sharper than the overall decline. One reason overall pageviews haven't declined so fast is the creation of new pages. To give an idea, non-mobile traffic dropped by about 1/3 from January 2013 to December 2014, but for many leading categories of pages, traffic dropped by about 1/2-2/3.

Why is this important? First reason: better context for understanding trends for individual pages

People's behavior on Wikipedia is a barometer of what they're interested in learning about. An analysis of trends in the views of pages can provide an important window into how people's curiosity, and the way they satisfy this curiosity, is evolving. To take an example, some people have proposed using Wikipedia pageview trends to predict flu outbreaks. I myself have tried to use relative Wikipedia pageview counts to gauge changing interests in many topics, ranging from visa categories to technology companies.

My initial interest in pageview numbers arose because I wanted to track my own influence as a Wikipedia content creator. In fact, that was my original motivation with creating Wikipedia Views. (You can see more information about my Wikipedia content contributions on my site page about Wikipedia).

Now, when doing this sort of analysis for individual pages, one needs to account for, and control for, overall trends in the views of Wikipedia pages that are occurring for reasons other than a change in people's intrinsic interest in the subject. Otherwise, we might falsely conclude from a pageview count decline that a topic is falling in popularity, whereas what's really happening is an overall decline in the use of (the non-mobile version of) Wikipedia to satisfy one's curiosity about the topic.

Why is this important? Second reason: a better understanding of the overall size and growth of the Internet.

Wikipedia has been relatively mature and has had the top spot as an information source for at least the last six years. Moreover, unlike almost all other top websites, Wikipedia doesn't try hard to market or optimize itself, so trends in it reflect a relatively untarnished view of how the Internet and the World Wide Web as a whole are growing, independent of deliberate efforts to manipulate and doctor metrics.

The case of colors

Let's look at Wikipedia pages on some of the most viewed colors (I've removed the 2015 and 2007 columns because we don't have the entirety of these years). Colors are interesting because the degree of human interest in colors in general, and in individual colors, is unlikely to change much in response to news or current events. So one would at least a priori expect colors to offer a perspective into Wikipedia trends with fewer external complicating factors. If we see a clear decline here, then that's strong evidence in favor of a genuine decline.

I've restricted attention to a small subset of the colors, that includes the most common ones but isn't comprehensive. But it should be enough to get a sense of the trends. And you can add in your own colors and check that the trends hold up.

Page namePageviews in year 2014Pageviews in year 2013Pageviews in year 2012Pageviews in year 2011Pageviews in year 2010Pageviews in year 2009Pageviews in year 2008TotalPercentageTags
Black 431K 1.5M 1.3M 778K 900K 1M 958K 6.9M 16.1 Colors
Blue 710K 1.3M 1M 987K 1.2M 1.2M 1.1M 7.6M 17.8 Colors
Brown 192K 284K 318K 292K 308K 300K 277K 2M 4.6 Colors
Green 422K 844K 779K 707K 882K 885K 733K 5.3M 12.3 Colors
Orange 133K 181K 251K 259K 275K 313K 318K 1.7M 4 Colors
Purple 524K 906K 847K 895K 865K 841K 592K 5.5M 12.8 Colors
Red 568K 797K 912K 1M 1.1M 873K 938K 6.2M 14.6 Colors
Violet 56K 96K 75K 77K 69K 71K 65K 509K 1.2 Colors
White 301K 795K 615K 545K 788K 575K 581K 4.2M 9.8 Colors
Yellow 304K 424K 453K 433K 452K 427K 398K 2.9M 6.8 Colors
Total 3.6M 7.1M 6.6M 6M 6.9M 6.5M 6M 43M 100 --
Percentage 8.5 16.7 15.4 14 16 15.3 14 100 -- --
 

Since the decline appears to have happened between 2013 and 2014, let's examine the 24 months from January 2013 to December 2014:

 

MonthViews of page BlackViews of page BlueViews of page BrownViews of page GreenViews of page OrangeViews of page PurpleViews of page RedViews of page VioletViews of page WhiteViews of page YellowTotal Percentage
201412 30K 41K 14K 27K 9.6K 28K 67K 3.1K 21K 19K 260K 2.4
201411 36K 46K 15K 31K 10K 35K 50K 3.7K 23K 22K 273K 2.5
201410 37K 52K 16K 34K 10K 34K 51K 4.5K 25K 26K 289K 2.7
201409 37K 57K 16K 35K 9.9K 37K 45K 4.8K 27K 29K 298K 2.8
201408 33K 47K 14K 34K 8.5K 31K 38K 3.9K 21K 22K 253K 2.4
201407 33K 47K 14K 30K 9.3K 31K 37K 4.2K 22K 22K 250K 2.3
201406 32K 49K 14K 31K 10K 34K 39K 4.9K 23K 22K 259K 2.4
201405 44K 55K 17K 37K 10K 51K 42K 5.2K 26K 26K 314K 2.9
201404 34K 60K 17K 36K 14K 38K 47K 5.8K 27K 28K 306K 2.8
201403 37K 136K 19K 51K 14K 123K 52K 5.5K 30K 31K 497K 4.6
201402 38K 58K 19K 39K 13K 41K 49K 5.6K 29K 29K 321K 3
201401 40K 60K 19K 36K 14K 40K 50K 4.4K 27K 28K 319K 3
201312 62K 67K 17K 44K 12K 48K 48K 4.4K 42K 26K 372K 3.5
201311 141K 96K 20K 65K 11K 68K 55K 5.3K 71K 34K 566K 5.3
201310 145K 102K 21K 69K 11K 77K 59K 5.7K 71K 36K 598K 5.6
201309 98K 80K 17K 60K 11K 53K 51K 4.9K 45K 30K 450K 4.2
201308 109K 87K 20K 57K 20K 57K 60K 4.6K 53K 28K 497K 4.6
201307 107K 92K 21K 61K 11K 66K 65K 4.6K 61K 30K 520K 4.8
201306 115K 106K 22K 69K 13K 73K 64K 5.5K 70K 33K 571K 5.3
201305 158K 122K 24K 79K 14K 83K 69K 11K 77K 39K 677K 6.3
201304 151K 127K 28K 83K 14K 86K 74K 12K 78K 40K 694K 6.4
201303 155K 135K 31K 92K 15K 99K 84K 12K 80K 43K 746K 6.9
201302 152K 131K 31K 84K 28K 95K 84K 17K 77K 41K 740K 6.9
201301 129K 126K 32K 81K 19K 99K 84K 9.6K 70K 42K 691K 6.4
Total 2M 2M 476K 1.3M 314K 1.4M 1.4M 152K 1.1M 728K 11M 100
Percentage 18.1 18.4 4.4 11.8 2.9 13.3 12.7 1.4 10.2 6.8 100 --
Tags Colors Colors Colors Colors Colors Colors Colors Colors Colors Colors -- --

 

As we can see, the decline appears to have begun around March 2013 and then continued steadily till about June 2014, at which numbers stabilized to their lower levels.

A few sanity checks on these numbers:

  • The trends appear to be similar for different colors, with the notable difference that the proportional drop was higher for the more viewed color pages. Thus, for instance, black and blue saw declines from 129K and 126K to 30K and 41K respectively (factors of four and three respectively) from January 2013 to December 2014. Orange and yellow, on the other hand, dropped by factors of close to two. The only color that didn't drop significantly was red (it dropped from 84K to 67K, as opposed to factors of two or more for other colors), but this seems to have been partly due to an unusually large amount of traffic in the end of 2014. The trend even for red seems to suggest a drop similar to that for orange.
  • The overall proportion of views for different colors comports with our overall knowledge of people's color preferences: blue is overall a favorite color, and this is reflected in its getting the top spot with respect to pageviews.
  • The pageview decline followed a relatively steady trend, with the exception of some unusual seasonal fluctuation (including an increase in October and November 2013).

One might imagine that this is due to people shifting attention from the English-language Wikipedia to other language Wikipedias, but most of the other major language Wikipedias saw a similar decline at a similar time. More details are in my longer version of this post on my personal site.

Geography: continents and subcontinents, countries, and cities

Here are the views of some of the world's most populated countries between 2008 and 2014, showing that the peak happened as far back as 2010:

Page namePageviews in year 2014Pageviews in year 2013Pageviews in year 2012Pageviews in year 2011Pageviews in year 2010Pageviews in year 2009Pageviews in year 2008TotalPercentageTags
China 5.7M 6.8M 7.8M 6.1M 6.9M 5.7M 6.1M 45M 9 Countries
India 8.8M 12M 12M 11M 14M 8.8M 7.6M 73M 14.5 Countries
United States 13M 15M 18M 18M 34M 16M 15M 129M 25.7 Countries
Indonesia 5.3M 5.2M 3.7M 3.6M 4.2M 3.1M 2.5M 28M 5.5 Countries
Brazil 4.8M 4.9M 5.3M 5.5M 7.5M 4.9M 4.3M 37M 7.4 Countries
Pakistan 2.9M 4.5M 4.4M 4.3M 5.2M 4M 3.2M 28M 5.7 Countries
Bangladesh 2.2M 2.9M 3M 2.8M 2.9M 2.2M 1.7M 18M 3.5 Countries
Russia 5.6M 5.6M 6.5M 6.8M 8.6M 5.4M 5.8M 44M 8.8 Countries
Nigeria 2.6M 2.6M 2.9M 3M 3.5M 2.6M 2M 19M 3.8 Countries
Japan 4.8M 6.4M 6.5M 8.3M 10M 7.3M 6.6M 50M 10 Countries
Mexico 3.1M 3.9M 4.3M 4.3M 5.9M 4.7M 4.5M 31M 6.1 Countries
Total 59M 69M 74M 74M 103M 65M 59M 502M 100 --
Percentage 11.7 13.8 14.7 14.7 20.4 12.9 11.8 100 -- --

Of these countries, China, India and the United States are the most notable. China is the world's most populous. India has the largest population with some minimal English knowledge and legally (largely) unfettered Internet access to Wikipedia, while the United States has the largest population with quality Internet connectivity and good English knowledge. Moreover, in China and India, Internet use and access have been growing considerably in the last few years, whereas it has been relatively stable in the United States.

It is interesting that the year with the maximum total pageview count was as far back as 2010. In fact, 2010 was so significantly better than the other years that the numbers beg for an explanation. I don't have one, but even excluding 2010, we see a declining trend: gradual growth from 2008 to 2011, and then a symmetrically gradual decline. Both the growth trend and the decline trend are quite similar across countries.

We see a similar trend for continents and subcontinents, with the peak occurring in 2010. In contrast, the smaller counterparts, such as cities, peaked in 2013, similarly to colors, and the drop, though somewhat less steep than with colors, has been quite significant. For instance, a list for Indian cities shows that the total pageviews for these Indian cities declined from about 20 million in 2013 (after steady growth in the preceding years) to about 13 million in 2014.

Some niche topics where pageviews haven't declined

So far, we've looked at topics where pageviews have been declining since at least 2013, and some that peaked as far back as 2010. There are, however, many relatively niche topics where the number of pageviews has stayed roughly constant. But this stability itself is a sign of decay, because other metrics suggest that the topics have experienced tremendous growth in interest. In fact, the stability is even less impressive when we notice that it's a result of a cancellation between slight declines in views of established pages in the genre, and traffic going to new pages.

For instance, consider some charity-related pages:

Page namePageviews in year 2014Pageviews in year 2013Pageviews in year 2012Pageviews in year 2011Pageviews in year 2010Pageviews in year 2009Pageviews in year 2008TotalPercentageTags
Against Malaria Foundation 5.9K 6.3K 4.3K 1.4K 2 0 0 18K 15.6 Charities
Development Media International 757 0 0 0 0 0 0 757 0.7 Pages created by Vipul Naik Charities
Deworm the World Initiative 2.3K 277 0 0 0 0 0 2.6K 2.3 Charities Pages created by Vipul Naik
GiveDirectly 11K 8.3K 2.6K 442 0 0 0 22K 19.2 Charities Pages created by Vipul Naik
International Council for the Control of Iodine Deficiency Disorders 1.2K 1 2 2 0 1 2 1.2K 1.1 Charities Pages created by Vipul Naik
Nothing But Nets 5.9K 6.6K 6.6K 5.1K 4.4K 4.7K 6.1K 39K 34.2 Charities
Nurse-Family Partnership 2.9K 2.8K 909 30 8 72 63 6.8K 5.9 Pages created by Vipul Naik Charities
Root Capital 3K 2.5K 414 155 51 1.2K 21 7.3K 6.3 Charities Pages created by Vipul Naik
Schistosomiasis Control Initiative 4K 2.7K 1.6K 191 0 0 0 8.5K 7.4 Charities Pages created by Vipul Naik
VillageReach 1.7K 1.9K 2.2K 2.6K 97 3 15 8.4K 7.3 Charities Pages created by Vipul Naik
Total 38K 31K 19K 9.9K 4.6K 5.9K 6.2K 115K 100 --
Percentage 33.4 27.3 16.3 8.6 4 5.1 5.4 100 -- --

For this particular cluster of pages, we see the totals growing robustly year-on-year. But a closer look shows that the growth isn't that impressive. Whereas earlier, views were doubling every year from 2010 to 2013 (this was the take-off period for GiveWell and effective altruism), the growth from 2013 to 2014 was relatively small. And about half the growth from 2013 to 2014 was powered by the creation of new pages (including some pages created after the beginning of 2013, so they had more months in a mature state in 2014 than in 2013), while the other half was powered by growth in traffic to existing pages.

The data for philanthropic foundations demonstrates a fairly slow and steady growth (about 5% a year), partly due to the creation of new pages. This 5% hides a lot of variation between individual pages:

Page namePageviews in year 2014Pageviews in year 2013Pageviews in year 2012Pageviews in year 2011Pageviews in year 2010Pageviews in year 2009Pageviews in year 2008TotalPercentageTags
Atlantic Philanthropies 11K 11K 12K 10K 9.8K 8K 5.8K 67K 2.1 Philanthropic foundations
Bill & Melinda Gates Foundation 336K 353K 335K 315K 266K 240K 237K 2.1M 64.9 Philanthropic foundations
Draper Richards Kaplan Foundation 1.2K 25 9 0 0 0 0 1.2K 0 Philanthropic foundations Pages created by Vipul Naik
Ford Foundation 110K 91K 100K 90K 100K 73K 61K 625K 19.5 Philanthropic foundations
Good Ventures 9.9K 8.6K 3K 0 0 0 0 21K 0.7 Philanthropic foundations Pages created by Vipul Naik
Jasmine Social Investments 2.3K 1.8K 846 0 0 0 0 5K 0.2 Philanthropic foundations Pages created by Vipul Naik
Laura and John Arnold Foundation 3.7K 13 0 1 0 0 0 3.7K 0.1 Philanthropic foundations Pages created by Vipul Naik
Mulago Foundation 2.4K 2.3K 921 0 1 1 10 5.6K 0.2 Philanthropic foundations Pages created by Vipul Naik
Omidyar Network 26K 23K 19K 17K 19K 13K 11K 129K 4 Philanthropic foundations
Peery Foundation 1.8K 1.6K 436 0 0 0 0 3.9K 0.1 Philanthropic foundations Pages created by Vipul Naik
Robert Wood Johnson Foundation 26K 26K 26K 22K 27K 22K 17K 167K 5.2 Philanthropic foundations
Skoll Foundation 13K 11K 9.2K 7.8K 9.6K 5.8K 4.3K 60K 1.9 Philanthropic foundations
Smith Richardson Foundation 8.7K 3.5K 3.8K 3.6K 3.7K 3.5K 2.9K 30K 0.9 Philanthropic foundations
Thiel Foundation 3.6K 1.5K 1.1K 47 26 1 0 6.3K 0.2 Philanthropic foundations Pages created by Vipul Naik
Total 556K 533K 511K 466K 435K 365K 340K 3.2M 100 --
Percentage 17.3 16.6 15.9 14.5 13.6 11.4 10.6 100 -- --

 

The dominant hypothesis: shift from non-mobile to mobile Wikipedia use

The dominant hypothesis is that pageviews have simply migrated from non-mobile to mobile. This is most closely borne by the overall data: total pageviews have remained roughly constant, and the decline in total non-mobile pageviews has been roughly canceled by growth in mobile pageviews. However, the evidence for this substitution doesn't exist at the level of individual pages because we don't have pageview data for the mobile domain before September 2014, and much of the decline occurred between March 2013 and June 2014.

What would it mean if there were an approximate one-on-one substitution from non-mobile to mobile for the page types discussed above? For instance, non-mobile traffic to colors dropped to somewhere between 1/3 and 1/2 of their original traffic level between January 2013 and December 2014. This would mean that somewhere between 1/2 and 2/3 of the original non-mobile traffic to colors has shifted to mobile devices. This theory should be at least partly falsifiable: if the sum of traffic to non-mobile and mobile platforms today for colors is less than non-mobile-only traffic in January 2013, then clearly substitution is only part of the story.

Although the data is available, it's not currently in an easily computable form, and I don't currently have the time and energy to extract it. I'll update this once the data on all pageviews since September 2014 is available on stats.grok.se or a similar platform.

Other hypotheses

The following are some other hypotheses for the pageview decline:

  1. Google's Knowledge Graph: This is the hypothesis raised in Wikipediocracy, the Daily Dot, and the Register. The Knowledge Graph was introduced in 2012. Through 2013, Google rolled out snippets (called Knowledge Cards and Knowledge Panels) based on the Knowledge Graph in its search results. So if, for instance, you only wanted the birth date and nationality of a musician, Googling would show you that information right in the search results and you wouldn't need to click through to the Wikipedia page. I suspect that the Knowledge Graph played some role in the decline for colors seen between March 2013 and June 2014. On the other hand, many of the pages that saw a decline don't have any search snippets based on the Knowledge Graph, and therefore the decline for those pages cannot be explained this way.
  2. Other means of accessing Wikipedia's knowledge that don't involve viewing it directly: For instance, Apple's Siri tool uses data from Wikipedia, and people making queries to this tool may get information from Wikipedia without hitting the encyclopedia. The usage of such tools has increased greatly starting in late 2012. Siri itself was released with the third generation iPad in September 2012 and became part of the iPhone released the next month. Since then, it has shipped with all of Apple's mobile devices and tablets.
  3. Substitution away from Wikipedia to other pages that are becoming more search-optimized and growing in number: For many topics, Wikipedia may have been clearly the best information source a few years back (as judged by Google), but the growth of niche information sources, as well as better search methods, have displaced it from its undisputed leadership position. I think there's a lot of truth to this, but it's hard to quantify.
  4. Substitution away from coarser, broader pages to finer, narrower pages within Wikipedia: While this cannot directly explain an overall decline in pageviews, it can explain a decline in pageviews for particular kinds of pages. Indeed, I suspect that this is partly what's going on with the early decline of pageviews (e.g., the decline in pageviews of countries and continents starting around 2010, as people go directly to specialized articles related to the particular aspects of those countries or continents they are interested in).
  5. Substitution to Internet use in other languages: This hypothesis doesn't seem borne out by the simultaneous decline in pageviews for the English, French, and Spanish Wikipedia, as documented for the color pages.

It's still a mystery

I'd like to close by noting that the pageview decline is still very much a mystery as far as I am concerned. I hope I've convinced you that (a) the mystery is genuine, (b) it's important, and (c) although the shift to mobile is probably the most likely explanation, we don't yet have clear evidence. I'm interested in hearing whether people have alternative explanations, and/or whether they have more compelling arguments for some of the explanations proffered here.

Tentative tips for people engaged in an exercise that involves some form of prediction or forecasting

5 VipulNaik 30 July 2014 05:24AM

Note: This is the concluding post of my LessWrong posts related to my forecasting work for MIRI. There are a few items related to forecasting that I didn't get time to look into and might return to later. I might edit this post to include references to those posts if I get to them later.

I've been looking at forecasting in different domains as part of work for the Machine Intelligence Research Institute (MIRI). I thought I'd draw on whatever I've learned to write up advice for people engaged in any activity that involves making forecasts. This could include a wide range of activities, including those that rely on improving the accuracy of predictions in highly circumscribed contexts (such as price forecasting or energy use forecasting) as well as those that rely on trying to determine the broad qualitative contours of possible scenarios.

The particular application of interest to MIRI is forecasting AI progress, leading up to (but not exclusively focused on) the arrival of AGI. I will therefore try to link my general tips with thoughts on how it applies to forecasting AI progress. That being said, I hope that what I say here will have wider interest and appeal.

If you're interested in understanding the state of the art with respect to forecasting AI progress specifically, consider reading Luke Muehlhauser's summary of the state of knowledge on when AI will be created. The post was written in May 2013, and there have been a couple of developments since then, including:

#1: Appreciate that forecasting is hard

It's hard to make predictions, especially about the future (see also more quotes here). Forecasting is a difficult job along many dimensions. Apart from being difficult, it's also a job where feedback is far from immediate. This holds more true as the forecasting horizon becomes wider (for lists of failed predictions made in the past, see here and here). Fortunately, a fair amount has been discovered about forecasting in general, and you can learn from the experience of people trying to make forecasts in many different domains.

Philip Tetlock's work on expert political judgment, whose conclusions he described here, and that I discussed in my post on the historical evaluations of forecasting, shows that at least in the domain of political forecasting, experts often don't do a much better job than random guesses, and even the experts who do well rarely do better than simple trend extrapolation. Not only do experts fail to do well, they are also poorly calibrated as to the quality of forecasts.

Even in cases where experts are right about the median or modal scenario, they often fail to both estimate and communicate forecast uncertainty.

The point that forecasting is hard, and should be approached with humility, will be repeated throughout this post, in different contexts.

#2: Avoid the "not invented here" fallacy, and learn more about forecasting across a wide range of different domains

The not invented here fallacy refers to people's reluctance to use tools developed outside of their domain or organization. In the context of forecasting, it's quite common. For instance, climate scientists have been accused of not following forecasting principles. The reaction of some of them has been along the lines of "why should we listen to forecasters, when they don't understand any climate science?" (more discussion of that response here, see also a similar answer on Quora). Moreover, it's not enough to only listen to outsiders who treat you with respect. The point of listening to and learning from other domains isn't to be generous to people in those domains, but to understand and improve one's own work (in this case, forecasting work).

There are some examples of successful importation of forecasting approaches from one domain to another. One example is the ideas developed for forecasting rare events, as I discussed in this post. Power laws for some rare phenomena, such as earthquakes, have been around for a while. Aaron Clauset and his co-authors have recently applied the same mathematical framework of power laws to other types of rare events, including terrorist attacks.

Evaluating AI progress forecasting on this dimension: My rough impression is that AI progress forecasting tends to be insular, learning little from other domains. While I haven't seen a clear justification from AI progress forecasters, the typical arguments I've seen are the historical robustness of Moore's law and the idea that the world of technology is fundamentally different from the world of physical stuff.

I think that future work on AI progress forecasting should explicitly consider forecasting problems in domains other than computing, and explicitly explain what lessons cross-apply and what don't, and why. I don't mean that all future work should consider all other domains. I just mean that at least some future work should consider at least some other domains.

#3: Start by reading a few really good general-purpose overviews

Personally, I would highlight Nate Silver's The Signal and the Noise. Silver's book is quite exceptional in the breadth of topics it covers, the clarity of its presentation, and the easy toggling between general principles and specific instances. Silver's book comfortably combines ideas from statistics, data mining, machine learning, predictive analytics, and forecasting. Not only would I recommend reading it quickly when you're starting out, I would also recommend returning to specific chapters of the book later if they cover topics that interest you. I personally found the book a handy reference (and quoted extensively from it) when writing LessWrong posts about forecasting domains that the book has covered.

Other books commonly cited are Tetlock's Expert Political Judgment and the volume Principles of Forecasting edited by J. Scott Armstrong, and contributed to by several forecasters. I believe both these books are good, but I'll be honest: I haven't read them, although I have read summaries of the books and shorter works by the authors describing the main points. I believe that you can similarly get the bulk of the value of Tetlock's work by reading his article for Cato Unbound co-authored with Dan Gardner, that I discussed here. For the principles of forecasting, see #4 below.

Evaluating AI progress forecasting on this dimension: There seems to be a lot of focus on a few AI-related and computing-related futurists, such as Ray Kurzweil. I do think the focus should be widened, and getting an understanding of general challenges related to forecasting is a better starting point than reading The Singularity is Near. That said, the level of awareness among MIRI and LessWrong people about the work of Silver, Armstrong, and Tetlock definitely seems higher than among the general public or even among the intelligentsia. I should also note that Luke Muehlhauser was the person who first pointed me to J. Scott Armstrong, and he's referenced Tetlock's work frequently.

#4: Understand key concepts and distinctions in forecasting, and review the literature and guidelines developed by the general-purpose forecasting community

In this post, I provided an overview of different kinds of forecasting, and also included names of key people, key organizations, key journals, and important websites. I would recommend reading that to get a general sense, and then proceeding to the Forecasting Principles website (though, fair warning: the website's content management system is a mess, and in particular, you might find a lot of broken links). Here's their full list of 140 principles, along with discussion of the evidence base for each principle. However, see also point #5 below.

#5: Understand some alternatives to forecasting, specifically scenario analysis and futures studies

If you read the literature commonly classified as "forecasting" in academia, you will find very little mention of scenario analysis and futures studies. Conversely, the literature on scenario analysis and futures studies rarely cites the general-purpose forecasting literature. But the actual "forecasting" exercise you intend to engage in may be better suited to scenario analysis than to forecasting. Or you might find that the methods of futures studies are a closer fit for what you are trying to achieve. Or you might try to use a mix of techniques.

Broadly, scenario analysis becomes more important when there is more uncertainty, and when it's important to be prepared for a wider range of eventualities. This matters more as we move to longer time horizons for forecasting. I discussed scenario analysis in this post, where I also speculate on possible reasons for the lack of overlap with the forecasting community.

Futures studies is closely related to scenario analysis (in fact, scenario analysis can be considered a method of futures studies) but the futures studies field has a slightly different flavor. I looked at the field of futures studies in this post.

It could very well be the case that you find the ideas of scenario analysis and futures studies inappropriate for the task at hand. But such a decision should be made only after acquiring a reasonable understanding of the methods.

Some other domains that might be better suited to the problem at hand include predictive analytics, predictive modeling, data mining, machine learning, and risk analysis. I haven't looked into any of these in depth in connection with my MIRI project (I've been reading up on machine learning for other work, and have been and will be posting about it on LessWrong but that's independent of my MIRI work).

Evaluating AI progress forecasting on this dimension: I think a reasonable case can be made that the main goals of AI progress forecasting are better met through scenario analysis. I discussed this in detail in this post.

#6: Examine forecasting in other domains, including domains that do not seem to be related to your domain at the object level

This can be thought of as a corollary to #2. Chances are, if you have read Nate Silver and some of the other sources, your curiosity about forecasting in other domains has already been piqued. General lessons about human failure and error may cross-apply between domains, even if the object-level considerations are quite different.

In addition to Silver's book, I recommend taking a look at some of my own posts on forecasting in various domains. These posts are based on rather superficial research, so please treat them only as starting points.

General:

Some domain-specific posts:

I also did some additional posts on climate science as a case study in forecasting. I have paused the exercise due to time and ability limitations, but I think the posts so far might be useful:

 

#7: Consider setting up data collection using best practices early on

Forecasting works best when we have a long time series of data to learn from. So it's best to set up data collection as quickly as possible, and use good practices in setting it up. Data about the present or recent past may be cheap to collect now, but could be hard to collect a few decades from now. We don't want to be spending our time two decades later figuring out how to collect data (and adjudicating disputes about the accuracy of data) if we could collect and archive the data in a stable repository right now.

If your organization is too small to do primary data collection, find another organization that engages in the data collection activities, and make sure you archive the data they collect, so that the data is available to you even if that organization stops operating.

Evaluating AI progress forecasting on this dimension: I think that there are some benefits from creating standardized records and measurements of the current state of AI and the quality of the current hardware and software. That said, there do exist plenty of reasonably standardized measurements already in these domains. There is little danger of this information completely disappearing, so that the project of combining and integrating them into a big picture is important but not time-sensitive. Hardware progress and specs are already well-documented, and we can get time series at places such as the Performance Curve Database. Software progress and algorithmic progress have also been reasonably well-recorded, as described by Katja Grace in her review for MIRI of algorithmic progress in six domains.

#8: Consider recording forecasts and scenarios, and the full reasoning or supporting materials

It's not just useful to have data from the past, it's also useful to have forecasts made based on past data and see how they compared to what actually transpired. The problem with forecasts is even worse than with data: if two decades later we want to know what one would have predicted using the data that is available right now, we simply cannot do that unless we make and record the predictions now. (We could do it in principle by imagining that we don't have access to the intermediate data. But in practice, people can find it hard to avoid being influenced by their knowledge of what has transpired in the interim when they build and tune their models). Retrodictions and hindcasts are useful for analysis and diagnosis, but they ultimately do not provide a convincing independent test of the model being used to make forecasts.

Evaluating AI progress forecasting on this dimension: See the link suggestions for recent work on AI progress forecasting at the beginning of the post.

The remaining points are less important and more tentative. I've included for completeness' sake.

#9: Evaluate how much expertise the domain experts have in forecasting

In some cases, domain experts also have expertise in making forecasts. In other cases, the relationship between domain expertise and the ability to make forecasts, or even to calibrate one's own forecast accuracy, is tenuous. I discussed the issue of how much deference to give to domain experts in this post.

#10: Use best practices from statistical analysis, computer programming, software engineering, and economics

Wherever using these disciplines, use them well. Statistical analysis arises in quantitative forecasting and prediction. Computer programming is necessary for setting up prediction markets or carrying out time series forecasting or machine learning with large data sets or computationally intensive algorithms. Software engineering is necessary once the computer programs exceed a basic level of complexity, or if they need to survive over the long term. Insights from economics and finance may be necessary for designing effective prediction markets or other tools to incentivize people to make accurate predictions and minimize their chances of gaming the system in ways detrimental to prediction accuracy.

The insularity critique of climate science basically accused the discipline of not doing this.

What if your project is too small and you don't have access to expertise in these domains? Often, a very cursory, crude analysis can be helpful in ballparking the situation. As I described in my historical evaluations of forecasting, the Makridakis Competitions provide evidence in favor of the hypothesis that simple models tend to perform quite well, although the correctly chosen complex models can outperform simple ones under special circumstances (see also here). So keeping it simple to begin with is fine. However, the following caveats should be noted:

  • Even "simple" models and setups can benefit from overview by somebody with subject matter expertise. The overviews can be fairly quick, but they still help. For instance, after talking to a few social scientists, I realized the perils of using simple linear regression for time series data. This isn't a deep point, but it can elude even a smart and otherwise knowledgeable person who hasn't thought much about the specific tools.
  • The limitations of the model, and the uncertainty in the associated forecast, should be clearly noted (see my post on communicating forecast uncertainty).

Evaluating AI progress forecasting on this dimension: I think that AI progress forecasting is at too early a stage to get into using detailed statistical analysis or software, so using simple models and getting feedback from experts, while noting potential weaknesses, seems like a good strategy.

#11: Consider carefully the questions of openness of data, practices, supporting code, and internal debate

While confidentiality and anonymity are valuable in some contexts, openness and transparency are good antidotes to errors that arise due to insufficient knowledge and groupthink (such as the types of problems I noted in my post on the insularity critique of climate science).

#12: Consider ethical issues related to forecasting, such as the waysyour forecasting exercise can influence real-world decisions and outcomes

This is a topic I intended to look into more but didn't get time to. I've collected a few links for interested parties:

Claim: Scenario planning is preferable to quantitative forecasting for understanding and coping with AI progress

1 VipulNaik 25 July 2014 03:43AM

As part of my work for MIRI on forecasting, I'm considering the implications of what I've read up for the case of thinking about AI. My purpose isn't to actually come to concrete conclusions about AI progress, but more to provide insight into what approaches are more promising and what approaches are less promising for thinking about AI progress.

I've written a post on general-purpose forecasting and another post on scenario analysis. In a recent post, I considered scenario analyses for technological progress. I've also looked at many domains of forecasting and at forecasting rare events. With the knowledge I've accumulated, I've shifted in the direction of viewing scenario analysis as a more promising tool than timeline-driven quantitative forecasting for understanding AI and its implications.

I'll first summarize what I mean by scenario analysis and quantitative forecasting in the AI context. People who have some prior knowledge of the terms can probably skim through the summary quickly. Those who find the summary insufficiently informative, or want to delve deeper, are urged to read my more detailed posts linked above and the references therein.

Quantitative forecasting and scenario analysis in the AI context

The two approaches I am comparing are:

  • Quantitative forecasting: Here, specific predictions or forecasts are made, recorded, and later tested against what actually transpired. The forecasts are made in a form where it's easy to score whether they happened. Probabilistic forecasts are also included. These are scored using one of the standard methods to score probabilistic forecasts (such as logarithmic scoring or quadratic scoring).
  • Scenario analysis: A number of scenarios of how the future might unfold are generated in considerable detail. Predetermined elements, common to the scenario, are combined with critical uncertainties, that vary between the scenarios. Early indicators that help determine which scenario will transpire are identified. In many cases, the goal is to choose strategies that are robust to all scenarios. For more, read my post on scenario analysis.

Quantitative forecasts are easier to score for accuracy, and in particular offer greater scope for falsification. This has perhaps attracted rationalists more to quantitative forecasting, as a way of distinguishing themselves from what appears to be the more wishy-washy realm of unfalsifiable scenario analysis. In this post, I argue that, given the considerable uncertainty surrounding progress in artificial intelligence, scenario analysis is a more apt tool.

There are probably some people on LessWrong who have high confidence in quantitative forecasts. I'm happy to make bets (financial or purely honorary) on such subjects. However, if you're claiming high certainty while I am claiming uncertainty, I do want to have odds in my favor (depending on how much confidence you express in your opinion), for reasons similar to those that Bryan Caplan described here.

Below, I describe my reasons for preferring scenario analysis to forecasting.

#1: Considerable uncertainty

Proponents of the view that AI is scheduled to arrive in a few decades typically cite computing advances such as Moore's law. However, there's considerable uncertainty even surrounding short-term computing advances, as I described in my scenario analyses for technological progress. When it comes to the question of progress in AI, we have to combine uncertainties in hardware progress with uncertainties in software progress.

Quantitative forecasting methods, such as trend extrapolation, tend to do reasonably well, and might be better than nothing. But they are not foolproof. In particular, the impending death of Moore's law, despite the trend staying quite robust for about 50 years, should make us cautious about too naive an extrapolation of trends. Arguably, simple trend extrapolation is still the best choice relative to other forecasting methods, at least as a general rule. But acknowledging uncertainty and considering multiple scenarios could prepare us a lot better for reality.

In a post in May 2013 titled When Will AI Be Created?, MIRI director Luke Muehlhauser (who later assigned me the forecasting project) looked at the wide range of beliefs about the time horizon for the arrival of human-level AI. Here's how Luke described the situation:

To explore these difficulties, let’s start with a 2009 bloggingheads.tv conversation between MIRI researcher Eliezer Yudkowsky and MIT computer scientist Scott Aaronson, author of the excellent Quantum Computing Since Democritus. Early in that dialogue, Yudkowsky asked:

It seems pretty obvious to me that at some point in [one to ten decades] we’re going to build an AI smart enough to improve itself, and [it will] “foom” upward in intelligence, and by the time it exhausts available avenues for improvement it will be a “superintelligence” [relative] to us. Do you feel this is obvious?

Aaronson replied:

The idea that we could build computers that are smarter than us… and that those computers could build still smarter computers… until we reach the physical limits of what kind of intelligence is possible… that we could build things that are to us as we are to ants — all of this is compatible with the laws of physics… and I can’t find a reason of principle that it couldn’t eventually come to pass…

The main thing we disagree about is the time scale… a few thousand years [before AI] seems more reasonable to me.

Those two estimates — several decades vs. “a few thousand years” — have wildly different policy implications.

After more discussion of AI forecasts as well as some general findings on forecasting, Luke continues:

Given these considerations, I think the most appropriate stance on the question “When will AI be created?” is something like this:

We can’t be confident AI will come in the next 30 years, and we can’t be confident it’ll take more than 100 years, and anyone who is confident of either claim is pretending to know too much.

How confident is “confident”? Let’s say 70%. That is, I think it is unreasonable to be 70% confident that AI is fewer than 30 years away, and I also think it’s unreasonable to be 70% confident that AI is more than 100 years away.

This statement admits my inability to predict AI, but it also constrains my probability distribution over “years of AI creation” quite a lot.

I think the considerations above justify these constraints on my probability distribution, but I haven’t spelled out my reasoning in great detail. That would require more analysis than I can present here. But I hope I’ve at least summarized the basic considerations on this topic, and those with different probability distributions than mine can now build on my work here to try to justify them.

I believe that in the face of this considerable uncertainty, considering multiple scenarios, and the implications of each scenario, can be quite helpful.

#2: Isn't scenario analysis unfalsifiable, and therefore unscientific? Why not aim for rigorous quantitative forecasting instead, that can be judged against reality?

First off, just because a forecast is quantitative doesn't mean it is actually rigorous. I think it's worthwhile to elicit and record quantitative forecasts. These can have high value for near-term horizons, and can provide a rough idea of the range of opinion for longer timescales.

However, simply phoning up experts to ask them for their timelines, or sending them an Internet survey, is not too useful. Tetlock's work, described in Muehlhauser's post and in my post on historical evaluations of forecasting, shows that unaided expert judgment has little value. Asking people who haven't thought through the issue to come up with numbers can give a fake sense of precision with little accuracy (and little genuine precision, either, if we consider the diverse range of responses from different experts). On the other hand, eliciting detailed scenarios from experts can force them to think more clearly about the issues and the relationships between them. Note that there are dangers to eliciting detailed scenarios: people may fall into their own make-believe world. But I think the trade-off with the uncertainty in quantitative forecasting still points in favor of scenario analysis.

Explicit quantitative forecasts can be helpful when people have an opportunity to learn from wrong forecasts and adjust their methodology accordingly. Therefore, I argue that if we want to go down the quantitative forecasting route, it's important to record forecasts about the near and medium future instead of or in addition to forecasts about the far future. Also, providing experts some historical information and feedback at the time they make their forecasts can help reduce the chances of them simply saying things without reflecting. Depending on the costs of recording forecasts, it may be worthwhile to do so anyway, even if we don't have high hopes that the forecasts will yield value. Broadly, I agree with Luke's suggestions:

  • Explicit quantification: “The best way to become a better-calibrated appraiser of long-term futures is to get in the habit of making quantitative probability estimates that can be objectively scored for accuracy over long stretches of time. Explicit quantification enables explicit accuracy feedback, which enables learning.”
  • Signposting the future: Thinking through specific scenarios can be useful if those scenarios “come with clear diagnostic signposts that policymakers can use to gauge whether they are moving toward or away from one scenario or another… Falsifiable hypotheses bring high-flying scenario abstractions back to Earth.”13
  • Leveraging aggregation: “the average forecast is often more accurate than the vast majority of the individual forecasts that went into computing the average…. [Forecasters] should also get into the habit that some of the better forecasters in [an IARPA forecasting tournament called ACE] have gotten into: comparing their predictions to group averages, weighted-averaging algorithms, prediction markets, and financial markets.” See Ungar et al. (2012) for some aggregation-leveraging results from the ACE tournament.

But I argue that the bulk of the effort should go into scenario generation and scenario analysis. Even here, the problem of absence of feedback is acute: we can design scenarios all we want for what will happen over the next century, but we can't afford to wait a century to know if our scenarios transpired. Therefore, it makes sense to break the scenario analysis exercises into chunks of 10-15 years. For instance, one scenario analysis could consider scenarios for the next 10-15 years. For each of the scenarios, we can have a separate scenario analysis exercise that considers scenarios for the 10-15 years after that. And so on. Note that the number of scenarios increases exponentially with the time horizon, but this is simply a reflection of the underlying complexity and uncertainty. In some cases, scenarios could "merge" at later times, as scenarios with slow early progress and fast later progress yield the same end result that scenario with fast early progress and slow later progress do.

#3: Evidence from other disciplines

Explicit quantitative forecasting is common in many disciplines, but the more we look at longer time horizons, and the more uncertainty we are dealing with, the more common scenario analysis becomes. I considered many examples of scenario analysis in my scenario analysis post. As you'll see from the list there, scenario analysis, and variants of it, have become influential in areas ranging from climate change (as seen in IPCC reports) to energy to macroeconomic and fiscal analysis to land use and transportation analysis. And big consulting companies such as McKinsey & Company use scenario analysis frequently in their reports.

It's of course possible to argue that the use of scenario analyses is a reflection of human failing: people don't want to make single forecasts because they are afraid of being proven wrong, or of contradicting other people's beliefs about the future. Or maybe people are shy of thinking quantitatively. I think there is some truth to such a critique. But until we have human-level AI, we have to rely on the failure-prone humans for input on the question of AI progress. Perhaps scenario analysis is superior to quantitative forecasting because humans are insufficiently rational, but to the extent it's superior, it's superior.

Addendum: What are the already existing scenario analyses for artificial intelligence?

I had a brief discussion with Luke Muehlhauser and some of the names below were suggested by him, but I didn't run the final list by him. All responsibility for errors is mine.

To my knowledge (and to the knowledge of people I've talked to) there are no formal scenario analyses of Artificial General Intelligence structured in a manner similar to the standard examples of scenario analyses. However, if scenario analysis is construed sufficiently loosely as a discussion of various predetermined elements and critical uncertainties and a brief mention of different possible scenarios, then we can list a few scenario analyses:

  • Nick Bostrom's book Superintelligence (released in the UK and on Kindle, but not released as a print book in the US at the time of this writing) discusses several scenarios for paths to AGI.
  • Eliezer Yudkowsky's report on Intelligence Explosion Microeconomics (93 pages, direct PDF link) can be construed as an analysis of AI scenarios.
  • Robin Hanson's forthcoming book on em economics discusses one future scenario that is somewhat related to AI progress.
  • The Hanson-Yudkowsky AI Foom debate includes a discussion of many scenarios.

The above are scenario analyses for the eventual properties and behavior of an artificial general intelligence, rather than scenario analyses for the immediate future. The work of Ray Kurzwzeil can be thought of as a scenario analysis that lays out an explicit timeline from now to the arrival of AGI.

[QUESTION]: Looking for insights from machine learning that helped improve state-of-the-art human thinking

3 VipulNaik 25 July 2014 02:10AM

This question is a follow-up of sorts to my earlier question on academic social science and machine learning.

Machine learning algorithms are used for a wide range of prediction tasks, including binary (yes/no) prediction and prediction of continuous variables. For binary prediction, common models include logistic regression, support vector machines, neural networks, and decision trees and forests.

Now, I do know that methods such as linear and logistic regression, and other regression-type techniques, are used extensively in science and social science research. Some of this research looks at the coefficients of such a model and then re-interprets them.

I'm interesting in examples where knowledge of the insides of other machine learning techniques (i.e., knowledge of the parameters for which the models perform well) has helped provide insights that are of direct human value, or perhaps even directly improved unaided human ability. In my earlier post, I linked to an example (courtesy Sebastian Kwiatkowski) where the results of  naive Bayes and SVM classifiers for hotel reviews could be translated into human-understandable terms (namely, reviews that mentioned physical aspects of the hotel, such as "small bedroom", were more likely to be truthful than reviews that talked about the reasons for the visit or the company that sponsored the visit).

PS: Here's a very quick description of how these supervised learning algorithms work. We first postulate a functional form that describes how the output depends on the input. For instance, the functional form in the case of logistic regression outputs the probability as the logistic function applied to a linear combination of the inputs (features). The functional form has a number of unknown parameters. Specific values of the parameters give specific functions that can be used to make predictions. Our goal is to find the parameter values.

We use a huge amount of labeled training data, plus a cost function (which itself typically arises from a statistical model for the nature of the error distribution) to find the parameter values. In the crudest form, this is purely a multivariable calculus optimization problem: choose parameters so that the total error function between the predicted function values and the observed function values is as small as possible. There are a few complications that need to be addressed to get to working algorithms.

So what makes machine learning problems hard? There are a few choice points:

  1. Feature selection: Figuring out the inputs (features) to use in predicting the outputs.
  2. Selection of the functional form model
  3. Selection of the cost function (error function)
  4. Selection of the algorithmic approach used to optimize the cost function, addressing the issue of overfitting through appropriate methods such as regularization and early stopping.

Of these steps, (1) is really the only step that is somewhat customized by domain, but even here, when we have enough data, it's more common to just throw in lots of features and see which ones actually help with prediction (in a regression model, the features that have predictive power will have nonzero coefficients in front of them, and removing them will increase the overall error of the model). (2) and (3) are mostly standardized, with our choice really being between a small number of differently flavored models (logistic regression, neural networks, etc.). (4) is the part where much of the machine learning research is concentrated: figuring out newer and better algorithms to find (approximate) solutions to the optimization problems for particular mathematical structures of the data.

 

[QUESTION]: Academic social science and machine learning

11 VipulNaik 19 July 2014 03:13PM

I asked this question on Facebook here, and got some interesting answers, but I thought it would be interesting to ask LessWrong and get a larger range of opinions. I've modified the list of options somewhat.

What explains why some classification, prediction, and regression methods are common in academic social science, while others are common in machine learning and data science?

For instance, I've encountered probit models in some academic social science, but not in machine learning.

Similarly, I've encountered support vector machines, artificial neural networks, and random forests in machine learning, but not in academic social science.

The main algorithms that I believe are common to academic social science and machine learning are the most standard regression algorithms: linear regression and logistic regression.

Possibilities that come to mind:

(0) My observation is wrong and/or the whole question is misguided.

(1) The focus in machine learning is on algorithms that can perform well on large data sets. Thus, for instance, probit models may be academically useful but don't scale up as well as logistic regression.

(2) Academic social scientists take time to catch up with new machine learning approaches. Of the methods mentioned above, random forests and support vector machines was introduced as recently as 1995. Neural networks are older but their practical implementation is about as recent. Moreover, the practical implementations of these algorithm in the standard statistical softwares and packages that academics rely on is even more recent. (This relates to point (4)).

(3) Academic social scientists are focused on publishing papers, where the goal is generally to determine whether a hypothesis is true. Therefore, they rely on approaches that have clear rules for hypothesis testing and for establishing statistical significance (see also this post of mine). Many of the new machine learning approaches don't have clearly defined statistical approaches for significance testing. Also, the strength of machine learning approaches is more exploratory than testing already formulated hypotheses (this relates to point (5)).

(4) Some of the new methods are complicated to code, and academic social scientists don't know enough mathematics, computer science, or statistics to cope with the methods (this may change if they're taught more about these methods in graduate school, but the relative newness of the methods is a factor here, relating to (2)).

(5) It's hard to interpret the results of fancy machine learning tools in a manner that yields social scientific insight. The results of a linear or logistic regression can be interpreted somewhat intuitively: the parameters (coefficients) associated with individual features describe the extent to which those features affect the output variable. Modulo issues of feature scaling, larger coefficients mean those features play a bigger role in determining the output. Pairwise and listwise R^2 values provide additional insight on how much signal and noise there is in individual features. But if you're looking at a neural network, it's quite hard to infer human-understandable rules from that. (The opposite direction is not too hard: it is possible to convert human-understandable rules to a decision tree and then to use a neural network to approximate that, and add appropriate fuzziness. But the neural networks we obtain as a result of machine learning optimization may be quite different from those that we can interpret as humans). To my knowledge, there haven't been attempts to reinterpret neural network results in human-understandable terms, though Sebastian Kwiatkowski's comment on my Facebook post points to an example where the results of  naive Bayes and SVM classifiers for hotel reviews could be translated into human-understandable terms (namely, reviews that mentioned physical aspects of the hotel, such as "small bedroom", were more likely to be truthful than reviews that talked about the reasons for the visit or the company that sponsored the visit). But Kwiatkowski's comment also pointed to other instances where the machine's algorithms weren't human-interpretable.

What's your personal view on my main question, and on any related issues?

How deferential should we be to the forecasts of subject matter experts?

12 VipulNaik 14 July 2014 11:41PM

This post explores the question: how strongly should we defer to predictions and forecasts made by people with domain expertise? I'll assume that the domain expertise is legitimate, i.e., the people with domain expertise do have a lot of information in their minds that non-experts don't. The information is usually not secret, and non-experts can usually access it through books, journals, and the Internet. But experts have more information inside their head, and may understand it better. How big an advantage does this give them in forecasting?

Tetlock and expert political judgment

In an earlier post on historical evaluations of forecasting, I discussed Philip E. Tetlock's findings on expert political judgment and forecasting skill, and summarized his own article for Cato Unbound co-authored with Dan Gardner that in turn summarized the themes of the book:

  1. The average expert’s forecasts were revealed to be only slightly more accurate than random guessing—or, to put more harshly, only a bit better than the proverbial dart-throwing chimpanzee. And the average expert performed slightly worse than a still more mindless competition: simple extrapolation algorithms that automatically predicted more of the same.
  2. The experts could be divided roughly into two overlapping yet statistically distinguishable groups. One group (the hedgehogs) would actually have been beaten rather soundly even by the chimp, not to mention the more formidable extrapolation algorithm. The other (the foxes) would have beaten the chimp and sometimes even the extrapolation algorithm, although not by a wide margin.
  3. The hedgehogs tended to use one analytical tool in many different domains; they preferred keeping their analysis simple and elegant by minimizing “distractions.” These experts zeroed in on only essential information, and they were unusually confident—they were far more likely to say something is “certain” or “impossible.” In explaining their forecasts, they often built up a lot of intellectual momentum in favor of their preferred conclusions. For instance, they were more likely to say “moreover” than “however.”
  4. The foxes used a wide assortment of analytical tools, sought out information from diverse sources, were comfortable with complexity and uncertainty, and were much less sure of themselves—they tended to talk in terms of possibilities and probabilities and were often happy to say “maybe.” In explaining their forecasts, they frequently shifted intellectual gears, sprinkling their speech with transition markers such as “although,” “but,” and “however.”
  5. It's unclear whether the performance of the best forecasters is the best that is in principle possible.
  6. This widespread lack of curiosity—lack of interest in thinking about how we think about possible futures—is a phenomenon worthy of investigation in its own right.

Tetlock has since started The Good Judgment Project (website, Wikipedia), a political forecasting competition where anybody can participate, and with a reputation of doing a much better job at prediction than anything else around. Participants are given a set of questions and can basically collect freely available online information (in some rounds, participants were given additional access to some proprietary data). They then use that to make predictions. The aggregate predictions are quite good. For more information, visit the website or see the references in the Wikipedia article. In particular, this Economist article and this Business Insider article are worth reading. (I discussed the GJP and other approaches to global political forecasting in this post).

So at least in the case of politics, it seems that amateurs, armed with basic information plus the freedom to look around for more, can use "fox-like" approaches and do a better job of forecasting than political scientists. Note that experts still do better than ignorant non-experts who are denied access to information. But once you have basic knowledge and are equipped to hunt more down, the constraining factor does not seem to be expertise, but rather, the approach you use (fox-like versus hedgehog-like). This should not be taken as a claim that expertise is irrelevant or unnecessary to forecasting. Experts play an important role in expanding the scope of knowledge and methodology that people can draw on to make their predictions. But the experts themselves, as people, do not have a unique advantage when it comes to forecasting.

Tetlock's research focused on politics. But the claim that the fox-hedgehog distinction turns out to be a better prediction of forecasting performance than the level of expertise is a general one. How true is this claim in domains other than politics? Domains such as climate science, economic growth, computing technology, or the arrival of artificial general intelligence?

Armstrong and Green again

J. Scott Armstrong is a leading figure in the forecasting community. Along with Kesten C. Green, he penned a critique of the forecasting exercises in climate science in 2007, with special focus on the IPCC reports. I discussed the critique at length in my post on the insularity critique of climate science. Here, I quote a part from the introduction of the critique that better explains the general prior that Armstrong and Green claim to be bringing to the table when they begin their evaluation. Of the points they make at the beginning, two bear directly on the deference we should give to expert judgment and expert consensus:

  • Unaided judgmental forecasts by experts have no value: This applies whether the opinions are expressed in words, spreadsheets, or mathematical models. It applies regardless of how much scientific evidence is possessed by the experts. Among the reasons for this are:
    a) Complexity: People cannot assess complex relationships through unaided observations.
    b) Coincidence: People confuse correlation with causation.
    c) Feedback: People making judgmental predictions typically do not receive unambiguous feedback they can use to improve their forecasting.
    d) Bias: People have difficulty in obtaining or using evidence that contradicts their initial beliefs. This problem is especially serious for people who view themselves as experts.
  • Agreement among experts is only weakly related to accuracy: This is especially true when the experts communicate with one another and when they work together to solve problems, as is the case with the IPCC process.

Armstrong and Green later elaborate on these claims, referencing Tetlock's work. (Note that I have removed the parts of the section that involve direct discussion of climate-related forecasts, since the focus here is on the general question of how much deference to show to expert consensus).

Many public policy decisions are based on forecasts by experts. Research on persuasion has shown that people have substantial faith in the value of such forecasts. Faith increases when experts agree with one another. Our concern here is with what we refer to as unaided expert judgments. In such cases, experts may have access to empirical studies and other information, but they use their knowledge to make predictions without the aid of well-established forecasting principles. Thus, they could simply use the information to come up with judgmental forecasts. Alternatively, they could translate their beliefs into mathematical statements (or models) and use those to make forecasts.

Although they may seem convincing at the time, expert forecasts can make for humorous reading in retrospect. Cerf and Navasky’s (1998) book contains 310 pages of examples, such as Fermi Award-winning scientist John von Neumann’s 1956 prediction that “A few decades hence, energy may be free”. [...] The second author’s review of empirical research on this problem led him to develop the “Seer-sucker theory,” which can be stated as “No matter how much evidence exists that seers do not exist, seers will find suckers” (Armstrong 1980). The amount of expertise does not matter beyond a basic minimum level. There are exceptions to the Seer-sucker Theory: When experts get substantial well-summarized feedback about the accuracy of their forecasts and about the reasons why their forecasts were or were not accurate, they can improve their forecasting. This situation applies for short-term (up to five day) weather forecasts, but we are not aware of any such regime for long-term global climate forecasting. Even if there were such a regime, the feedback would trickle in over many years before it became useful for improving forecasting.

Research since 1980 has provided much more evidence that expert forecasts are of no value. In particular, Tetlock (2005) recruited 284 people whose professions included, “commenting or offering advice on political and economic trends.” He asked them to forecast the probability that various situations would or would not occur, picking areas (geographic and substantive) within and outside their areas of expertise. By 2003, he had accumulated over 82,000 forecasts. The experts barely if at all outperformed non-experts and neither group did well against simple rules. Comparative empirical studies have routinely concluded that judgmental forecasting by experts is the least accurate of the methods available to make forecasts. For example, Ascher (1978, p. 200), in his analysis of long-term forecasts of electricity consumption found that was the case.

Note that the claims that Armstrong and Green make are in relation to unaided expert judgment, i.e., expert judgment that is not aided by some form of assistance or feedback that promotes improved forecasting. (One can argue that expert judgment in climate science is not unaided, i.e., that the critique is mis-applied to climate science, but whether that is the case is not the focus of my post). While Tetlock's suggestion to be more fox-like, Armstrong and Green recommend the use of their own forecasting principles, as encoded in their full list of principles and described on their website.

A conflict of intuitions, and an attempt to resolve it

I have two conflicting intuitions here. I like to use the majority view among experts as a reasonable Bayesian prior to start with, that I might then modify based on further study. The relevant question here is who the experts are. Do I defer to the views of domain experts, who may know little about the challenges of forecasting, or do I defer to the views of forecasting experts, who may know little of the domain but argue that domain experts who are not following good forecasting principles do not have any advantage over non-experts?

I think the following heuristics are reasonable starting points:

  • In cases where we have a historical track record of forecasts, we can use that to evaluate the experts and non-experts. For instance, I reviewed the track record of survey-based macroeconomic forecasts, thanks to a wealth of recorded data on macroeconomic forecasts by economists over the last few decades. (Unfortunately, these surveys did not include corresponding data on layperson opinion).
  • The faster the feedback from making a forecast to knowing whether it's right, the more likely it is that experts would have learned how to make good forecasts.
  • The more central forecasting is to the overall goals of the domain, the more likely people are to get it right. For instance, forecasting is a key part of weather and climate science. But forecasting progress on mathematical problems has a negligible relation with doing mathematical research.
  • Ceteris paribus, if experts are clearly recording their forecasts and the reasons behind them, and systematically evaluating the performance on past forecasts, that should be taken as (weak) evidence in favor of the experts' views being taken more seriously (even if we don't have enough of a historical track record to properly calibrate forecast accuracy). However, if they simply make forecasts but then fail to review their past history of forecasts, this may be taken as being about as bad as not forecasting at all. And in cases that the forecasts were bold, failed miserably, and yet the errors were not acknowledged, this should be taken as being considerably worse than not forecasting at all.
  • A weak inside view of the nature of domain expertise can give some idea of whether expertise should generally translate to better forecasting skill. For instance, even a very weak understanding of physics will tell us that physicists are no more likely to determine whether a coin toss will yield heads or tails, even though the fate of the coin is determined by physics. Similarly, with the exception of economists who specialize in the study of macroeconomic indicators, one wouldn't expect economists to be able to forecast macroeconomic indicators better than most moderately economically informed people.

Politicization?

My first thought was that the more politicized a field, the less reliable any forecasts coming out of it. I think there are obvious reasons for that view, but there are also countervailing considerations.

The main claimed danger of politicization is groupthink and lack of openness to evidence. It could even lead to suppression, misrepresentation, or fabrication of evidence. Quite often, however, we see these qualities in highly non-political fields. People believe that certain answers are the right ones. Their political identity or ego is not attached to it. They just have high confidence that that answer is correct, and when the evidence they have does not match up, they think there is a problem with the evidence. Of course, if somebody does start challenging the mainstream view, and the issue is not quickly resolved either way, it can become politicized, with competing camps of people who hold the mainstream view and people who side with the challengers. Note, however, that the politicization has arguably reduced the aggregate amount of groupthink in the field. Now that there are two competing camps rather than one received wisdom, new people can examine evidence and better decide which camp is more on the side of truth. People in both camps, now that they are competing, may try to offer better evidence that could convince the undecideds or skeptics. So "politicization" might well improve the epistemic situation (I don't doubt that the opposite happens quite often). Examples of such politicization might be the replacement of geocentrism by heliocentrism, the replacement of creationism by evolution, and the replacement of Newtonian mechanics by relativity and/or quantum mechanics. In the first two cases, religious authorities pushed against the new idea, even though the old idea had not been a "politicized" tenet before the competing claims came along. In the case of Newtonian and quantum mechanics, the debate seems to have been largely intra-science, but quantum mechanics had its detractors, including Einstein, famous for the "God does not play dice" quip. (This post on Slate Star Codex is somewhat related).

The above considerations aren't specific to forecasting, and they apply even for assertions that fall squarely within the domain of expertise and require no forecasting skill per se. The extent to which they apply to forecasting problems is unclear. It's unclear whether most domains have any significant groupthink in favor of particular forecasts. In fact, in most domains, forecasts aren't really made or publicly recorded at all. So concerns of groupthink in a non-politicized scenario may not apply to forecasting. Perhaps the problem is the opposite: forecasts are so unimportant in many domains that the forecasts offered by experts are almost completely random and hardly informed in a systematic way by their expert knowledge. Even in such situations, politicization can be helpful, in so far as it makes the issue more salient and might prompt individuals to give more attention to trying to figure out which side is right.

The case of forecasting AI progress

I'm still looking at the case of forecasting AI progress, but for now, I'd like to point people to Luke Muehlhauser's excellent blog post from May 2013 discussing the difficulty with forecasting AI progress. Interestingly, he makes many points similar to those I make here. (Note: Although I had read the post around the time it was published, I hadn't read it recently until I finished drafting the rest of my current post. Nonetheless, my views can't be considered totally independent of Luke's because we've discussed my forecasting contract work for MIRI).

Should we expect experts to be good at predicting AI, anyway? As Armstrong & Sotala (2012) point out, decades of research on expert performance2 suggest that predicting the first creation of AI is precisely the kind of task on which we should expect experts to show poor performance — e.g. because feedback is unavailable and the input stimuli are dynamic rather than static. Muehlhauser & Salamon (2013) add, “If you have a gut feeling about when AI will be created, it is probably wrong.”

[...]

On the other hand, Tetlock (2005) points out that, at least in his large longitudinal database of pundit’s predictions about politics, simple trend extrapolation is tough to beat. Consider one example from the field of AI: when David Levy asked 1989 World Computer Chess Championship participants when a chess program would defeat the human World Champion, their estimates tended to be inaccurately pessimistic,8 despite the fact that computer chess had shown regular and predictable progress for two decades by that time. Those who forecasted this event with naive trend extrapolation (e.g. Kurzweil 1990) got almost precisely the correct answer (1997).

Looking for thoughts

I'm particularly interested in thoughts from people on the following fronts:

  1. What are some indicators you use to determine the reliability of forecasts by subject matter experts?
  2. How do you resolve the conflict of intuitions between deferring to the views of domain experts and deferring to the conclusion that forecasters have drawn about the lack of utility of domain experts' forecasts?
  3. In particular, what do you think of the way that "politicization" affects the reliability of forecasts?
  4. Also, how much value do you assign to agreement between experts when judging how much trust to place in expert forecasts?
  5. Comments that elaborate on these questions or this general topic within the context of a specific domain or domains would also be welcome.

Scenario analyses for technological progress for the next decade

10 VipulNaik 14 July 2014 04:31PM

This is a somewhat long and rambling post. Apologies for the length. I hope the topic and content are interesting enough for you to forgive the meandering presentation.

I blogged about the scenario planning method a while back, where I linked to many past examples of scenario planning exercises. In this post, I take a closer look at scenario analysis in the context of understanding the possibilities for the unfolding of technological progress over the next 10-15 years. Here, I will discuss some predetermined elements and critical uncertainties, offer my own scenario analysis, and then discuss scenario analyses by others.

Remember: it is not the purpose of scenario analysis to identify a set of mutually exclusive and collectively exhaustive outcomes. In fact, usually, the real-world outcome has some features from two or more of the scenarios considered, with one scenario dominating somewhat. As I noted in my earlier post:

The utility of scenario analysis is not merely in listing a scenario that will transpire, or a collection of scenarios a combination of which will transpire. The utility is in how it prepares the people undertaking the exercise for the relevant futures. One way it could so prepare them is if the early indicators of the scenarios are correctly chosen and, upon observing them, people are able to identify what scenario they're in and take the appropriate measures quickly. Another way is by identifying some features that are common to all scenarios, though the details of the feature may differ by scenario. We can therefore have higher confidence in these common features and can make plans that rely on them.

The predetermined element: the imminent demise of Moore's law "as we know it"

As Steven Schnaars noted in Megamistakes (discussed here), forecasts of technological progress in most domains have been overoptimistic, but in the domain of computing, they've been largely spot-on, mostly because the raw technology has improved quickly. The main reason has been Moore's law, and a couple other related laws, that have undergirded technological progress. But now, the party is coming to an end! The death of Moore's law (as we know it) is nigh, and there are significant implications for the future of computing.

Moore's law refers to many related claims about technological progress. Some forms of this technological progress have already stalled. Other forms are slated to stall in the near future, barring unexpected breakthroughs. These facts about Moore's law form the backdrop for all our scenario planning.

The critical uncertainty arises in how industry will respond to the prospect of Moore's law death. Will there be a doubling down on continued improvement at the cutting edge? Will the battle focus on cost reductions? Or will we have neither cost reduction nor technological improvement? What sort of pressure will hardware stagnation put on software?

Now, onto a description of the different versions of Moore's law (slightly edited version of information from Wikipedia):

  • Transistors per integrated circuit. The most popular formulation is of the doubling of the number of transistors on integrated circuits every two years.

  • Density at minimum cost per transistor. This is the formulation given in Moore's 1965 paper. It is not just about the density of transistors that can be achieved, but about the density of transistors at which the cost per transistor is the lowest. As more transistors are put on a chip, the cost to make each transistor decreases, but the chance that the chip will not work due to a defect increases. In 1965, Moore examined the density of transistors at which cost is minimized, and observed that, as transistors were made smaller through advances in photolithography, this number would increase at "a rate of roughly a factor of two per year".

  • Dennard scaling. This suggests that power requirements are proportional to area (both voltage and current being proportional to length) for transistors. Combined with Moore's law, performance per watt would grow at roughly the same rate as transistor density, doubling every 1–2 years. According to Dennard scaling transistor dimensions are scaled by 30% (0.7x) every technology generation, thus reducing their area by 50%. This reduces the delay by 30% (0.7x) and therefore increases operating frequency by about 40% (1.4x). Finally, to keep electric field constant, voltage is reduced by 30%, reducing energy by 65% and power (at 1.4x frequency) by 50%. Therefore, in every technology generation transistor density doubles, circuit becomes 40% faster, while power consumption (with twice the number of transistors) stays the same.

So how are each of these faring?

  • Transistors per integrated circuit: At least in principle, this can continue for a decade or so. The technological ideas exist to publish transistor sizes down from the current values of 32 nm and 28 nm all the way down to 7 nm.
  • Density at minimum cost per transistor. This is probably stopping around now. There is good reason to believe that, barring unexpected breakthroughs, the transistor size for which we have minimum cost per transistor shall not go down below 28 nm. There may still be niche applications that benefit from smaller transistor sizes, but there will be no overwhelming economic case to switch production to smaller transistor sizes (i.e., higher densities).
  • Dennard scaling. This broke down around 2005-2007. So for approximately a decade, we've essentially seen continued miniaturization but without any corresponding improvement in processor speed or performance per watt. There have been continued overall improvements in energy efficiency of computing, but not through this mechanism. The absence of automatic speed improvements has led to increased focus on using greater parallelization (note that the miniaturization means more parallel processors can be packed in the same space, so Moore's law is helping in this other way). In particular, there has been an increased focus on multicore processors, though there may be limits to how far that can take us too.

Moore's law isn't the only law that is slated to end. Other similar laws, such as Kryder's law (about the cost of hard disk space) may also end in the near future. Koomey's law on energy efficiency may also stall, or might continue to hold but through very different mechanisms compared to the ones that have driven it so far.

Some discussions that do not use explicit scenario analysis

The quotes below are to give a general idea of what people seem to generally agree on, before we delve into different scenarios.

EETimes writes:

We have been hearing about the imminent demise of Moore's Law quite a lot recently. Most of these predictions have been targeting the 7nm node and 2020 as the end-point. But we need to recognize that, in fact, 28nm is actually the last node of Moore's Law.

[...]

Summarizing all of these factors, it is clear that -- for most SoCs -- 28nm will be the node for "minimum component costs" for the coming years. As an industry, we are facing a paradigm shift because dimensional scaling is no longer the path for cost scaling. New paths need to be explored such as SOI and monolithic 3D integration. It is therefore fitting that the traditional IEEE conference on SOI has expanded its scope and renamed itself as IEEE S3S: SOI technology, 3D Integration, and Subthreshold Microelectronics.

Computer scientist Moshe Yardi writes:

So the real question is not when precisely Moore's Law will die; one can say it is already a walking dead. The real question is what happens now, when the force that has been driving our field for the past 50 years is dissipating. In fact, Moore's Law has shaped much of the modern world we see around us. A recent McKinsey study ascribed "up to 40% of the global productivity growth achieved during the last two decades to the expansion of information and communication technologies made possible by semiconductor performance and cost improvements." Indeed, the demise of Moore's Law is one reason some economists predict a "great stagnation" (see my Sept. 2013 column).

"Predictions are difficult," it is said, "especially about the future." The only safe bet is that the next 20 years will be "interesting times." On one hand, since Moore's Law will not be handing us improved performance on a silver platter, we will have to deliver performance the hard way, by improved algorithms and systems. This is a great opportunity for computing research. On the other hand, it is possible that the industry would experience technological commoditization, leading to reduced profitability. Without healthy profit margins to plow into research and development, innovation may slow down and the transition to the post-CMOS world may be long, slow, and agonizing.

However things unfold, we must accept that Moore's Law is dying, and we are heading into an uncharted territory.

CNet says:

"I drive a 1964 car. I also have a 2010. There's not that much difference -- gross performance indicators like top speed and miles per gallon aren't that different. It's safer, and there are a lot of creature comforts in the interior," said Nvidia Chief Scientist Bill Dally. If Moore's Law fizzles, "We'll start to look like the auto industry."

Three critical uncertainties: technological progress, demand for computing power, and interaction with software

Uncertainty #1: Technological progress

Moore's law is dead, long live Moore's law! Even if Moore's law as originally stated is no longer valid, there are other plausible computing advances that would preserve the spirit of the law.

Minor modifications of current research (as described in EETimes) include:

  • Improvements in 3D circuit design (Wikipedia), so that we can stack multiple layers of circuits one on top of the other, and therefore pack more computing power per unit volume.
  • Improvements in understanding electronics at the nanoscale, in particular understanding subthreshold leakage (Wikipedia) and how to tackle it.

Then, there are possibilities for totally new computing paradigms. These have fairly low probability, and are highly unlikely to become commercially viable within 10-15 years. Each of these offers an advantage over currently available general-purpose computing only for special classes of problems, generally those that are parallelizable in particular ways (the type of parallelizability needed differs somewhat between the computing paradigms).

  • Quantum computing (Wikipedia) (speeds up particular types of problems). Quantum computers already exist, but the current ones can tackle only a few qubits. Currently, the best known quantum computers in action are those maintained at the Quantum AI Lab (Wikipedia) run jointly by Google, NASA. and USRA. It is currently unclear how to manufacture quantum computers with a larger number of qubits. It's also unclear how the cost will scale in the number of qubits. If the cost scales exponentially in the number of qubits, then quantum computing will offer little advantage over classical computing. Ray Kurzweil explains this as follows:
    A key question is: how difficult is it to add each additional qubit? The computational power of a quantum computer grows exponentially with each added qubit, but if it turns out that adding each additional qubit makes the engineering task exponentially more difficult, we will not be gaining any leverage. (That is, the computational power of a quantum computer will be only linearly proportional to the engineering difficulty.) In general, proposed methods for adding qubits make the resulting systems significantly more delicate and susceptible to premature decoherence.

    Kurzweil, Ray (2005-09-22). The Singularity Is Near: When Humans Transcend Biology (Kindle Locations 2152-2155). Penguin Group. Kindle Edition.
  • DNA computing (Wikipedia)
  • Other types of molecular computing (Technology Review featured story from 2000, TR story from 2010)
  • Spintronics (Wikipedia): The idea is to store information using the spin of the electron, a quantum property that is binary and can be toggled at zero energy cost (in principle). The main potential utility of spintronics is in data storage, but it could potentially help with computation as well.
  • Optical computing aka photonic computing (Wikipedia): This uses beams of photons that store the relevant information that needs to be manipulated. Photons promise to offer higher bandwidth than electrons, the tool used in computing today (hence the name electronic computing).

Uncertainty #2: Demand for computing

Even if computational advances are possible in principle, the absence of the right kind of demand can lead to a lack of financial incentive to pursue the relevant advances. I discussed the interaction between supply and demand in detail in this post.

As that post discussed, demand for computational power at the consumer end is probably reaching saturation. The main source of increased demand will now be companies that want to crunch huge amounts of data in order to more efficiently mine data for insight and offer faster search capabilities to their users. The extent to which such demand grows is uncertain. In principle, the demand is unlimited: the more data we collect (including "found data" that will expand considerably as the Internet of Things grows), the more computational power is needed to apply machine learning algorithms to the data. Since the complexity of many machine learning algorithms grows at least linearly (and in some cases quadratically or cubically) in the data, and the quantity of data itself will probably grow superlinearly, we do expect a robust increase in demand for computing.

Uncertainty #3: Interaction with software

Much of the increased demand for computing, as noted above, does not arise so much from a need for raw computing power by consumers, but a need for more computing power to manipulate and glean insight from large data sets. While there has been some progress with algorithms for machine learning and data mining, the fields are probably far from mature. So an alternative to hardware improvements is improvements in the underlying algorithms. In addition to the algorithms themselves, execution details (such as better use of parallel processing capabilities and more efficient use of idle processor capacity) can also yield huge performance gains.

This might be a good time to note a common belief about software and why I think it's wrong. We often tend to hear of software bloat, and some people subscribe to Wirth's law, the claim that software is getting slower more quickly than hardware is getting faster. I think that there are some softwares that have gotten feature-bloated over time, largely because there are incentives to keep putting out new editions that people are willing to pay money for, and Microsoft Word might be one case of such bloat. For the most part, though, software has been getting more efficient, partly by utilizing the new hardware better, but also partly due to underlying algorithmic improvements. This was one of the conclusions of Katja Grace's report on algorithmic progress (see also this link on progress on linear algebra and linear programming algorithms). There are a few softwares that get feature-bloated and as a result don't appear to improve over time as far as speed goes, but it's arguably the case that people's revealed preferences show that they are willing to put up with the lack of speed improvements as long as they're getting feature improvements.

Computing technology progress over the next 10-15 years: my three scenarios

  1. Slowdown to ordinary rates of growth of cutting-edge industrial productivity: For the last few decades, several dimensions of computing technology have experienced doublings over time periods ranging from six months to five years. With such fast doubling, we can expect price-performance thresholds for new categories of products to be reached every few years, with multiple new product categories a decade. Consider, for instance, desktops, then laptops, then smartphones, then tablets. If the doubling time reverts to the norm seen in other cutting-edge industrial sectors, namely 10-25 years, then we'd probably see the introduction of revolutionary new product categories only about once a generation. There are already some indications of a possible slowdown, and it remains to be seen whether we see a bounceback.
  2. Continued fast doubling: The other possibility is that the evidence for a slowdown is largely illusory, and computing technology will continue to experience doublings over timescales of less than five years. There would therefore be scope to introduce new product categories every few years.
  3. New computing paradigm with high promise, but requiring significant adjustment: This is an unlikely, but not impossible, scenario. Here, a new computing paradigm, such as quantum computing, reaches the realm of feasibility. However, the existing infrastructure of algorithms is ill-designed for quantum computing, and in fact, quantum computing engenders many security protocols while offering its own unbreakable ones. Making good use of this new paradigm requires a massive re-architecting of the world's computing infrastructure.

There are two broad features that are likely to be common to all scenarios:

  • Growing importance of algorithms: Scenario (1): If technological progress in computing power stalls, then the pressure for improvements to the algorithms and software may increase. Scenario (2): if technological progress in computing power continues, that might only feed the hunger for bigger data. And as the size of data sets increases, asymptotic performance starts mattering more (the distinction between O(n) and O(n2) matters more when n is large). In both cases, I expect more pressure on algorithms and software, but in different ways: in the case of stalling hardware progress, the focus will be more on improving the software and making minor changes to improve the constants, whereas in the case of rapid hardware progress, the focus will be more on finding algorithms that have better asymptotic (big-oh) performance. Scenario (3): In the case of paradigm shifts, the focus will be on algorithms that better exploit the new paradigm. In all cases, there will need to be some sort of shift toward new algorithms and new code that better exploits the new situation.
  • Growing importance of parallelization: Although the specifics of how algorithms will become more important varies between the scenarios, one common feature is that algorithms that can better make parallel use of large numbers of machines will become more important. We have seen parallelization grow in importance over the last 15 years, even as the computing gains for individual processors through Moore's law seems to be plateauing out, while data centers have proliferated in number. However, the full power of parallelization is far from tapped out. Again, parallelization matters for slightly different reasons in different cases. Scenario (1): A slowdown in technological progress would mean that gains in the amount of computation can largely be achieved by scaling up the number of machines. In other words, the usage of computing shifts further in a capital-intensive direction. Parallel computing is important for effective utilization of this capital (the computing resources). Scenario (2): Even in the face of rapid hardware progress, automatic big data generation will likely improve much faster than storage, communication, and bandwidth. This "big data" is too huge to store or even stream on a single machine, so parallel processing across huge clusters of machines becomes important. Scenario (3): Note also that almost all the new computing paradigms currently under consideration (including quantum computing) offer massive advantages for special types of parallelizable problems, so parallelization matters even in the case of a paradigm shift in computing.

Other scenario analyses

McKinsey carried out a scenario analysis here, focused more on the implications for the semiconductor manufacturing industry than for users of computing. The report notes the importance of Moore's law in driving productivity improvements over the last few decades:

As a result, Moore’s law has swept much of the modern world along with it. Some estimates ascribe up to 40 percent of the global productivity growth achieved during the last two decades to the expansion of information and communication technologies made possible by semiconductor performance and cost improvements.

The scenario analysis identifies four potential sources of innovation related to Moore's law:

  1. More Moore (scaling)
  2. Wafer-size increases (maximize productivity)
  3. More than Moore (functional diversification)
  4. Beyond CMOS (new technologies)

Their scenario analysis uses a 2 X 2 model, with the two dimensions under consideration being performance improvements (continue versus stop) and cost improvements (continue versus stop). The case that both performance improvements and cost improvements continue is the "good" case for the semiconductor industry. The case that both stop is the case where the industry is highly likely to get commodified, with profit margins going down and small players catching up to the big ones. In the intermediate cases (where one of the two continues and the other stops), consolidation of the semiconductor industry is likely to continue, but there is still a risk of falling demand.

The McKinsey scenario analysis was discussed by Timothy Taylor on his blog, The Conversable Economist, here.

Roland Berger carried out a detailed scenario analysis focused on the "More than Moore" strategy here.

Blegging for missed scenarios, common features and early indicators

Are there scenarios that the analyses discussed above missed? Are there some types of scenario analysis that we didn't adequately consider? If you had to do your own scenario analysis for the future of computing technology and hardware progress over the next 10-15 years, what scenarios would you generate?

As I noted in my earlier post:

The utility of scenario analysis is not merely in listing a scenario that will transpire, or a collection of scenarios a combination of which will transpire. The utility is in how it prepares the people undertaking the exercise for the relevant futures. One way it could so prepare them is if the early indicators of the scenarios are correctly chosen and, upon observing them, people are able to identify what scenario they're in and take the appropriate measures quickly. Another way is by identifying some features that are common to all scenarios, though the details of the feature may differ by scenario. We can therefore have higher confidence in these common features and can make plans that rely on them.

I already identified some features I believe to be common to all scenarios (namely, increased focus on algorithms, and increased focus on parallelization). Do you agree with my assessment that these are likely to matter regardless of scenario? Are there other such common features you have high confidence in?

If you generally agree with one or more of the scenario analyses here (mine or McKinsey's or Roland Berger's), what early indicators would you use to identify which of the enumerated scenarios we are in? Is it possible to look at how events unfold over the next 2-3 years and draw intelligent conclusions from that about the likelihood of different scenarios?

Communicating forecast uncertainty

5 VipulNaik 12 July 2014 09:30PM

Note: This post is part of my series of posts on forecasting, but this particular post may be of fairly limited interest to many LessWrong readers. I'm posting it here mainly for completeness. As always, I appreciate feedback.

In the course of my work looking at forecasting for MIRI, I repeatedly encountered discussions of how to communicate forecasts. In particular, a concern that emerged repeatedly was the clear communication of the uncertainty in forecasts. Nate Silver's The Signal and the Noise, in particular, focused quite a bit on the virtue of clear communication of uncertainty, in contexts as diverse as financial crises, epidemiology, weather forecasting, and climate change.

In this post, I pull together discussions from a variety of domains about the communication of uncertainty, and also included my overall impression of the findings.

Summary of overall findings

  • In cases where forecasts are made and used frequently (the most salient example being temperature and precipitation forecasts) people tend to form their own models of the uncertainty surrounding forecasts, even if you present forecasts as point estimates. The models people develop are quite similar to the correct ones, but still different in important ways.
  • In cases where forecasts are made more rarely, as with forecasting rare events, people are more likely to have simpler models that acknowledge some uncertainty but are less nuanced. In these cases, acknowledging uncertainty becomes quite important, because wrong forecasts of such events can lead to a loss of trust in the forecasting process, and can lead people to ignore correct forecasts later.
  • In some cases, there are arguments for modestly exaggerating small probabilities to overcome specific biases that people have that cause them to ignore low-probability events.
  • However, the balance of evidence suggests that forecasts should be reported as honestly as possible, and all uncertainty should be clearly acknowledged. If the forecast does not acknowledge uncertainty, people are likely to either use their own models of uncertainty, or lose faith in the forecasting process entirely if the forecast turns out to be far off from reality.

Probabilities of adverse events and the concept of the cost-loss ratio

A useful concept developed for understanding the utility of weather forecasting is the cost-loss model (Wikipedia). Consider that if a particular adverse event occurs, and we do not take precautionary measures, the loss incurred is L, whereas if we do take precautionary measures, the cost is C, regardless of whether the event occurs. An example: you're planning an outdoor party, and the adverse event in question is rain. If it rains during the event, you experience a loss of L. If you knew in advance that it would rain, you'd move the venue indoors, at a cost of C. Obviously, C < L for you to even consider the precautionary measure.

The ratio C/L is termed the cost-loss ratio and describes the probability threshold above which it makes sense to take the precautionary measure.

One way of thinking of the utility of weather forecasting, particularly in the context of forecasting adverse events (rain, snow, winds, and more extreme events) is in terms of whether people have adequate information to make correct decisions based on their cost-loss model. This would boil down to several questions:

  • Is the probability of the adverse event communicated with sufficient clarity and precision that people who need to use it can plug it into their cost-loss model?
  • Do people have a correct estimate of their cost-loss ratio (implicitly or explicitly)?

As I discussed in an earlier post, The Weather Channel has admitted to explicitly introducing wet bias into its probability-of-precipitation (PoP) forecasts. The rationale they offered could be interpreted as a claim that people overestimate their cost-loss ratio. For instance, a person may think his cost-loss ratio for precipitation is 0.2 (20%), but his actual cost-loss ratio may be 0.05 (5%). In this case, in order to make sure people still make the "correct" decision, PoP forecasts that fall between 0.05 and 0.2 would need to inflated to 0.2 or higher. Note that TWC does not introduce wet bias at higher probabilities of precipitation, arguably because (they believe) that this is well above the cost-loss ratio for most situations.

Words of estimative probability

In 1964, Sherman Kent (Wikipedia), the father of intelligence analysis, wrote an essay titled "Words of Estimative Probability" that discussed the use of words to describe probability estimates, and how different people may interpret the same word as referring to very different ranges of probability estimates. The concept of words of estimative probability (Wikipedia), along with its acronym, WEP, is now standard jargon in intelligence analysis.

Some related discussion of the use of words to convey uncertainty in estimates can be found in the part of this post where I excerpt from the paper discussing the communication of uncertainty in climate change.

Other general reading

#1: The case of weather forecasting

Weather forecasting has some features that make it stand out among other forecasting domains:

  • Forecasts are published explicitly and regularly: News channels and newspapers carry forecasts every day. Weather websites update their forecasts on at least an hourly basis, sometimes even faster, particularly if there are unusual weather developments. In the United States, The Weather Channel is dedicated to 24 X 7 weather news coverage.
  • Forecasts are targeted at and consumed by the general public: This sets weather forecasting apart from other forms of forecasting and prediction. We can think of prices in financial markets and betting markets as implicit forecasts. But they are targeted at the niche audiences that pay attention to them, not at everybody. The mode of consumption varies. Some people just get their forecasts from the weather reports in their local TV and radio channel. Some people visit the main weather websites (such as the National Weather Service, The Weather Channel, AccuWeather, or equivalent sources in other countries). Some people have weather reports emailed to them daily. As smartphones grow in popularity, weather apps are an increasingly common way for people to keep tabs on the weather. The study on communicating weather uncertainty (discussed below) found that in the United States, people in its sample audience saw weather forecasts an average of 115 times a month. Even assuming heavy selection bias in the study, people in the developed world probably encounter a weather forecast at least once a day.
  • Forecasts are used to drive decision-making: Particularly in places where weather fluctuations are significant, forecasts play an important role in event planning for individuals and organizations. At the individual level, this can include deciding whether to carry an umbrella, choosing what clothes to wear, deciding whether to wear snow boots, deciding whether conditions are suitable for driving, and many other small decisions. At the organizational level, events may be canceled or relocated based on forecasts of adverse weather. In locations with variable weather, it's considered irresponsible to plan an event without checking the weather forecast.
  • People get quick feedback on whether the forecast was accurate: The next day, people know whether what was forecast transpired.

The upshot: people are exposed to weather forecasts, pay attention to them, base decisions on them, and then come to know whether the forecast was correct. This happens on a daily basis. Therefore, they have both the incentives and the information to form their own mental model of the reliability and uncertainty in forecasts. Note also that because the reliability of forecasts varies considerably by location, people who move from one location to another may take time adjusting to the new location. (For instance, when I moved to Chicago, I didn't pay much attention to weather forecasts in the beginning, but soon learned that the high variability of the weather combined with reasonable accuracy of forecasts made then worth paying attention to. Now that I'm in Berkeley, I probably pay too much attention to the forecast relative to its value, given the stability of weather in Berkeley).

With these general thoughts in mind, let's look at the paper Communicating Uncertainty in Weather Forecasts: A Survey of the U. S. Public by Rebecca E. Morss, Julie L. Demuth, and Jeffrey K. Lazo. The paper is based on a survey of about 1500 people in the United States. The whole paper is worth a careful read if you find the issue fascinating. But for the benefits of those of you who find the issue somewhat interesting but not enough to read the paper, I include some key takeaways from the paper.

Temperature forecasts: the authors find that even though temperature forecasts are generally made as point estimates, people interpret these point estimates as temperature ranges. The temperature ranges are not even necessarily centered at the point estimates. Further, the range of temperatures increases with the forecast horizon. In other words, people (correctly) realize that forecasts made for three days later have more uncertainty attached to them than forecasts made for one day later. In other words, peoples understanding of the nature of forecast uncertainty in temperatures is correct, at least in the broad qualitative sense.

The authors believe that people arrive at these correct models through their own personal history of seeing weather forecasts and evaluating how they compare with the reality. Clearly, most people don't keep close track of how forecasts compare with the reality, but they are still able to get the general idea over several years of exposure to weather forecasts. The authors also believe that since the accuracy of weather forecasts varies by region, people's models of uncertainty may also differ by region. However, the data they collect does not allow for a test of this hypothesis. For more, read Sections 3a and 3b of the paper.

Probability-of-precipitation (PoP) forecasts: The authors also look at people's perception of probability-of-precipitation (PoP) forecasts. The correct meteorological interpretation of PoP is "the probability that precipitation occurs given these meteorological conditions." The frequentist operationalization of this would be "the fraction (situations with meteorological conditions like this where precipitation does occur)/(situations with meteorological conditions like this)." To what extent are people aware of this meaning? One of the questions in the survey elicits information on this front:

TABLE 2. Responses to Q14a, the meaning of the forecast
“There is a 60% chance of rain for tomorrow” (N 1330).
It will rain tomorrow in 60% of the region. 16% of respondents
It will rain tomorrow for 60% of the time. 10% of respondents
It will rain on 60% of the days like tomorrow.* 19% of respondents
60% of weather forecasters believe that it will rain tomorrow. 22% of respondents
I don’t know. 9% of respondents
Other (please explain). 24% of respondents
* Technically correct interpretation, according to how PoP forecasts are verified, as interpreted by Gigerenzer et al. (2005).

So about 19% of participants choose the correct meteorological interpretation. However, of the 24% who offer other explanations, many suggest that they are not so much interested in the meteorological interpretation as in how this affects their decision-making. So it might be the case that even if people aren't aware of the frequentist definition, they are still using the information approximately correctly as it applies to their lives. One such application would be a comparison with the cost-loss ratio to determine whether to engage in precautionary measures. Note that, as noted earlier in the post, it may be the case that people overestimate their own cost-loss ratio, but this is a distinct problem from incorrectly interpreting the probability.

I also found the following resources, that I haven't had the time to read through, but that might help people interested in exploring the issue in more detail (I'll add more to this list if I find more):

#2: Extreme rare events (usually weather-related) that require significant response

For some rare events (such as earthquakes) we don't know how to make specific predictions of their imminent arrival. But for others, such as hurricanes, cyclones, blizzards, tornadoes, and thunderstorms, specific probabilistic predictions can be made. Based on these predictions, significant action can be undertaken, ranging from everybody deciding to stock up on supplies and stay at home, to a mass evacuation. Such responses are quite costly, but the loss they would avert if the event did occur is even bigger. In the cost-loss framework discussed above, we are dealing with both a high cost and a loss that could be much higher. However, unlike the binary case discussed above, the loss spans more of a continuum: the amount of loss that would occur without precautionary measures depends on the intensity of the event. Similarly, the costs span a continuum: the cost depends on the extent of precautionary measures taken.

Since both the cost and loss are huge, it's quite important to get a good handle on the probability. But should the correct probability be communicated, or should it be massaged or simply converted to a "yes/no" statement? We discussed earlier the (alleged) problem of people overestimating their cost-loss ratio, and therefore not taking adequate precautionary measures, and how the Weather Channel addresses this by deliberately introducing a wet bias. But the stakes are much higher when we are talking of shutting down a city for a day or ordering a mass evacuation.

Another complication is that the rarity of the event means that people's own mental models haven't had a lot of data to calibrate the accuracy and reliability of forecasts. When it comes to temperature and precipitation forecasts, people have years of experience to rely on. They will not lose faith in a forecast based on a single occurrence. When it comes to rare events, even a few memories of incorrect forecasts, and the concomitant huge costs or huge losses, can lead people to be skeptical of the forecasts in the future. In The Signal and the Noise, Nate Silver extensively discusses the case of Hurricane Katrina and the dilemmas facing the mayor of New Orleans that led him to delay the evacuation of the city, and led many people to ignore the evacuation order even after it was announced.

A direct strike of a major hurricane on New Orleans had long been every weather forecaster’s worst nightmare. The city presented a perfect set of circumstances that might contribute to the death and destruction there. [...]

The National Hurricane Center nailed its forecast of Katrina; it anticipated a potential hit on the city almost five days before the levees were breached, and concluded that some version of the nightmare scenario was probable more than forty-eight hours away . Twenty or thirty years ago, this much advance warning would almost certainly not have been possible, and fewer people would have been evacuated. The Hurricane Center’s forecast, and the steady advances made in weather forecasting over the past few decades, undoubtedly saved many lives.

Not everyone listened to the forecast, however. About 80,000 New Orleanians —almost a fifth of the city’s population at the time— failed to evacuate the city, and 1,600 of them died. Surveys of the survivors found that about two-thirds of them did not think the storm would be as bad as it was. Others had been confused by a bungled evacuation order; the city’s mayor, Ray Nagin, waited almost twenty-four hours to call for a mandatory evacuation, despite pleas from Mayfield and from other public officials. Still other residents— impoverished, elderly, or disconnected from the news— could not have fled even if they had wanted to.

Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 109-110). Penguin Group US. Kindle Edition.

So what went wrong? Silver returns to this later in the chapter:

As Max Mayfield told Congress, he had been prepared for a storm like Katrina to hit New Orleans for most of his sixty-year life. Mayfield grew up around severe weather— in Oklahoma, the heart of Tornado Alley— and began his forecasting career in the Air Force, where people took risk very seriously and drew up battle plans to prepare for it. What took him longer to learn was how difficult it would be for the National Hurricane Center to communicate its forecasts to the general public.

“After Hurricane Hugo in 1989,” Mayfield recalled in his Oklahoma drawl, “I was talking to a behavioral scientist from Florida State. He said people don’t respond to hurricane warnings. And I was insulted. Of course they do. But I have learned that he is absolutely right. People don’t respond just to the phrase ‘hurricane warning.’ People respond to what they hear from local officials. You don’t want the forecaster or the TV anchor making decisions on when to open shelters or when to reverse lanes.”

Under Mayfield’s guidance, the National Hurricane Center began to pay much more attention to how it presented its forecasts. It contrast to most government agencies, whose Web sites look as though they haven’t been updated since the days when you got those free AOL CDs in the mail, the Hurricane Center takes great care in the design of its products, producing a series of colorful and attractive charts that convey information intuitively and accurately on everything from wind speed to storm surge.

The Hurricane Center also takes care in how it presents the uncertainty in its forecasts. “Uncertainty is the fundamental component of weather prediction,” Mayfield said. “No forecast is complete without some description of that uncertainty.” Instead of just showing a single track line for a hurricane’s predicted path, for instance, their charts prominently feature a cone of uncertainty—“ some people call it a cone of chaos,” Mayfield said. This shows the range of places where the eye of the hurricane is most likely to make landfall. Mayfield worries that even this isn’t enough. Significant impacts like flash floods (which are often more deadly than the storm itself) can occur far from the center of the storm and long after peak wind speeds have died down. No people in New York City died from Hurricane Irene in 2011 despite massive media hype surrounding the storm, but three people did from flooding in landlocked Vermont once the TV cameras were turned off.

[...]


Mayfield told Nagin that he needed to issue a mandatory evacuation order, and to do so as soon as possible.

Nagin dallied, issuing a voluntary evacuation order instead. In the Big Easy, that was code for “take it easy”; only a mandatory evacuation order would convey the full force of the threat. Most New Orleanians had not been alive when the last catastrophic storm, Hurricane Betsy, had hit the city in 1965. And those who had been, by definition, had survived it. “If I survived Hurricane Betsy, I can survive that one, too. We all ride the hurricanes, you know,” an elderly resident who stayed in the city later told public officials. Reponses like these were typical. Studies from Katrina and other storms have found that having survived a hurricane makes one less likely to evacuate the next time one comes. 

The reasons for Nagin’s delay in issuing the evacuation order is a matter of some dispute— he may have been concerned that hotel owners might sue the city if their business was disrupted. Either way, he did not call for a mandatory evacuation until Sunday at 11 A.M. —and by that point the residents who had not gotten the message yet were thoroughly confused . One study found that about a third of residents who declined to evacuate the city had not heard the evacuation order at all. Another third heard it but said it did not give clear instructions. Surveys of disaster victims are not always reliable— it is difficult for people to articulate why they behaved the way they did under significant emotional strain, and a small percentage of the population will say they never heard an evacuation order even when it is issued early and often. But in this case, Nagin was responsible for much of the confusion.

There is, of course, plenty of blame to go around for Katrina— certainly to FEMA in addition to Nagin. There is also credit to apportion— most people did evacuate, in part because of the Hurricane Center’s accurate forecast. Had Betsy topped the levees in 1965, before reliable hurricane forecasts were possible, the death toll would probably have been even greater than it was in Katrina. One lesson from Katrina, however, is that accuracy is the best policy for a forecaster. It is forecasting’s original sin to put politics, personal glory, or economic benefit before the truth of the forecast. Sometimes it is done with good intentions, but it always makes the forecast worse. The Hurricane Center works as hard as it can to avoid letting these things compromise its forecasts. It may not be a concidence that, in contrast to all the forecasting failures in this book, theirs have become 350 percent more accurate in the past twenty-five years alone.

“The role of a forecaster is to produce the best forecast possible,” Mayfield says. It’s so simple— and yet forecasters in so many fields routinely get it wrong.

Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 138-141). Penguin Group US. Kindle Edition. 

Silver notes similar failures of communication of forecast uncertainty in other domains, including exaggeration of the 1976 swine flu outbreak.

I also found a few related papers that may be worth reading if you're interested in understanding the communication of weather-related rare event forecasts:

#3: Long-run changes that might necessitate policy responses or long-term mitigation or adaptation strategies, such as climate change

In marked contrast to daily weather forecasting as well as extreme rare event forecasting is the forecasting of gradual long-term structural changes. Examples include climate change, economic growth, changes in the size and composition of the population, and technological progress. Here, the general recommendation is clear and detailed communication of uncertainty using multiple formats, with the format tailored to the types of decisions that will be based on the information.

On the subject of communicating uncertainty in climate change, I found the paper Communicating uncertainty: lessons learned and suggestions for climate change assessment by Anthony Patt and Suraje Dessai. The paper is quite interesting (and has been referenced by some of the other papers mentioned in this post).

The paper identifies three general sources of uncertainty:

  • Epistemic uncertainty arises from incomplete knowledge of processes that influence events.
  • Natural stochastic uncertainty refers to the chaotic nature of the underlying system (in this case, the climate system).
  • Human reflexive uncertainty refers to uncertainty in human activity that could affect the system. Some of the activity may be undertaken specifically in response to the forecast.

This is somewhat similar to, but not directly mappable to, the classification of sources of uncertainty by Gavin Schmidt from NASA that I discussed in my post on weather and climate forecasting:

  • Initial condition uncertainty: This form of uncertainty dominates short-term weather forecasts (though not necessarily the very short term weather forecasts; it seems to matter the most for intervals where numerical weather prediction gets too uncertain but long-run equilibrating factors haven't kicked in). Over timescales of several years, this form of uncertainty is not influential.
  • Scenario uncertainty: This is uncertainty that arises from lack of knowledge of how some variable (such as carbon dioxide levels in the atmosphere, or levels of solar radiation, or aerosol levels in the atmosphere, or land use patterns) will change over time. Scenario uncertainty rises over time, i.e., scenario uncertainty plagues long-run climate forecasts far more than it plagues short-run climate forecasts.
  • Structural uncertainty: This is uncertainty that is inherent to the climate models themselves. Structural uncertainty is problematic at all time scales to a roughly similar degree (some forms of structural uncertainty affect the short run more whereas some affect the long run more).

Section 2 of the paper has a general discussion of interpreting and communicating probabilities. One of the general points made is that the more extreme the event, the lower people's mental probability threshold for verbal descriptions of likelihood. For instance, for a serious disease, the probability threshold for "very likely" may be 30%, whereas for a minor ailment, it may be 90% (these numbers are my own, not from the paper). The authors also discuss the distinction between frequentist and Bayesian approaches and claim that the frequentist approach is better suited to assimilating multiple pieces of information, and therefore, frequentist framings should be preferred to Bayesian framings when communicating uncertainty:

As should already be evident, whether the task of estimating and responding to uncertainty is framed in stochastic (usually frequentist) or epistemic (often Bayesian) terms can strongly influence which heuristics people use, and likewise lead to different choice outcomes [23]. Framing in frequentist terms on the one hand promotes the availability heuristic, and on the other hand promotes the simple acts of multiplying, dividing, and counting. Framing in Bayesian terms, by contrast, promotes the representativeness heuristic, which is not well adapted to combining multiple pieces of information. In one experiment, people were given the problem of estimating the chances that a person has a rare disease, given a positive result from a test that sometimes generates false positives. When people were given the problem framed in terms of a single patient receiving the diagnostic test, and the base probabilities of the disease (e.g., 0.001) and the reliability of the test (e.g., 0.95), they significantly over-estimate the chances that the person has the disease (e.g., saying there is a 95% chance). But when people were given the same problem framed in terms of one thousand patients being tested, and the same probabilities for the disease and the test reliability, they resorted to counting patients, and typically arrived at the correct answer (in this case, about 2%). It has, indeed, been speculated that the gross errors at probability estimation, and indeed errors of logic, observed in the literature take place primarily when people are operating within the Bayesian probability framework, and that these disappear when people evaluate problems in frequentist terms [23,58].

The authors offer the following suggestions in the discussion section (Section 4) of their paper:

The challenge of communicating probabilistic information so that it will be used, and used appropriately, by decision-makers has been long recognized. [...] In some cases, the heuristics that people use are not well suited to the particular problem that they are solving or decision that they are making; this is especially likely for types of problems outside their normal experience. In such cases, the onus is on the communicators of the probabilistic information to help people find better ways of using the information, in such a manner that respects the users’ autonomy, full set of concerns and goals, and cognitive perspective.

That these difficulties appear to be most pronounced when dealing with predictions of one-time events, where the probability estimates result from a lack of complete confidence in the predictive models. When people speak about such epistemic or structural uncertainty, they are far more likely to shun quantitative descriptions, and are far less likely to combine separate pieces of information in ways that are mathematically correct. Moreover, people perceive decisions that involve structural uncertainty as riskier, and will take decisions that are more risk averse. By contrast, when uncertainty results from well-understood stochastic processes, for which the probability estimate results from counting of relative frequencies, people are more likely to work effectively with multiple pieces of information, and to take decisions that are more risk neutral.

In many ways, the most recent approach of the IPCC WGI responds to these issues. Most of the uncertainties with respect to climate change science are in fact epistemic or structural, and the probability estimates of experts reflect degrees of confidence in the occurrence of one-time events, rather than measurement of relative frequencies in relevant data sets. Using probability language, rather than numerical ranges, matches people’s cognitive framework, and will likely make the information both easier to understand, and more likely to be used. Moreover, defining the words in terms of specific numerical ranges ensures consistency within the report, and does allow comparison of multiple events, for which the uncertainty may derive from different sources.

We have already mentioned the importance of target audiences in communicating uncertainties, but this cannot be emphasized enough. The IPCC reports have a wide readership so a pluralistic approach is necessary. For example, because of its degree of sophistication, the water chapter could communicate uncertainties using numbers, whereas the regional chapters might use words and the adaptive capacity chapter could use narratives. “Careful design of communication and reporting should be done in order to avoid information divide, misunderstandings, and misinterpretations. The communication of uncertainty should be understandable by the audience. There should be clear guidelines to facilitate clear and consistent use of terms provided. Values should be made explicit in the reporting process” [32].

However, by writing the assessment in terms of people’s intuitive framework, the IPCC authors need to understand that this intuitive framework carries with it several predictable biases. [...]

The literature suggests, and the two experiments discussed here further confirm, that the approach of the IPCC leaves room for improvement. Further, as the literature suggests, there is no single solution for these potential problems, but there are communication practices that could help. [...]

Finally, the use of probability language, instead of numbers, addresses only some of the challenges in uncertainty communication that have been identified in the modern decision support literature. Most importantly, it is important in the communication process to address how the information can and should be used, using heuristics that are appropriate for the particular decisions. [...] Obviously, there are limits to the length of the report, but within the balancing act of conciseness and clarity, greater attention to full dimensions of uncertainty could likely increase the chances that users will decide to take action on the basis of the new information.

Forecasting rare events

5 VipulNaik 11 July 2014 10:48PM

In an earlier post, I looked at some general domains of forecasting. This post looks at some more specific classes of forecasting, some of which overlap with the general domains, and some of which are more isolated. The common thread to these classes of forecasting is that they involve rare events.

Different types of forecasting for rare events

When it comes to rare events, there are three different classes of forecasts:

  1. Point-in-time-independent probabilistic forecasts: Forecasts that provide a probability estimate for the event occurring in a given timeframe, but with no distinction based on the point in time. In other words, the forecast may say "there is a 5% chance of an earthquake higher than 7 on the Richter scale in this geographical region in a year" but the forecast is not sensitive to the choice of year. These are sufficient to inform decisions on general preparedness. In the case of earthquakes, for instance, the amount of care to be taken in building structures can be determined based on these forecasts. On the other hand, it's useless for deciding the timing of specific activities.
  2. Point-in-time-dependent probabilistic forecasts: Forecasts that provide a probability estimate that varies somewhat over time based on history, but aren't precise enough for a remedial measure that substantially offsets major losses. For instance, if I know that an earthquake will occur in San Francisco in the next 6 months with probability 90%, it's still not actionable enough for a mass evacuation of San Francisco. But some preparatory measures may be undertaken.
  3. Predictions made with high confidence (i.e., a high estimated probability when the event is predicted) and a specific time, location, and characteristics: Precise predictions of date and time, sufficient for remedial measures that substantially offset major losses (but possibly at huge, if much smaller, cost). The situation with hurricanes, tornadoes, and blizzards is roughly in this category.

Statistical distributions: normal distributions versus power law distributions

Perhaps the most ubiquitous distribution used in probability and statistics is the normal distribution. The normal distribution is a symmetric distribution whose probability density function decays superexponentially with distance from the mean (more precisely, it is exponential decay in the square of the distance). In other words, the probability decays slowly at the beginning, and faster later. Thus, for instance, the ratio of pdfs for 2 standard deviations from the mean and 1 standard deviation from the mean is greater than the ratio of pdfs for 3 standard deviations from the mean and 2 standard deviations from the mean. To give explicit numbers: about 68.2% of the distribution lies between -1 and +1 SD, 95.4% lies between -2 and +2 SD, 99.7% lies between -3 and +3 SD, and 99.99% lies between -4 and +4 SD. So the probability of being more than 4 standard deviations is less than 1 in 10000.

If the probability distribution for intensity looks (roughly) like a normal distribution, then high-intensity events are extremely unlikely. So, if the probability distribution for intensity is normal, we do not have to worry about high-intensity events much.

The types of situations where rare event forecasting becomes more important is where events that are high-intensity, or "extreme" in some sense, occur rarely but not as rarely as in a normal distribution. We say that the tails of such distributions are thicker than those of the normal distribution, and the distributions are termed "thick-tailed" or "fat-tailed" distributions. [Formally, the thickness of tails is measured using a quantity called excess kurtosis, which sees how the fourth central moment compares with the square of the second central moment (the second central moment is the variance, and it is the square of the standard deviation), then subtracts off the number 3, which is the corresponding value for the normal distribution. If the excess kurtosis for a distribution is positive, it is a thick-tailed distribution.]

The most common example of such distributions that is of interest to us is power law distributions. Here, the probability is proportional to a negative power. So the decay is like a power. If you remember some basic precalculus/calculus, you'll recall that power functions (such as the square function or cube function) grow more slowly than exponential functions. So power law distributions decay more subexponentially: they decay more slowly than exponential decay (to be more precise, the decay starts off as fast, then slows down). As noted above, the pdf for the normal distribution decays exponentially in the square of the distance from the mean, so the upshot is that power law distributions decay more slowly than normal distributions.

For most of the rare event classes we discuss, to the extent that it has been possible to pin down a distribution, it has looked a lot more like a power law distribution than a normal distribution. Thus, rare events need to be heeded. (There's obviously a selection effect here: for those cases where the distributions are close to normal, forecasting rare events just isn't that challenging, so they wouldn't be included in my post).

UPDATE: Aaron Clauset, who appears in #4, pointed me (via email) to his Rare Events page, containing the code (Matlab and Python) that he used in his terrorism statistics paper mentioned as an update at the bottom of #4. He noted in the email that the statistical methods are fairly general, so interested people could use the code if they were interested in cross-applying to rare events in other domains.

Talebisms

One of the more famous advocates of the idea that people overestimate the ubiquity of normal distributions and underestimate the prevalence of power law distributions is Nassim Nicholas Taleb. Taleb calls the world of normal distributions Mediocristan (the world of mediocrity, where things are mostly ordinary and weird things are very very rare) and the world of power law distributions Extremistan (the world of extremes, where rare and weird events are more common). Taleb has elaborated on this thesis in his book The Black Swan, though some parts of the idea are also found in his earlier book Fooled by Randomness.

I'm aware that a lot of people swear by Taleb, but I personally don't find his writing very impressive. He does cover a lot of important ideas but they didn't originate with him, and he goes off on a lot of tangents. In contrast, I found Nate Silver's The Signal and the Noise a pretty good read, and although it wasn't focused on rare events per se, the parts of it that did discuss such forecasting were used by me in this post.

(Sidenote: My criticism of Taleb is broadly similar to that offered by Jamie Whyte here in Standpoint Magazine. Also, here's a review by Steve Sailer of Taleb. Sailer is much more favorably inclined to the normal distribution than Taleb is, and this is probably related to his desire to promote IQ distributions/The Bell Curve type ideas, but I think many of Sailer's criticisms are spot on).

Examples of rare event classes that we discuss in this post

The classes discussed in this post include:

  1. Earthquakes: Category #1, also, hypothesized to follows a power law distribution.
  2. Volcanoes: Category #2.
  3. Extreme weather events (hurricanes/cyclones, tornadoes, blizzards): Category #3.
  4. Major terrorist acts: Questionable, at least Category #1, some argue it is Category #2 or Category #3. Hypothesized to follow a power law distribution.
  5. Power outages (could be caused by any of 1-4, typically 3)
  6. Server outages (could be caused by 5)
  7. Financial crises
  8. Global pandemics, such as the 1918 flu pandemic (popularly called the "Spanish flu") that, according to Wikipedia, "infected 500 million people across the world, including remote Pacific islands and the Arctic, and killed 50 to 100 million of them—three to five percent of the world's population." They probably fall under Category #2, but I couldn't get a clear picture. (Pandemics were not in the list at the time of original publication of the post; I added them based on a comment suggestion).
  9. Near-earth object impacts (not in the list at the time of original publication of the post; I added them based on a comment suggestion).

Other examples of rare events would also be appreciated.

#1: Earthquakes

Earthquake prediction remains mostly in category 1: there are probability estimates of the occurrence of earthquakes of a given severity or higher within a given timeframe, but these estimates do not distinguish between different points in time. In The Signal and the Noise, statistician and forecasting expert Nate Silver talks to Susan Hough (Wikipedia) of the United States Geological Survey and describes what she has to say about the current state of earthquake forecasting:

What seismologists are really interested in— what Susan Hough calls the “Holy Grail” of seismology— are time-dependent forecasts, those in which the probability of an earthquake is not assumed to be constant across time.

Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (p. 154). Penguin Group US. Kindle Edition.

The whole Silver chapter is worth reading, as is the Wikipedia page on earthquake prediction, which covers much of the same ground.

In fact, even for the time-independent earthquake forecasting, currently the best known forecasting method is the extremely simple Gutenberg-Richter law, which says that for a given location, the frequency of earthquakes obeys a power law with respect to intensity. Since the Richter scale is logarithmic (to base 10), this means that adding a point on the Richter scale makes the frequency of earthquakes decrease to a fraction of the previous value. Note that the Gutenberg-Richter law can't be the full story: there are probably absolute limits on the intensity of the earthquake (some people believe that an earthquake of intensity 10 or higher is impossible). But so far, it seems to have the best track record.

Why haven't we been able to come up with better models? This relates to the problem of overfitting common in machine learning and statistics: when the number of data points is very small, and quite noisy, then trying a more complicated law (with more freely varying parameters) ends up fitting the noise in the data rather than the signal, and therefore ends up being a poor fit for new, out-of-sample data. The problem is dealt with in statistics using various goodness of fit tests and measures such as the Akaike information criterion, and it's dealt with in machine learning using a range of techniques such as cross-validation, regularization, and early stopping. These approaches can generally work well in situations where there is lots of data and lots of parameters. But in cases where there is very little data, it often makes sense to just manually select a simple model. The Gutenberg-Richter law has two parameters, and can be fit using a simple linear regression. There isn't enough information to reliably fit even modestly more complicated models, such as the characteristic earthquake models, and past attempts based on characteristic earthquakes failed in both directions (a predicted earthquake at Parkfield never materialized, and the probability of the 2011 Japan earthquake was underestimated by the model relative to the Gutenberg-Richter law).

Silver's chapter and other sources do describe some possibilities for short-term forecasting based on foreshocks and aftershocks, and seismic disturbances, but note considerable uncertainty.

The existence of time-independent forecasts for earthquakes has probably had major humanitarian benefits. Building codes and standards, in particular, can adapt to the probability of earthquakes. For instance, building standards are greater in the San Francisco Bay Area than in other parts of the United States, partly because of the greater probability of earthquakes. Note also that Gutenberg-Richter does make out-of-sample predictions: it can use the frequency of low-intensity earthquakes to predict the frequency of high-intensity earthquakes, and therefore obtain a time-independent forecast of such an earthquake in a region that may never have experienced it.

#2: Volcanic eruptions

Volcanoes are an easier case than earthquakes. Silver's book doesn't discuss them, but the Wikipedia article offers basic information. A few points:

  • Volcanic activity falls close to category #2: time-dependent forecasts can be made, albeit with considerable uncertainty.
  • Volcanic activity poses less immediate risk because fewer people live close to the regions where volcanoes typically erupt.
  • However, volcanic activity can affect regional and global climate for a few years (in the cooling direction), and might even shift the intercept of other long-term secular and cyclic trends in climate (the reason is that the dust particles released by volcanoes into the atmosphere reduce the extent to which solar radiation is absorbed). For instance, the 1991 Mount Pinatubo eruption is credited with causing the next 1-2 years to be cooler than they otherwise would be, masking the heating effect of a strong El Nino.

#3:  Extreme weather events (lightning, hurricanes/cyclones, blizzards, tornadoes)

Forecasting for lightning and thunderstorms has improved quite a bit over the last century, and falls squarely within Category #3. In The Signal and the Noise, Nate Silver notes that the probability of an American dying from lightning has dropped from 1 in 400,000 in 1940 to 1 in 11,000,000 today, and a large part of the credit goes to better weather forecasting causing people to avoid the outdoors at the times and places that lightning might strike.

Forecasting for hurricanes and cyclones (which are the same weather phenomenon, just at different latitudes) is quite good, and getting better. It falls squarely in category #3: in addition to having general probability estimates of the likelihood of particular types of extreme weather events, we can forecast them a day or a few days in advance, allowing for preparation and minimization of negative impact.

The precision for forecasting the eye of the storm has increased about 3.5-fold in length terms (so about 12-fold in area terms) over the last 25 years. Nate Silver notes that 25 years ago, the National Hurricane Center's forecasts for where a hurricane would hit on landfall, made three days in advance, were 350 miles off on average. Now they're about 100 miles off on average. Most of the major hurricanes that hit the United States, and many other parts of the world, were forecast well in advance, and people even made preparations (for instance, by declaring holidays, or stocking up on goods). Blizzard forecasting is also fairly impressive: I was at Chicago in 2011 when a blizzard hit, and it had been forecast at least a day in advance. With tornadoes, tornado warning alerts are often issued, albeit the tornado often doesn't actually touch down even after the alert is issued (fortunately for us).

See also my posts on weather forecasting and climate forecasting.

#4: Major terrorist acts

Terrorist attacks are interesting. It has been claimed that the frequency-damage relationship for terrorist attacks follows a power law. The academic paper that popularized this observation is a paper by Aaron Clauset, Maxwell Young and Kristian Gleditsch titled "On the Frequency of Severe Terrorist Attacks" (Journal of Conflict Resolution 51(1), 58 - 88 (2007)), here. Bruce Schneier wrote a blog post about a later paper by Clauset and Frederick W. Wiegel, and see also more discussion here, here, here, and here (I didn't select these links through a very discerning process; I just picked the top results of a Google Search).

Silver's book does allude to power laws for terrorism, but I couldn't find any reference to Clauset in his book (oops, seems like my Kindle search was buggy!) and says the following about Clauset:

Clauset’s insight, however, is actually quite simple— or at least it seems that way with the benefit of hindsight. What his work found is that the mathematics of terrorism resemble those of another domain discussed in this book: earthquakes.

Imagine that you live in a seismically active area like California. Over a period of a couple of decades, you experience magnitude 4 earthquakes on a regular basis, magnitude 5 earthquakes perhaps a few times a year, and a handful of magnitude 6s. If you have a house that can withstand a magnitude 6 earthquake but not a magnitude 7, would it be right to conclude that you have nothing to worry about?

Of course not. According to the power-law distribution that these earthquakes obey, those magnitude 5s and magnitude 6s would have been a sign that larger earthquakes were possible—inevitable, in fact, given enough time. The big one is coming, eventually. You ought to have been prepared.

Terror attacks behave in something of the same way. The Lockerbie bombing and Oklahoma City were the equivalent of magnitude 7 earthquakes. While destructive enough on their own, they also implied the potential for something much worse— something like the September 11 attacks, which might be thought of as a magnitude 8. It was not an outlier but instead part of the broader mathematical pattern.

Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 427-428). Penguin Group US. Kindle Edition.

So terrorist attacks are at least in category 1. What about categories 2 and 3? Can we forecast terrorist attacks the way we can forecast volcanoes, or the way we can forecast hurricanes. One difference between terrorist acts and the "acts of God" discussed so far is that to the extent one has inside information about a terrorist attack that's good enough to predict it with high accuracy, it's usually also sufficient to actually prevent the terrorist attack. So Category 3 becomes trickier to define. Should we count the numerous foiled terrorist plots as evidence that terrorist acts can be successfully "predicted" or should we only consider successful terrorist acts in the denominator? And another complication is that terrorist acts are responsive to geopolitical decisions in ways that earthquakes are definitely not, with extreme weather events falling somewhere in between.

As for Category 2, the evidence is unclear, but it's highly likely that terrorist acts can be forecast in a time-dependent fashion to quite a degree. If you want to crunch the numbers yourself, the Global Terrorism Database (website, Wikipedia) and Suicide Attack Database (website, Wikipedia) are available for you to use. I discussed some general issues with political and conflict forecasting in my earlier post on the subject.

UPDATE: Clauset emailed me with some corrections to this section of the post, which I have made. He also pointed to a recent paper he co-wrote with Ryan Woodward about estimating the historical and future probabilities of terror events, available on the ArXiV. Here's the abstract:

Quantities with right-skewed distributions are ubiquitous in complex social systems, including political conflict, economics and social networks, and these systems sometimes produce extremely large events. For instance, the 9/11 terrorist events produced nearly 3000 fatalities, nearly six times more than the next largest event. But, was this enormous loss of life statistically unlikely given modern terrorism's historical record? Accurately estimating the probability of such an event is complicated by the large fluctuations in the empirical distribution's upper tail. We present a generic statistical algorithm for making such estimates, which combines semi-parametric models of tail behavior and a nonparametric bootstrap. Applied to a global database of terrorist events, we estimate the worldwide historical probability of observing at least one 9/11-sized or larger event since 1968 to be 11-35%. These results are robust to conditioning on global variations in economic development, domestic versus international events, the type of weapon used and a truncated history that stops at 1998. We then use this procedure to make a data-driven statistical forecast of at least one similar event over the next decade.

#5: Power outages

Power outages could have many causes. Note that insofar as we can forecast the phenomena underlying the causes, this can be used to reduce, rather than simply forecast, power outages.

  • Poor load forecasting, i.e., electricity companies don't forecast how much demand there will be and don't prepare supplies adequately. This is less of an issue in developed countries, where the power systems are more redundant (at some cost to efficiency): Note here that the power outage occurs due to a failure of a more mundane forecasting exercise. Forecasting the frequency of power outages due to this cause is basically an exercise in calibrating the quality of the mundane forecasting exercise.
  • Abrupt or significant shortages in fuel, often for geopolitical reasons. This therefore ties in with the general exercise of geopolitical forecasting (see my earlier post on the subject). This seems rare in the modern world, due to the considerably redundancy built into global fuel supplies.
  • Disruption of power lines or power supply units due to weather events. The most common causes appear to be lightning, ice, wind, rain, and flooding. This ties in with #3, and with my weather forecasting and climate forecasting posts. This is the most common cause of power outages in developed countries with advanced electricity grids (see, for instance, here and here).
  • Disruption by human or animal activity, including car accidents and animals climbing onto and playing with the power lines.
  • Perhaps the most niche source of power outages, that many people may be unaware of, is geomagnetic storms (Wikipedia). These are quite rare, but can result in major power blackouts. Geomagnetic storms were discussed in past MIRI posts (here and here). Geomagnetic storms are low-frequency and low-probability events but with potentially severe negative impact.

My impression is that when it comes to power outages, we are at Category 2 in forecasting. Load forecasting can identify seasons, times of the day, and special occasions when power demand will be high. Note that the infrastructure needs to built for peak capacity.

We can't quite be in Category 3,  because in cases where we can forecast more finely, we could probably prevent the outage anyway.

What sort of preventive measures do people undertake with knowledge of the frequency of power outages? In places where power outages are more likely, people are more likely to have backup generators. People may be more likely to use battery-powered devices. If you know that a power outage is likely to happen in the next few days, you might take more care to charge the batteries on your devices.

#6: Server outages

In our increasingly connected world, websites going down can have a huge effect on the functioning of the Internet and of the world economy. As with power infrastructure, the complexity of server infrastructure needed to increase uptime increases very quickly. The point is that routing around failures at different points in the infrastructure requires redundancy. For instance, if any one server fails 10% of the time, and the failures of different components are independent, you'd need two servers to get to a 1% failure rate. But in practice, the failures aren't independent. For instance, having loads of servers in a single datacenter covers the risk of any given server there crashing, but it doesn't cover the risk of the datacenter itself getting disconnected (e.g., losing electricity, or getting disconnected from the Internet, or catching fire). So now we need multiple datacenters. But multiple datacenters are far from each other, so that increases the time costs of synchronization. And so on. For more detailed discussions of the issues, see here and here.

My impression is that server outages are largely Category 1: we can use the probability of outages to determine the trade-off between the cost of having redundant infrastructure and the benefit of more uptime. There is an element of Category 2: in some cases, we have knowledge that traffic will be higher at specific times, and additional infrastructure can be brought to bear for those times. As with power infrastructure, server infrastructure needs to be built to handle peak capacity.

#7: Financial crises

The forecasting of financial crises is a topic worthy of its own post. As with climate science, financial crisis forecasting has the potential for heavy politicization, given the huge stakes both of forecasting financial crises and of any remedial or preventative measures that may be undertaken. In fact, the politicization and ideology problem is probably substantially worse in financial crisis forecasting. At the same time, real-world feedback occurs faster, providing more opportunity for people to update their beliefs and less scope for people getting away with sloppiness because their predictions take too long to evaluate.

A literally taken strong efficient market hypothesis (EMH) (Wikipedia) would suggest that financial crises are almost impossible to forecast, while a weaker reading of the EMH would suggest that the financial market is efficient (Wikipedia): it's hard to make money off the business of forecasting financial crises (for instance, you may know that a financial crisis is imminent with high probability, but the element of uncertainty, particularly with regards to timing, can destroy your ability to leverage that information to make money). On the other hand, there are a lot of people, often subscribed to competing schools of economic thought, who successfully forecast the 2007-08 financial crisis, at least in broad strokes.

Note that there are people who reject the EMH, yet claim that financial crises are very hard to forecast in a time-dependent fashion. Among them is Nassim Nicholas Taleb, as described here. Interestingly, Taleb's claim to fame appears to have been that he was able to forecast the 2007-08 financial crisis, albeit it was more of a time-independent forecast than a specific timed call. The irony was noted by by Jamie Whyte here in Standpoint Magazine.

I found a few sources of information for financial crises, that are discussed below.

Economic Predictions records predictions made by many prominent people and how they compared to what transpired.  In particular, this page on their website notes how many of the top investors, economists, and bureaucrats missed the financial crisis, but also identifies some exceptions: Dean Baker, Med Jones, Peter Schiff, and Nouriel Roubini. The page also discusses other candidates who claim to have forecasted the crisis in advance, and reasons why they were not included. While I think they've put in a fair deal of effort into their project, I didn't see good evidence that they have a strong grasp of the underlying fundamental issues they are discussing.

An insightful general overview of the financial crisis is found in Chapter 1 of Nate Silver's The Signal and the Noise, a book that I recommend you read in its entirety. Silver notes four levels of forecasting failure.

 

  • The housing bubble can be thought of as a poor prediction. Homeowners and investors thought that rising prices implied that home values would continue to rise, when in fact history suggested this made them prone to decline.
  • There was a failure on the part of the ratings agencies, as well as by banks like Lehman Brothers, to understand how risky mortgage-backed securities were. Contrary to the assertions they made before Congress, the problem was not that the ratings agencies failed to see the housing bubble. Instead, their forecasting models were full of faulty assumptions and false confidence about the risk that a collapse in housing prices might present.
  • There was a widespread failure to anticipate how a housing crisis could trigger a global financial crisis. It had resulted from the high degree of leverage in the market, with $ 50 in side bets staked on every $ 1 that an American was willing to invest in a new home.
  • Finally, in the immediate aftermath of the financial crisis, there was a failure to predict the scope of the economic problems that it might create. Economists and policy makers did not heed Reinhart and Rogoff’s finding that financial crises typically produce very deep and long-lasting recessions.


Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (pp. 42-43). Penguin Group US. Kindle Edition.

 

Silver finds a common thread among all the failures (emphases in original):

There is a common thread among these failures of prediction. In each case, as people evaluated the data, they ignored a key piece of context:

  • The confidence that homeowners had about housing prices may have stemmed from the fact that there had not been a substantial decline in U.S. housing prices in the recent past. However, there had never before been such a widespread increase in U.S. housing prices like the one that preceded the collapse.
  • The confidence that the banks had in Moody’s and S& P’s ability to rate mortgage-backed securities may have been based on the fact that the agencies had generally performed competently in rating other types of financial assets. However, the ratings agencies had never before rated securities as novel and complex as credit default options.
  • The confidence that economists had in the ability of the financial system to withstand a housing crisis may have arisen because housing price fluctuations had generally not had large effects on the financial system in the past. However, the financial system had probably never been so highly leveraged, and it had certainly never made so many side bets on housing before.
  • The confidence that policy makers had in the ability of the economy to recuperate quickly from the financial crisis may have come from their experience of recent recessions, most of which had been associated with rapid, “V-shaped” recoveries. However, those recessions had not been associated with financial crises, and financial crises are different.

There is a technical term for this type of problem: the events these forecasters were considering were out of sample. When there is a major failure of prediction, this problem usually has its fingerprints all over the crime scene.

Silver, Nate (2012-09-27). The Signal and the Noise: Why So Many Predictions Fail-but Some Don't (p. 43). Penguin Group US. Kindle Edition.

While I find Silver's analysis plausible and generally convincing, I don't think I have enough of an inside-view understanding of the issue.

A few other resources that I found, but didn't get a chance to investigate, are listed below:

#8: Pandemics

I haven't investigated this thoroughly, but here are a few of my impressions and findings:

  • I think that pandemics stand in relation to ordinary epidemiology in the same way that extreme weather events stand in relation to ordinary weather forecasting. In both cases, the main way we can get better at forecasting the rare and high-impact events is by getting better across the board. There is a difference that makes the relation between moderate disease outbreaks and pandemics even more important than the corresponding case for weather: measures taken quickly to react to local disease outbreaks can help prevent global pandemics.
  • Chapter 7 of Nate Silver's The Signal and the Noise, titled "Role Models", discusses forecasting and prediction in the domain of epidemiology. The goal of epidemiologists is to obtain predictive models that have a level of accuracy and precision similar to those used for the weather. However, the greater complexity of human behavior, as well as the self-fulfilling and self-canceling nature of various predictions, makes the modeling problem harder. Silver notes that agent-based modeling (Wikipedia) is one of the commonly used tools. Silver cites a few examples from recent history where people were overly alarmed about possible pandemics, when the reality turned out to be considerably milder. However, the precautions taken due to the alarm may still have saved lives. Silver talks in particular of the 1976 swine flu outbreak (where the reaction turned out be grossly disproportional to the problem, and caused its own unintended consequences) and the 2009 flu pandemic.
  • In recent years, Google Flu Trends (website, Wikipedia) has been a common technique in identifying and taking quick action against the flu. Essentially, Google uses the volume of web search for flu-related terms by geographic location to identify the incidence of the flu by geographic location. It offers an early "leading indicator" of flu incidence compared to official reports, that are published after a time lag. However, Google Flu Trends has run into problems of reliability: news stories about the flu might prompt people to search for flu-related terms, even if they aren't experiencing symptoms of the flu. Or it may even be the case that Google's own helpful search query completions get people to search for flu-related terms once other people start searching for the term. Tim Harford discusses the problems in the Financial Times here. I think Silver doesn't discuss this (which is a surprise, since it would have fit well with the theme of his chapter).

#9: Near-earth object impacts

I haven't looked into this category in sufficient detail. I'll list below the articles I read.

[QUESTION]: Driverless car forecasts

7 VipulNaik 11 July 2014 12:25AM

Of the technologies that have a reasonable chance of come to mass market in the next 20-25 years and having a significant impact on human society, driverless cars (also known as self-driving cars or autonomous cars) stand out. I was originally planning to collect material discussing driverless cars, but Gwern has a really excellent compendium of statements about driverless cars, published January 2013 (if you're reading this, Gwern, thanks!). There have been a few developments since then (for instance, Google's announcement that it was building its own driverless car, or a startup called Cruise Automation planning to build a $10,000 driverless car) but the overall landscape remains similar. There's been some progress with understanding and navigating city streets and with handling adverse weather conditions, and it's more or less on schedule.

My question is about driverless car forecasts. Driverless Future has a good summary page of forecasts made by automobile manufacturer, insurers, and professional societies. The range of time for the arrival of the first commercial driverless cars varies between 2018 and 2030. The timeline for driverless cars to achieve mass penetration is similarly stagged between the early 2020s and 2040. (The forecasts aren't all directly comparable).

A few thoughts come to mind:

  1. Insurer societies and professional societies seem more conservative in their estimates than manufacturers (both automobile manufacturers and people manufacturing the technology for driverless cars). Note that the estimates of many manufacturers are centered on their projected release dates for their own driverless cars. This suggests an obvious conflict of interest: manufacturers may be incentivized to be optimistic in their projections of when driverless cars will be released, insofar as making more optimistic predictions wins them news coverage and might also improve their market valuation. (At the same time, the release dates are sufficiently far in the future that it's unlikely that they'll be held to account for false projections, so there isn't a strong incentive to be conservative the same way as there is with quarterly sales and earning forecasts). Overall, then, I'd defer more to the judgment of the professional societies, namely the IEEE and the Society of Autonomous Engineers.
  2. The statements compiled by Gwern point to the many legal hurdles and other thorny issues of ethics that would need to be resolved, at least partially, before driverless cars start becoming a big presence in the market.
  3. The general critique made by Schnaars in Megamistakes (that I discussed here) applies to driverless car technology: consumers may be unwilling to pay the added cost despite the safety benefits. Some of the quotes in Gwern's compendium reference related issues. This points further in the direction of forecasts by manufacturers being overly optimistic.

Questions for the people here:

  • Do you agree with my points (1)-(3) above?
  • Would you care to make forecasts for things such as: (a) the date that the first commercial driverless car will hit the market in a major country or US state? (b) the date by which over 10% of new cars sold in a large country or US state will be driverless (i.e., capable of fully autonomous operation), (c) same as (b), but over 50%, (d) the date by which over 10% of cars on the road (in a large country or US state) will be operating autonomously, (e) same as (d), but over 50%. You don't have to answer these exact questions, I'm just providing some suggestions since "forecast the future of driverless cars" is overly vague.
  • What's your overall view on whether it is desirable at the margin to speed up or slow down the arrival of autonomous vehicles on the road? What factors would you consider in answering such a question?

View more: Next